Dhvani detects the language of the input text automatically. This is mainly based on the unicode code point range of the input text. The exact steps used for language detection is as follows:
- Tokenize the input to words. Done by synthesizer
- For each word, scan from left to right, until a valid unicode range of any of the supported language is found.
- If the script detected is Devangari, which is common script for many languages, check the system locale. If the system locale is a language which is in Devangari script and a supportef language , return that language . Marathi uses Devanagari script. So this step is important for that module.
- If a word could not detected, try to read it using the language of previous word. For eg a string like "മലയാളം 123".
Here for 123, the language cannot be determined using the above rules. So try the language of the previous word. In our case , it is ml_IN, so 123 will be passed to the number reading function of Malayalam module.
- If any of the above is not succeeded,exit the program with the error message.
- Language detection is done for each word. That means, even if the input sentence contains words from any supported languages, dhvani will read it in corresponding language. For example, a single sentense may contain, Hindi, Marathi, Gujarathi and Malayalam words
- User language switch is given high priority. If user specifies the language using -l or --lang switch the above algorithm will not be used.
- Since the detection is not based on first letter, the word may start with any symbols. For example , "---മലയാളം" will be recognized as a Malayalam word.