Language Identification
Language identification is the problem of determining which language a document or a part of a document was written in. The OrcaTec language identification module contains a language model that describes the pattern of character sequences that are typical of each of its several languages. For example, the frequency of the letter sequence “eux” is much more common in French than in English text, and the frequency of “ery” is much more common in English than in French.
The OrcaTec language identification module uses N-grams to recognize the encoding of documents and their language. An N-gram is a sequence of N letters. “Ery” would be a 3-gram or trigram because it is a sequence of three letters. The frequencies with which the various N-grams occur in the different languages are used to guess (with high accuracy) the language in which the text was written. The OrcaTec language model for language identification incorporates these frequencies and then uses Kullback-Leibler divergence to determine which language a particular sample belongs to. The document is assigned to the appropriate process or appropriate reader based on its language.