Concept Search (includes Boolean)
Two of the biggest problems in using a search engine to identify relevant documents are synonymy and polysemy. Synonymy means that there are multiple ways of saying the same thing (there are over 200 words that mean roughly “think”) and polysemy means that the same word can have multiple meanings (there are 82 definitions for the word “strike”). Language modeling takes advantage of the fact that the words in a text are not chosen completely randomly. If a document has the word “lawyer” in it then it is also likely to have words like “judge,” “matter,” “case,” “court,” etc. Conversely, if a document has words like “judge,” “matter,” “case,” or “court,” then it is likely to be about the same topic, whether or not “attorney” appears in the document.
For each word in the set of documents, the language model derives the probability that this word occurs, given the other words in the same paragraph. Typically, the set of documents used to compute the language model is the collection that is indexed, but it can be a different set if that is desired. The use of a language model allows relevant documents to be retrieved even if they do not happen to contain the query word or words. It also allows document to be ranked, not just by whether they have a specific word in them, but by whether that word appears in the appropriate context—thereby reducing the effects of both synonymy and polysemy.
The Concept Search module also includes the capabilities for full Boolean search that users have come to expect.