Home » Products

Interesting Phrase Finder

In information retrieval situations, it is often necessary to get a quick idea of what a text or collection is about. One method for doing this is to identify the interesting words and phrases. Interesting in this context means that the terms are characteristic of specific documents. These are terms that are useful for distinguishing documents or groups of documents from the remainder of documents in a collection or distinguishing a collection of documents from the universe of all possible documents in that language.

For example, in electronic discovery, the attorneys may need a quick overview of a collection of documents that tells them roughly what these documents are about and how they are different from other documents. In commercial applications it may be useful to identify those words that are typical of a specific document, either for summarization or in order to link to other documents that may be related. Interesting terms, in this context, can also be used to label clusters of documents or email or other conversational threads.
Some sample applications for Interesting Phrases include:

  • Help me to summarize what a group of documents is about
  • Help me to determine whether a document might be responsive to my interests
  • Help me to identify terms that would be useful to find more documents like this one
  • Help me to identify terms that are related to the search term I used to find this document
  • Help me to identify terms that other people may use to search for this document

The OrcaTec Interesting Phrase Finder uses a language model to determine which words in a text are likely to be contextually interesting. The language model describes what words are to be expected in general in a collection of documents. Words or phrases that are characteristic of the document or documents but are otherwise atypical are the most likely to be interesting in that context. The OrcaTec Interesting Phrase Finder is trained specifically on each collection of documents to take advantage of the fact that some concepts may be interesting in some contexts, but not in others.

Based on the language model it has generated, the OrcaTec Interesting Phrase Finder, computes the likelihood each word or phrase in a specific document is interesting, according to its criteria. The interesting phrase finder can also include a dull words list, containing words and phrases that may be known not to be interesting, no matter how characteristic the words are, and hot words list, which contains words and phrases that are interesting whenever they are found, regardless of how characteristic they are.