Home » Products

FAQ: OrcaTec Information Discovery Toolkit

Operations and Set Up

Q: The toolkit provides a web service. Does that mean that we have to access our data over the internet from your facility?
A: No, the Toolkit runs on your servers in your facility. A web service is a specific kind of software that is designed to support interoperable machine to machine interaction over a network. The toolkit uses the HTTP protocol and a RESTful interface to manage this communication. HTTP is the same protocol used to send web pages from servers to browsers. A RESTful interface means that it uses the standard HTTP operations (e.g., GET, PUT, POST and DELETE) to communicate operational information. REST stands for REpresentational State Transfer, because the resources that are accessed represent information that could change state as the process goes on. For example, a document that is submitted for processing results in a change of state of the system when the results of that processing are available.

Q: Does the Toolkit run on Windows?
A: No. In principle, the Toolkit could run on Windows, but we distribute it as a software appliance, like a router or firewall. Indexing, clustering and so forth are highly computationally intensive. In order to obtain maximum processing speed it is essential that all of the server’s resources be dedicated to these tasks. The appliance runs on one or more dedicated servers and there is therefore, no chance that you could be running some unknown combination of software packages that somehow interacts with the Toolkit and causes it to behave badly (e.g., crash). We have tailored the operating system so that just those features that are required to run the Toolkit are available and others are not. This also increases the speed and reliability of the system and makes testing and maintenance much easier.

Q: Is the system difficult to set up?
A: No. You load a CD into a machine with no operating system, boot the machine and answer a few questions. That’s all there is.

Q: Is the system difficult to use?
A: No. It is very simple. It has a simple, but powerful API (Application Programming Interface) running as a set of RESTful resources over HTTP. The API document has many examples.

Q: Will the system work with .NET or C# or Macs or …
A: Yes. Software tools are available in just about any language to create the HTTP requests that are used to control the system and retrieve information.

Q: We are a Windows shop. Will my IT staff be able to manage the system?
A: Yes. All management is done through a web interface (i.e., using a browser) in the same way as other IT software and appliances that you probably already have. Even backups are managed through this web interface. No special skills are needed to manage the system. Windows IT support should have no trouble with it.

Q: Does the system need to access to the Internet for updates and license verification?
A: No. Updates and license management can be handled locally or over the internet, whichever is more convenient for you. During operations, the system should have access just to a restricted set of computers on your network. Your IT staff should know how to set up your network to restrict access.

Q: Does the system return results as XML?
A: It can return any results as either XML or JSON. Data are also submitted to the system as either XML or JSON.

Q: What is JSON?
A: JSON stands for JavaScript Object Notation. JSON is a lightweight format for computer data exchange which is not specific to the JavaScript language. It is similar to XML, but is much less verbose (less description information and more data than XML). JSON is text-based and is easier to read than XML and easier to process. Practically any programming language and environment has tools for accessing JSON. The JSON format is specified in RFC 4627.

Q: Can you give some examples of using the system?
A: Sure. Here is how you create a new collection. The example uses cURL, which is a widely available open-source program for sending and receiving information over the HTTP interface. These examples show a command-line version of cURL. libcurl is available to manage the same kind of messages from within most common programming languages.

[curl]:http://curl.haxx.se/

curl -H ‘Accept: application/json’ -H ‘Content-Type: application/json’ \ -d ‘{ “description”: “Collection description goes here.”, “collection_prefs”: { “default_language”: “en”, “near_dupes”: “True”, “concept_clusters”: “False”, “field_store”: “True”, “field_tokenize”: “True”, “field_default_search”: “_body” } }’ \
http://otsearch:8090/project01/collections

The first line of this example is communication “housekeeping.” It adds two headers to the message that is being transmitted to the OrcaTec server. The first one says that it is sending information in JSON format and the second is that it would like to receive its information also in JSON format. The next several lines are message body. The curly braces (“{“ and “}”) mark the start and end of a JSON object. One JSON object consists of two parts, a “description” and a set of “collection_prefs.” The “collection_prefs” object itself contains another object, consisting of six parts, “default_language,” etc. The value of “default_language” is then set to “en” for English.
The last line of the example is the URL to which this message should be sent.
The same example using XML is:

curl -H ‘Accept: text/xml’ -H ‘Content-Type: text/xml’ \
-d ‘ Collection description goes here. en True True True True

http://otsearch:8090/project01/collections

Usage—General

Q: Does the OrcaTec Information Retrieval Toolkit work in languages other than English?
A: It works in any language. One tool, the Interesting Word Finder, works more completely when it has some language-specific resources, but they will all work with any language. The Language Identifier distinguishes among a large number of languages (with more being added), but it can only identify the languages that it has models for.

Q: What document formats does the system support?
A: The system accepts only documents that are plain Unicode (UTF-8 encoded) text in the form of either XML or JSON. The host application is responsible for converting text formats and extracting the text. This ensures that all processes reflect exactly the same text.

Usage—Near Duplicates

Q: Can the near-duplicate clustering algorithm be run in batch mode?
A: Yes. It is possible to run just the near-duplicate clustering algorithms and none of the other tools. You can also run a collection of documents through the clusterer and then export the clusters as JSON or XML, which can be read by other applications.

Q: How are near duplicates determined?
A: Briefly, the process involves finding how similar each pair of documents is. Documents are broken into “shingles” where a shingle is a set of consecutive words. For example, if the shingles are three-words long, then words 1-3 would be the first shingle, 2-4 would be the second, and so forth. The similarity of two documents is measured as the number of shared shingles divided by the number of unique shingles in the two documents. More similar documents are those that share more shingles.

Q: Can we set the threshold for how much overlap there must be for documents to be clustered together by the near-duplicate tool?
A: Yes. This threshold can be set during project configuration.

Q: Does near-duplicate clustering work only with English documents?
A: Near-duplicate clustering will work with any language.

Q: Can a document be in more than one cluster?
A: No, a document can be in only one cluster. Some documents will be in no clusters.

Usage—Concept Search

Q: Do I need a separate tool to do Boolean and proximity searches?
A: No, Boolean, proximity, fuzzy (change a few letters), phrase, and wildcard searches are all included. You can also specify fields to search in.

Q: What words are indexed?
A: All words in all of the fields specified during the Collection creation are searchable.

Q: Are stopwords indexed?
A: Stopwords are indexed. By default, they are excluded from queries, but the application can also force them to be included in the query.

Q: Are words stemmed during indexing?
A: No, words are not stemmed during indexing.

Q: Does the system assign concepts to pages?
A: The system does not assign concepts to documents. Rather it learns the meanings of the words in those documents relative to the other words it has learned. The Interesting Phrase Finder can be used to identify the most significant words and phrases in a document or collection of documents.

Q: Do I have to build a taxonomy, thesaurus, or ontology to use concept search?
A: No. The system learns the meanings of words from how they are used in the collection.

Q: How does concept searching work?
A: The system constructs a language model from the documents that it is fed. This language model represents the patterns of word usage in the document collection. The meaning of words is given largely by their pattern of usage. So, for example, the word “support” will tend to occur in patterns with other words such as “pricing,” “subsidies,” “controls” and “floor” in an economics context. In another context, such as engineering, “support” will occur in patterns involving words like “weight,” “bearing,” and “stability.” In a collection about economics, you would expect that documents that contain the word “support” would also contain some of these other words. Conversely, if a document contained one or more of these related words, you would expect it to be more or less about the same topic, even if the word “support” did not appear in the document. The language model learns these patterns and uses them to retrieve relevant documents.

Q: I searched for the word “tired” and got some documents. I searched for the word “fatigued” and got different documents. Why did the system not find these synonyms?
A: The language model does not construct a thesaurus. It learns how words are related in the context of the documents it has been fed. Perhaps these words were used differently in the context of the documents. You know that these words are related, but unless they appear in similar contexts, the language model has no opportunity to learn this relation. The word “tired” could be used to talk about golfers being tired after a round of golf. The word “fatigued” could be used to talk about metal fatigue or stress fractures. In the context of the documents they might not be used interchangeably.
The relationships learned by the language model are strongly dependent on the specific documents it has been fed. This means that they are not misled by idiomatic patterns of word use (e.g., using the word “apple” to refer to a computer, not a fruit), but it also means that people know way more about words in general than the computer can.

Q: Can I “teach” the language model?
A: Yes. You can feed it documents that are to be added to the language model, but not indexed. It will then incorporate the language patterns in those documents and find related information in the search collection. This teaching is rarely necessary on document collections larger than a few thousand.

Q: If I add documents in batches, won’t the language model change? For example, I may get 100,000 on Monday and another 200,000 on Thursday. What happens to the searches I do after each batch of documents is loaded?
A: If you tell the system to continue to update the language model, then a search after the second batch may yield different documents than a search after the first batch. This is normal. In addition to finding documents that were not present in the first batch, new word patterns may have been learned. This is the same as would happen if you had people reading the documents. As they read more, they learn more. You can turn off additional training and force the language model to stay as it was following the first batch. Alternatively, your application may mark or tag those documents that are found after the first batch. Once identified, it is an easy matter to track any changes.

Q: I did a search for the exact word “chewco” in my collection and the system reported that there were 10 documents containing this word. I did a concept search for “chewco” and the system reported 4,054 documents. Can you explain this?
A: When you submit a concept search, the system searches for the exact terms that you entered along with other terms that the language model says are related. For example, with a subset of the Enron emails as our collection, the chewco query might be expanded to includes words like “way,”, “issues,” etc. Each of these terms is given a weight (for “chewco” it is 10.0; for “investments” it is 0.279). These weights indicate how important that word is for the query.

[“chewco”, “10.0”], [“way”, “0.02”], [“issues”, “0.021”], [“deal”, “0 .027”], [“including”, “0.034”], [“ebs”, “0.037”], [“azurix”, “0.052”], [“mails”, “0.061”], [“accounting”, “0.16”], [“investments”, “0.279”], [“relate”, “0.31”]

A document containing any of these terms is then included in the results list following a concept search. The results are ranked, however, by the degree of match between the expanded concept search query and document. As we proceed down the ranks, subsequent documents are judged by the computer to be less and less relevant. Your application can impose a cutoff score to limit the number of results returned. The list of related words and their weights can be retrieved using the debug search variable.

Usage—Language Identification

Q: What languages does the system identify?
A: The languages that the system can identify can be retrieved through the API.

curl -H ‘Accept: application/json’ \ http://otsearch:8090/v1/langid

Q: How does language identification work?
A: Each language has a characteristic pattern of character usage. For example, French words are much more likely to end in “eux” than are English words. The language identification tool contains a model of these letter use patterns. When a text is presented for identification, it compares the distribution of letter sequences in the text with the distribution for each of the languages that it knows. The text is then classified as the language where the distribution of characters in the text best matches the distribution in that language’s model. Longer texts are more accurately classified than shorter texts because longer texts give a better estimate of the character sequence distribution.

Usage—Interesting Phrase Finder

Q: How are interesting phrases identified?
A: Three methods are used to identify interesting words and phrases in a submitted text sample. The first is to find those words that are characteristic of the text, but not characteristic of the average document in the collection. Atypical words are usually interesting.
The second method finds those word sequences that together occur more often than would be predicted by the language model for the collection. This technique identifies statistically improbably phrases. Both of these techniques are language-independent.
The third technique uses part-of-speech information to identify phrases containing nouns. The use of this technique can be applied to any language, but it requires some language-specific rules. Currently, only English rules are included.