Near Duplicate Clustering
The OrcaTec Information Discovery Toolkit includes a patent-pending near-duplicate clusterer. It identifies and groups together documents that are nearly identical to one another. In many collections 30% or more of the documents may be very similar to other documents in the collection.
For example, if Sally sends an email to Don, and Don replies to it by adding “sounds good” at the beginning, these are two different documents, but they differ mainly in the headers and in the addition of these two words. Similarly, revisions of a contract may contain nearly the same text. Near duplicate clustering arranges these similar documents into groups.
Unless managed, these near duplicates can clog search results lists or place an undue burden on those seeking to analyze the content of the documents. In electronic discovery, for example, being able to reduce the number of documents that needs to be read or grouping similar documents together can save enormous amounts of review time and expense.
Hashing techniques, which are typically used to identify exact duplicate documents, will not work to identify near duplicates because these documents, by definition, differ from one another by some number of words. Hash functions return different values for these documents and it is impossible to know from the hash function just how different these documents are.
Instead, the OrcaTec Information Discovery Toolkit uses probability statistics to identify near duplicate documents. It looks at the overlap between word sequences of two documents and computes the probability that they overlap. Basically, the more sequences that are shared by the two documents, the more likely they are to be near duplicates of one another. Those that match above a settable threshold are grouped into clusters at the time the documents are first added to the system.
The patent-pending clustering system does not directly compare documents word for word, but computes an intermediate sketch of these word sequences, which it can process very quickly. It uses a proprietary algorithm to do this clustering in only one pass through the documents. As a result it is very fast.