Home » Products

Email Threads

Email does not exist in isolation. Rather, they occur in conversations or, as they are called, threads. One person sends an email, another replies or forwards that email to yet a third person, and so on. Each thread has a history, which provides a context. An email thread is a collaborative effort among a group of participants that is likely best understood in the context of the conversation, rather than as an isolated document.
Email threads collate related messages to provide context for understanding an exchange. For example, and email that said only “sounds good” would be very difficult to interpret in isolation and much easier to understand in the context of the email that elicited the statement of agreement. Email threads allow one to understand each email in the context of an ongoing conversation embedded in the flow of an organization’s activity, rather than as an isolated event.

Email threads are common. Estimates of the number of emails in a thread, its length, range from around three to four or more.
The structure of emails presents certain challenges reconstructing their threads. Nominally, the subject of an email is intended to give its topic, and thus, the topic of its thread. People are inconsistent in their use of subject lines. People will use the reply function to start up wholly new topics simply because it is usually easier to click reply than it is to find a person’s email address and start a new thread. Emails on a wide variety of topics may also have the same subject line because the sender does not take advantage of the subject line or puts something uninformative in it (e.g., “for your consideration”).

Emails in a thread often quote earlier emails in the thread. There are a number of conventions for identifying quoted content, but none of these is used with complete reliability. For example, some emails include a line demarcating original from quoted content like: On Apr 21, 2008, at 14:46 , Jim Johnson wrote:

Others use “>” to mark quoted lines (with or without the demarcation line). Still other patterns may also be used to signal quotations. In addition, in many organizations, most or all of the emails may include a disclaimer at the bottom. This disclaimer is usually intended to convey that the email should be treated as a privileged communication, that the email does not provide tax advice, or some other standard messages. Therefore, the simple fact that two emails contain the same text, may not indicate that they are part of the same thread. As a result, it may not be possible to say with complete certainty that one email that shares text with another one necessarily quotes it.

The best known algorithm for identifying email messages in a thread was originally created by Jamie Zawinski. This algorithm, which is freely available, uses email headers to link emails. As an email is sent from one system to another, the body of the email is augmented with a set of headers (metadata) that describe the email and how it should be routed. Most emails contain an “In-Reply-To:” and a “References:” field. These fields are intended, according to the email standard, to indicate where in a thread a particular email stands. Zawinski notes, however, that there is a high degree of variability in how these fields are used in practice. Zawinski identified hundreds of variations in a small sample of emails. As a result, it is difficult to say with 100% confidence, based on the email’s header fields, whether the email is part of a thread or not.

The bottom line is that there is no perfect way to identify emails in a thread. Neither the content, nor the headers of the email are adequate to always identify members of a thread. As a result, OrcaTec uses multiple techniques to identify emails in a thread. We employ Zawinski’s algorithm to examine the emails’ headers and use our own near-duplicate technology to identify email contents. In addition, the system can identify when one email fully quotes another. We call that subsumption. Knowing that an email is fully contained within another means that the first email does not need to be examined separately.