To schedule a demonstration, call 1-800-998-4874

eDiscovery 101

What is Email threading?

Email threading is a tool used to link email chains and reduce the volume of data that needs to be examined.

 


At its base level, email threading is a an analytic tool used to provide insight into a series of documents (emails) in a very efficient manner. Email threading greatly reduces the time and complexity of reviewing emails, by gathering all forwards, replies, and reply-all messages together. Email threading identifies email relationships, and extracts and normalizes email metadata. Email relationships that can be easily revealed using email threading include:

  • Email threads
  • People involved in an email conversation
  • Email attachments (if the Parent ID is provided along with the attachment item)
  • Duplicate emails

An email thread is a single email conversation that starts with an original email, (the beginning of the conversation), and includes all of the subsequent replies and forwards pertaining to that original email. The analytics engine uses a combination of email headers and email bodies to determine if emails belong to the same thread. Analytics allows for data inconsistencies that can occur, such as timestamp differences generated by different servers. The analytics engine goes through and determines which emails are inclusive, meaning that it contains unique content and should be reviewed.

At a high level of explanation, an email threading algorithm performs the following operations:

  1. It segments emails into the component emails, such as when a reply, reply-all, or forward occurred.
  2. Examines and normalizes the header data, such as senders, recipients, and dates. This happens with both the component emails and the parent email, whose headers are usually passed explicitly as metadata.
  3. Recognizes when emails belong to the same conversation, (referred to as the Email Thread Group), using the body segments along with headers, and determine where in the conversation the emails occur.
  4. Determines inclusiveness (see next section) within conversations by analyzing the text, the sent time, and the sender of each email.

It segments emails into the component emails, such as when a reply, reply-all, or forward occurred. Examines and normalizes the header data, such as senders, recipients, and dates. This happens with both the component emails and the parent email, whose headers are usually passed explicitly as metadata. Recognizes when emails belong to the same conversation, (referred to as the Email Thread Group), using the body segments along with headers, and determine where in the conversation the emails occur. Determines inclusiveness (see next section) within conversations by analyzing the text, the sent time, and the sender of each email.

In structured analytics, there are two types of email messages:

  1. Inclusive – Defined as an email that contains unique content not included in any other email and therefore, MUST be reviewed. For example, emails with no replies or forwards are by definition inclusive as is the last email in a thread.
  2. Non-Inclusive – Any email whose text and attachments are fully contained in another inclusive email. (see footnote 1 for a complete description of email inclusion)

Focusing staff review on only the inclusive emails and removing duplicates, will dramatically shorten the review process, without any loss in accuracy. The role of the email threading analytics engine is to derive the email threads and determine which subset of each conversation constitutes the minimal inclusive set. Inclusiveness analysis ensures that even the smallest changes in content will not be missed by reviewers.

Early Case Assessment has come back to the forefront of eDiscovery. For a number of years, people focused solely on speeding up the document review process and forgot about the importance of culling data to reduce the volume of material that needs to be reviewed. Given the rapidly increasing volume of ESI, the demand is to get more done, in less time, for less money without sacrificing the quality of the end product. Early Case Assessment tools to the rescue.

Email threading analysis is an extremely effective tool to reduce the volume of data that needs to be examined. Due to the nature of email threading, the team can achieve both significant increases in efficiency and quality. Combined with some workflow improvements, impressive results can be obtained.

Think about how people generally create document review sets. They assign material by custodians to coding/review staff. The problem with this approach is that email chains may be broken when documents sets are arbitrarily created in sets of “X” documents. Reviewers only get a partial story, since all relevant information is not included in their review set. Reviewers can’t possibly understand complex chains of communications if they only have access to parts of the whole. Organizing data by complete conversation, combined with the ability to read only the “inclusive” or most rich email entries, can dramatically reduce the total documents that must be reviewed.

Email threading assessment during an ECA process increases awareness of the content of the document corpus by:

  • Understanding the scope your case more quickly since it allows you to focus in on entire conversations rather than just snippets. This can be used to improve search term selection, prioritize custodians and concept cluster choices earlier in the process.
  • Improving consistency in coding decisions. Since reviewers are seeing the entire context, small snippets of a conversation may have a clearer meaning and should result in a more consistent and accurate coding calls.
  • Providing a great Quality Assessment/Quality Control tool to look for inconsistent coding calls by reviewing all the documents in the thread. In particular, reviewing coding decisions on privilege are easier to review and validate.
  • Reviewing by inclusive emails will dramatically reduce the number of individual records that have to be reviewed resulting in obvious coast savings. Additionally, it will improve reviewers understanding of the overall content within a single email chain.

Footnote 1

Common Inclusive email definitions included in analytics

  • The last email in a thread: The last email in a particular thread should always be marked inclusive, because any text added in this last email (even just a "forwarded" indication) will be unique to this email and this one alone. If there were no attachments, and no changes to the subject line or the body of the mail, this would be the only type of inclusiveness.
  • The end of attachments: When an email has attachments, and the recipient replies, the attachments are often dropped from the display. For this reason, the end of the thread will not contain all of the text and attachments of the email. Structured Analytics will flag one of the emails containing the attachments as inclusive to make sure that all the important information is reviewed.
  • Change of text: Email threading analytics capture any changes in the body of the email and display both versions for review. One can imagine that an employee wishing to eliminate negative information might attempt to change a word or two by modifying the original email during a reply and or forwarding the email to a third source. In this case, the Analytics engine would recognize that the email from Person A to Person B contained different text than that from Person B to Person C, and flag both emails as inclusive. This type of rule regarding text changes would also display two emails as inclusive if someone includes their responses to questions within the body of the original email text.
  • Change of sender or time: If the Analytics engine finds what looks like a prior email, but the sender or time of that email doesn't match what's expected, it can trigger an extra inclusiveness tag. Note that there is a certain amount of tolerance built in for things like different email address display formats ("Johnson, Jeffrey" versus "Jeffrey Johnson" versus "jeffrey.johnson@google.com"). There is also the understanding that date stamps can be deceiving due to clock discrepancies on different email servers and time zone changes.
  • Duplication: While not necessarily a reason for inclusiveness, when duplicate emails exist, either both or neither are marked inclusive by most Structured Analytics tools. Duplicates most commonly occur in a situation where person A sends an email to B and to C, and you collect data from two or more of the three people. To avoid redundant review, you should be sure to remove email duplicates from the population before creating review sets