As a follow-up to our previous article on research, which covered filtering and keyword research as two ways to narrow or broaden your data set, the next step is to use feed and concept research. to reduce the data set. The overall result of eliminating duplicate threads and selecting a unique set of concepts gives you better insight into the performance of keyword searches and filtering activities.
Deduplication, threads and concepts
What does deduplication have to do with all this? What is the relationship between deduplication and threading and concept research? Exact deduplication is repeatable, reproducible and based on well-known mathematics. Concept search and email thread are both based on artificial intelligence (AI).
Although the goals of deduplication, email chaining, and concept indexing are all to identify groups of documents, whether accurate, similar, or inclusive, the technologies used to compute and interpret the results are very different.
Deduplication in practice
The deduplication method applied is an essential question to ask before looking. During processing, it is essential to deduplicate, so that only one copy of an exact duplicate is sought. Whether the duplicate copies come from other repositories or from sources, it is important to provide only one exact copy for threading, conceptual analysis and research. In the past, all dupes were loaded into a system and a filtering technique was used so that only primaries were indexed, threaded or searched by concept. Now, most tools remove duplicates and update fields to indicate that a document contains duplicate fields, custodians, and other items.
From a cost perspective, loading duplicates into a system that bills per GB means you risk paying twice for the same document. Another reason to use exact dupe is to get the same results regardless of platform. Since each platform may have a unique algorithm for calculating email thread or concept search, differences may arise if a situation arises where other platforms are used. For example, if two loose files share the same MD5 hash, most platforms would detect them as exact duplicates, regardless of their deduplication technology. However, two emails grouped in the same thread based on metadata may be interpreted by other analytics engines and grouped into different threads.
The purpose of the email thread has evolved to detect whether all content in one part of an email thread is included in a later part of that thread. This identifies the last email as inclusive, and the previous email can be ignored because all of its content is contained elsewhere. For example, an email with an attachment was sent by Mary to Joe, and he replied to her. The response would not contain all of the original content because the attachment would be missing. If, in the previous scenario, Joe received an email from Mary and forwarded it to Sue, that forward would contain all of the original content because it would include the attachment.
These examples of common mail operations do not necessarily manipulate the process of calculating mail threading. When a user edits or deletes the original content of a received email, the system should detect that the last email in the thread does not include all of the original content, such as when someone one responds to a numbered “online” list by putting their comments in the original list of authors.
In all these examples, the system must be able to calculate a thread and a tree showing the lineage of an email and its different branches through the use of metadata. Some tools may even be able to detect gaps in the email tree where emails are missing.
Improvements to searching using threads
Using both the Tree ID and the Inclusive Email ID can improve the ability to locate documents or reduce review load. As we discussed in the previous blog post, it’s important to know whether you’re using search to try to narrow down or broaden your results.
The simplest example of threading improving search results is to limit a search to inclusive emails, since there won’t be multiple results in a thread for an original email. Imagine that the initial email of one in ten threads contains the keyword “John”, but none of the remaining email bodies contain this term. A search of the entire thread would be reactive because even if John only appeared in the first one, a reviewer would see results in every subsequent email. An inclusive search would mean that you would only have to review one email. However, this does not work for filtering by date, because multiple dates are represented in a thread.
Using the other output of the email thread – the tree – allows the encoding of entire threads. The combination of the two technologies enables the production of unobtrusive emails that were ignorable for review purposes without a reviewer having to peek into those documents. In this case, a reviewer marked an inclusive email as responsive and not privileged, and therefore all emails were ignored because this inclusive email can be produced without the need for an individual review.
Concept search relies on a technology that groups items based on their textual content. In most cases, a by-product of conceptual search – indexing – is the calculation of near-duplicates. Near-dupes are similar to exact dupes, except for a different indexable character or words, and often a percentage calculation is provided. Near-duplicates of 95% or more can reasonably be considered versions of the same document. But knowing the optimal percentage threshold to use as the threshold for skipping near-duplicate items depends on the technology and the length of the documents. The longer the document, the higher the required percentage can be. This may require practice and familiarity with a particular eDiscovery tool’s decisions regarding near-dupe identification.
Conceptual indexing takes this almost foolish idea one step further, by identifying documents that have the same ideas but may not share similar vocabulary. For example, one document may talk about the rules of the game of football (the name of this sport in America), while another talks about the same rules but calling it football (as it is called in the rest of the world) . These documents would not be identified as near dupes but would be identified as conceptually similar. A document describing the rules of American football and football as it is known to the world would not be identified as conceptually similar.
Note that in most conceptual indexing, a document is only included in the conceptual group where it is most similar to the rest of the documents; he would not be part of several groups. Many platforms apply cluster visualization to concept groups, which presents larger groupings as more abstract ideas.
Search Enhancements Using Conceptual Search
Once documents are in concept groups and nearly duplicated, searching can be improved in several ways. The most obvious is to continue to narrow the search set by eliminating near-dupes who share a high percentage. It’s a bit like ignoring emails by identifying their inclusions. When searching for major near-duplicates in a set of nearly duped inclusive emails, the results can be split into the number of different concepts versus the discrete number of documents with responsive keywords.
For example, if the reduced set contains 10,000 items and a search term returns only a dozen items, it is fair to say that this is a targeted search. If it returned 20% of the set, it may need to be refined, or the data is just rich with that term. Rather than simply identifying the number of results, this type of search identifies the number of document types that exist in the search result set. Once the sets of documents to be reviewed are identified, it is possible to review all those in the content group where the hits exist. This helps identify relevant documents for searches that do not contain a particular keyword. In the previous example, if a reviewer searched for the word football, it would not return the rulebook where the game was called football. But looking at the whole conceptual group would lead the critic to understand that it is also called football. Each industry has acronyms, colloquialisms and specific vocabulary; if a user is unfamiliar with them, searching for concepts can help.
The other major benefit of concept indexing is the ability to search by concepts rather than keywords. In some systems, this is called concept research; in others it is called searching for copies. In either case, a reviewer provides sentences or paragraphs instructing the system to identify documents conceptually similar to what was provided. Obviously, the more data provided, the fewer conceptually similar documents need to be found.
A final aspect of conceptual search and indexing is when an eDiscovery team may have no idea what they are looking for. Grouping helps clarify the contents of the set. First, documents are randomly selected from each concept group and reviewed. Next, concept groups that have responsive documents in the sample set should be examined in their entirety to identify additional responsive documents. This common approach is actually the start of the workflow for most technology-assisted review (TAR) or predictive coding engines on the market. This technique is used by both plaintiffs and defendants when they have a set of documents with unknown content.
Keyword searching and date filtering are ubiquitous and understandable to most users, and they generally produce the same results across platforms. Threads and concept research can improve and advance keyword research and filtering to improve overall search efficiency and effectiveness. The identification and conceptual grouping of email threads based on artificial intelligence is not a new idea, but has become simpler to use and easier to understand in modern eDiscovery platforms. Combining all of these search technologies is the most effective way to reduce the review burden and locate the most relevant documents for your case.