Concept Clustering in eDiscovery

Topic: eDiscovery
By: Jeffrey R. Schaefer & Megan B. Gramke

Imagine yourself in a complex litigation in which your opponent finally responds to requests for production. You receive requested email files from a group of 10 employees. The email images produced, including attachments, comprise 200,000 files. If printed, these materials would be more than one million pages.

For these 200,000 files, you have data about the emails such as identities of the senders and recipients, the dates of the emails and the email subjects. You also have searchable text for each email. You know what the attachments are. Still, you have 200,000 electronic files to sort through. That’s a big number. Where do you start?

Maybe key employees’ emails should be selected for initial review. Perhaps there are certain key terms that can be used to search for useful evidence among all of the emails. These are certainly valid approaches.

However, what if it were possible to quickly sort the 200,000 items into about 20 “buckets” where the items in each bucket shared certain textual similarities shown on the bucket labels? Buckets whose labels revealed highly relevant terms could be targeted for priority review.

The software tool that makes this possibility real is called “concept clustering.” It is an analytic tool available in eDiscovery platforms such as the Relativity platform implemented at Ulmer. No case specific details are required to “cluster” a set of materials. Once the materials to be clustered are specified, one or more algorithms within the clustering tool evaluate the data, find the textual similarities, and create the “clusters” and labels. For clusters with large amounts of files, “sub-clusters” are created that further refine the data. Our experience is that clustering is a useful tool to quickly locate materials that share the most relevant and important concepts in a dataset, whether applied to email or other text based file types.

Clustering is just one analytic tool available to evaluate electronically stored information, in cases large or small. It can speed human review of materials and make the review process more efficient and cost effective. For more information, contact one of Ulmer’s eDiscovery Practice Group members.