Cracking eDiscovery with Help from AI.
John Pryor: With me today is Alan Lockett, senior director of research at CS- DISCO, an e- Discovery company and one of Georgian's investments. E- Discovery is a fascinating market space supporting legal proceedings, such as litigation and government investigation, et cetera, where the information being looked through and sought is in electronic format. No boxes of paper like we've seen in those lawyer movies, and it's more challenging than sifting through all that paper, because now we have metadata. Think about how a timestamp of a particular document could be very relevant in a copyright lawsuit for example. Plus, we have all these issues of provenance and relevance that you have to be on top of before you release this information to requesting parties. Really is an amazing space. I'm John Pryor and welcome to the Impact podcast. Now, Alan runs CS- CISCO's AI lab, and is responsible for developing their AI and ML strategies. He's got an eclectic background with a love of languages and education that includes robotics, neural networks, ontologies, network graphs, I don't know about you, but NLP comes to my mind. Alan, welcome. So CS- DISCO as an e- Discovery company obviously's for legal. So I bet there's a million movies or TV shows that could probably rattle off a half a dozen right now telling the classic David versus Goliath story, and the little law firm gets the hundreds of boxes delivered, that was real though, right? Wasn't that the case pre- technology?
Alan Lockett: I would say so. Certainly they would deliver the documents in bank of boxes. There would be rooms full of them and you would have to go through them. So I've certainly seen those movies as well. It's still the case, except that the volume of information is increasing. One of the challenges for large and small players alike in the discovery arena is that, when you go and sue a big company, they have a lot of data. It's very expensive for them to go through that data and find what's relevant to you. And it's easy to be on the side of the little guy all the time, but we have to keep in mind that many times there are people who will go around just suing big companies for very specious reasons in order to just make the company pay, because it's cheaper than going through your documents. So, this hurdle of discovery of having to find all the information you have that's pertinent to the case, is a reason why companies will often just settle and walk away. In fact, a company that I did some work for a few years ago as a software consultant, was sued on the basis of supposedly violating a patent for writing out information into XML for later use. So this is basically just writing into a standard file format and then pulling it back in. The company that I did work for got sued and I was interviewed by the lawyers and they ultimately settled that. I never had to go further with that, but they just paid money to these people, it seemed irrelevant to me. So it's actually on both sides. It's a big challenge and of course if you're the little guy with an actual complaint, one of the standard strategies in e- Discovery is to dump a few million documents on you that passed some sort of weak test for relevance. So you do have to go through them because what you want is in there, but they don't have to give you more. So there are a lot of issues of dealing with volume of data, and it's only getting worse as we're starting to see all of the emails end up in these records. Previously, none of the phone calls ended up being discoverable, but now all your emails are discoverable, and it's getting to the point where people use chat apps and their phones for all these communications and all that data is now discoverable. So the volumes of data only increase.
John Pryor: So we have these boxes and boxes of paper and that was really what it was. Things got a lot better with machine learning. Things got a lot better with electronic documents, clearly all the emails and the texts and the instant messaging that you referred to, those are all online and available. So at least they're already digitize. We're not talking about scanned paper at least.
Alan Lockett: Typically we're talking about things that are digitized, but nonetheless, some scanning and optical character recognition is typically involved. The reason why is simple. If you give to your opponent, if you give to the other attorney, these original documents, say a PowerPoint file or a Microsoft document, you don't necessarily know what's in the metadata or hidden in comments, you may not have looked through them all. So there might be extra information. So one thing that a lot of attorneys will often do is, when they produce documents, they will strip them of the information that might be there so that they don't have to hand over extra hidden information they didn't know about. So then they'll turn it into a scanned document or a printout that will then be OCR'd back into a digitized form, so that they know what is actually being digitized, and so that it's limited to what they would prefer to share.
John Pryor: And I don't remember the example and I would probably just give me a headache, but there was something in the US government and it had something to do with a Word doc, and they found the metadata in there and it got a lot different.
Alan Lockett: That's the kind of thing I'm talking about.
John Pryor: Oh, that's really cool. So let's talk about a term, and I don't know if it's a CS DISCO term or it's an industry term, but TAR, Technology Assisted Review, how does that work?
Alan Lockett: It's an industry term. The history of e- Discovery, and you might be more familiar with it than I am, but the basic concept is, initially what people were doing was they were trying to train machine learning models that would then be able to classify the remainder of the documents. So if I have a million documents, I might go through and label 10,000 documents so that I could predict on the remaining 990, 000 documents. And they would just let the machine run on those 990,000 documents with some light validation. So, if their light validation determined that their model wasn't good enough, they would go through and retrain the model with another 10, 000 labels and then they would test on the remaining 980, 000. The purpose of this was just to try to reduce the workload, but usually in these scenarios, what we're talking about is a set of a million documents, but only 10,000 or so might be relevant. One to 5% is pretty typical for the percentage of a database that has to be turned over to the other side of the case. And in that scenario, it's okay to look at one to 5% of the data. So, instead what's evolved now is the idea that lawyers want to put eyes on every document that they're going to hand over to the opposing counsel. Because that way you're safer, you know what information is going over to the other side. Whereas in the case where you were only using machine learning model to decide relevance for 90 some percent of your documents, you had no idea what you were giving away...
John Pryor: Wow.
Alan Lockett: And so what happened is that, the term Technology Assisted Review was also introduced as a marketing and a branding term to refer to more general processes that might involve machine learning, but generally involve technology. In its weakest form, what it means is just using technology to assist in review of documents for discovery. So they refer now to TAR 1 Dato, which typically means where you train a model and then just run it on a set of documents and turn it over sight on scene. And sometimes people want to do this still for cases where there's low risk. For example, they might be responding to an SEC investigation or something similar, where they're going to see everything, so what's the point of withholding. And if you can save time by simply running a machine, then it's cheaper than doing a full discovery eyes on every relevant document. TAR 2 Dato typically involves a different set of procedures. It can be anything from simply applying keyword search, to employing topic clustering in order to pre- filter, to a process known as continuous active learning or CAL, which is what it usually means. And in a CAL scenario, what you're doing is, you are still doing this iterative labeling of documents. So the lawyer or reviewer will look over documents, code them, or mark them for the categories of relevance that they have decided needed to be shared or that they want to use for internal processes, and then once they've coded these documents, a machine learning model is trained, and then it's used in order to guide their work. In the active learning scenario what happens is that, the machine identifies the documents that are most likely to change judge's, so most likely to result in a different model, and we'll put these in front of the user. But in actual fact, people use the term CAL whether or not active learning is officially involved, to essentially mean that there's a machine in the loop, and a human in the loop. And the human is coding documents, and the machine is suggesting documents or providing a way of sorting documents or simply informing the user and guiding their decisions.
John Pryor: So let me go through two parts. I want to talk about the kind of the basic process, and then we'll talk about looking at these documents. First one, are there two sides to discovery that the company that has to provide the paper has to first figure out what to provide, and then the recipient then gets to do a different degree of analysis to find if they have a case or not? Are there two sides of discovery? I'd never really thought about that before. I always thought it was somebody looking at documents, but I hadn't really thought about them before and the after.
Alan Lockett: It's a mutual obligation. And I have to emphasize, I'm not a lawyer, I've been around them a bunch and I'm familiar a bit with this process, but I'm not. I couldn't give you for example, the name of the rules, federal procedure, the exact number that tells you what to do Rule 26 or something. But the basic concept is that each side has a responsibility or an obligation to share all relevant information with the other side. The entire US justice system, as well as the British system, is an adversarial system. So the plaintiff gets to assert that the defendant has certain documents and they say, well, we want these different categories of documents. They're required to meet and confer in order to determine what the criteria of the search will be. So again, they have to negotiate. Once they've negotiated, they decide on perhaps a set of keywords, perhaps a set of criteria, and this then becomes an RFP, a request for production that is written out for what things they have to provide. Both sides get to do this. The defendants can also request documents from the plaintiff. Usually, the plaintiff is a smaller entity with less documents that the defendant may already have in their entirety anyway, but in a situation, for example, with Apple suing Samsung, what you have is two big companies that are both going to request information from each other. And furthermore, when two big companies are involved then lawsuit, it's never one way. Apple may have sued Samsung first, I don't remember the exact details, but I would be willing to bet that Samsung counter sued-
John Pryor: Absolutely.
Alan Lockett: ...in order to make sure that there was some reciprocity and that both of them had a gun pointed to their head, and that's normal operating procedures. So yes, you're always in a position where both have to produce. Now, if the receiving party of a production believes that the producer has not produced everything they're obligated to produce, they can make a motion to the judge to look into it. The judge may then sanction the producing party, or force them to redo their production. There's a variety of things that can happen there. And the recipient of the production can also complain if they feel they've received too much, because it isn't allowed for the producer to snow under the receiver of the production. And so you can have court sanctions that are imposed in that case as well.
John Pryor: So we do have the David and Goliath, but we also have Goliath versus Goliath. I'd like to understand that a little more kind of, when someone's not processing the documents, whether they're giving it or they've received them, I like staying with the Samsung and Apple thing, because it's simpler for me, if I remember correctly, there was just an issue of rounded corners. So when someone's giving a document, let's say Samsung gets it to Apple, Apple starts looking for perhaps the use of the phrase, rounded corners, to see who suggested it, are they looking for that level of words? We first doing kind of a higher level of categorization that there are emails and patents, how deep do we go? And it really does seem like an editor of process to me. They start with something like you said, there's maybe 10,000 documents, they look at a thousand, they learn from those thousand. Maybe they iterate grab the next 500, what's kind of the macro level process for going through this?
Alan Lockett: Well, there's two sides to it. You mentioned the side where say, I've received a production. We have a lot of, we call the matters, that's generally any action, could be investigation, could be a lawsuit. We have a number of matters that are in fact people receiving productions that want to analyze what they've received, either to dispute them or just to know what's there. And so in those cases, usually a very high percentage of the documents are relevant. It's also usually a smaller number of documents and often a smaller firm that's having to do the analysis. In that case, they might just go through all of the documents. In the case, on the other side, where you're the large corporation, you've been sued and now you have to take all the communications of some large proportion of your major executives, plus several other people and decide what might be relevant, which is something that they have to negotiate who's documents will they look at, then this will all get fed into a system and might involve several tens or hundreds of terabytes of data. They will hire a review manager who might even work for a separate consulting company, who then will be responsible for assembling a team of reviewer, that could number anywhere from the tens to the dozens or even the hundreds, and for that they're going to need software to control the workflow. They will start by doing some exploration. We recommend that they go and do a random sample on the data in order to try to establish what percentage they're looking for in general. So that can give them a sense of how hard it's going to be to find the data, a sense of when they'll be done, a sense of what the proper budget timelines are. And then we also recommend that they do some targeted keyword searches. Lawyers are addicted to key words. They use keywords all the time. A lot of times during those Meet and Confer Rule 26( g) meetings, they will establish the keywords that are used. They can actually establish a set of keywords that are used to throw out any documents that don't have at least one of these keywords, that's often done. So keywords are a very core part of what they do, so we recommend that they do these targeted searches. Often they've talked to the clients, they know what's in there, they should look for what's in there. So that gives you the starting point, a labeled set that's based on random exploration, as well as a targeted set, an informed set of documents that are relevant to your process.
John Pryor: So someone did decide, for example, that rounded corner might be inappropriate, such a parameter in that example that we talked about, that rounded corner matters.
Alan Lockett: Sure. Sure.
John Pryor: There's no doubt in my mind, it doesn't need to be controlled by a human that is going to process and extract critical information from documents as well, is that what's meant by tagging. How does that work?
Alan Lockett: The first process, whenever we get documents we pass them through our ingest process, which involves doing a variety of things. For example, determining a lot of the attributes of the file, like here are the emails, here are the PDFs, here are the technical drawings and sorting everything out into different categories and assigning lots of metadata. If we know that things, emails will also parse out where the conversations are so that we can put together the full email chain. There's a lot of pre- processing that goes into it. That is a crucial factor of TAR and where people are able to search on these attributes. And I think a lot of e- Discovery systems provide some similar breakdown of the documents at first. The tagging comes in later. For us tags they're very informal. We started off as working with smaller firms where there's just less process and procedure, but essentially a tag is just like a tag on Instagram or something. They'll create a tag, they'll name. It it'll incorporate a whole lot of documents, it'll have some meaning, that meaning is opaque to our system, we don't have it written down anywhere. However, in a large case with a large law firm and teams of tens or hundreds, those tags are actually very formally decided in advance and they're intended to be very precise, and the reviewers are given very precise instructions of how they are to tag documents.
John Pryor: How does that influence the technology then? The human is tagging or is it influencing the discovery software?
Alan Lockett: The human applies all tags in our system. Our machine learning proposes scores that the reviewers can see, they can choose to apply those tags or not. In some cases they may not even see the scores. We might use the scores in a hidden fashion in order to sort the documents that they're seeing. It would be the review manager who would control that. Because we don't want to influence unduly what the reviewer is going to decide. Instead, they set up these stages, these blocks of work that each reviewer has to do, and then the review manager decides that those are going to be fed in order of the most likely to be tagged according to our machines judgment. So that's where the machine learning can come in.
John Pryor: But you're really augmenting and adding value to the human being, but you are absolutely not replacing the human in this case at all.
Alan Lockett: No, and we certainly aren't trying to do that. Actually, one of our key branding characteristics is that we say that the lawyer's in control. So whereas some of the earlier systems, some other systems, what they'll do is, once you enter their e- discovery review process, you have to follow their rules or else you don't get a good result. Now it's still true in our process that if you don't follow our recommendations, you'll get a worse result, but not necessarily a horrible result and if you're looking at every document that's relevant to anyway, it'll be the right judgment coming out. But in general, we are lower touch. It's very easy, they can just click a button, and then our machine learning is turned on, DISCO AI is turned on and available to provide them with recommendations. So we try to be out of their way so that the lawyer can follow their process instead of ours.
John Pryor: So you're behind the scenes, so if you're going to demonstrate kind of return on investment, would it be the number of documents that you could look at in a period of time? Would you be using less of the high price legal resources, less of the associates, cheaper resource? Is it across the board? I'm just thinking about how a process might take X months. How long would it take adding your software to the mix here?
Alan Lockett: We say all of those things. Fundamentally, the biggest gain is that because we're using the latest technology in the cloud and we have very good engineers, a lot of the gain is from the ease of use. We also have a very large focus on tailoring the product to how lawyers actually use it. Our software versus our competitors, even without the machine learning is far more efficient, just because it does what they want to do, in the order they want to do and it does it fast and efficiently. That said, when you look at, why would you use DISCO AI? Why would you use our machine learning on top of that process? It's all of those things. In some cases, we actually now are running some reviews ourselves to be able to develop our processes and understand how to use our tool and what gains we can actually get using our tool. And what we're finding is we're able to review as little as 20 to 30% of the documents based on using our DISCO AI. In some cases we've even used as little as 5% of the documents. Again, the size of the matter is important. If you've got 10,000 documents, then reviewing 4, 000 is not really so onerous, but if you've got a million and you're able to be done after reviewing 10, 000, you save a lot of money. And in fact, the margins that I'm hearing are up to 80%, that we're able to turn into margin just by using our AI in order to front load things, and then using statistical testing at the beginning and end of the review in order to determine how much was there to find, and did you find at all?
John Pryor: So we've got a pretty cool running DISCO system incomes your new investor, Georgian Partners. Talk to me a little about your work with the team.
Alan Lockett: Sure, you bet. Well, the system was built in 2015 based on convolutional neural networks. At the time, that was a state- of- the- art approach to text classification. Around the summer of 2018, the technology fundamentally changed. You had technology such as Home fit and BERT, these were transfer learning technologies where you would first train neural networks on say, the Corpus of English literature, so that you would have a model that already knows English and then you would apply that to a classification task. And you would find that instead of getting 90% accuracy, you'd get 95% accuracy with these transfer learning techniques. So it was a fundamental change, but it was also fundamentally different from the system that we had in 2015. One of the reasons we were so excited about working with Georgian Partners, at least in the AI lab, is because we had a shared philosophy that transfer learning, representation learning were going to be fundamentally transformative in terms of how AI is used in commercial applications, and these were going to provide a much higher value. And so even during the phase, while we were discussing Georgian Partners investment, the fact that we had similar views of where AI was going, and the fact that Georgian had on hand talented and experienced personnel, who knew how to work with these models, was very important to us agreeing to work together. So when we were able to actually take advantage of Georgian's impact team, the primary advantages were that now we would have the ability, in view of all the other things that we had to do, to have people who are available to do some of these longer range tasks that you couldn't necessarily get done with a small team in a startup. So, the advantage's that Georgian could provide experienced researchers who did not need to come up to speed, because they already knew the technologies in order to address strategic opportunities that otherwise might have waited into a longer or sat on the back burner, just because we wouldn't have been able to get to them.
John Pryor: That's neat. How are things looking compared to the original kind of convolutional neural network to this use of BERT say, how are things looking? How does it run differently? What are your end user's perceive?
Alan Lockett: Well, the end users should just perceive better scores. In other words, the judgments that the machine is making should simply be better. And they should be able to sense the quality. It's substantial enough that they should just notice that the machine now seems to hone in on what's happening faster. And it really is faster. The big advantage of these methods is that we believe it will give us 14% relative increase in accuracy using 25% of the data. And that's based on experiments on about a dozen of our actual production matters. So we're now able to field a model with high quality in a quarter of the time. And the net result of that is that our modeling suggests that we will be able to reduce the number of review documents by a further 20%, which is just further savings and margin that we'll be able to capture through our managed review business, so that our clients who use us would be able to capture with appropriate processes. And so, now what we're doing is building out the infrastructure to make this operate efficiently and cheaply before we're able to turn it on for everyone. But, by the end of the year at the latest, all of our customers should be benefiting from these new transfer learning based models.
John Pryor: Really fascinating. Really interesting. Thank you so much Alan for taking the time to be with us.
Alan Lockett: You're welcome. It's been my pleasure.
Imagine having to sort through millions of documents to find the one detail that matters most. This is the discovery process, a legal requirement for most corporate lawsuits. It’s expensive, time-consuming, and fraught with risk – a perfect job for AI – as long as you can get it right.
Alan Lockett is our guest on this episode of the Georgian Impact Podcast. Alan is Senior Director of Research at CS DISCO, an eDiscovery company with an AI-assisted review product and a Georgian company. DISCO is dramatically reducing the time and money that goes into discovery.
You’ll hear about:
- How lawyers are using CS DISCO to sort through tens of hundreds of terabytes of data.
- Why emails and chat transcripts are making the discovery process more complex than ever before.
- How Alan’s team is setting new standards for technology-assisted review (TAR).
- How DISCO collaborated with the R&D team at Georgian.
Who is Alan Lockett?
Alan Lockett is Senior Director of Research at CS DISCO, an eDiscovery AI company with a focus on working the way lawyers work rather than reinventing the wheel. His research focuses on artificial intelligence, natural language processing, and cognitive architectures - especially methods including neural networks, graphical models, and probability theory. He holds a Ph.D. in Computer Science from the University of Texas, Austin and completed a Postdoc in Robotics & Deep Learning at IDSIA in Switzerland.