Transcript
Hi, I’m Josh from Nomio, but my friends and family prefer to call me that guy who has done nothing but build and maintain contract repositories for seven years.
So pick whichever name you prefer. Now, we’ve already covered what happens if you naively try to stick Claude or another LLM on top of your contracts, and then some of the ways that we get around the extreme cost and slowness that comes about as a result.
But all the way through this discussion, we’ve been talking as if we’re simply putting contracts into whatever system we’re talking about.
But that’s not actually true. What we’re really talking about is documents and they are not the same as contracts. You can think of it as a contract is one or more documents.
That together make up the contract, but what we deal with as our actual input is a big pile of documents, individual PDFs or word files, and this changes the whole game, it makes the problem much harder, and it means that there’s a big step that we have to solve before we start doing any of our fancy data extraction, question answering stuff. So let’s dive in.
The first thing you have to do is recognise that if you’re plugging something on top of your file system, whether it’s Claude, whether it’s another kind of contract management system, etc., you first have to clean up your big pile of documents.
Only some of them are going to be relevant for building your contract repository. There are going to be lots of duplicates in there and even more completely irrelevant documents.
These are often kind of very ancillary documents that you don’t care about. Lots of things like drafts, unsigned versions of the docs, out-of-date terms, random invoices and receipts, so stuff that’s just not relevant at all.
And all of this stuff is going to really pollute and actually compromise the accuracy of the answers you can expect to receive if you don’t get rid of them. And they’re also just going to inflate the cost of any kind of processing that you try and run on top.
So you first need to isolate what are just the contractual documents. And then the problem has only just started because you have to do the very difficult job of figuring out how you group these documents together to form the contracts that are actually the objects that you’re going to be extracting data from and answering questions about. If you just look at the naked documents, you’re going to be drawing a whole load of incorrect conclusions and we’ll see why if we look at some of these examples.
So sometimes we’ll get lucky and we’ll just have a standalone contract. So the contract is a single document and everything is contained within - nice and simple.
This is what we’ve been assuming throughout. But you go one step up and you might have a set of terms and conditions that are generic, so they might be stored on a website somewhere or a PDF and you then have an order form that sits on top of them.
And so the terms and conditions contain some of the information. The order form contains other bits of information like when it starts, what the price is, what the initial term is, and the only way that you can fully understand the entire contract is when you consider both of the documents together.
And then you’ve also got to take account of the fact that the order is likely to override or take precedence over some of the terms in the terms and conditions.
Then you’ve got cases where you might have the main agreement and additional schedules, like a schedule of prices or schedule of services, that’s important to the agreement and you need to make sure that all of those are grouped together.
You might have cases where an amendment has come in. So you’ve got your original standalone agreement and then a couple of years later an amendment has signed.
It extends the term. It increases the price. It updates a whole load of the existing terms and you need to make sure that your system isn’t naively looking at these as two separate contracts because then it’s going to be drawing incorrect conclusions from one out of date document and one incomplete document. So just like with the order in the terms and conditions you need to combine these together into a single contract and make sure that that contract as a whole reflects any of the changes that the amendment made.
And then we go a step up and we have things like MSAs and statements of work. And the difference here is that this is not just one contract.
We have here what I would call multiple contractual instruments so we’ve got an overarching agreement that governs how all child agreements will be handled between two parties and then each statement of work is effectively its own contractual instrument.
It has its own term. It has its own start date, it’s own end date, it’s own renewal provisions, it’s own billing terms.
But every statement of work still shares a bunch of common terms from the MSA. And it’s at this point that you might realise, hold on a second, neither a spreadsheet nor my file system, with nested folders, is actually capable at all of modelling these relationships between documents.
And this is the tough part. Unless you have a purpose-built system or database that actually knows how to model all of these relationships between documents, you’re never going to be able to accurately extract information from these documents in the context of the contract that they’re in.
It’s just impossible. So if you’re just chucking Claude at a naked pile of documents, you’re going to be getting things wrong simply by the fact of not having your documents organised properly, even if it’s pulling correct information from each individual document.
And then we have the very real possibility of any combination of these kinds of structures that you see here.
So you might have an amended MSA. Or you might have an order, terms and conditions and a bunch of ancillary documents.
And very often you will get things like an order form and terms of conditions that someone has stuck together in one single document or you might get a contract and then a separate agreement like an NDA or a data processing agreement stuck on top.
And so now you’ve got two concurrent sets of terms but whatever system you’re using has to be smart enough to understand whether it’s looking at a single contract in one document or whether that document might contain multiple contracts.
So it gets very hairy very quickly. Then we have this issue of permissions. If you are just sticking an LLM on top of a file system, it’s going to be impossible for you to enact any kind of access control.
Remember that large language models are probabilistic systems, and if you want the guarantee of security, if you want to make sure that certain people only see certain things, there is simply no substitute for having a database that says who can see what.
There is actually a solution to all of these problems and it’s by intelligently combining the solutions that we’ve looked at so far.
So that is going to be the topic of the next video in the meantime. Have a wonderful day. Maybe go spend some time outside. See you.