Why “data extraction” doesn’t work

Data extraction doesn’t work because:

  1. The data you actually care about can’t be extracted.
  2. You have no link back to the source document, so you can’t easily validate the output.
  3. In extracting data, you’ve thrown away the critical surrounding linguistic context.

Imagine you’re creating a spreadsheet to act as your contract database.

Your colleague reads through each contract and enters key information into each row and column of the spreadsheet. Or they use a fancy AI tool. This is data “extraction”.

Then you realise that what you actually care about can’t be extracted.

So your colleague has to work it out manually.

After they’ve done that, you start using the spreadsheet. Two weeks later, you spot a number that doesn’t look right.

The only way for you to know if there’s a mistake or if there’s important context is to find the contract and find the clause containing that information.

A few months later and you’ve found a few more mistakes. Now you don’t trust the spreadsheet, especially since your colleague has left the business and you can’t ask them any questions.

You end up needing to repeat the whole exercise.