Earlier this year, a whistleblower secretly leaked documents from a law firm in Mauritius to a group of investigative journalists. The documents provide a rare look at how multinational companies avoid paying taxes when they do business in Africa, the Middle East, and Asia. But with 200,000 documents, some hundreds of pages long, the trove was too large for the journalists to simply sit down and read by themselves.
Enter artificial intelligence. To support the reporters working on the project, Quartz built a machine-learning model that identifies similar documents among a set. For instance, when reporters found one particularly useful business form or tax filing on their own, the model could help them find others like it. Suddenly, the vast array of documents was a lot more manageable, and the journalists were able to complete their reporting and publish their findings with dozens of news organizations starting today.
The project, dubbed the Mauritius Leaks, involved 54 journalists around the world secretly coordinating online over many months in an encrypted workspace built by the International Consortium of Investigative Journalists (ICIJ), which received the original leak. The most important work was done by human reporters making sense of the documents and uncovering the magnitude of corporate tax-avoidance, including our Quartz colleagues Max de Haldevang, Justin Rohrlich, and Abdi Latif Dahir, who wrote about Bob Geldof’s African investment fund, Sequoia Capital, and more.
Our AI aided the investigation by applying a journalist’s human judgment identifying a particular kind of document—like a tax return or a business plan—across the entire document trove. While the AI didn’t do anything a human couldn’t do (after all, knowledgeable journalists know what a tax return looks like), it did the job a lot faster, freeing humans to do other tasks.
“So, so helpful,” Will Fitzgibbon, a senior reporter at ICIJ, said after Quartz’s model found several financial statements in the trove. “I hadn’t seen some of these, largely because they are hidden way, way down in bundled PDFs that mean you need to scroll through the whole thing to find it.”
This new model of AI-assisted investigative journalism is the focus of the Quartz AI Studio, which is pursuing several such projects this year with the support of a grant from the Knight Foundation. While most investment in AI for media and other industries has focused on automation at large scale, we’ve found potential in applying the same technology for more bespoke research in close coordination with humans.
The Mauritius Leaks provided an opportunity to apply that idea to a complex investigative project with unique challenges.
Fitzgibbon led the investigation for ICIJ and had identified a few interesting types of documents, including financial statements, business plans, and a particular kind of tax return we called a CTX tax return. He wanted us to figure out how to find more of them.
Documents that said “2015 Financial Statements.pdf” in the filename or “Notes to the Financial Statements” in the first few paragraphs aren’t the issue; ordinary search tools can find those. But words can get mangled when a document has been printed, scanned, and read into a computer with optical character recognition software. In once case, we saw the phrase “CTX Return” become “C’I’X Return,” with apostrophes around an I instead of a T, which a regular search would miss.
An ordinary search might also miss slight phrasing variations, which are especially likely in less structured documents such as business plans, which are valuable to reporters because they include descriptions of what a company does and why it chose to incorporate in Mauritius.
How we did it
In broad strokes, we built a model that, given an interesting document, lets us find similar documents.
What does “similar” mean? Good question.
One option would be to train models to spot the difference between “tax returns” and “not tax returns.” This is known as “supervised” learning, because humans must classify the training as one or the other to start with.
The other option, which is what we chose, is to train a model that essentially fingerprints each document, allowing us to find clusters of similar fingerprints. We used doc2vec, which is called an “unsupervised” machine learning method because it learns useful things about the documents just by “reading” all the documents. Doc2vec is an extension of the more widely-known word2vec algorithm that – using complex math – maps words into a 100-dimensional space where similar words are close together and the relationships between words are represented spatially. Doc2vec maps documents into this same space, so similar documents are close together. By feeding a few hundred thousand documents to the model it built a “mental” map of all the documents in the Mauritius Leak.
Training the model took about 13 hours on my Macbook Pro.
To actually use the model, we then fed it a seed set of CTX returns so that, metaphorically, the model would “average” those documents and find things they had in common. If we had fed the model only one CTX return from, say, the software company Esri (which appears in the leaked documents), the model would consider as “similar” other Esri documents that weren’t CTX returns. That isn’t what we wanted.
Once the model was done, I looked over the filenames of the documents it found. Many of the business plans had “Business Plan” in the filename, but where the model shined was finding business plans with filenames like “BPlan” or “BP.”
We then posted a list of those documents into the shared workspace the journalists used, along with internal links to the documents themselves.
How we measured success
The initial results seemed great. From a “seed set” of 14 CTX returns, the model gave me 300 documents, many of which were CTX returns I hadn’t seen yet.
But how great were the results?
We couldn’t use traditional methods of measuring success. For one, we didn’t know how many CTX returns were in the Mauritius Leaks, so we couldn’t calculate how many we had missed. And unlike supervised classification, this method doesn’t give us a sharp division between what the model thinks are and aren’t CTX returns. If I asked the model for the 300 documents most similar to a known CTX return, I would get back 300 documents, whether or not the 300th looked like a CTX return.
We didn’t care so much about false-positives (aka “precision”), since reporters are good at ignoring irrelevant documents. But we care very much about not missing documents (aka “recall”); we didn’t want CTX Returns to remain un-found in the dataset.
With this in mind, I chose one CTX return and separated it from my seed set of CTX returns. Then I asked the doc2vec model for documents most similar to the first CTX return and measured how many of my list were included, and in what rank.
I originally found my seed set of CTX returns by searching in the ICIJ database for “Corporate Income Tax – Annual Return” and looking at a handful to confirm that they were the right thing. To simulate how well the doc2vec model did at finding other CTX returns lacking that search term, I redacted that term from each of the other CTX returns.
My 14 “other” CTX returns, with the search term redacted, showed up in the top 53 results, with 11 of the 14 in the top 26. Many of the documents that appeared higher than members of my list of 14 were right answers, too. They were CTX returns I hadn’t seen, exactly what I was looking for.
That said, we can’t confirm whether we found all of the documents we were looking for.
Hurdle one: Training computers on secret documents
Using machine learning on a document leak poses a lot of challenges. One is generating “training” data to teach the computer model what you want to find.
If you’re making a spam detector, for example, you’d start with a bunch of emails people have already marked as “spam” or “not spam.” The documents in the Mauritius Leaks had no such labels, and contained little label-like information at all. And we had only a few weeks to work, so laboriously reading a sample of the documents to label manually was out of the question. Plus the documents had to remain secret to protect the source who leaked them, so we couldn’t use workers from online services such as Amazon’s Mechanical Turk to label them.
Hurdle two: Documents within documents
Documents buried within other documents pose another problem our approach helped solve. Imagine that a company’s financial statements appear as 3 pages in a 160-page PDF of documents the company used to establish its first Mauritian bank account.
In this case, the model might miss the financial statements altogether because the rest of the PDF is more similar to something else, like maybe a tax return.
So how would we find that buried financial statement?
I took a stab at solving this by dividing each document into 1,000-word slices that overlap – so each word is contained in exactly two slices – and trained a model that considered these to be documents, too. (Using 1,000 words as a cutoff was somewhat arbitrary; 500 words might’ve been a better choice.)
This worked fairly well for detecting financial statements. For instance, the Mauritius Leaks included a “retrocession agreement” in which one company was selling access to another company’s investment fund. Appended to the retrocession agreement was a “due diligence questionnaire” about the first company, and attached to that was its financial statements, beginning on page 35.
A reporter searching for financial statements might not bother to read a seemingly-unrelated contract to find the financial statements hidden on page 35. But with our model, we have a far stronger signal that the document contains what we’re looking for and is actually worth reading.
Try it yourself
We can’t share our exact steps and code for this project because we can’t publish the Mauritius Leaks themselves. However, if you’re willing to play with some Python code, we’ve put together an equivalent model and a Jupyter notebook you can experiment with. The “leak” in our example case is a trove of emails written by New York Mayor Bill de Blasio, which were released under a freedom of information request.
Want to learn more? Read about how to use artificial intelligence for reporting projects that involve big piles of documents.
Read more of Quartz’s reporting on how the Mauritius Leaks expose global tax avoidance.