More than 700,000 leaked documents, weighing in at 356 gigabytes, reveal how Isabel dos Santos, the wealthiest woman in Africa and the daughter of Angola’s former president, siphoned hundreds of millions of dollars in public money out of one of the poorest countries on the planet. Digging into such leaks is something the International Consortium of Investigative Journalists (ICIJ) has plenty of experience with. But they had a problem:
That’s a heck of a lot of files.
ICIJ partnered with Quartz’s AI Studio to find a solution. We built a system using artificial intelligence to “read” all the documents and help journalists from Quartz, ICIJ and other partner organizations find the kinds of documents they expected in the cache of leaks—regardless of file format, spelling, transcription errors, phrasing, or even the language of the document.
- Western accountants and consultants played a role in legitimizing Isabel dos Santo’s empire—and that a failure to regulate these companies in the West lets them do so.
- A major American consulting firm, Accenture, did work valued around $50 million for dos Santos-linked companies and emails show one of its executives making light of corruption allegations against her.
- Big banks tried to crack down on her, but dos Santo sidestepped the problem by buying massive stakes in Portuguese banks for herself. Once inside the EU, she could move her allegedly stolen money anywhere around the world.
- Dos Santo was a distributor for major fashion brands, including Dolce & Gabbana, giving her a veneer of respectability among European elites.
Here’s how we did it.
Computers can’t really understand meaning. But a piece of software called the Universal Sentence Encoder does a decent job pretending. It transforms any sentence into a list of 512 numbers, called a vector. The numbers in the vector aren’t really meaningful on their own, but taken together, sentences that mean about the same thing have vectors that are close to each other.
Even better, those vectors are similar for sentences with similar meaning—even if the sentences are in different languages! That was important for this project, which included documents in both English and Portuguese (the language spoken in Angola).
For instance, searching for sentences similar to “establishing a new corporation” found these sentence fragments as the top two matches:
- “nova entidade para a sociedade,” Portuguese for “a new entity for the company” in an email discussing creating a new corporate structure.
- “of the firm as newly constituted,” from a law authorizing secretive corporate structures in Mauritius.
To build the system, we calculated a vector for every sentence in every file in the leak and stored them in a database called Annoy. That mades it quick and easy to calculate “nearest-neighbors”—that’s a technical term, but it means about what it sounds like: for a given input sentence’s vector, the vectors for sentences that are closest to it.
So when reporters working on the project wanted to pinpoint companies that dos Santos was involved with, we first found a few examples of one company’s board meeting minutes, just through manual searching. Then, we plugged in the vectors for each sentence in the board meeting minutes we already had, and examined the nearby documents. Many of these were other corporate board-related documents, but ones that would’ve been missed because they didn’t contain the word “minutes,” like one proxy-voting authorization letter.
We’ll be using this approach on Quartz’s investigative team to search through political Facebook ads—so we can find ads making various interesting or problematic claims, even if we don’t know any of the keywords that might be present. This approach should be applicable to other cases where journalists or researchers have huge amounts of documents and don’t know what’s in them. To help that along, we’ve published an open-source demo for our workflow on Github.