The US Justice department quietly uploaded a new version of the Mueller report to its website today.
You wouldn’t notice any difference just by looking at the report, which can be downloaded here. What’s new is a layer of data that makes it possible, finally, to access the underlying text of the document.
When the Mueller report came out on April 18, it was essentially a giant file of images. (At 140 MB, the file was more than 300 times larger than an ebook of Crime and Punishment.) The Justice department appears to have scanned a paper copy of the report—using a Ricoh MP C6502 Color Laser Multifunction Printer, for what it’s worth—and released the PDF of that scan to the public. That’s why the text is blurry, you can see the edges of some pages, and there’s a fuzzy yellow line through the middle of the entire report.
The decision immediately elicited groans from people trying to search the report for juicy details. A giant file of images has no text to search. It was also condemned by a group involved in setting technical specifications for the portable document format: “This deliberate and unnecessary act made the document substantially harder for anyone and everyone to use, forever,” wrote Duff Johnson, executive director of the PDF Association, in a delightful review of the file’s nerdiest details.
News organizations and Mueller fanatics quickly addressed this problem by running the PDF through a process known as optical character recognition (OCR) to add searchable text to the document. So, to review: The Mueller report was written on a computer, then printed out on paper, scanned back into digital images, and finally regenerated into text using software.
The Justice department’s image-only PDF also seemed to violate the US government’s own guidelines for making documents accessible to all readers. If PDFs don’t come with a layer of text and other metadata, “persons with disabilities who utilize assistive technology such as screen readers or speech-to-text tools may find it difficult or impossible to access essential or critical information,” explains the federal agency in charge of such things.
The website for Robert Mueller’s special counsel investigation acknowledged this shortcoming—“The Department recognizes that these documents may not yet be in an accessible format”—and offered to send a text file of the report to people who would have trouble reading it. It’s not clear if anyone requested such a file or received one. In any event, that offer was removed from the special counsel’s website today when the new version of the PDF was uploaded.
Alerted to the updated PDF today, Johnson found himself both impressed and disappointed.
“They really did try to do a good job here to try to make it accessible,” he said in a phone interview. “Unfortunately, there are still a lot of errors in the tags.”
The software the Justice department used to OCR the report, Adobe Acrobat, generated some jumbled or incomplete text in the new PDF, especially around large redactions and photos. And many invisible markers intended to make the document more accessible were applied incorrectly.
The file is also still a 140MB set of images, albeit now with text underlying it. A never-scanned, native-text PDF of the report would likely be less than 5MB.
“They made an earnest effort to improve it,” said Johnson, who worked in politics in the 1990s before getting into the nerdy world of document formats. He was rightly proud that his analysis of the earlier PDF had received such attention and seemingly prompted the Justice department to try again, and he promised to write another post analyzing the new effort.
But one of the most important PDFs in American history still contained a few mysteries.
“It remains a scanned file,” Johnson observed, picking over the document on his computer. “Why is it scanned at all? One possibility is that when it was received from Mueller, it was on paper. That’s weird, if true. Why wouldn’t Mueller send over a digital file? Why would DoJ not say, ‘Excuse me, could you send over a PDF?’”