Artificial intelligence can scour code to find accidentally public passwords

Hiding out.
Hiding out.
Image: Reuters/Maxim Shemetov
We may earn a commission from links on this page.

Sometimes sensitive data, like passwords or keys that unlock encrypted communications, are accidentally left open for anybody to see. It’s happened everywhere from the Republican National Committee to Verizon, and as long as information can be public on the internet the trend isn’t going to stop.

But researchers at software infrastructure firm Pivotal have taught AI to locate this accidentally public sensitive information in a surprising way: By looking at the code as if it were a picture. Since modern artificial intelligence is arguably better than humans at identifying minute differences in images, telling the difference between a password and normal code for a computer is just like recognizing a dog from a cat.

The best way to check whether private passwords or sensitive information has been left public today is to use hand-coded rules called “regular expressions.” These rules tell a computer to find any string of characters that meets specific criteria, like length and included characters. But passwords are all different, and this method means that the security engineer has to anticipate every kind of private data they want to guard against.

To automate the process, the Pivotal team first turned the text of passwords and code into matrixes, or lists of numbers describing each string of characters. This is the same process used when AI interprets images—similar to how the images reflected into our eyes are turned into electrical signals for the brain, images and text need to be in a simpler form for computers to process.

When the team visualized the matrices, private data looked different from the standard code. Since passwords or keys are often randomized strings of numbers, letters, and symbols—called “high entropy”—they stand out against non-random strings of letters.

Below you can see a GIF of the matrix with 100 characters of simulated secret information.

A matrix with confidential information.

And then here’s another with 100 normal, non-secret code:

Image: Pivotal

The two patterns are completely different, with patches of higher-entropy appearing lighter in the top example of “secret” data.

Pivotal then trained a  deep learning algorithm typically used for images on the matrixes, and, according to Pivotal chief security officer Justin Smith, the end result performed better than the regular expressions the firm typically uses.