Audio transcription is finally being automated

I have far more interviews recorded than I ever want to transcribe. I speak with people for a living, and record almost all my calls and conversations. There are few ways to tackle this problem, and I’ve tried most of them: transcribing everything myself (the best method, but only if you can block out a few days), using expensive transcription services, utilizing cheap speech-to-text AI developer tools from Google and IBM, jotting down roughly when each important topic is being talked about corresponding to the recording (my current method), and even just trying to keep interviews more concise (it’s like herding cats).

By Dave Gershgorn3 min readUpdated July 20, 2022

Add QZ to Google

I have far more interviews recorded than I ever want to transcribe. I speak with people for a living, and record almost all my calls and conversations. There are few ways to tackle this problem, and I’ve tried most of them: transcribing everything myself (the best method, but only if you can block out a few days), using expensive transcription services, utilizing cheap speech-to-text AI developer tools from Google $GOOGL and IBM $IBM, jotting down roughly when each important topic is being talked about corresponding to the recording (my current method), and even just trying to keep interviews more concise (it’s like herding cats).

Trint wants to end my pain—and yours. Upload an audio or video file, wait a minute or two for it to process, and Trint will spit out a rough transcription in its online text editor. The text is tagged to the audio; click on a word, and you jump to that point in the recording. Once you fix the AI’s inevitable errors in the online editor (the amount varies based on the speaker and quality of the audio), the text can be exported as captions on a video, or as a complete transcript.

The essential business news, delivered fresh every morning.

Join 500,000+ readers who start their day with Quartz.

By subscribing, you agree to our Terms of Service and Privacy Policy.

I recently recorded a two-hour interview with a well-known computer scientist, but balked at the idea of having to transcribe all that tape. Transcription services would have been too expensive for the scope of the story—but along came Trint. It cost $15 for the rough transcript, and I spent about 45 minutes polishing the copy I needed, saving at least five hours of work. (I transcribe at 1/3 speed to match my slow typing.) The most common corrections I had to make were fixes to punctuation errors, and when either the researcher or myself started a new sentence without finishing the first one (this required some manual rewriting). It also had a tough time with proper nouns, and sometimes stumbled on fast or mumbled speech. I attribute some of issues to my relatively poor-quality microphone (my laptop).

Jeffrey Kofman, CEO of Trint and former on-air correspondent for ABC and CBS, says Trint is especially suited for television and radio, where interviews are recorded in quiet spaces with quality microphones, but also useful for lawyers, doctors, or anyone who conducts interviews in controlled environments. One of Trint’s specialities is US president Donald Trump, Kofman says, pulling up a transcript and video of the president’s Jan. 31 Supreme Court nomination speech.

“Trump comes out 99.5% correct, because he uses small words and speaks slowly,” Kofman said. “It’s actually not a good example, because it sets an expectation that everyone will be this good.”

An AI transcript of the president would never go immediately live, though, says Kofman. Trint isn’t meant to replace human-checked transcription, just do most of the heavy lifting. Trint doesn’t have specific numbers on its accuracy for customers—it can’t see their information for privacy reasons, so it can’t make the calculation. But on internal tests, Trint was shown to have 95-98% accuracy, according to the company.

Long-form speech-to-text has been a large problem for artificial intelligence in the past. While Microsoft $MSFT and Google tout the accuracy of their speech-to-text tools that developers use to build apps, those methods largely are geared toward short snippets of audio. The open-source tool autoEdit, which is funded by the journalism non-profit Knight Foundation and uses IBM’s Bluemix to transcribe speech, was nearly unusable when I stacked it against Trint (albeit in my own anecdotal test). At least two other startups, Transcribe Online and AI-Media, claim to be using AI for transcription, but neither allows users to easily sign up and upload audio. Chinese search giant Baidu just announced SwiftScribe, a long-form transcription tool, but the service so far is in an extremely exclusive beta launch of 30-50 transcriptionists.

Kofman says Trint will eventually be available for integration into video and audio content management systems for TV and radio stations, and the company is also working on a portable version of their video player, that could be published to websites with fully-tagged transcripts.