Skip to navigationSkip to content

Data scientists are trying to make the internet accessible in every language

Pro-democracy protesters light up their mobile phones as they attend a mass rally in Thailand.
Reuters/Soe Zeya
Access for all.
  • Nicolás Rivero
By Nicolás Rivero

Tech Reporter based in New York

Published Last updated

If you speak one of about two dozen dominant languages, the internet is your oyster.

You can navigate the web in your own tongue. If you come across a website or document in an unfamiliar language, your browser will instantly translate for you. Search engines can intuit what you’re looking for from a few cryptic keywords. Digital keyboards carry the special characters and diacritical marks you need. Voice assistants understand you. Spell checkers automatically catch your mistakes, while predictive text helps you craft memos and emails.

All these conveniences are powered by language-specific AI programs, and they’re available to more than 4 billion speakers of languages like English, Spanish, Mandarin, and Hindi. But this is not how the other half browses. Roughly 3.5 billion speakers of around 7,000 other languages don’t have access to some or any of these AI-powered tools—shutting them out of the internet’s most powerful benefits, many of which we take for granted.

Left unchecked, this gap will create a group of digital haves and have-nots: those who can browse nearly any corner of the web with ease, and those who will struggle to access information that isn’t written in their languages.

“People say the internet democratized information, but it didn’t—it democratized information in English,” said Ari Ramkilowan, a data scientist developing machine translation tools for South African languages. “For people to truly get access to it they need to see it in a language that they’re comfortable with.”

Ramkilowan is one of many AI researchers across the globe working on advancing natural language processing (NLP) for so-called “low-resource” languages—which is to say, languages that don’t have massive databases of digital text that can be used to create well-tuned algorithms. Languages become low-resource for a variety of reasons, ranging from the size of the population and its access to the internet, to the amount of text they are writing online or that publishers are translating for that population. These are not easy real-world problems for scientists to solve. Instead, they are compiling troves of training text and tinkering with techniques to make algorithms do more with less data, so that more people can move around the web with ease.

While the researchers share a common goal, they’re working with vastly different levels of funding and institutional support. A few have come up with model approaches that other languages could emulate.

🇮🇳 In India, startups have taken the lead. After cellular data prices fell dramatically in 2016, hundreds of millions of Indian consumers got access to the internet for the first time—and promptly found themselves confronted by websites, apps, and digital services available only in English and Hindi. Since then, a slew of tech startups have raced to develop NLP for all 22 official Indian languages.

The startups hope they can cash in by helping businesses reach millions of potential customers in their own languages—and some have already been snapped up by the likes of retail giant Flipkart and telecom behemoth Reliance. The sheer number of Indian language speakers made investing in them an attractive business proposition the moment that smartphone penetration shot up, which offers a glimmer of hope for other widely-spoken languages like Javanese and Igbo. For smaller language communities, the startup approach might be a tougher sell.

🌍  African data scientists are collaborating across the continent. The Masakhane project is a group of 144 researchers from 19 countries working to advance NLP for African languages. The researchers face similar challenges: meager funding, few opportunities to meet and learn from far-flung peers, and trouble finding training text locked away behind copyright claims, obscure file formats, or the doors of private libraries. So they’ve created a group to pool resources, share knowledge, and help each other move their projects forward.

Many Masakhane initiatives are volunteer efforts, fueled by passionate researchers willing to dedicate countless hours to collecting songs, stories, and prayers, or convincing local news outlets to turn over their archives. They’ve built language databases from scratch and created preliminary translation models for dozens of languages. But progress is painstaking. Contributors working nights and weekends can gather training text hundreds or thousands of words at a time; the state-of-the-art language models, by comparison, draw on hundreds of billions of words.

💰 Google is NLP’s corporate Goliath. The company’s Translate tool now supports 108 languages, although performance varies widely between high-resource languages like Spanish and German and low-resource languages like Yoruba and Malagasy. Still, the tool has enabled billions of internet users to access online content posted in scores of languages they don’t speak—a massive contribution to the internet’s promise of creating a more open and connected world.

Google funnels some of its vast wealth toward developing techniques to improve translation for low-resource languages. These include tricks like using a translator to make up training text when the available data are thin, or building multilingual models that can transfer basic grammatical knowledge between several related languages. While these methods have the benefit of improving NLP performance for many languages at once, they have limits. So far, none of these hacks has allowed low-resource language models to approach the performance of their high-resource peers.

💪 The Basque government funds NLP as a point of national pride. At the University of the Basque Country, roughly 60 computer scientists and linguists known as the Ixa group are on a nationalist mission to develop NLP for the Basque language, with funding from the region’s semi-autonomous government. Beginning in the 1990s, Ixa created a spellchecker, a digital dictionary, and later a Basque translator, all with the goal of standardizing the language and promoting its use.

The research is motivated by the Basque community’s fierce commitment to preserving its language, which faced extinction during decades of cultural repression under Spanish dictator Francisco Franco. As a result, a community of just 750,000 speakers has developed a remarkably robust set of NLP tools. “Our goal is to be able to use Basque in everyday life, without facing any problems because other people don’t understand our language,” said Ixa co-founder Kepa Sarasola.

For companies and individuals alike, expanding NLP means expanding access—to new markets, to more products and services, and to the internet’s founding promise that connecting people across the globe will make the world a better place. “It becomes particularly important when you think about the internet as a primary way of spreading information,” said Masakhane researcher Jade Abbott. “We need to support these languages because we’re now excluding a large portion of the population from not only understanding the conversation that’s happening online but also contributing to it.”