If an AI researcher wants to build a natural language processing model in English, there’s no shortage of data to train her algorithms.
With a click, she could have 1.8 million articles from the New York Times archives, carefully tagged by topic. She might throw in 800,000 stories from the Reuters archives, or 30 million words of text from the Wall Street Journal. Of course, she could also just use the state-of-the-art GPT-3 language model, which cut its teeth on more than 290 billion English words scraped from around the web.
But if she wants to build a model that will work for Setswana or Sepedi, two of South Africa’s 11 official languages, her best bet might be a nascent dataset of a few hundred headlines drawn from the Facebook page of the South African Broadcasting Corporation (SABC). The corpus is the work of researchers from seven South African universities, who aspire to build up their own version of the massive datasets that exist for US newspapers to power natural language processing (NLP) programs.
This isn’t just an academic exercise. Vukosi Marivate, who led the project and chairs data science research at the University of Pretoria in South Africa, says building NLP models in a wider range of languages can increase access to vast swaths of the internet. These algorithms power machine translators, the chatbots that have taken over customer service on retail websites, and AI aggregators that bring you personalized streams of news.
“We keep on thinking that the internet is democratic and that all the data is available everywhere, and it’s not. It’s available in English,” Marivate said. “For some communities, just having the simple translate function on Twitter or Facebook has made things more accessible because they can look at content and read it in their language. The impact of that cannot be overstated.”
In order to develop the NLP algorithms that make those tools possible, you need lots of training data. But that data doesn’t exist for low-resourced languages, which may have many speakers but few archives of digital text to feed into AI algorithms.
To train their model, Marivate and his collaborators first used the few documents that were available online in Setswana and Sepedi. There wasn’t much: the South African constitution, a few thousand Wikipedia pages, translations of Jehovah’s Witness texts. Then they sprinkled in headlines posted to the SABC Facebook page to build a tool that could group news stories in each language into categories like “sports,” “politics,” and “business.”
Since they published their results, one of the grad students on the team has been working on expanding the dataset to include headlines in Xhosa and isiZulu. Marivate hopes the tool will demonstrate what the researchers could do with access to the publisher’s archives, containing data from decades of news in all 11 South African languages.
“We think this is a public good,” Marivate said, “and if we make the data available…it would allow us to build up tools for these languages that could make things a lot more accessible.”