After a thorough scientific, ethical, and business review of all projects submitted to the 2022 IRCAI Global Top 100 call, IRCAI has deemed ten submissions as “outstanding” based on their AI integrity, potential impact on the SDGs, business sustainability, and ethical design. In this article, we will focus on one of these ten submissions, “Afrocentric NLP”, a project that has produced a neural language identification toolkit for 517 African languages. In early May, we had the privilege of welcoming team member and researcher Ife Adebara (a PhD researcher at the University of British Columbia) to our UN STI Forum side event.
Underrepresented voices in the digital sphere
The speed at which we access information from the internet has changed in remarkable ways. Whether you are looking for a guide to improving your algebra skills, an article on the history of your hometown, or PDF instructions on how to use that old camera you found in your grandmother’s pantry, the internet offers a wealth of valuable information. However, language barriers make it difficult to access online content. In the digital sphere, certain languages are poorly represented, leaving many languages severely underrepresented. Dominant languages are amplified and end up speaking for others. Web translators can at least help translate content, but many languages remain excluded from their automated capabilities, further exacerbating the issue.
An Afrocentric approach to technological development
Motivated to take an “Afrocentric approach to technology development,” Ife and her team are developing various language identification models. These are machine learning models used to automatically determine the language of a given text or speech samples. AfroLID has been trained on large datasets containing text and language samples from a wide range of African languages, and can thus learn patterns, statistical features, and linguistic properties specific to each of these languages, enabling it to make accurate predictions about the language in new, unseen samples. Ife explains that AfroLID represents an “important first step in human language processing.” Language identification models are an important prerequisite for decomposing texts into smaller units such as individual words, characters, and other linguistic units (called “tokens” in NLP), which can then facilitate the development of multilingual models and machine translation services. “AfroLID is a multidomain web dataset manually curated from 14 language families domiciled in 50 African countries across 5 of the graphic systems,” Ife explains. Therewith, the language identification model covers an astounding 517 languages and language varieties across the continent.
A publicly available toolkit
The LID toolkit is publicly available to “aid the continued development of natural language processing models for African countries,” she notes. The data are of high quality and are manually curated to “ensure that languages are represented correctly.” “This is especially important for Africa,” she adds, “where African people need to continue to be taught in the languages they prefer to speak and learn.”
Afrocentric NLP is a group project by Ife Adebara (PhD researcher at the University of British Columbia), Muhammad Abdul-Mageed (Canada Research Chair in Natural Language Processing at the University of British Columbia), AbdelRahim Elmadany (Postdoctoral researcher at the University of British Columbia) and Alcides Alcoba (Research Assistant at the University of British Columbia). Take a look at AfroLID’s GitHub, working demo and installation requirements. For more background, see the related publication on the neural language identification tool. Ife has also co-authored articles on massively multilingual language models for Africa, linguistic and sociopolitical challenges in developing NLP technologies for African languages, and on using transfer learning based on pre-trained neural machine translation models to translate between similar low-resource languages.