Announcement

Presenting the Global Top 100 outstanding projects: Afrocentric NLP

Published on May 30, 2023

Share this post

After a thorough scientific, ethical, and business review of all projects submitted to the 2022 IRCAI Global Top 100 call, IRCAI has deemed ten submissions as “outstanding” based on their AI integrity, potential impact on the SDGs, business sustainability, and ethical design. In this article, we will focus on one of these ten submissions, “Afrocentric NLP”, a project that has produced a neural language identification toolkit for 517 African languages. In early May, we had the privilege of welcoming team member and researcher Ife Adebara (a PhD researcher at the University of British Columbia) to our UN STI Forum side event.

Underrepresented voices in the digital sphere

The speed at which we access information from the internet has changed in remarkable ways. Whether you are looking for a guide to improving your algebra skills, an article on the history of your hometown, or PDF instructions on how to use that old camera you found in your grandmother’s pantry, the internet offers a wealth of valuable information. However, language barriers make it difficult to access online content. In the digital sphere, certain languages are poorly represented, leaving many languages severely underrepresented. Dominant languages are amplified and end up speaking for others. Web translators can at least help translate content, but many languages remain excluded from their automated capabilities, further exacerbating the issue.

An Afrocentric approach to technological development

Motivated to take an “Afrocentric approach to technology development,” Ife and her team are developing various language identification models. These are machine learning models used to automatically determine the language of a given text or speech samples. AfroLID has been trained on large datasets containing text and language samples from a wide range of African languages, and can thus learn patterns, statistical features, and linguistic properties specific to each of these languages, enabling it to make accurate predictions about the language in new, unseen samples. Ife explains that AfroLID represents an “important first step in human language processing.” Language identification models are an important prerequisite for decomposing texts into smaller units such as individual words, characters, and other linguistic units (called “tokens” in NLP), which can then facilitate the development of multilingual models and machine translation services. “AfroLID is a multidomain web dataset manually curated from 14 language families domiciled in 50 African countries across 5 of the graphic systems,” Ife explains. Therewith, the language identification model covers an astounding 517 languages and language varieties across the continent.

A publicly available toolkit

The LID toolkit is publicly available to “aid the continued development of natural language processing models for African countries,” she notes. The data are of high quality and are manually curated to “ensure that languages are represented correctly.” “This is especially important for Africa,” she adds, “where African people need to continue to be taught in the languages they prefer to speak and learn.”

Afrocentric NLP is a group project by Ife Adebara (PhD researcher at the University of British Columbia), Muhammad Abdul-Mageed (Canada Research Chair in Natural Language Processing at the University of British Columbia), AbdelRahim Elmadany (Postdoctoral researcher at the University of British Columbia) and Alcides Alcoba (Research Assistant at the University of British Columbia). Take a look at AfroLID’s GitHub, working demo and installation requirements. For more background, see the related publication on the neural language identification tool. Ife has also co-authored articles on massively multilingual language models for Africa, linguistic and sociopolitical challenges in developing NLP technologies for African languages, and on using transfer learning based on pre-trained neural machine translation models to translate between similar low-resource languages.

12 Best Papers Recognized for Advancing Ethical AI at the Global Forum on the Ethics of AI 2025

Jul 11, 2025 | Announcement, Events

The 3rd Global Forum on the Ethics of Artificial...

IRCAI and AWS Announce 23 Startups for the 2025 Compute for Climate Fellowship

Jul 10, 2025 | Announcement

LJUBLJANA – July 10, 2025 – IRCAI and Amazon Web...

IRCAI at the Digital Alliance EU–LAC High Level Policy Dialogue in São Paulo

Jul 7, 2025 | Events

The International Research Centre on Artificial...

IRCAI and Zindi announce winning solutions of the AI for Equity Challenge in collaboration with AWS

Jun 6, 2025 | Announcement

In a vast collaborative effort involving several...

IRCAI launches Top 100 2025: Global AI for SDGs Index

Jun 3, 2025 | Announcement, Call to action

The International Research Centre on Artificial...

Advancing Open Education: The 8th OE4BW Mentorship Program Begins

Apr 10, 2025 | Announcement

The UNESCO Chair on Open Technologies for Open...

IRCAI and AWS Expand the Compute for Climate Fellowship Program and Opens Applications for 2025

Mar 4, 2025 | Announcement

The program funds proof of concepts for new...

IRCAI at AI Action Summit 2025: Shaping the future of AI Governance

Feb 14, 2025 | Events

The AI Action Summit, one of the most...

Mitja Jermol: How artificial intelligence is revolutionizing education

Feb 7, 2025 | Interview

"AI offers powerful tools to improve teaching...

Transforming Industries with Human-AI Collaboration: Insights from the HumAIne Project

Nov 18, 2024 | Events

Can humans and AI collaborate effectively in...

MORE NEWS

CONTACT

International Research Centre
on Artificial Intelligence (IRCAI)
under the auspices of UNESCO

Jožef Stefan Institute
Jamova cesta 39
SI-1000 Ljubljana

info@ircai.org
ircai.org

The designations employed and the presentation of material throughout this website do not imply the expression of any opinion whatsoever on the part of UNESCO concerning the legal status of any country, territory, city or area of its authorities, or concerning the delimitation of its frontiers or boundaries.

Design by Ana Fabjan