Afrocentric NLP

IRCAI Global Top 100 List

2022 | Canada | Communications/Electronics | Outstanding | SDG10 | SDG11 | SDG16 | SDG17 | SDG3 | SDG4 | SDG9

Afrocentric NLP

Company or Institution

The University of British Columbia & Mohamed Bin Zayed University of Artificial Intelligence

Industry

Communications/Electronics

Website

https://afrolid.readthedocs.io/en/latest/

Country

Canada

Sustainable Development Goals (SDGs)

SDG 3: Good Health and Well-being

SDG 4: Quality Education

SDG 9: Industry, Innovation and Infrastructure

SDG 10: Reduced Inequality

SDG 11: Sustainable Cities and Communities

SDG 16: Peace and Justice Strong Institutions

SDG 17: Partnerships to achieve the Goal

General description of the AI solution

In motivating and advocating for an Afrocentric approach to technology development, where what technologies to build and how to build, evaluate, and deploy them arise from the needs of local African communities, we develop AfroLID – a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families domiciled in 50 African countries, utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 96.16 F1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on many languages. We show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. We also offer several controlled case studies and perform a linguistically motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations. Developing AfroLID tools is vital for all natural language processing (NLP) since automatic language identification (LID) is an important first step in processing human language appropriately. It can be a prerequisite for determining appropriate tokenization. Furthermore, some preprocessing approaches may be necessary for certain languages, but may hurt performance in other languages. LID has facilitated development of widely multilingual models such mT5 and large multilingual datasets such as CCAligned, ParaCrawl, WikiMatrix, OSCAR, and mC4 which have advanced research in NLP. We focus on LID for African languages because of the dearth of resources available for African languages. To the best of our knowledge, AfroLID is the first language identification tool that supports 517 African language and language varieties.

Github, open data repository, prototype or working demo

https://github.com/UBC-NLP/afrolid
https://demos.dlnlp.ai/afrolid/

Publications

• Adebara, I., & AbdelRahim E., & Abdul-Mageed, M., Alcides A. (2022). AfroLID: A Neural Language Identification Tool for African Languages. Accepted at EMNLP 2022

• Adebara, I., & Abdul-Mageed, M. (2022). Linguistically-Motivated Yoruba-English machine translation. Accepted at COLING 2022.

• Adebara, I., & Abdul-Mageed, M. (2022). Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go. Accepted in main conference In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.

• Adebara, I., & Abdul-Mageed, M. (2021). Improving Similar Language Translation with Transfer Learning. arXiv preprint arXiv:2108.03533.

• Adebara, I., Elmadany, A., Abdul-Mageed, M., & Alcoba Inciarte, A. (2023). SERENGETI: Massively Multilingual Language Models for Africa. Accepted @ ACL 2023.

Needs

Funding

Personnel

Customers

Public Exposure

Mentorship Program

HPC resources and/or Cloud Computing Services

CONTACT

International Research Centre
on Artificial Intelligence (IRCAI)
under the auspices of UNESCO

Jožef Stefan Institute
Jamova cesta 39
SI-1000 Ljubljana

info@ircai.org
ircai.org

The designations employed and the presentation of material throughout this website do not imply the expression of any opinion whatsoever on the part of UNESCO concerning the legal status of any country, territory, city or area of its authorities, or concerning the delimitation of its frontiers or boundaries.