Company or Institution
The University of British Columbia & Mohamed Bin Zayed University of Artificial Intelligence
Sustainable Development Goals (SDGs)
SDG 3: Good Health and Well-being
SDG 4: Quality Education
SDG 9: Industry, Innovation and Infrastructure
SDG 10: Reduced Inequality
SDG 11: Sustainable Cities and Communities
SDG 16: Peace and Justice Strong Institutions
SDG 17: Partnerships to achieve the Goal
General description of the AI solution
In motivating and advocating for an Afrocentric approach to technology development, where what technologies to build and how to build, evaluate, and deploy them arise from the needs of local African communities, we develop AfroLID – a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families domiciled in 50 African countries, utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 96.16 F1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on many languages. We show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. We also offer several controlled case studies and perform a linguistically motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations. Developing AfroLID tools is vital for all natural language processing (NLP) since automatic language identification (LID) is an important first step in processing human language appropriately. It can be a prerequisite for determining appropriate tokenization. Furthermore, some preprocessing approaches may be necessary for certain languages, but may hurt performance in other languages. LID has facilitated development of widely multilingual models such mT5 and large multilingual datasets such as CCAligned, ParaCrawl, WikiMatrix, OSCAR, and mC4 which have advanced research in NLP. We focus on LID for African languages because of the dearth of resources available for African languages. To the best of our knowledge, AfroLID is the first language identification tool that supports 517 African language and language varieties.
Github, open data repository, prototype or working demo
• Adebara, I., & AbdelRahim E., & Abdul-Mageed, M., Alcides A. (2022). AfroLID: A Neural Language Identification Tool for African Languages. Accepted at EMNLP 2022
• Adebara, I., & Abdul-Mageed, M. (2022). Linguistically-Motivated Yoruba-English machine translation. Accepted at COLING 2022.
• Adebara, I., & Abdul-Mageed, M. (2022). Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go. Accepted in main conference In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
• Adebara, I., & Abdul-Mageed, M. (2021). Improving Similar Language Translation with Transfer Learning. arXiv preprint arXiv:2108.03533.
• Adebara, I., Elmadany, A., Abdul-Mageed, M., & Alcoba Inciarte, A. (2023). SERENGETI: Massively Multilingual Language Models for Africa. Accepted @ ACL 2023.
HPC resources and/or Cloud Computing Services