Organisation Name
UBC Deep Learning and Natural Language Processing (DLNLP) Group
Industry
Education
Organisation Website
Country
Canada
Sustainable Development Goals (SDGs)
SDG 4: Quality Education
SDG 9: Industry, Innovation and Infrastructure
SDG 10: Reduced Inequality
General Description of the AI tool
Our tool is a state-of-the-art Language Identification (LID) system built for Africa’s linguistically diverse communities. It accurately detects 700+ languages using a curated dataset and a novel hierarchical framework. By improving data quality and reducing mislabeling, it supports inclusive NLP development, enables local-language digital access, and provides a foundational resource for building fair, culturally aligned AI systems across the continent.
Relevant Research and Publications
1. AfroLID: A Neural Language Identification Tool for African Languages (Adebara et al., 2022)
Paper: https://aclanthology.org/2022.emnlp-main.128.pdf
Code: https://github.com/UBC-NLP/afrolid/tree/main
Helped shape our project by highlighting the limits of existing LID coverage and accuracy for African languages and motivating the need for more robust, inclusive models and high-quality curated datasets.
2. Serengeti: Multilingual Benchmarks and Models for African Languages (Adebara et al., 2023)
Paper: [https://arxiv.org/pdf/2212.10785]
Models: [https://huggingface.co/UBC-NLP/serengeti]
Provided key insights into benchmark creation, evaluation standards, and multilingual modeling for African languages, directly informing our methodology and evaluation design.
3. Cheetah: Natural Language Generation for 517 African Languages (Adebara et al., 2024)
Paper: https://arxiv.org/pdf/2401.01053
Cheetah introduces a massively multilingual NLG model for 500+ African languages. It informs our project by showing how large-scale, Africa-centric models can be built and evaluated, reinforcing the need for reliable LID and clean language-specific data as a foundation for downstream generation.
4.The State and Fate of Linguistic Diversity and Inclusion in the NLP World (Joshi et al., 2020)
Paper: https://aclanthology.org/2020.acl-main.560.pdf
This paper documents how NLP research overwhelmingly neglects many of the world’s languages and highlights structural barriers to linguistic inclusion. It directly shapes our project by motivating a focus on African languages, framing the “language gap” as an equity issue, and grounding our work in linguistic justice and representation.
5. AfroBench: How Good are Large Language Models on African Languages? (Ojo et al., 2025)
Paper: https://arxiv.org/pdf/2311.07978
This benchmark evaluates LLMs across 64 African languages and 15 tasks, revealing large performance gaps and highlighting data scarcity. It guided our project’s focus on dataset quality, language coverage, and hierarchical modelling.
Needs
Funding
Public Exposure
HPC resources and/or Cloud Computing Services