2025 | Canada | Education | Excellent | SDG10 | SDG4 | SDG9
Language Identification Tool for African Languages

Organisation Name

UBC Deep Learning and Natural Language Processing (DLNLP) Group

Industry

Education

Organisation Website

https://www.dlnlp.ai/

Country

Canada

Sustainable Development Goals (SDGs)

SDG 4: Quality Education

SDG 9: Industry, Innovation and Infrastructure

SDG 10: Reduced Inequality

General Description of the AI tool

Our tool is a state-of-the-art Language Identification (LID) system built for Africa’s linguistically diverse communities. It accurately detects 700+ languages using a curated dataset and a novel hierarchical framework. By improving data quality and reducing mislabeling, it supports inclusive NLP development, enables local-language digital access, and provides a foundational resource for building fair, culturally aligned AI systems across the continent.

Relevant Research and Publications

1. AfroLID: A Neural Language Identification Tool for African Languages (Adebara et al., 2022)

Paper: https://aclanthology.org/2022.emnlp-main.128.pdf
Code: https://github.com/UBC-NLP/afrolid/tree/main
Helped shape our project by highlighting the limits of existing LID coverage and accuracy for African languages and motivating the need for more robust, inclusive models and high-quality curated datasets.

2. Serengeti: Multilingual Benchmarks and Models for African Languages (Adebara et al., 2023)

Paper: [https://arxiv.org/pdf/2212.10785]
Models: [https://huggingface.co/UBC-NLP/serengeti]

Provided key insights into benchmark creation, evaluation standards, and multilingual modeling for African languages, directly informing our methodology and evaluation design.

3. Cheetah: Natural Language Generation for 517 African Languages (Adebara et al., 2024)

Paper: https://arxiv.org/pdf/2401.01053

Cheetah introduces a massively multilingual NLG model for 500+ African languages. It informs our project by showing how large-scale, Africa-centric models can be built and evaluated, reinforcing the need for reliable LID and clean language-specific data as a foundation for downstream generation.

4.The State and Fate of Linguistic Diversity and Inclusion in the NLP World (Joshi et al., 2020)
Paper: https://aclanthology.org/2020.acl-main.560.pdf

This paper documents how NLP research overwhelmingly neglects many of the world’s languages and highlights structural barriers to linguistic inclusion. It directly shapes our project by motivating a focus on African languages, framing the “language gap” as an equity issue, and grounding our work in linguistic justice and representation.

5. AfroBench: How Good are Large Language Models on African Languages? (Ojo et al., 2025)
Paper: https://arxiv.org/pdf/2311.07978

This benchmark evaluates LLMs across 64 African languages and 15 tasks, revealing large performance gaps and highlighting data scarcity. It guided our project’s focus on dataset quality, language coverage, and hierarchical modelling.

Needs

Funding

Public Exposure

HPC resources and/or Cloud Computing Services

CONTACT

International Research Centre
on Artificial Intelligence (IRCAI)
under the auspices of UNESCO 

Jožef Stefan Institute
Jamova cesta 39
SI-1000 Ljubljana

info@ircai.org
ircai.org

FOLLOW US

The designations employed and the presentation of material throughout this website do not imply the expression of any opinion whatsoever on the part of UNESCO concerning the legal status of any country, territory, city or area of its authorities, or concerning the delimitation of its frontiers or boundaries.

PRIVACY  POLICY