UCT MzansiLM for 11 languages

UCT highlights a small language model for South Africa's 11 official languages

The University of Cape Town reported on MzansiText and MzansiLM on May 4, 2026. The research project combines a curated multilingual corpus and a 125M-parameter language model trained from scratch for all 11 official written South African languages.

Why low-resource languages need different models

The arXiv paper explains that nine of the 11 languages are low-resource. Large global models often respond poorly in languages such as isiNdebele or Sepedi because far less training data exists than for English or major European languages.

What MzansiLM does technically

Hugging Face describes MzansiLM as a decoder-only LlamaForCausalLM with 125,008,384 parameters, 30 layers, a 2,048-token context length and a custom BPE tokenizer with 65,536 tokens. The paper reports 20.65 BLEU for isiXhosa data-to-text and 78.5% macro-F1 on isiXhosa news classification.

What the limits are

The authors state openly that few-shot reasoning remains near chance at this model size. MzansiLM is therefore mainly a reproducible research baseline, not a universal ChatGPT replacement for South Africa.

Why it matters

The topic matters because AI usefulness depends heavily on language. If tools work well only in English, citizens, public agencies and companies in many regions are excluded. Small open models and datasets create a foundation for local applications, auditability and later improvements.

Practical example

A South African bank could use MzansiText in 2026 to classify customer service messages in isiXhosa, Sesotho and Sepedi. A pilot with 50,000 anonymized messages would first measure where MzansiLM beats a generic large model and where humans still need to review outputs.

UCT releases MzansiLM for 11 South African languages

UCT highlights a small language model for South Africa's 11 official languages

Why low-resource languages need different models

What MzansiLM does technically

What the limits are

Why it matters

Practical example

💡 In plain English

Key Takeaways

Sources & Context