UCT researchers develop AI model for all 11 South African languages

Staff Reporter|Published 2 weeks ago

The UCT researchers behind MzansiLM, a new artificial intelligence (AI) language model trained specifically on South Africa’s 11 official written languages. From left: Simbarashe Mawere, Anri Lombard, Dr Jan Buys and Dr Francois Meyer.

Image: Supplied

A groundbreaking initiative from the University of Cape Town (UCT) is poised to transform access to artificial intelligence tools throughout South Africa. A team of researchers has unveiled a new AI language model specifically designed to recognise and understand all 11 official written languages of the country, a remarkable step towards bridging the digital divide that has long left many South Africans underrepresented in the AI narrative.

This innovative research, spearheaded by Anri Lombard and Dr Jan Buys from UCT’s Department of Computer Science, along with Dr Francois Meyer and a committed team of collaborators, will be showcased at the prestigious Language Resources and Evaluation Conference (LREC) in Mallorca, Spain, this month.

The researchers have developed a dual contribution to the advancement of African language AI: MzansiText — a meticulously curated multilingual dataset that encapsulates South Africa's linguistic diversity — and MzansiLM, a language model trained from scratch on this dataset. As AI-powered language tools increasingly dictate our access to information and communication globally, the need for a solution that caters to South Africa’s unique languages has never been more urgent.

Dr Buys highlighted a pivotal issue within the field, stating, “In language modelling, languages are considered low resource due to the limited textual datasets available for training.” His assertion underscores the challenges faced by speakers of many South African languages, who often encounter inadequate responses from popular AI services when using languages such as isiNdebele or Sepedi.

The researchers say the reality is that nine out of South Africa’s 11 official languages fall into this low-resource category. While some languages like isiZulu and isiXhosa have garnered attention on the global stage, others remain largely overlooked. MzansiLM emerges as the first publicly available decoder-only language model able to support all 11 official written languages within a single framework.

Dr Meyer addressed the need for inclusivity, stating, “With MzansiLM, we aimed to construct a singular model specifically focused on South Africa, covering all 11 official languages, including those often neglected.” This ambitious goal is rooted in Lombard's master’s research, which investigates language-model architectures for low-resource languages — an area still ripe for exploration and innovation.

The MzansiLM model, though modest with its 125 million parameters, has demonstrated outstanding performance across targeted tasks, surpassing even some larger open-source models in benchmarks involving multiple South African languages. In tests on isiXhosa text generation, MzansiLM produced quality outputs that rivalled models up to ten times its size.

MzansiLM is not designed for general conversation like ChatGPT or Claude. Instead, it serves as a foundational model that developers can fine-tune for specific applications, such as summarising texts or annotating raw data in South African languages, offering a promising alternative to mass-market, proprietary language tools.

While MzansiLM currently lays the groundwork for future developments, its immediate benefits are envisaged in larger iterations and systems built on its foundation. The value of this research extends beyond South Africa, shedding light on a more significant global challenge: why advanced AI systems remain less effective for languages other than English.

Dr Buys noted that existing linguistic models struggle with general-purpose user interactions due to limited training data, thus emphasising the gap that MzansiLM seeks to help close.

Moving forward, the UCT team recognises that their work is merely the beginning. Lombard remarked, “Closing the gap in capabilities between South African languages and English requires ongoing collaborative efforts.” Echoing this sentiment, Meyer reinforced the importance of an open research community, stating that sharing datasets and models fosters progress that can’t be achieved in isolation.

In a commitment to support further research and innovation, UCT has made both MzansiText and MzansiLM publicly available. Their publication, MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages, can be accessed on arXiv, paving the way for a more equitable future in AI. The MzansiLM model can be downloaded here.

IOS

UCT researchers develop AI model for all 11 South African languages

Artificial Intelligence is acting like an employee, without the rules

SA needs an AI war room and less AI policy debate

South Africa's AI Policy: Now is the Time to Raise Our Concerns

Academy announces AI actors and writers ineligible for Oscars

AI: The Silent Storm Threatening South Africa's Future – A Progressive Call to Action ...