Social media company trains LLM on 40,000 sentences in sub-Saharan African languages

RWS’s TrainAI data collection services played a pivotal role in enabling a social media giant to train its large language model on underrepresented languages.

Our client, a leading social media company, wanted to build a large language model (LLM) for low-resource languages focusing on the sub-Saharan Africa market.

To train the LLM, the client needed a large volume of monolingual data from published texts of varied content types in the specific long-tail languages of Wolof and Oromo. Data collection activities also had to be conducted in a manner that complied with all applicable copyright regulations.

The project was a complex one, and the client didn’t know where to start.

Challenges

Collect text data in low-resource, long-tail Wolof and Oromo languages
A significant volume of text data was required to properly train the LLM
Data collection activities had to comply with all relevant copyright laws

Solution

TrainAI from RWS

Data collection, content creation, and data generation
TrainAI partnered with a legal firm to conduct extensive legal research into cross-jurisdiction copyright license transfers
Suitable publications in Wolof and Oromo were sourced, their rightsholders contacted, and license transfers arranged between rightsholders, RWS, and the client
Our TrainAI community of skilled AI data experts then prepared all acquired texts for the purpose of training the client’s LLM

Results

After tailored post-processing of all acquired texts, we delivered a corpus of more than 40,000 sentences in Wolof and Oromo to train the client’s LLM
We successfully supported a complex, locale-specific expansion of the client's LLM capabilities for the sub-Saharan Africa market

Social media company trains LLM on 40,000 sentences in sub-Saharan African languages

Challenges

Solution

Results

Related resources