shape

Social media company trains LLM on 40,000 sentences in sub-Saharan African languages

RWS’s TrainAI data collection services played a pivotal role in enabling a social media giant to train its large language model on underrepresented languages.
shape dots shape dots

Our client, a leading social media company, wanted to build a large language model (LLM) for low-resource languages focusing on the sub-Saharan Africa market.

To train the LLM, the client needed a large volume of monolingual data from published texts of varied content types in the specific long-tail languages of Wolof and Oromo. Data collection activities also had to be conducted in a manner that complied with all applicable copyright regulations. 

The project was a complex one, and the client didn’t know where to start.

Challenges

  • Collect text data in low-resource, long-tail Wolof and Oromo languages
  • A significant volume of text data was required to properly train the LLM
  • Data collection activities had to comply with all relevant copyright laws

Solution

  • Data collection, content creation, and data generation
  • TrainAI partnered with a legal firm to conduct extensive legal research into cross-jurisdiction copyright license transfers
  • Suitable publications in Wolof and Oromo were sourced, their rightsholders contacted, and license transfers arranged between rightsholders, RWS, and the client
  • Our TrainAI community of skilled AI data experts then prepared all acquired texts for the purpose of training the client’s LLM

Results

  • After tailored post-processing of all acquired texts, we delivered a corpus of more than 40,000 sentences in Wolof and Oromo to train the client’s LLM
  • We successfully supported a complex, locale-specific expansion of the client's LLM capabilities for the sub-Saharan Africa market