Social media company trains LLM on 40,000 sentences in sub-Saharan African languages
RWS’s TrainAI data collection services played a pivotal role in enabling a social media giant to train its large language model on underrepresented languages.
Our client, a leading social media company, wanted to build a large language model (LLM) for low-resource languages focusing on the sub-Saharan Africa market.
To train the LLM, the client needed a large volume of monolingual data from published texts of varied content types in the specific long-tail languages of Wolof and Oromo. Data collection activities also had to be conducted in a manner that complied with all applicable copyright regulations.
The project was a complex one, and the client didn’t know where to start.
Challenges
- Collect text data in low-resource, long-tail Wolof and Oromo languages
- A significant volume of text data was required to properly train the LLM
- Data collection activities had to comply with all relevant copyright laws
Solution
- Data collection, content creation, and data generation
- TrainAI partnered with a legal firm to conduct extensive legal research into cross-jurisdiction copyright license transfers
- Suitable publications in Wolof and Oromo were sourced, their rightsholders contacted, and license transfers arranged between rightsholders, RWS, and the client
- Our TrainAI community of skilled AI data experts then prepared all acquired texts for the purpose of training the client’s LLM
Results
- After tailored post-processing of all acquired texts, we delivered a corpus of more than 40,000 sentences in Wolof and Oromo to train the client’s LLM
- We successfully supported a complex, locale-specific expansion of the client's LLM capabilities for the sub-Saharan Africa market