Results from TrainAI's multilingual LLM synthetic data generation study 2.0

Large language models (LLMs) are only as good as the data they're trained on, and getting high-quality data in languages beyond English remains one of AI's toughest challenges. Not only is the internet largely Anglocentric, but real-world multilingual datasets are scarce, and much of what does exist is copyrighted, biased or locked behind restrictive licenses.

That's why synthetic data generation has become central to how organizations build and scale global AI models.

But how well do today's leading LLMs perform when you ask them to generate synthetic training data across a range of languages and tasks? That's the question we set out to answer. The results might surprise you.

The questions we explored

Our previous synthetic data generation study gave us a first look at how LLMs handle multilingual synthetic data generation. However, the LLM landscape moves fast: new models arrive, old ones get updated and performance can shift dramatically between releases.

Version 2.0 of our study was designed to build on that foundation with sharper analyses and harder questions.

We wanted to know the following:

Which LLMs produce the most natural and grammatical outputs across specific languages and tasks?
Can synthetic data generation now work for underrepresented languages, not just English and French?
To what extent do LLMs accurately follow task-specific instructions?
How do LLM outputs compare to human creators working under realistic conditions?
What is the true cost-efficiency of LLMs?
Do model upgrades translate into better multilingual performance?

Methodology at a glance

We tested eight LLMs: Claude 4.5 Sonnet, DeepSeek V3.1, Gemini 2.5 Pro, GPT-5, Llama 4 Maverick, Mistral Medium 3.1, Mistral Small 3.2 and Qwen3 235B.

Each model was assigned four complex tasks: domain-specific paragraph generation, conversation generation, text normalization and translation.

Performance was evaluated across eight carefully selected languages, spanning both high-resource and underrepresented contexts: English, Arabic, Simplified Chinese, French, Kinyarwanda, Polish, Tagalog and Tamil.

To establish a human baseline, our own linguists performed the same tasks under realistic constraints: limited time, minimal research and no secondary review.

For each language, three native-speaking professional linguists evaluated every output blindly, without knowing whether it came from a human or a model. All annotators were required to first pass a qualification step to ensure scoring consistency.

The scale of the effort speaks for itself: 25,600 samples, 76,800 annotations, 211,200 data annotator ratings and 120 linguists. Performance was graded on a scale of 1-5, with 5 representing the best possible score and 1 the worst.

Key results and insights

The findings exposed a great deal about how LLMs perform on data generation and manipulation tasks across different languages, as well as how the latest model generations perform relative to their previous generations.

Overall: top performers across all tasks and languages

We evaluated the models based on their performance across all tasks and languages. Gemini 2.5 Pro took the top spot with an average score of 4.73 out of 5, followed by Claude Sonnet 4.5 at 4.61 and DeepSeek V3.1 at 4.51.

Multilingual proficiency: the language gap is closing fast

Gemini 2.5 Pro scored above 4.5 in Kinyarwanda across multiple tasks. That’s important because this is a language in which previous model generations could barely produce coherent text.

GPT-5 and Claude Sonnet 4.5 also showed meaningful improvements in long-tail languages. This means that synthetic data generation and text manipulation are becoming viable for a far broader range of languages than ever before.

Task alignment: one model doesn't fit all

No single model dominated across tasks and languages. GPT-5 excelled at text normalization and translation but struggled with content generation, especially in Chinese and Polish. This aligns with results from our previous data generation study, which showed that GPT-4o was one of the better choices for complex tasks across languages.

Gemini 2.5 Pro led in conversation generation and translation, but it was matched by Claude Sonnet 4.5 on domain-specific paragraph generation. Even smaller models like Mistral Small 3.2 exceeded expectations in certain languages.

Practitioners should expect to evaluate models against their specific use cases rather than relying on overall rankings.

Humans vs. machines: LLMs surpass single-pass human quality

When human creators worked under realistic constraints – limited time, no secondary review, and no extensive research – the top LLMs consistently outscored them. Humans delivered steady results around 4.5 out of 5 but didn't rank first in any language.

This doesn't make human expertise obsolete. It suggests a shift in roles.

LLMs now produce stronger first drafts at scale. Humans can then add value through review, refinement and specialized judgment.

Cost: thinking harder costs more – tokenizer efficiency matters again

Reasoning models generate up to 10x more tokens during their "thinking" process, making tokenizer efficiency a renewed cost consideration.

Gemini 2.5 Pro leads in tokenizer performance at 3.67 characters per token. That makes it roughly 10% more efficient than the closest contender, GPT-5. It’s even up to 3.5x more efficient than Claude Sonnet 4.5 on non-Latin scripts.

Benchmark drift: today's upgrade might be tomorrow's downgrade

Performance doesn’t improve linearly from one model generation to the next – and weaknesses don’t always persist. GPT-5 fell behind smaller models on content generation tasks where GPT-4o had been competitive as shown in our previous study.

Meanwhile, Llama 3.1, which was the slowest model in our last study, has since evolved into Llama 4 Maverick, which was the fastest model in this study.

Model upgrades often mean improvements on specific tasks, but they can also reshuffle strengths and weaknesses unpredictably, reinforcing the need to re-evaluate the latest model versions.

What these findings mean for AI teams

The most important takeaway is that there is no single "best" LLM for multilingual synthetic data generation. The right model depends entirely on your tasks, languages and cost priorities.

A model that excels at translation may underperform at conversation generation. A model that tops the charts in French may falter in Tamil.

This makes routine LLM evaluation essential. As models evolve between releases, the assumptions you made six months ago may no longer hold. Organizations building enterprise AI models need to treat multilingual LLM benchmarking as a continuous discipline.

It also reinforces the growing role of hybrid workflows that integrate both machines and humans. LLMs can handle the heavy lifting of first-pass data creation at scale, but human-in-the-loop AI review remains critical for catching nuance, cultural context and edge cases that models still miss.

Recommendations and next steps

If you're defining your AI training data strategy, here's where to focus:

Evaluate multiple models against your specific use cases rather than relying on general rankings. Performance differences across tasks and languages are significant enough to make one model ideal for one workflow and unsuitable for another.
Factor in tokenizer efficiency alongside AI data quality scores, especially for high-volume or reasoning-intensive workloads, where longer reasoning traces create a “reasoning tax” by increasing token usage and compounding costs.
Plan for re-evaluation. The benchmark drift we observed means even familiar model families can surprise you (positively or negatively) with each new release.
Treat synthetic data generation as one piece of a broader AI data strategy. Its primary strength lies in filling gaps in niche scenarios or meeting requirements that rarely occur naturally or are difficult to collect, and it works best alongside human validation and domain expertise.

Continuous evaluation in a moving landscape

The LLM landscape continues to evolve at extraordinary speed. New models may have already been released that override some of our findings.

Continuous benchmarking isn't optional anymore. It's how you stay ahead.

At TrainAI by RWS, we combine technological understanding with human intelligence to deliver end-to-end data services for AI. Whether you need multilingual synthetic training data, expert data annotation or LLM benchmarking tailored to your use case, we're here to help.

Want the full breakdown?

Get all the details from TrainAI’s multilingual LLM synthetic data generation study 2.0, including the full methodology, LLM comparison results and model recommendations.

Check out our next blog, ‘Why there’s no ‘best’ LLM for multilingual synthetic data,’ where we explore how general benchmark leaderboards obscure sharp differences in LLM performance by task, language and workload.

Tags:

Artificial Intelligence (AI) TrainAI/Data Services for AI

Author

Tomáš Burkert

Head of Innovation, TrainAI by RWS

Tomáš leads innovation for RWS's TrainAI data services practice, which delivers complex, cutting-edge AI training data solutions to global clients operating across a broad range of industries. His mission is to understand even the most complex client needs and work with the TrainAI team to successfully design, execute and deliver a wide range of AI data services projects.

Tomáš has over a decade of experience in localization and several years of experience serving major big tech clients in the AI data services space. He holds a master's degree in English Language Translation from the Masaryk University in Brno, Czechia.

All from Tomáš Burkert

TrainAI Study 2.0: Benchmarking leading LLMs on multilingual synthetic data generation