Understand how today’s LLMs perform across multiple tasks and languages

You’ll learn:
  • How leading LLMs perform across high- and low-resource languages
  • Which models excel at specific synthetic data generation tasks
  • Where LLMs outperform humans
  • How to choose the right LLM for global AI applications

Preview our previous study findings

Download TrainAI’s multilingual LLM synthetic data generation study 2.0

Loading...

How does your preferred LLM stack up against the rest?

Our in-depth study assessed the outputs of 8 LLMs against humans across 4 data generation tasks and 8 languages using human expert evaluators.
Document pencil

25,600

Samples of paragraphs, conversations or sentences
Content Management Colour

76,800

Annotations submitted
Award

>211,200

Annotator ratings
People

120

Linguists

Here are 6 key takeaways from our study:

Overall: top performers across all tasks and languages

Overall: top performers across all tasks and languages

Here are the top performers overall across all 4 data generation tasks and 8 languages from our study.

Rank Model Average overall score (1–5 range, higher is better)
1 Gemini 2.5 Pro 4.73
2 Claude Sonnet 4.5 4.61
3 DeepSeek V3.1 4.51

Multilingual proficiency: the language gap is closing fast

Multilingual proficiency: the language gap is closing fast

The disparity between LLM performance on well-supported vs. underrepresented languages has narrowed dramatically. Gemini 2.5 Pro achieved scores above 4.5 out of 5 across multiple tasks in Kinyarwanda – a language in which previous model generations could barely produce coherent text. 

GPT-5 and Claude Sonnet 4.5 also showed meaningful improvements in long-tail languages. While challenges remain, the current generation of models signals that synthetic data generation and translation are becoming viable for a far broader range of languages than ever before.

Task alignment: one model doesn’t fit all

Task alignment: one model doesn’t fit all

No single model dominated across all tasks and languages. GPT-5 excelled at text normalization and translation but struggled with content generation, particularly in Chinese and Polish. 

Gemini 2.5 Pro led in conversation generation and translation but was matched by Claude Sonnet 4.5 on domain-specific paragraph generation. Smaller models like Mistral Small 3.2 performed surprisingly well in certain languages while failing in others. Practitioners should expect to evaluate models against their specific use cases rather than relying on overall rankings.

Humans vs. machines: LLMs surpass single-pass human quality

Humans vs. machines: LLMs surpass single-pass human quality

When human creators worked under realistic constraints – limited time, no secondary review, and no extensive research – the top LLMs consistently outscored them across most languages. 

Humans achieved steady results around 4.5 but did not rank first in any language. This doesn't render human expertise obsolete; rather, it suggests a shift in workflows where LLMs produce strong first drafts and humans add value through review, refinement, and specialized judgment.

Cost: thinking harder costs more – tokenizer efficiency matters again

Cost: thinking harder costs more – tokenizer efficiency matters again

Reasoning models generate up to 10x more tokens during their "thinking" process, amplifying the cost impact of tokenizer efficiency. Gemini 2.5 Pro now leads with 3.67 characters per token – roughly 10% more efficient than GPT-5 and up to 3.5 times more efficient than Claude Sonnet 4.5 on non-Latin scripts. 

For high-volume or reasoning-intensive workloads, these differences translate directly into significant cost variation, making tokenizer efficiency a renewed consideration when selecting models.

Benchmark drift: today’s upgrade might be tomorrow’s downgrade

Benchmark drift: today’s upgrade might be tomorrow’s downgrade

Performance doesn't always improve linearly from one model generation to the next – and weaknesses don't always persist. GPT-5 fell behind smaller models on several content generation tasks where GPT-4o had been competitive. Conversely, Mistral and Llama models closed a 3–4x tokenizer efficiency gap that plagued their predecessors. 
 
Perhaps most striking: Llama 3.1 was the slowest model we measured in our previous study, yet Llama 4 Maverick now ranks as the fastest of all tested models. Model upgrades reshuffle strengths and weaknesses unpredictably, reinforcing the need to re-evaluate even familiar model families with each new release.
RWS is trusted by
HP
Accenture
HP
Accenture

Get the full LLM study today