Every time a new model drops, the same ritual plays out. Teams scan the leaderboard, identify the top scorer and update their stack accordingly. The process is fast and decisive, and it often leads to wasted spend, misaligned performance and avoidable rework at scale.
In TrainAI’s multilingual LLM synthetic data generation study 2.0, we evaluated how leading large language models (LLMs) perform under real-world multilingual conditions. Each model was ranked on a scale of one to five. The higher the score, the better the performance.
Overall, Gemini 2.5 Pro leads with an average score of 4.73 out of 5, followed by Claude Sonnet 4.5 at 4.61 and DeepSeek V3.1 at 4.51.
Those numbers are real, and they matter. But they only tell part of the story. Read on to learn why enterprise teams building multilingual AI systems can’t assume that a single “best” LLM will work best in every context.
What our LLM study 2.0 actually tested
In this study, our team tested 8 LLMs on 4 complex text generation and manipulation tasks across 8 languages. We established a robust baseline by enlisting human creators to work on the same tasks under realistic conditions.
To give you a sense of the study’s scale, here are some hard numbers on volume of outputs, annotations and evaluators:
- 25,600 output samples of paragraphs, conversations or sentences
- 76,800 annotations submitted
- 211,200 annotator ratings
- 120 linguists
Some languages, such as English and French, are considered high-resource and have strong LLM support. However, the project also included low-resource languages like Kinyarwanda and Tamil.
The four tasks we tested model performance on were:
- Domain-specific paragraph generation
- Conversation generation
- Text normalization (converting text to spell out numbers, acronyms and more for better text-to-speech performance)
- Translation
The result is a large dataset of annotations from different linguists regarding different languages and tasks. It provides us with several perspectives on the data, allowing us to reveal interesting patterns. Most importantly, it allows us to go beyond aggregated overall scores, which may hide crucial differences among models.
LLM multilingual performance is improving – but not uniformly
Task alignment reshuffles the rankings
Cost and tokenizer efficiency reshape the trade-offs
Benchmark drift makes one-time decisions risky
Rethinking AI model selection: context over rank
- Which languages are mission-critical, and where does quality failure have the highest business impact?
- Which task dominates your pipeline, and does your leading model actually excel at that task?
- Could you use multiple LLMs for specific tasks in your workflow and measure each separately?
- What's your tolerance for cost variance, particularly for reasoning-intensive workflows or workload in non-Latin scripts?
- How often are you re-benchmarking, given the pace of model releases?
Build smarter, not just faster: LLM selection for multilingual AI workloads

Author
Tomáš Burkert
Head of Innovation, TrainAI by RWS
