One model, all languages, all tasks? The myth of the universal LLM

Every time a new model drops, the same ritual plays out. Teams scan the leaderboard, identify the top scorer and update their stack accordingly. The process is fast and decisive, and it often leads to wasted spend, misaligned performance and avoidable rework at scale.

In TrainAI’s multilingual LLM synthetic data generation study 2.0, we evaluated how leading large language models (LLMs) perform under real-world multilingual conditions. Each model was ranked on a scale of one to five. The higher the score, the better the performance.

Overall, Gemini 2.5 Pro leads with an average score of 4.73 out of 5, followed by Claude Sonnet 4.5 at 4.61 and DeepSeek V3.1 at 4.51.

Those numbers are real, and they matter. But they only tell part of the story. Read on to learn why enterprise teams building multilingual AI systems can’t assume that a single “best” LLM will work best in every context.

What our LLM study 2.0 actually tested

In this study, our team tested 8 LLMs on 4 complex text generation and manipulation tasks across 8 languages. We established a robust baseline by enlisting human creators to work on the same tasks under realistic conditions.

To give you a sense of the study’s scale, here are some hard numbers on volume of outputs, annotations and evaluators:

25,600 output samples of paragraphs, conversations or sentences
76,800 annotations submitted
211,200 annotator ratings
120 linguists

Some languages, such as English and French, are considered high-resource and have strong LLM support. However, the project also included low-resource languages like Kinyarwanda and Tamil.

The four tasks we tested model performance on were:

Domain-specific paragraph generation
Conversation generation
Text normalization (converting text to spell out numbers, acronyms and more for better text-to-speech performance)
Translation

The result is a large dataset of annotations from different linguists regarding different languages and tasks. It provides us with several perspectives on the data, allowing us to reveal interesting patterns. Most importantly, it allows us to go beyond aggregated overall scores, which may hide crucial differences among models.

LLM multilingual performance is improving – but not uniformly

Here’s the good news: the gap between high-resource and underrepresented languages is closing faster than most practitioners expected. Gemini 2.5 Pro scored above 4.5 across multiple tasks in Kinyarwanda, a language that earlier model generations struggled to handle coherently. GPT-5 and Claude Sonnet 4.5 showed meaningful gains there, too.

But improvement isn't uniform, and that distinction matters. An LLM that excels in English conversation generation may perform very differently when generating domain-specific content in Polish or handling translation in Tamil.

Multilingual AI performance depends on the specific combination of language, task and model, so results don’t necessarily carry over cleanly across use cases. Treating multilingual performance like a checkbox is a planning risk that can compound problems when you start to scale your AI project.

Task alignment reshuffles the rankings

The LLM’s overall scores also break down once you start looking at task-specific workloads. As with language-specific performance, the leaderboard shifts significantly when you drill down into differences in task-level performance.

GPT-5, for example, excelled at text normalization and translation but struggled with content generation, particularly in Chinese and Polish. Meanwhile, Gemini 2.5 Pro did extremely well with conversation generation, and Claude Sonnet 4.5 matched it on domain-specific paragraph generation.

Mistral Small 3.2 was competitive with state-of-the-art models in English and French and didn’t trail too far behind them in Arabic and Chinese. This could be an indicator that small models can work well for specialized tasks, languages or workflows.

Regardless, no single LLM dominated across the board. If your primary task is building conversational training data for a multilingual virtual assistant, your model ranking would look completely different than if you were normalizing text at high volume.

Task alignment should be the first filter in data annotation and labeling workflows before the overall score enters the conversation.

Cost and tokenizer efficiency reshape the trade-offs

Reasoning models generate significantly more tokens per request than non-reasoning ones, sometimes up to 10 times more, because they produce internal “thinking” tokens before delivering an answer. This additional chain of thought provides a measurable increase in performance, but the extra computation further amplifies any inefficiencies caused by poor tokenizers, especially at scale.

Tokenizer performance, measured as the average number of characters encoded by a single token, can vary widely even within Latin‑script languages. For instance, all major LLMs achieve roughly half the tokenizer efficiency in Kinyarwanda compared to English.

LLM usage is typically billed by the number of tokens processed, so tokenizer efficiency directly affects costs.

That means even the most capable model isn’t always the most cost‑effective one. Choosing a model based on multiple criteria, even if it‘s not the leading one on quality metrics, may enable companies to unlock substantial savings without significantly compromising quality.

For example, while Claude Sonnet 4.5 competes with the best models in terms of quality, it exhibits poor tokenizer efficiency in Arabic, encoding only 1.48 characters per token. By comparison, Gemini 2.5 Pro is significantly more efficient in Arabic, encoding 3.23 characters per token.

In practice, if an identical sentence is processed by both Claude Sonnet 4.5 and Gemini 2.5 Pro, Claude Sonnet 4.5 will result in more than 2x the token usage, and therefore 2x the costs.

We observed significant differences between models, especially on languages using non-Latin scripts, which can reshuffle the rankings for cost-sensitive or large-scale deployments of AI.

We urge any companies providing LLM-supported products to consider the tokenizer efficiency of the models they use, especially if they serve global audiences.

Benchmark drift makes one-time decisions risky

Here's where leaderboard-driven decision-making becomes risky. Our study results demonstrate that LLM performance doesn't improve linearly from one model generation to the next.

Model weaknesses don't always persist in new updates, but that doesn’t guarantee that new weaknesses won’t emerge.

Consider the trajectory of Meta's Llama. In our previous LLM benchmarking study, Llama 3.1 ranked the slowest model tested. In study 2.0, Llama 4 Maverick ranks the fastest of all models evaluated, a complete and unexpected reversal.

GPT-5, on the other hand, fell behind smaller models on several content generation tasks where GPT-4o had been competitive.

Strengths and weaknesses shuffle unpredictably with each model release. A model selection decision made six months ago may already be stale, and today’s winner may not necessarily be tomorrow's best choice.

This means model selection isn’t a one-time decision – it’s an ongoing operational risk that requires continuous evaluation.

Rethinking AI model selection: context over rank

So, what does a smarter evaluation process actually look like? Start by replacing the general question of "which model is best?" with a set of more specific ones.

Before committing to any LLM for a multilingual use case, your team should be asking:

Which languages are mission-critical, and where does quality failure have the highest business impact?
Which task dominates your pipeline, and does your leading model actually excel at that task?
Could you use multiple LLMs for specific tasks in your workflow and measure each separately?
What's your tolerance for cost variance, particularly for reasoning-intensive workflows or workload in non-Latin scripts?
How often are you re-benchmarking, given the pace of model releases?

The answers reframe LLM benchmarking and selection as an ongoing discipline rather than an IT procurement decision.

Furthermore, human expertise remains central to this process, both from an evaluation standpoint and from an error reduction standpoint. For example, human-in-the-loop data validation ensures that even top-performing models produce training data you can actually depend on, with expert review identifying the gaps that automated scoring misses.

Build smarter, not just faster: LLM selection for multilingual AI workloads

The key lesson from TrainAI’s multilingual LLM synthetic data generation study 2.0 is that a single, aggregate score is an incomplete basis for decision-making. Language variability, task alignment, tokenizer efficiency and benchmark drift all shape real-world outcomes in ways that overall rankings don't reflect.

RWS's TrainAI team has spent years building the infrastructure to support exactly this kind of contextual evaluation, from synthetic data generation and collection to human validation at scale, across 500+ language pairs and variants.

If you're currently evaluating LLMs for a multilingual deployment, download the full study today. And if you need LLM benchmarking data tailored to your specific use case, get in touch with TrainAI. We'll help you find the right model for the right job.

Tags:

Artificial Intelligence (AI) TrainAI/Data Services for AI

Author

Tomáš Burkert

Head of Innovation, TrainAI by RWS

Tomáš leads innovation for RWS's TrainAI data services practice, which delivers complex, cutting-edge AI training data solutions to global clients operating across a broad range of industries. His mission is to understand even the most complex client needs and work with the TrainAI team to successfully design, execute and deliver a wide range of AI data services projects.

Tomáš has over a decade of experience in localization and several years of experience serving major big tech clients in the AI data services space. He holds a master's degree in English Language Translation from the Masaryk University in Brno, Czechia.

All from Tomáš Burkert