LLM benchmarking leaderboard: Languages, creativity and tasks
6 days ago
5 mins


If you’ve spent any time working with large language models (LLMs) lately, you’ve probably noticed the competition is fiercer and more confusing than ever.
TrainAI’s LLM Synthetic Data Generation Study set out to provide some clarity by benchmarking LLMs across different scenarios and use cases, from generating synthetic textual data to translation or text normalization. And the results? While Claude, GPT-4o, and Gemini took the top spots, there were some eyebrow-raising surprises, especially around instruction following and linguistic proficiency. For teams evaluating LLM models, especially in multilingual settings, our LLM benchmarking report provides valuable insights.
What makes our study unique?
Firstly, in a world that predominantly tests only English language benchmarks, our study evaluated the LLMs on eight languages across diverse language families, including less represented languages such as Tamil and Kinyarwanda.
Additionally, we tested the models' on LLM benchmarks like natural language generation and processing capabilities through synthetic data generation and natural language processing (NLP) tasks. This approach provides a nuanced view of how well a model “speaks” a given language compared to evaluating closed questions or natural language understanding. To draw a parallel from human language learning, this is akin to testing for passive and active knowledge of the language.
Moreover, we deliberately avoided the common "AI evaluates AI" paradigm by employing professional linguistic experts to perform LLM benchmarking. Using human evaluators, surprisingly rare in current LLM evaluations, ensures reliable judgment of language quality, particularly relevant for human end-users.
We evaluated state-of-the-art models (Anthropic Claude, Google Gemini, OpenAI GPT 4o) alongside models of various sizes (Meta Llama), architectures (AI21 Jamba), and proveniences (Mistral Large). All models underwent testing on six tasks ranging from easy (generate a simple sentence of at least 10 words) to complex (generate a 10-turn conversation or normalize text for text-to-speech engines). Let’s explore what the results revealed across tasks, creativity and languages.
Language: Claude leads the pack, with GPT-4o and Gemini Pro not far behind
Claude 3.5 Sonnet decisively took the lead, winning or tying for first place in six of the eight languages we tested. On our 5-point Likert scale, it consistently scored above 4.0 in all languages except Kinyarwanda. However, despite its strong performance, it was, in some circumstances, bested or closely followed by other models, most commonly the silver and bronze medalists: GPT 4o and Gemini 1.5 Pro, respectively. In multiple cases, Claude Sonnet either tied with or narrowly outperformed models from OpenAI and Google, and for example, on Tagalog, it ended up in third place after Meta Llama 405B and GPT 4o. The differences among the three medalists tended to be slim, especially in higher-scoring languages, such as English, French, or Polish.
Smaller models, surprising savings
Surprisingly, OpenAI’s GPT-4o model showed only minor quality differences (4.28 vs. 4.11 overall score) compared to its smaller GPT-4o mini variant, despite the latter being 15 times less expensive. For many use cases, this tradeoff between slight quality reductions and substantial cost savings could be advantageous. For example, Arabic task scores were identical for GPT 4o and GPT 4o mini.
Creativity: Conversations challenge models more than single sentences
A key aspect of synthetic data generation at scale is variability – how different individual sentences are from each other. Our analysis, split between single-sentence generation and conversation generation, revealed significant differences (up to 0.4 points). Most models showed higher variability in single-sentence tasks, but Meta Llama models, possibly due to extensive conversational training, had higher variability in conversations. However, Gemini 1.5 Pro still emerged the most creative overall.
Task instruction adherence: Claude excels where others stumble
LLMs are notoriously unreliable at counting characters (try counting ‘R' in “strawberry”). Perhaps more interestingly, we observed that LLMs struggle with word-counting. Our data generation tasks required the models to generate sentences of at least 10 words; most models failed, even in English. We observed single-digit percentages of sentences failing to meet the requirement. The only exception was Claude 3.5 Sonnet, which consistently met word count instructions in every language except Tamil. As we conclude in our study in detail, words are not the most reliable cross-lingual measure of information load or density, but word counting still remains a very practical tool in most languages.
Overall guidance and recommendations
Bigger isn’t always better: Rethinking scaling laws
Scaling laws suggest that larger neural networks with more data and compute power yield better performance. Put more simply: the bigger the better. While this was mostly confirmed in our Meta Llama 3.1 evaluations (8B, 70B and 405B), we observed instances where the 70B model matched or slightly outperformed the 405B model. This was especially evident in English, where the largest model tended to overinterpret instructions or select inappropriate language registers.
Why rigorous, multilingual testing matters
Companies, LLM practitioners and AI departments must rigorously test for their specific AI use cases instead of relying on generic LLM benchmarks. With LLMs increasingly deployed in customer-facing scenarios, comprehensive multilingual testing is critical. Because LLMs use is open-ended and unpredictable, extensive functional testing is essential to ensure reliable performance across diverse inputs and languages. Ideally, LLM benchmarking should also include in-language safety evaluations (red-teaming) to ensure fine-tuning modifications maintain or enhance foundational safety.
Dive deeper into the full study
Interested in additional takeaways, specific model results, or performance across individual languages? Download the full 138-page TrainAI LLM Synthetic Data Generation Study, which includes our prompts and annotation guidelines.
Stay tuned for our next blog , LLM Benchmarking Beyond the Basics, where we'll examine current benchmarking practices, highlight their limitations, and share best practices for more effective LLM evaluations.