How leading LLMs compare: Key takeaways from TrainAI’s LLM benchmarking study
3 days ago
6 mins


Did you know that even the biggest companies behind today’s state-of-the-art large language models (LLMs) are running out of data¹ to train their newest models? One approach to mitigate this risk, which is being used by LLM companies like OpenAI, Anthropic, and Google, is to use synthetic data generated by the LLMs themselves.
To evaluate LLMs’ viability to create synthetic data, TrainAI decided to test the ability of several popular LLMs to generate sentences and conversations and assess their general natural language processing (NLP) skills across a variety of languages using human expert evaluators. Our LLM benchmarking study aims to provide insights that can serve as a starting point to further validate the use of LLM-generated synthetic data for specific AI use cases.
The methodology behind the study
TrainAI’s LLM synthetic data generation study tested nine leading LLMs across practical data generation tasks in eight carefully selected languages. Unlike typical automated LLM benchmarks that rely on closed-question formats or AI judge models, this study focused on open-ended, natural language tasks evaluated by human expert evaluators.
The LLMs tested included AI21’s Jamba 1.5 Large, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini Pro 1.5, Mistral’s Mistral Large 2, Meta’s Llama 3.1 405B, 70B, and 8B models, and OpenAI’s GPT 4o and GPT 4o mini.
Each model was evaluated across six tasks: simple sentence generation, generation of sentences with specific entities, domain-specific sentence creation, conversation generation, text normalization, and translation.
The LLM evaluation was conducted in English, Arabic, Simplified Chinese, French, Kinyarwanda, Polish, Tagalog, and Tamil. Prompts were originally crafted in English and then translated into the remaining languages by professional linguists. For each language, three native-speaking linguists evaluated the LLM-generated outputs against specific criteria (such as grammar and naturalness) without knowing which model generated which outputs.
Now let’s do an LLM comparison of today’s most popular models and take a look at what that means for anyone using them to create content, answer questions, or work in multiple languages.
The LLM benchmarking study reveals 6 key takeaways
1. Overall: top performers across all tasks and languages
Claude 3.5 Sonnet emerged as the overall front-runner in our study, achieving the highest average score (4.40 out of 5) across six data generation tasks and eight languages. It ranked first or tied for first in six of the eight languages evaluated and consistently scored above 4.0 in all but one – Kinyarwanda.
However, while Claude 3.5 Sonnet set the benchmark, it wasn’t untouchable. GPT 4o (4.28) and Gemini Pro 1.5 (4.26) followed closely, even outperforming Claude on specific tasks. The key takeaway? While Claude may lead overall, the “best” model depends on your specific use case and language requirements. No single model dominated every scenario, reinforcing the importance of benchmarking LLM models against the tasks and locales that matter most to your AI application.
2. Language proficiency: English and French are easy – for others, choose wisely
Overall scores show that all models perform well in languages such as English and French, especially on simpler tasks. However, performance across less represented languages (Arabic, Chinese, Polish, Tagalog) was more mixed, highlighting the importance of testing multiple LLMs when targeting specific markets.
Some models (Llamas 70B and 8B, Jamba) scored below 4.0 out of 5.0 in less represented languages, making them unsuitable for high-quality data generation. In low-resource languages like Kinyarwanda, all models tested became largely unusable. Notably, Claude Sonnet was the only model that consistently delivered high-quality generated data in Tamil.
In addition, some models underperformed on more complex tasks, even in English. The results from Claude Sonnet and GPT 4o make them better choices for more complex tasks across languages.
3. Instruction adherence: some LLMs listen better than others
Not all LLMs are equally good at following instructions. Different LLMs perform differently for synthetic data generation depending on the task or use case. For instance, when prompted to generate sentences with a minimum length of 10 words, most models failed to meet this requirement consistently, even in English.
While specifying sentence length by word count isn't ideal in a multilingual context, since languages vary in information density of words, it is still notable that Claude Sonnet was the only model to reliably meet the minimum word count in all languages except Tamil.
Regarding sentence length, Claude Sonnet, the Meta Llama models, and Gemini Pro tended to produce longer outputs than other models, with up to a 20% difference in character count, despite receiving identical prompts.
That said, model choice should always be guided by your specific use case. While Gemini Pro showed strong results in translation and text normalization tasks, it underperformed in conversation generation, highlighting the importance of task-specific evaluation.
4. Creativity: natural language variability declines with output length
When it comes to generating natural language at scale, variety matters. To evaluate the creativity of the LLMs when generating natural language, we used a variability score (a measure of how different or varied individual sentences or conversations are). All the models we tested delivered less varied data for tasks requiring longer outputs, and the results show a distinct decline in variability between single sentence and conversation generation. This limitation should be considered when using LLM synthetic data generation for real-world applications.
Overall, Gemini Pro outperformed the rest in variability. As well as scoring the highest average across six languages, it also produced the smallest volume of low-variability outputs by far.
5. Speed: not all LLMs win the race
Speed matters, especially if you're deploying LLMs in real-time environments. When measuring the speed at which sentences were generated, we found that the Meta Llama models generated data at a noticeably slower pace than the other models. The 70B and 405B models were, respectively, up to 3x and 10x slower than most of the rest. GPT 4o, however, won the race, even occasionally exceeding the speed of GPT 4o mini.
6. Cost: token bloat leads to budget bloat
Lower tokenizer performance leads to higher costs, and we saw striking differences in model tokenizer performance, even in English. For languages with complex writing systems (such as Tamil), the differences in model tokenizer performance were significant: we found up to a 450% increase in token usage between the least and most efficient model tokenizers, which equates to a 450% increase in costs.
GPT 4o’s tokenizer was the top performer. Gemini Pro’s tokenizer slightly outperformed GPT 4o’s on Polish and Simplified Chinese and was competitive in most other languages. The remaining model tokenizers demonstrated significantly lower performance compared to Gemini Pro and GPT 4o.
LLM recommendations for synthetic data generation
No single model came out on top across every task and language, but some consistently performed better on key priorities like instruction adherence, speed, creativity, and cost efficiency. With the LLM landscape evolving rapidly, new models may already be reshaping the picture. That’s why it’s essential to benchmark multiple models against your specific use case, because what works best for one application or language might not be the right fit for another.
Download the full report now
Get all the details from TrainAI’s LLM synthetic data generation study, including the full methodology, LLM comparison results and model recommendations.
Don’t miss our next blog, The LLM Benchmarking Leaderboard, where we take a closer look at how each model performed across tasks, creativity and languages to reveal who’s really leading the pack.
References:
¹ Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L. & Hobbhahn, M. (2024). Position: Will we run out of data? Limits of LLM scaling based on human-generated data. Proceedings of Machine Learning Research 235:49523-49544 proceedings.mlr.press/v235/villalobos24a