How does your preferred LLM stack up against the rest?

Our in-depth study assessed 9 LLMs across 6 data generation tasks and 8 languages using human expert evaluators.

38,000

Sentences generated

115,000

Submitted annotations

>250,000

Ratings

27

Linguists

Here are 6 key takeaways from our study:

Overall: top performers across all tasks and languages

Here are the top performers overall across all 6 data generation tasks and 8 languages from our study.

Rank	Model	Average overall score out of 5.0
1	Claude 3.5 Sonnet	4.40
2	GPT 4o	4.28
3	Gemini Pro 1.5	4.26

Language proficiency: English and French are easy – for others, choose wisely

Overall scores indicate that all the models perform very well in languages such as English and French, especially on simpler tasks. Some models underperformed on more complex tasks, even in English.

Performance in less represented languages (Arabic, Chinese, Polish, Tagalog) was more varied, reinforcing the need to test multiple LLMs for specific use cases.

Instruction adherence: some LLMs listen better than others

Different LLMs perform differently for synthetic data generation depending on task or use case. For example, when we tasked the models to generate sentences that were at least 10 words long, most models could not reliably produce sentences that met this requirement, even in English. In addition, quality and performance declined in languages such as Arabic, Polish and Tamil.

Your choice of model will depend on your specific use case.

Creativity: natural language variability declines with output length

To evaluate the creativity of the LLMs we used a variability score (a measure of how different or varied individual sentences or conversations are). All the models we tested delivered less varied data for tasks requiring longer outputs with a decline in variability between single sentence and conversation generation.

This limitation should be considered when using LLM synthetic data generation for real-world applications.

Speed: not all LLMs win the race

When measuring the speed at which sentences were generated, we found that some models generated data at a noticeably slower pace than others, with the lowest performers up to 10x slower than the fastest models.

Cost: token bloat leads to budget bloat

Even in English, we saw striking differences in model tokenizer performance. For languages with complex writing systems (such as Tamil), the differences in model tokenizer performance were significant. We found that the least efficient model tokenizer has an increase in token usage of up to 450% compared with the most efficient one, which equates to a 450% increase in costs.