How does your preferred LLM stack up against the rest?
Our in-depth study assessed 9 LLMs across 6 data generation tasks and 8 languages using human expert evaluators.

38,000
Sentences generated

115,000
Submitted annotations

>250,000
Ratings

27
Linguists
Here are 6 key takeaways from our study:


Overall: top performers across all tasks and languages
Overall: top performers across all tasks and languages
Here are the top performers overall across all 6 data generation tasks and 8 languages from our study.
Rank | Model | Average overall score out of 5.0 |
1 | Claude 3.5 Sonnet | 4.40 |
2 | GPT 4o | 4.28 |
3 | Gemini Pro 1.5 | 4.26 |


Language proficiency: English and French are easy – for others, choose wisely
Language proficiency: English and French are easy – for others, choose wisely
Overall scores indicate that all the models perform very well in languages such as English and French, especially on simpler tasks. Some models underperformed on more complex tasks, even in English.
Performance in less represented languages (Arabic, Chinese, Polish, Tagalog) was more varied, reinforcing the need to test multiple LLMs for specific use cases.


Instruction adherence: some LLMs listen better than others
Instruction adherence: some LLMs listen better than others
Different LLMs perform differently for synthetic data generation depending on task or use case. For example, when we tasked the models to generate sentences that were at least 10 words long, most models could not reliably produce sentences that met this requirement, even in English. In addition, quality and performance declined in languages such as Arabic, Polish and Tamil.
Your choice of model will depend on your specific use case.


Creativity: natural language variability declines with output length
Creativity: natural language variability declines with output length
To evaluate the creativity of the LLMs we used a variability score (a measure of how different or varied individual sentences or conversations are). All the models we tested delivered less varied data for tasks requiring longer outputs with a decline in variability between single sentence and conversation generation.
This limitation should be considered when using LLM synthetic data generation for real-world applications.


Speed: not all LLMs win the race
Speed: not all LLMs win the race
When measuring the speed at which sentences were generated, we found that some models generated data at a noticeably slower pace than others, with the lowest performers up to 10x slower than the fastest models.


Cost: token bloat leads to budget bloat
Cost: token bloat leads to budget bloat
Even in English, we saw striking differences in model tokenizer performance. For languages with complex writing systems (such as Tamil), the differences in model tokenizer performance were significant. We found that the least efficient model tokenizer has an increase in token usage of up to 450% compared with the most efficient one, which equates to a 450% increase in costs.


Get the full LLM study today
RWS is trusted by



















