Video Player is loading.
Current Time 0:00
Duration 0:00
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters

        LLM synthetic data generation study

        How does your preferred LLM stack up against the rest?

        Our in-depth study assessed 9 LLMs across 6 data generation tasks and 8 languages using human expert evaluators.
        Document pencil

        38,000

        Sentences generated
        Content Management Colour

        115,000

        Submitted annotations
        Award

        >250,000

        Ratings
        People

        27

        Linguists

        Here are 6 key takeaways from our study:

        Overall: top performers across all tasks and languages

        Overall: top performers across all tasks and languages

        Here are the top performers overall across all 6 data generation tasks and 8 languages from our study.


        Rank Model Average overall score out of 5.0
        1 Claude 3.5 Sonnet 4.40
        2 GPT 4o 4.28
        3 Gemini Pro 1.5 4.26

        Language proficiency: English and French are easy – for others, choose wisely

        Language proficiency: English and French are easy – for others, choose wisely

        Overall scores indicate that all the models perform very well in languages such as English and French, especially on simpler tasks. Some models underperformed on more complex tasks, even in English.

        Performance in less represented languages (Arabic, Chinese, Polish, Tagalog) was more varied, reinforcing the need to test multiple LLMs for specific use cases.

        Instruction adherence: some LLMs listen better than others

        Instruction adherence: some LLMs listen better than others

        Different LLMs perform differently for synthetic data generation depending on task or use case. For example, when we tasked the models to generate sentences that were at least 10 words long, most models could not reliably produce sentences that met this requirement, even in English. In addition, quality and performance declined in languages such as Arabic, Polish and Tamil.

        Your choice of model will depend on your specific use case.

        Creativity: natural language variability declines with output length

        Creativity: natural language variability declines with output length

        To evaluate the creativity of the LLMs we used a variability score (a measure of how different or varied individual sentences or conversations are). All the models we tested delivered less varied data for tasks requiring longer outputs with a decline in variability between single sentence and conversation generation.

        This limitation should be considered when using LLM synthetic data generation for real-world applications.

        Speed: not all LLMs win the race

        Speed: not all LLMs win the race

        When measuring the speed at which sentences were generated, we found that some models generated data at a noticeably slower pace than others, with the lowest performers up to 10x slower than the fastest models.

        Cost: token bloat leads to budget bloat

        Cost: token bloat leads to budget bloat

        Even in English, we saw striking differences in model tokenizer performance. For languages with complex writing systems (such as Tamil), the differences in model tokenizer performance were significant. We found that the least efficient model tokenizer has an increase in token usage of up to 450% compared with the most efficient one, which equates to a 450% increase in costs.
        Get the full LLM study today
        RWS is trusted by
        HP
        Accenture
        HP
        Accenture

        Download the full LLM study