Should AI train AI? Weighing the risks and benefits

Vasagi Kothandapani Vasagi Kothandapani President, Enterprise Services and TrainAI, RWS 23 Apr 2025 6 mins 6 mins

Artificial intelligence is evolving at a rapid pace, with new models and techniques constantly emerging. As the demand for high-quality AI training data grows and its availability declines, and as enterprises struggle to navigate data privacy and regulatory hurdles, companies have started exploring alternative ways to source data. One approach that is gaining traction is generating synthetic data using AI itself. 

This method raises an important question: Should AI be trained using data generated by other AI models? While synthetic data can provide cost efficiency and scalability, it can also present challenges, such as model collapse, where AI models degrade over time as they repeatedly learn from AI-generated outputs. 

This article explores the benefits and risks of AI training AI, the impact of model collapse, and why human-in-the-loop (HITL) methodologies are vital for AI’s viability long term.

The role of synthetic data in AI training

Instead of collecting data from real-world sources, synthetic data is artificially generated. AI models can create synthetic images, text, audio/speech and other complex datasets that mimic real-world data. Let’s explore the benefits of data generated by AI models:

  • Lower costs and faster generation : Unlike real-world data, which requires time-consuming collection, organization and labeling, synthetic data can be generated at scale with minimal manual effort, significantly reducing costs. 
  • Privacy protection : Real-world datasets often contain personally identifiable information, requiring strict compliance with privacy regulations. Synthetic data, by contrast, can be generated without sensitive user data, reducing privacy risks. 
  • Expanded dataset diversity : In industries such as autonomous driving and healthcare, capturing rare but critical scenarios in real-world data can be impractical or costly. Synthetic data can be used to simulate edge cases and improve model robustness.
  • Unlimited data supply : As AI development shifts focus from model scale to data quality and quantity, synthetic data offers a virtually limitless supply of training material, accelerating experimentation and iteration without the bottlenecks of real-world data sourcing. 

As a result, synthetic data is increasingly used in finance and healthcare industries. While these advantages may seem attractive, the dependence on AI-generated data raises concerns. So, what happens when AI repeatedly learns from its own outputs instead of real-world data?

Understanding model collapse

One concerning risk of AI training AI is model collapse—a phenomenon where AI models degrade in accuracy, quality and diversity due to overuse of synthetic data. Imagine a model that generates data, which is then used to train a newer version of itself. This process repeats over multiple iterations, introducing small distortions, biases or errors that become amplified with each output. The result? The AI drifts further from reality, producing increasingly unreliable, biased or repetitive outputs. Researchers often use an analogy of making a photocopy of a photocopy to explain this phenomenon. Each iteration compounds imperfections as the clarity reduces and the distortions become more pronounced. 

A 2024 study exploring AI model collapse when trained on recursively generated data observed early signs of this effect. Language models trained primarily on synthetic outputs showed performance degradation, particularly on tasks that required nuanced understanding of human intent and context. This highlights the importance of grounding AI development in high-quality, human-validated data to avoid reinforcing flaws at scale.

Downstream effects: why it matters

If AI training AI leads to model collapse, the consequences could have significant impacts:

  1. Loss of diversity in AI outputs

    AI models are designed to produce creative, varied and insightful outputs. However, when models are repeatedly trained on synthetic data, they risk homogenization, leading to responses that lack diversity and originality. This also introduces noticeable patterns that make AI-generated content easier to spot.

    For example, many LLMs develop signature quirks—such as overused phrases—that are increasingly recognizable. A recent study by Copyleaks highlighted how different AI models display distinct writing "fingerprints," making it easier for both experts and laypeople to identify AI-generated text.

  2. Error propagation and bias amplification

    AI models are naturally prone to biases inherited from their training data. If synthetic data contains errors or biases, those flaws can be reinforced, making models less reliable and accurate over time. For instance, if an AI language model trained on synthetic data develops a subtle bias in emotion, that bias could magnify over time, leading to distorted understandings of user inputs in chatbots.

  3. Decreasing reliability and real-world applicability

    AI models trained mostly with synthetic data may struggle with generalization—the ability to apply learning to real scenarios. Without real-world applications, AI systems can become overfitted to synthetic patterns. Overfitting often occurs when a dataset is too small to represent the real world, making it less effective in unpredictable conditions. As an illustration, imagine an AI model trained solely in synthetic speech data. It might perform well in controlled environments, but it could struggle when processing natural conversations with accents and background noise.

  4. Ethical and regulatory concerns

    As AI systems play a larger role in decision-making across critical sectors like healthcare, human resources and law enforcement, the ethical risks of relying on synthetic data become more pressing. It is important to consider what can happen when an AI model trained on flawed or biased synthetic data, makes decisions that affect people’s lives, jobs or access to services. 

    The growing dependence on AI-generated data raises key questions:

  • Who is accountable if AI-generated data leads to discriminatory or harmful outcomes?
  • How do we ensure transparency in decision-making if AI models are built on layers of synthetic data?
  • What are the legal implications of AI-generated datasets in regulated industries like finance, healthcare or law enforcement?
  • Can we audit how an AI model learned what it knows or trace back its decision logic through layers of synthetic data?
  • Would you trust a critical system, like a medical AI, trained entirely by another AI instead of humans?
These concerns highlight the need for careful supervision by humans when incorporating synthetic data into AI training.

Striking the right balance: AI + human oversight

While AI-generated data offers clear advantages, relying on it exclusively presents real risks. Even state-of-the-art large language models achieve only 80–90% accuracy on comprehension tasks in various benchmarks (such as MMLU). And since generation tasks are often more complex than understanding tasks, error rates could be even higher when synthetic data is used for training. 

This reinforces the need for human involvement—not just to validate AI outputs, but to ensure quality, accuracy and reduced bias. Rather than allowing AI models to train each other in isolation, the best approach is a hybrid model where AI-generated data is enhanced with human contribution and real-world data.

The role of humans-in-the-loop in AI training

A human-in-the-loop (HITL) approach ensures that AI systems remain grounded in reality. 

Humans can: 

  1. Validate AI-generated data – identifying and correcting biases, inaccuracies and inconsistencies. 
  2. Introduce real-world context – ensuring AI models learn from diverse, high-quality and ethically sourced data. 
  3. Provide ethical oversight – reducing the risk of unintended results in AI decision-making. 

By integrating human expertise, AI systems can maintain accuracy, fairness and adaptability—even as synthetic data starts to play a larger role in model training.

A thoughtful approach to AI training

The question is not just whether AI can train AI, but whether it should—and to what extent. While synthetic data offers efficiency, scalability and privacy benefits, overreliance on AI-generated outputs carries risks that could weaken AI’s long-term reliability as it can lead to model collapse, bias development and reduced accuracy. 

To prevent these risks, AI must be developed with a balanced approach of synthetic and real-world data, human-in-the-loop methodologies, and ethical safeguards. Maintaining human involvement and grounding AI in real-world data is essential for ensuring its continued progress. 

The future of AI depends not on replacing human expertise but on combining artificial intelligence with human intelligence to create genuine intelligence. Learn more about GENUINE INTELLIGENCE™ – the future of human-machine collaboration. 

Interested in building AI models that remain accurate and adaptable? Let’s talk about how to combine AI and human expertise for better results. Get in touch with our TrainAI team, and we’ll set up a call to discuss your project.

Glossary

  • Synthetic data: Data that is artificially generated by algorithms rather than collected from real-world events. It mimics the structure and characteristics of real data and is often used to augment or replace real-world datasets in AI training. 
  • Model collapse: A degradation process in which AI models lose accuracy, diversity and quality due to repeated training on synthetic or AI-generated data. Errors and biases compound over time, making the model less useful and reliable. 
  • Overfitting: A modelling issue where an AI system performs well on its training data but poorly on new or unseen data. It occurs when the model learns patterns too precisely, failing to generalise. 
  • MMLU (Massive Multitask Language Understanding): A benchmark designed to evaluate the reasoning and knowledge capabilities of AI models across 57 academic and professional subjects. Often used to compare model performance against human-level accuracy. 
  • Human-in-the-loop (HITL): A model development approach where human experts are involved throughout the AI training or evaluation process. HITL ensures AI systems stay grounded in real-world knowledge, with ethical oversight and quality control.
Vasagi Kothandapani
Author

Vasagi Kothandapani

President, Enterprise Services and TrainAI, RWS
Vasagi is President of Enterprise Services, responsible for multiple global client accounts at RWS, as well as RWS’s TrainAI data services practice which delivers leading-edge AI training data solutions to global clients across a broad range of industries.  She has 27 years of industry experience and has held various leadership positions in business delivery, technology, sales, product management, and client relationship roles in both product development and professional services organizations globally. She spent most of her career working in the technology and banking sectors, supporting large-scale technology and digital transformation initiatives.
 
Prior to joining RWS, Vasagi worked with Appen where she was responsible for managing a large portfolio of AI data services business for the company’s top global clients.  Before that she spent two decades at Cognizant and CoreLogic in their banking and financial services practice, managing several banks and fintech accounts. Vasagi holds a Master’s degree in Information Technology, a Post Graduate Certificate in AI, and several industry certifications in Data Science, Architecture, Cybersecurity, and Business Strategy.
All from Vasagi Kothandapani