How AI is trained: the critical role of AI training data

Nayna Jaen Marketing Director, Enterprise Services & TrainAI, RWS 26 Mar 2024

5 minute read

From medical diagnosis to planning a holiday, from battling climate change to writing a cover letter for a job application, the uses of AI are multiplying at mind-boggling speed. It's mostly generative AI in the headlines these days, but whatever AI model or use case we're talking about, its success comes down to how well the AI has been trained. This in turn depends on having the right kind of data for AI training.

Good-quality AI training data – LOTS of it – is the key ingredient for developing AI applications that can reliably do what we want them to do. So let's explore what AI training data is, how it's prepared and how it's applied in the AI training process. In other blogs we explore some common AI training challenges and how to ensure that your AI training data is of sufficient quality to meet your needs.

What is AI training data?

AI training data is a set of information, or inputs, used to teach AI models to make accurate predictions or decisions. For instance, if a model is being trained to identify images of dogs, its AI training dataset will comprise pictures containing dogs, with each dog labeled 'dog'. This data is fed into the AI model as learning inputs, allowing it ultimately to identify dogs accurately in other, previously unseen images (more on how this learning happens later).

Data for AI training may be naturally generated by human activity and collected for use in an AI training dataset, or it may be manufactured for the purpose, creating synthetic data that mimics real-world training data. Synthetic training data is especially useful when real-world data is limited or sensitive.

Training data formats

Depending on the purpose of the AI model, its training data may be:

Text data. Anything from tweets and web pages to literary works, academic papers and government documents can be used to teach AI models to process and generate human language.
Audio, including speech data. Voice-activated AI models or speech-to-text applications need to be trained to identify and respond appropriately to human speech, including dealing with different accents and speech patterns. They may even need to identify different emotions. Other types of audio, such as animal sounds, music, traffic or other environmental noises will also be used to train AI applications such as virtual assistants or environmental monitoring systems.
Image data. Computer vision applications such as facial recognition, driverless vehicles or medical imaging analysis will use AI training datasets containing relevant digital images.
Video data. Like still images, video formats can be used to train computer vision applications such as facial recognition, driverless vehicles or surveillance systems.
Sensor data. Signals from devices that capture physical information such as temperature, biometrics or an object’s acceleration are known as sensor data, and used to train AI models used in driverless vehicles, industrial automation and internet of things (IoT) devices, among other use cases.

Labeled vs unlabeled AI training data

Whatever its format, data can be used in a labeled or unlabeled form – or a combination of both – to train AI models:

Labeled data is information tagged with labels that act as signposts to help guide the AI model in its learning. For instance, a photo of a cat might be labeled 'cat', helping the AI model to identify what a cat looks like. Labeled data is typically used in a type of training known as supervised learning, with the labels providing critical context for the AI's learning (more on this later).
Unlabeled data is raw data—think photographs or text data without any tags or labels for context. It’s primarily used in unsupervised learning (more on this later).

Both these types of data are often essential in crafting a well-rounded AI model.

Data for different stages of the process

Training an AI model is an iterative process that isn't complete until the initial training work has been validated and the model's performance well-tested on previously unseen data. We can distinguish initial AI training datasets from validation datasets (or dev sets), used to develop and finetune the model by evaluating it during training. Finally, test datasets are used to evaluate and 'prove' the final model. All are regarded as data used for AI training.

Preparing data for AI training

Before data can be used for AI training, it needs to be collected and processed for use as an AI training dataset. This involves data collection, annotation, validation and pre-processing:

Data collection

Data collection for AI training is not nearly as simple as it sounds, because you need a lot of it and it needs to represent the full variety of scenarios that the AI may encounter. After all, if your training images only include dogs in a standing position, you shouldn't be surprised if your AI fails to identify any dog that is sitting, lying down, running, jumping or swimming. The data format(s) you need will depend on your intended AI application, and if you can't collect enough real-world data to cover training, validation and testing, then synthetic data may be a viable option to fill the gaps.

Data annotation (or labeling)

AI training usually requires at least some of the training data to be labeled or tagged (as explained previously). For example, parts of an image may be labeled ‘dog’, ‘cat’, ‘tree’, ‘flower’ or ‘fruit’, or elements of text may be labeled ‘person’, ‘city’, ‘country’ or ‘date’. Annotation is a labor-intensive process requiring human judgement, and is essential for AI to be able to learn by example (see the discussion on ‘supervised learning’ below).

Data validation

The next step is to ensure that the data is fit for purpose. This is about ensuring the quality of the AI training data and may include both automated and human-in-the-loop checks for errors, irrelevancies, inconsistencies and biases in the data that could affect AI performance.

Data pre-processing

Before it can be used, the data must be cleaned and organized to optimize it for AI training. This includes responding to issues discovered during data validation – for example, by correcting errors, removing irrelevant data, resolving inconsistencies and handling missing or incomplete data.

Pre-processing may also involve data normalization or standardization to help the AI model process the data in a consistent manner, reducing the risk of bias and improving its performance. For example, you might normalize a text dataset to ensure consistent frequencies of words such as ‘apple’ and ‘banana’, ensuring that one doesn’t appear five times more than the other. This helps the model to compare them effectively. Similarly, for an audio dataset you might adjust volume levels so the AI encounters the same range of volume throughout and can analyze the audio consistently. Or for images you might ensure that they all have similar brightness and contrast, so the AI can work with them uniformly.

Finally, the data is split into training, validation and test datasets to be used for training and evaluating the AI model.

How is AI trained?

With your AI training data prepared, you can move onto the actual training. This involves feeding the training data into AI algorithms designed to learn from it in specific ways. Broadly speaking, there are three ways to do this, and they are often combined:

In supervised learning, the AI algorithm is given labeled data and the labels are the output that the AI must learn to produce. It's akin to a teacher-student relationship, with the model (the student) learning from the examples provided (the teacher). Feeding images labeled 'dog' to an AI model is an example of supervised learning using labeled image data. The model learns to associate a range of dog-relevant features with the label 'dog', improving its ability to reliably produce the output 'dog' when shown an unlabeled image of a dog.
In unsupervised learning, the AI model is given unlabeled data, requiring it to find patterns or structure within the data without any help. It allows for more exploratory learning, especially useful when the outcomes are unknown or we want the model to learn more about the underlying structure of the data than is obvious, or would typically be captured by human labels. Allowing the model to create its own clusters of similar data or identify anomalies or outliers can help us to discover patterns that have been hidden from us. For example, an AI model trained through unsupervised learning might identify unusual patterns in patient health data that indicate potential disease or health issues, aiding early diagnosis and personalized treatment planning.
In reinforcement learning, the AI model performs a series of actions and is regularly given feedback in the form of a reward or punishment. This helps the model understand the consequences of its actions and make better decisions over time. A common example of reinforcement learning is when an AI model learns how to play a game well by playing it many times and adjusting its strategy based on the outcomes (feedback) of winning (reward) or losing (punishment). Over time it learns what works and what doesn't.

Evaluating the AI model

Evaluating an AI's performance, especially its ability to apply its learning to previously unseen scenarios, is an essential part of the overall AI training process. It allows us to understand how well the model has learned from the AI training data and how it might perform in real-world situations.

Performance metrics

There are many ways to assess an AI model's performance, but first you need to decide what measures to use to evaluate it. The choice of appropriate performance metrics will depend on what the AI is trained to do.

For example, to evaluate a classification task such as email spam filtering (where the AI model classifies emails as spam or non-spam), you can use measures such as accuracy to assess how many emails were correctly classified, precision to assess how many emails marked as spam were actually spam, recall to assess how many actual spam emails were correctly identified as spam, or the F1 score, which combines precision and recall in a single measure.

For a regression task such as a real estate AI model predicting house prices, you might use mean absolute error (MAE) or root mean squared error (RMSE) to measure the difference between predicted and actual prices.

These metrics provide a quantitative measure of the model's performance, helping identify areas of strength and areas needing improvement.

Cross-validation

One common technique for assessing how a model might perform on an independent dataset is cross-validation. It involves dividing the AI training dataset into several groups, called 'folds'. With the data divided into k folds, the model is then trained k times, each time using a different fold as the test set and the remaining folds as the training set.

Each time cross-validation is performed, relevant performance metrics (as discussed above) are computed to measure how well the model performs on the test set. The metrics from each cross-validation iteration are typically aggregated by, for example, taking the mean, median or other summary statistic, to provide an overall estimate of the model's performance. The result is a comprehensive and accurate indicator of whether the model is robust and can generalize well to unseen data.

Overfitting and underfitting

During AI model evaluation, it’s important to check for overfitting and underfitting. Overfitting occurs when the model learns the training data too well, to the extent that it struggles to generalize to new data. Underfitting occurs when the model fails to capture the underlying patterns in the data, resulting in poor performance on both the training and testing datasets. An evaluation technique like cross-validation can help detect both overfitting and underfitting. If identified, they should be mitigated during the evaluation stage to ensure the model's reliability and generalizability.

Overfitting and underfitting are just two things to look out for during model evaluation. There’s also model robustness, scalability, interpretability and more. But I want to end by highlighting the issue of bias.

Bias

Evaluation of an AI model isn't complete without assessing it for potential bias, which can occur for several reasons, including biases in the AI training data or biases embedded in the algorithms themselves.

We deal with bias in training data in greater depth, with examples, in a separate blog on AI training data quality.

Bias embedded in the algorithms themselves is often inadvertent and can occur due to unconscious biases of programmers or assumptions made during the algorithm design process. A notable instance is the case of predictive policing software used by several police departments in the US to predict where crimes were likely to occur. The algorithm was trained on historical crime data, but during the training process unconscious human biases ended up causing the model to give more weight to certain types of crimes, such as drug-related offenses. This resulted in a disproportionate focus on neighbourhoods with a higher incidence of drug-related crimes, many of which were lower-income or predominantly minority neighbourhoods. The results weren’t necessarily indicative of a higher overall crime rate but rather a reflection of bias built into the software.

Overcoming AI training challenges

It takes specific effort to avoid bias in AI models. We cover this further, with practical advice on what to do, in a separate blog addressing some of the most common AI training challenges. But it all starts with preparing the data for AI training. Data is the foundation on which AI models are built, supporting the development of accurate, reliable, and robust AI. For AI training data you can depend on, contact RWS’s TrainAI team today.

Nayna Jaen

Marketing Director, Enterprise Services & TrainAI, RWS

As Marketing Director, Enterprise Services & TrainAI, Nayna leads marketing efforts to promote TrainAI data services, localization, and language services and technologies to RWS’s largest clients to drive business growth. She leads all marketing initiatives for TrainAI and RWS’s Enterprise Service division, supporting sales and production teams to effectively deliver for clients.

Nayna has more than 25 years’ experience working in marketing, communications, digital marketing, and IT project management roles within the AI, technology, industrial, creative, and professional services industries. She holds a Bachelor of Fine Arts (BFA) degree from Boston University and a Master of Business Administration (MBA) degree with a specialization in Marketing and Information Technology (IT) from the University of British Columbia.

Connect with Nayna on LinkedIn.

All from Nayna Jaen