Structured content: a secret weapon for finetuning AI

Alex Abey 26 Mar 2024 5 mins

Once strictly the domain of technical documentation departments, the use of structured content is fast spreading through enterprises, as an increasing number of functions seek better control of their content. Many want to publish information to multiple formats and channels without duplicating content. Others need machine-readable information for workflow automation initiatives. There are many other drivers too, including simplifying compliance and improving information findability. 

Whatever their primary reason for investing in a structured content solution, I don’t think any business that has done so thought they might suddenly find themselves with a powerful resource for making the most of the ‘next big thing’ in AI development. But that’s what has happened, as the sudden rise of generative AI (GenAI) has businesses everywhere scrambling to adapt LLMs or other models to their own needs. For those seeking to finetune AI, structured content – if they have it – can be a secret weapon.

What is AI finetuning and why does it matter?

Finetuning takes a pre-trained AI model – OpenAI’s GPT model, for example – and refines it for a specific enterprise task through additional training with more relevant, targeted data. It's akin to reshaping a universal tool to perform a specialized task with precision. 

Finetuning is critical for many enterprise scenarios because the familiar problem of hallucination that plagues GenAI models can have much more serious consequences for a business relying on the AI’s output than for a consumer having a bit of fun with ChatGPT. Customers, partners, shareholders and regulators won’t be impressed if your explanation for an error is that ‘our LLM said so’. By ensuring that the AI is trained on data that is more relevant to its purpose, finetuning makes it less likely to generate a falsehood to ‘fill a gap’. A finetuned model will be more likely to use information and language specific to your business, because that is what it has been preferentially trained on. 

Before I get into why structured content is so good for finetuning, it’s worth noting that there are other ways to adapt a generic AI model for a specific enterprise use case, including retrieval-augmented generation (RAG), which we’ve used to power our new ‘trustable chat’ capability in Tridion.

Precision finetuning: the role of structured content

But let’s get back to finetuning. As with all AI training, nothing is more important to the success of finetuning than the quality of the data. Structured content has several characteristics, not shared by unstructured data, that make it good quality AI training data: 

  • Consistency of the structure. Just the fact that structured content is meticulously organized in a consistent, hierarchical way (defined by a set of rules called a schema or content model) makes it easier for an AI model to parse and process it. One of the enterprise drivers for structured content is that it is more machine-friendly than unstructured content, which matters when automating workflows. This machine-friendliness matters in AI training too. Well-structured data has less ambiguity and noise for the AI, enabling it to identify patterns and relationships more easily. For example, an ‘if-then’ structure in the data might prevent the model from hallucinating a step in a procedure that doesn’t exist. 
  • Context provided by granular metadata. While unstructured data may have some document-level metadata, structured data is characterized by metadata applied at a much more granular level. Since each individual content component – a paragraph, image, video or table, for example – has been enriched with descriptive metadata (typically applied by a content specialist), there is a lot of relevant context to guide the AI model. Depending on the authoring tool used, the metadata may even have been applied with the help of semantic AI to ensure consistency of the metadata itself – a feature we call ‘smart tagging’ in Tridion. Additionally, structured content metadata will typically cover not just the topic of the content but its purpose, for example by identifying a block of text as a ‘procedure’. Added to the structure of the content, this really helps the AI model to build a more complete and accurate picture and ultimately deliver more informed, context-aware outputs. 
  • Enhanced context if using taxonomy. Not all enterprises using structured content are organizing their metadata into a formal taxonomy or knowledge graph. But for those that are, this extra dimension of organization creates an even richer interconnected web of information for the AI model to learn from. Typically used to enrich online experiences with more relevant search results and recommendations, taxonomies and knowledge graphs can be just as beneficial for AI training, exposing the semantic relationships within the data to guide the AI with even greater nuance. 
  • Potential for personalization. The context provided by metadata in structured content can potentially help the AI model trained on it to respond differently to two similar prompts that reveal a difference in audience or intent. For example, imagine an LLM trained with a dataset that includes two versions of a user manual, one tagged for an audience of mechanics or service engineers and one for non-technical customers. Faced with a prompt that reveals the intended use or nature of the user, the LLM could respond with the more relevant information. 
  • Data accuracy. There’s no guarantee that just because content is structured, it is accurate and up to date. But one of the most fundamental drivers for structured content is that it simplifies content reuse and updates (with the flexibility to create and control derivative content), and this significantly reduces the risk of out-of-date versions of content persisting. Enterprises implementing a structured content solution are also more likely to implement content governance processes that improve the likelihood of their content remaining accurate and up-to-date. For these reasons, structured content tends to be a more accurate repository of data for AI training than unstructured content, helping to build a more reliable AI model with less effort to cleanse the training data.

Solid foundations for AI finetuning

If your organization is exploring AI finetuning and wants to know more about how your structured content can play a part – or if you have no structured content but still need to build a reliable enterprise AI model – our TrainAI team is here to help. Contact us to learn more. 

If you’re interested in structured content as a potential solution for any other content management problem, our Tridion team is here to help. Contact us to discuss your needs and learn more.

Alex Abey
Author

Alex Abey

General Manager of Tridion and Fonto
As General Manager for Tridion and Fonto, Alex Abey leads RWS’s content management portfolio. These technologies allow clients to create, manage, translate and deliver any type of content to any device – providing customers with an outstanding digital experience in their own language.
All from Alex Abey