What are the best methods and tools for term extraction?

Patrick Beßler 26 Aug 2022 6 mins
RWS terminology
Using consistent terminology in all your content – be it product descriptions, newsletters or product manuals – is no less important for your brand image than your use of colour, font, tone of voice and other branding choices. It projects a professional image to your audience, builds trust and ensures a good user experience. On the other hand, inconsistency or ambiguity in the way you refer to products, parts or concepts will confuse and frustrate your customers at best – at worst, it could even pose a health and safety risk.
 
Proper management of terminology is particularly essential when drafting and localizing large amounts of technical documentation. In order for the content writers, editors and translators to be able to ensure consistency, the correct definitions and target-language equivalents of concepts need to be recorded in a central terminology database, or ‘termbase’.
 
So how do you go about creating a termbase? Let’s examine and compare three of the main methods.

Method 1: Manual term extraction

At a very basic level, you might take an ad hoc, manual approach: each time the writers, editors and translators come across a new, inconsistent or ambiguous term, they add it to the termbase as a new entry. The terminologist then cleans up the list of term candidates and supplements the entries with extra information such as notes on context, grammar or usage.
 
With this method, the termbase expands with each text. And for small-scale work, it may be sufficient. However, it quickly becomes untenable as the number of documents and people involved grows. As the undertaking becomes more complex, terminology questions increasingly go unanswered – and unclear source-language terminology can quickly balloon into larger problems in translations and future content. At this point, effort and cost start to outweigh the benefits.
 
Fortunately, you can automate your terminology collection using specific software, which we’ll look at in our next method.

Method 2: Automatic term extraction

Using software tools specially designed for term extraction makes for a much quicker, more scalable method of building a termbase. These tools scrape anything they see as a ‘term’ from a source and create a list of term candidates. Once the terminologist has cleaned the list up and added any supplementary information in the termbase, the translators complete the entries for the target-language equivalents.
 
These automated tools identify which words and phrases might constitute terms based on linguistic or statistical factors:
  • In linguistic term extraction, the tool identifies features such as the word type, principal parts or part of speech and assigns them to the term in a form of on-the-fly annotation. This approach delivers better results than the statistical approach because the system is analyzing the source text. However, it requires the annotation tools to be available in the relevant language, and developing each language-specific system is resource-intensive.
  • In statistical term extraction, the tool identifies a word or phrase as a term based on the number of times it appears in the text. Because this system doesn’t analyze the linguistic features of the source text, it can be applied to any language, making it a lower-cost option.
 
To capitalize on the strengths and mitigate the weaknesses of each approach, most companies take a hybrid approach, combining the two.
 
If you don’t yet have a termbase in place, these tools enable you to scrape terms from all your existing content to quickly fill one with most of the terminology already in use at your company. This should also allow you to spot and anticipate inconsistencies or ambiguities.
 
Every subsequent text you produce will also need to be checked for terms. After all, terminology never stays still; your termbase will never be complete. Maintaining a disciplined system will reduce the amount of time your translators have to spend on avoidable terminology queries and research and improve the quality of your content in all languages.
 
These tools cover monolingual automatic term extraction from source-language texts – but what about texts that have already been translated? This brings us to our final method.

Method 3: Bilingual automatic term extraction

You can also automatically extract terms from bilingual texts, such as translation memories or aligned source and target texts. This method produces a list of source-language term candidates alongside their target-language equivalents – removing the need for translators to enter the terms manually. The list is checked and edited by both the terminologists and the translation department to prevent any errors.
 
Bilingual term extraction is often done in two stages – source-language terms first, and target-language equivalents second – to allow the source-language terms to be edited if necessary. Alternatively, it can be done by extracting the source-language and target-language term candidates simultaneously. The resulting lists are linked using word alignment.

Final tips on optimizing your termbase

All of the methods discussed can be made more efficient in a number of ways. For example, you can reduce the amount of processing of term candidates your terminologist has to do by using stopword lists. These tell the tools to ignore general words and phrases – such as common nouns, prepositions and conjunctions – which are not considered ‘terms’. Similarly, you can set the tools to ignore terms that are already stored in your termbase.
 
It’s also important to record specific metadata about each term, such as its status – that is, whether the term is the preferred, alternative, obsolete or forbidden designation. This not only provides crucial guidance on usage to the writers and translators, but also allows automated verification systems to recognize the designations and therefore suggest preferred terms first, or flag any forbidden terms that have been used. To make it easier for the terminologist to assign the right designations or identify gaps in the terminology, termbase entries should include definitions and descriptions to explain the concept.
 
Regardless of your method, one golden rule applies: start as soon as possible. At the latest, you should have an up-to-date termbase in place by the time you begin localization – otherwise your translation quality and speed will suffer. Ideally, however, your terminology system should be in place before you even begin drafting your technical documentation, because the first decisions on terminology are made early, during planning and development.
 
Handled right, your terminology can make your products user-friendly and your branding effective; handled wrong, it can damage how your customers perceive you. Whether your method is manual or automatic, statistical or linguistic, monolingual or bilingual, terminology management can’t be ignored.
 
As a world-leading provider of language services and technology, RWS is uniquely qualified to help you tackle your terminology challenges. Find out how you can build and manage your own termbases with MultiTerm – or get in touch today to discover our specialized services for establishing and optimizing your terminology management.
Patrick Beßler
Author

Patrick Beßler

Patrick is a researcher at RWS, and works closely with customers on their machine translation and post-editing programs.
All from Patrick Beßler