The “Provenance” of Global Content: an Interview with Professor Dave Lewis
Today’s localized content is often the result of deep and complex supply chains, representing a challenge not just for workflow and quality management, but also in the context of growing social demand for transparency, accountability and governance. How can the localization industry optimize its use of supply chains and minimize risk when there are so many moving parts?
Enter “provenance,” a concept with potentially profound implications to the language industry. We sat down with Professor Dave Lewis from the ADAPT Research Centre to discuss.
A new idea that’s always been around
The term “provenance” comes from the fine art world and describes a piece’s documented history. How and when was it created? Who has bought and sold it? Where has it been stored? Has it ever been damaged? Stolen?
In the fine art world, the provenance of a piece of art is directly related to its market value. It’s used to verify authenticity, reveal historical significance and limit the risk that a buyer unintentionally purchases ill-gotten gains.
Supporting the notion that life imitates art, you can find parallels to the provenance concept outside of the art world everywhere. The wine community has adopted the practice directly. Real estate transactions in the US involve title searches and seller disclosure statements. Companies like Starbucks and Apple advertise the trace-ability and transparency of their global supply chains. One might even argue that the explosion of interest in personal genomics reflects a desire to understand our own provenance as individuals.
Modern society tends to value history, context, data and transparency. It shouldn’t come as a surprise, then, that the concept of provenance should find relevance in the world of computer science and data management.
Professor Dave Lewis is working to expand the practice of provenance within the localization services industry, where the journey of content through a supply chain might be long and complex, and the answer to the question “Who contributed what?” may not be straightforward.
We were excited to have the opportunity to discuss this rich and interesting topic with him.
What does provenance mean in terms of how it relates to localization?
The idea of provenance is becoming increasingly important in areas where we’re dealing with a lot of data, or more specifically, content. Essentially, it’s answering the question, “Where did this come from, and who was involved in its transition to its current state?”
For localization workflows, we recognize that translation and localization are multi-skilled, collaborative activities that involve both human and, increasingly, automated components. Some translators are better at translating certain types of content than others, so we consider in our quality control not only how well something’s been translated, but who translated it.
The same applies to using machine translation or automated terminology and extraction. It matters which technologies were used, because we know that some work better with certain types of content than others.
The introduction of provenance, especially in localization workflows, is about capturing such things in sufficient detail to really be able to make proper workflow and quality management decisions.
We are interested in how we can introduce this more broadly into localization workflows, so we’ve done a lot of work to introduce the idea of provenance into relevant standards. We worked with the Worldwide Web Consortium to introduce the idea of provenance into the Internationalization Tag Set 2.0, which we can add to both XML and HTML content to better inform web-based localization processes. There are similar capabilities built into XLIFF 2.0.
The capability is there; it’s now a case of how well it's being used in practice as we pass things back and forth in the localization workflow.
Let’s discuss the relationship between provenance and process improvement. Can you give an example where a process would be improved by having provenance information?
I think one of the hot spots is quality assurance and doing quality assessments of translations. Projects rarely have the resources to run quality assessments on everything, but perhaps we can better allocate limited quality assurance resources to the places where we think there is higher risk. This is often done quite informally, but as we introduce more steps in the chain and use newer techniques like terminology extraction and automated post-editing, we want to know where these techniques are and aren’t being used.
Could provenance data be useful for AI-driven intelligent workflows?
I think so. But I still think we’re probably a long way from AI micromanaging our workflows.
We tend to talk about this in terms of having sufficient metadata being inserted by the tools so that it’s not a big manual overhead to collect it. As we start amassing that metadata, it allows us to make workflow decisions.
We did an EU project a few years ago where we asked, if you could improve the quality of your machine translation over the duration of a project by adding more of the manually post-edited outputs into the machine translation engine, how does that affect progress and workflow management decisions? Do we perhaps start a project with the most difficult parts, or those that are most distinct from the data that our engine is framed on? Do we assign our best post-editors to that task?
There are potential optimizations that you can start contemplating if you have access to provenance information.
With AI in general, a concern that a lot of buyers of either services or tools that have an AI component have is: how do I know that it works? Some of my colleagues in ADAPT are looking at things like automated quality assessment and quality prediction techniques.
We’re working to introduce into standards like XLIFF some space in the metadata for confidence metrics.
Do you see blockchain having a role in provenance modelling?
Blockchain is a bit like machine learning. It’s one of these massively over-hyped technologies, and I think people are still searching around to find where the technology really addresses the business problem. There are a lot of people who are looking at blockchain as a solution, when in fact a database with an API and a sensible level of security will do the job just fine.
We still work within a model where people are remunerated on a word rate. Things like how long did you spend post-editing, what corrections did you make, did you correct terminology? Those sorts of things have value but aren’t always very clearly remunerated.
Some of it we can sort of track already with existing models of provenance, but I think where blockchain might help is if we start coming up with different models of the way people with linguistic skills are remunerated or acknowledged for their contribution—not just to the job at hand, but also to the contribution they make, including to future MT engines.
We don’t really have a good model of doing that at the moment. We can certainly track the provenance of someone’s input into a particular translation, but we rarely take it further than assessing that translation as part of a corpus of text that is then used to train another MT engine. The translators who are involved in providing that text as part of a translation job get no additional remuneration for how that gets used in a future MT engine, or perhaps in an iteration of an MT engine that they are using already.
You could use blockchain to keep a more ‘in the wild,’ immutable record of what a set of translators has contributed to a particular resource, translation memory or text. Some companies manage those quite strategically, others don’t. And often it just gets lost or is filed somewhere, but nobody really knows who has a stake in it.
You can see there is potential there, but I think this business model isn’t there yet.
I’d like you to put on your futurist hat and imagine that all the objectives and goals of your research have been perfectly actualized in the real world. What does that look like?
To sort of reverse your question a little bit, the nightmare scenario for us is that young people look at the world of machine translation and say, “Why on earth would I go and spend three or four years in a university learning to be a linguist if the job is going to be automated? How am I going to build a career out of that?”
We hope that our work on provenance, speculation about potential uses of blockchain and consideration of other forms of ownership over intellectual work can be shaped into something that allows the human skills of translation to flourish in combination with machine translation and various forms of AI.
This assumption that once AI has learned everything, we won’t need human translators anymore, is a complete fallacy. Language changes, and our source of new content will dry up if we don’t have enough translations. There is a lot of work to go around. The issue is around what kind of work, how people are remunerated and how they claim credit for not just straight translations, but for other linguistic skills that they bring to the table.
That is our end goal: to get a bit of harmony between the AI side and the human side, because we can’t see them being divorced any time soon.
Any parting thoughts?
I think it’s important to maintain forums where we can have these discussions. LocWorld and events like that are useful in that they provide a space for the feedback and input that are important to the research community, and especially to the standardization community. We need to make sure we keep talking to each other about the best way to go forward, and don’t fall into the trap of thinking any one mega-big technology solution is going to solve the problem for everybody.
Thanks to Dave and the folks at the ADAPT Research Centre for making this interview possible. They’re working on things that are potentially transformational to the language industry and have been generous in sharing their ideas with us (and by extension with you).
I’m all about wanting to keep the conversation going. As a person who’s interested in maximizing the value of metadata in general, I think there’s a lot to be harvested through the concept of provenance.