Over the next two weeks, we’re taking a slightly different approach on the blog. In today’s article, the first of two parts, we will hear from Jérôme Torres-Lozano of Inventus, a user of Iconic’s Neural MT solutions for e-discovery. He gives us an entertaining look at his experiences on the challenges of language, particular in the legal field, and how language technology is helping to overcome these obstacles. Enjoy!
"Un type spécial de beauté existe qui est né dans la langue, de la langue et pour la langue.” – Gaston Bachelard (1884-1962) French philosopher and poet.
As a Belgian national born to Spanish parents, my love affair with languages started at a young age. I was brought up bilingual in a country with three national languages, namely Dutch, French and German; this meant that I was constantly exposed to different languages and cultures. When growing up in Brussels, reading traffic signs, street names and publicity boards displayed both in Dutch and French became second nature.
Family summer holidays spent in Galicia or Valencia, my mother’s and father’s respective home regions, added yet another layer to my life’s linguistic fabric. Galician, an official language closely related to Portuguese, is spoken in my mother’s region, and Valencian, the official name given to the variant of Catalan spoken in my father’s region. These additional languages punctuated our extended family conversations as cousins, uncles and aunts rapidly turned back, out of habit, to their own regional vernacular, meaning my siblings and I had to quickly pick up the semantics, syntax and lexicology of these languages in order to keep up with the family chatter.
Unsurprisingly, I went on to study English and Italian in the Home Counties and Dutch in the Brabançon town of Tilburg in the very south of the Netherlands, just 90 minutes from Brussels. My first job in the UK, some twenty years ago, was at a major translation corporation in Berkshire, programme managing translations of automotive and telecom manuals into Norwegian and Swedish. No, I do not speak either of those languages before you ask.
This exposure to the translation world cemented my belief that “a special kind of beauty exists which is born in language, of language, and for language” (G. Bachelard). 15 years on, the beauty of languages and translation is now even more prevalent in my world of e-Discovery and Computer Forensics. The rise in global commerce has seen an increase in cross-border litigation, arbitration and compliance investigations, which has in turn meant that the industry has had to develop or tweak existing technical and human solutions to deal with the processing, keyword searching, the review and potential translation of documents for these types of legal matters.
It’s all Caesar Salad to me!
Firstly, the industry was faced with the technical limitations of processing data that was not in simple Latin script. Processing documents in languages that use the Roman alphabet but with extended Latin script (i.e. accents and special characters, which most languages have) was a real challenge. My French name is a perfect example. Even to this day, when completing an online form, the rendering of it can turn from Jérôme to a rather disturbing Jirtme on some language unfriendly sites.
The Japanese kindly refer to this garbling of characters as Mojibake (文字化け), meaning unintelligible. The Russian coined it krakozyabry (кракозя́бры), Germans may call it Buchstabensalat (a salad of letters or think alphabetti spaghetti) but all languages not using simple Latin characters would suffer from some level of scrambling of text. E-discovery providers had to painstakingly apply the correct language encoding to each document to eradicate the issue or at the very least minimise it.
And then came Unicode, the so-called universal panacea of all encoding systems, or so it seemed. Its UTF-8 variant seems to have been adopted as the worldwide standard and while it eliminated many of the Mojibake issues encountered in the past, these are still present when dealing with legacy systems and other proprietary or Asian email systems that still have their own encoding system or partly use Unicode.
Once most e-Disclosure specialists were able to process foreign language documents, the next challenge was how to identify specific languages in the universe of data to assist the legal teams in planning their review, the amount of documents that may need translation and/or the allocation of foreign language review resources.