It’s all Greek to me!
Earlier options for language detection were crude and rudimentary, mostly based on alphabets or writing systems so it could detect and differentiate Arabic, Cyrillic, Chinese and Latin Scripts, for example, but could not provide any more granularity. Is it Farsi or Arabic, Russian or Ukrainian, Cantonese or Mandarin, Spanish or Italian...or is it all Greek?
Then it all got a bit clever, using dictionaries, but as always with languages, nothing is that straightforward. Little use is a dictionary when you need to differentiate the Spanish word fresco from the Portuguese, Italian, Dutch or English fresco (meaning fresh, insolent, wall painting…). Languages within the same family tree share many common roots and attributes, which made automated differentiation and detection a challenge.
As a small linguistic digression, did you know that the word butterfly is one of the few words that disprove my comment about the sharing of roots? Butterfly turns into a French papillon, an Italian farfalla, a Portuguese borboleta, a Spanish mariposa, a Rumanian fluture, a German Schmetterling, a Dutch vlinder, a Danish sommerfugl, a Swedish fjäril… you get the gist! Exceptions are what languages are made of.
Now, language identification uses intelligent algorithms combining dictionaries and the analysis of character sets, accentuation, spelling, single letters and grouping letters, etc… Still not perfect but a little less Greek to us.
It takes all Kinds to make a World!
Now, what do you do with all of these documents that are in so many different languages? Translate them, machine translate them or have them reviewed by a native lawyer? There are in fairness, no right or wrong answers as it all depends.
Machine translation has come a long way and while it is said that in the early days some maverick providers used tools available on line to machine translate documents, therefore sending their client data out onto the Internet, most serious outfits used in-house hosted hybrid solutions based on tried and tested, though limiting, rule-based and statistical translation technology.
The latest development is Neural Machine Translation, a predictive model that takes into account full sentences or paragraphs rather than the sequence of a few words, and machine learns the nuances in languages based on that specific language’s morphology and characteristics, achieving higher levels of quality. A game changer that soon enough will be adopted at large in the e-Discovery world.
Neural Machine translation is a valuable tool to prioritise documents and give you high-level insight into the content of the documents at a fraction of the cost of human translation, but if you need to provide a translation to the court or the other side, a certified human translation will often be the standard. From my translation industry days, the golden rule was to use translators that would translate into their mother tongue. This is still valid today, as translators will always have a more profound linguistic and cultural attachment to their mother tongue than the source language however well they understand it, speak it and write it.
Another option that bypasses these solutions, or can be deployed in combination with these, is the increasingly popular use of native or language-proficient document reviewers. Qualified lawyers that will have the necessary language and industry-specific skills to perform a first-level review of the multilingual document population. Most contract lawyers are proficient in several different languages which befits our modern multilingual world, provides further flexibility and increases the speed of review.
It’s a Full Circle! (*)
I would not have thought that 15 years after leaving the translation industry, now working in the realm of legal IT, I would be the resident linguist assisting corporations and law firms with their multilingual e-Disclosure needs, whether it be speaking to the clients’ IT representatives in their own language, breaking cultural barriers explaining the process of discovery in layman’s terms or strategising on the best search methodology to apply when dealing with other languages. La boucle est bouclée ! (*).