High-resolution, low-context English is already the preferred pivot language for multilingual localization. But recent advances in computer science and linguistics now enable us to get much more out of the English language — and thus help improve automatic natural-language translation and speech recognition.
It all started in 2013 when Yevgeni Berzak at the Massachusetts Institute of Technology (MIT) began working on an algorithm that could automatically determine the native language of someone writing in English, with the objective of developing grammar-correcting software that the user could tailor to his or her linguistic background.
This research, in turn, yielded linguistic insights into other languages, thanks to the tell-tale grammatical idiosyncrasies in English text written by non-native speakers — such as dropping or adding prepositions, substituting particular tenses for others, or misusing particular auxiliary verbs — that could point back to the language natively spoken by those writers, and even show the linguistic proximity between those languages.
Inspiration from imperfection
Then, much like a popular TV series that yields one hit spinoff after another, it led to a new project that could eventually serve to boost the accuracy of Machine Translation. Berzak’s new research focuses on the fact that although English is the most commonly-used language on the Internet with over 1 billion speakers, most of those people are non-native speakers. According to Berzak, “This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English.”
The culmination of this latest effort was the release of MIT’s first major database of English sentences written by non-native speakers. The researchers’ dataset, consisting of 5,124 sentences written by ESL (English as a Second Language) students, is now one of the 59 datasets available from the organization that oversees the Universal Dependency (UD) syntactic relationship annotation standard. As more data is amassed and UD annotated, it will enable more robust training of MT engines for use in the localization field.
What is most striking about this research is that science is now incorporating human imperfection into the equation. We are getting past the “garbage in, garbage out” paradigm that we all grew up with. Grammatically incorrect translations, which did not offer value in the past, are now the source of insight and inspiration — that is, thanks to thousands of hours of hard work by MIT researchers to fully annotate the sentences and give them value.
Linguistics on the map
In addition to providing a rich source of parallel texts for linguistic insight, the internet also enables linguists to track the birth and spread of new words through tweets and other social media. In fact, between 2009 and 2011, a team of researchers at the Georgia Institute of Technology, led by Jacob Eisenstein, mapped out the phenomenon.
They discovered that new words (and even emoticons) tended to originate in urban areas, spreading across Twitter first to cities with similar economic and ethnic makeups, then to a broader audience. Demographic similarity was found to be a stronger factor than geographical proximity in the propagation of new words. Not surprising in the digital world of today.
Technology tells us what we are
From big data to little data and everything in between, the internet is full of information. The projects by MIT and Georgia Tech show how scientists who are creative thinkers can take data that may have been filed away as errors and noise, and leverage it to gain insight on how the world actually works. We hope that more young and bright minds will show us our reflections in the mirrors of language and data.