Speech-to-Speech Translation: Its Present and Future
Click here to close
Click here to close
Subscribe here

Speech-to-Speech Translation: Its Present and Future

Speech-to-Speech Translation: Its Present and Future

Speech-to-speech translation

Speech-to-Speech (S2S) technology seems to have finally stepped out of the realm of science fiction, yet it’s not ready for prime time. In their report published earlier this year, the Translation Automation User Society (TAUS) recognizes this as the paradox the technology currently finds itself in. 

The report outlines the current status, future directions, challenges, and opportunities of speech translation. It also includes interviews with 13 people who represent institutes and companies researching and working in this field. We present highlights from the report.

New directions and possibilities

Ike Sagie of Lexifone believes that existing engines for Machine Translation (MT) and Speech Recognition (SR) cannot be used straightaway. Optimization layers and other modifications are also required. Since people speak continuously, there must be an acoustic solution that cuts the flow into sentences or segments and sends the output to an audio optimization layer. Linguistic optimization is needed in the next stage to ensure translation accuracy, such as making sure interrogative sentences are annotated with question marks.

Chris Wendt of Microsoft/Skype states that SR, MT, and Text-to-Speech (TTS) by themselves are not enough to make a translated conversation work. Because clean input is necessary for translation, elements of spontaneous language—hesitations, repetitions, corrections, etc.—must be cleaned between automatic SR and MT. For this purpose, Microsoft has built a function called TrueText to turn what you said into what you wanted to say. Because it’s trained on real-world data, it works best on the most common mistakes, Wendt says.

According to Chengqing Zong from the Chinese Academy of Sciences, future advancements in S2S technology may also include different means of evaluating quality than current automatic techniques such as Bleu Scores. In the future, Zong says, “We’ll rely more on human judgment. Work on neural networks will continue, despite problems with speed and data sparseness.”

Use cases

The most popular use case for S2S is, of course, in travel. Siegfried “Jimmy” Kunzmann of EML says the enterprise market is interesting for its scalability, yet there are privacy and company data security issues. Specifically, there is potential in the enterprise messaging or texting market.

Kunzmann says another attractive area is the transcription and translation of telephone conversations. However, because phone bandwidth is limited, accurate performance depends upon customizing the domain vocabulary with appropriate tools.

Another speech-intensive area is that of speech analytics, which can be used, for instance, to gauge the performance of customer service agents. Until now, this is being done only on recorded content, but S2S technology should enable extension to real-time analysis soon.

Alex Waibel of Jibbigo fame, representing Carnegie Mellon University (CMU) / Karlsruhe Institute of Technology in the report, cites medical exchanges, humanitarian missions, broadcast news, and interpretation of lectures, seminars, and political speeches as other important use cases. He says, “Given the formidable language barriers in this setting, the question isn’t whether technology is better than humans, but whether it’s better than nothing.” 

The promise of breaking down language barriers

To a large extent, this still remains a promise. The cost of getting the technology to work for new languages (much less all the world’s 7,000 languages) is still too high. Extensive speech and translation databases must be collected in order to support a new language; also, vocabularies, language models, and acoustic models must be built, trained, and adapted for each language, domain, and task. Most long-tail languages are not well researched, and for many, no data can be found on the internet.

Yuqing Gao of IBM/Microsoft Applied Sciences Group feels that the hope to develop general-purpose speech translation, usable in any situation, is misguided—and in fact, a liability to the field—because it has prompted false expectations. It’s best to aim instead for Spoken Language Translation (SLT) for specific use cases, Gao says. 

Challenges are many

More than once in the report, the biggest challenge to the future development of S2S technology is cited as the consumer expectation that it be a free service. This stems from the fact that several companies are already providing it for free.

Take voice search, for example. It is the weakest market in terms of revenue, yet this area currently has the biggest potential. But large platforms presently give it away for free. This would deter other, perhaps smaller, companies from dedicating resources to research. 

Interviewees also frequently mention the misconception that SLT is a solved problem. This, in turn, leads to a lack of funding for advanced research.

One interesting challenge that Christ Wendt of Microsoft/Skype mentioned is that speech as a medium of communication may be losing out to text for some use cases. Young people tend to send Instant Messages (IM) or use Snapchat more than call each other, for instance.

Another interesting point from the report was one by John Frei and Yan Auerbach of SpeechTrans. They feel that emerging devices and S2S hardware are important factors. In China, holding a phone up to someone as part of a translated interchange may be seen as threatening, whereas a wristband or watch may be better.


Universal translators have long been fantasized about, and speech technologies are big steps in that direction. Yet the paradox we mention at the beginning of this post both aids and ails future developments in this field. In the meantime, we’ll see how S2S advances.