Me: OK, Google. Can you tell me how text-to-speech technology is being used today?
Google: Sure. You’re listening to one example of it right now.
Text-to-speech technology has sure come a long way. From accessibility to education to customer service, this medium has a broad range of applications, and looks like it’s only going to reach further — and in more languages — in the future.
What’s it used for?
Speech synthesis, the process of artificially generating human speech, dates back to automatons in the 12th century, and the first computers to produce human-like voices from text were in the 1950s. For those of us old enough to remember the Speak & Spell educational toy from the 1970s, that too was a text-to-speech (TTS) device, and these days, any piece of tech that speaks to you is using this technology.
Text-to-speech engines consist of a database of sound bites, either human-recorded, or generated synthetically by computer acoustic models. When text is inputted, it is converted to the appropriate phonetic sounds from the database and grouped into phrases, clauses, and sentences. Then, this content is analyzed to determine correct duration, pitch, and intonation, and run through a synthesizer that speaks it aloud.
Diagram of the text-to-speech workflow from the Wikipedia article on speech synthesis
Uses include: assisting people who have reading or visual impairments, language learners with pronunciation, or students who learn better through auditory lessons; automated announcements (think train stations in European cities); audiobooks; and those lovely automated menus (called IVRS or interactive voice response systems) when you call your bank or customer service.
Lastly, we’re all using text-to-speech technology in our mobile devices (Siri, OK Google, and Cortana), cars (infotainment systems read texts or emails to you from your paired phone and provide driving directions), and homes (Amazon Echo and Google Home recite news articles, weather forecasts, and your calendar). When you talk to the device, an additional speech-to-text component converts the analog sound to digital, breaks it down into phonemes, and translates them to text. The system then executes a command or searches for the requested content and uses TTS to read it to you.
Take trainings to the next level
Adding sound to elearning, demos, how-to videos, trainings, and presentations seems like a no-brainer: people can watch or listen, and a play-by-play of a task or process allows for listening while doing—think of the printer technician knee-deep in the machine, who can’t stop to look at a screen.
But just like machine translation is a way to translate content that would otherwise be too expensive to localize, TTS can provide audio for training and how-to videos that just need information read aloud, not polished, marketing-esque fluency. According to JBI Studios, 10,000 words of content can be converted to audio files by a text-to-speech system in about 5 minutes, whereas a voice talent needs about 8 hours to do the same. Imagine localizing a video script and recording it effortlessly in a variety of TTS voices versus the time and cost of sourcing the talent.
There are many TTS options for languages besides English. All of the major computer and mobile operating systems have some level of text-to-speech capabilities built in. Windows and Google have 26 languages, whereas Apple supports 30. Linguatec’s Voice Reader software comes in 4 different versions and offers over 70 voices in 45 languages. iSpeech offers online and cloud TTS solutions in 30 different languages including Hong Kong Cantonese and Arabic, some offering both male and female voices.
Some machine translation software also provides TTS — great for language learners. Google Translate provides text-to-speech in 32 languages. LEC’s offerings include TTS in 10 bidirectional language pairs. PROMT’s mobile apps for both offline and online translation provide speech-to-text as well as text-to-speech. ImTranslator’s online MT tool supports only 12 languages, but it can back-translate and read the translations aloud.
OK, but do they sound like robots?
Some do and some don’t, but gone are the days of the truly robotic voices from WarGames. Ivona’s SpeechCloud is a web service providing TTS in 51 voices across 23 languages, and the voices sound great.
Hello there, my name is Salli. I am one of the IVONA voices... Credit: IVONA Software (An Amazon Company)
In fact, when running text in various languages through demos of iSpeech’s and Linguatec’s offerings, I found that the English voices were actually the choppiest. The other languages also had more accurate pronunciation as well as intonation.
Text-to-speech technology seems to have embedded natural-sounding aural content into all aspects of our lives — and further bridge the language gap with affordable multilingual audio solutions. Looks like the future is going to be spoken to us.