Language technology lovers have cause for celebration this week. Microsoft announced that its conversational speech recognition technology has actually surpassed parity with professional human transcribers. With a 5.1 percent error rate, it is a 12 percent leap in error reduction over just last year’s measurements, sets a new industry standard, and is expected to be a boon to a wealth of Microsoft business services, including those in the translation space.
According to Xuedong Huang, the chief speech scientist of Microsoft’s Speech and Dialog Research Group, one star of this success story is Microsoft Cognitive Toolkit 2.1. The tool, distributed for free on Github under an open-source license, is built for processing massive datasets. In this case, it was trained to tackle Switchboard, a dataset of 260 hours of recorded American English telephone conversations. They were collected for Texas Instruments in 1990 and 1991 and, since then, made available to a wide variety of industry and academic projects in the speech recognition sector.
A number of Microsoft products have already benefited from its research group’s work. Among them is Presentation Translator, which was just launched in July. A PowerPoint add-in powered by the Microsoft Translator live feature, Presentation Translator translates live presentations from ten spoken languages—specifically Arabic, Chinese (Mandarin), English, French, German, Italian, Japanese, Portuguese, Russian, and Spanish—into 60 supported text languages, output as slide subtitles. Moreover, for English and Chinese speakers, Presentation Translator allows users to customize the speech input to handle their industry-specific jargon and terminology, boosting accuracy by as much as 30 percent according to Microsoft.
Getting started with Presentation Translator. Source: Microsoft Research
As Huang notes in his blog post on the group’s achievement, such improved accuracy in conversational speech recognition comes with some caveats. Heavily accented speech, multilingual and multi-party conversations, and even noisy background environments continue to challenge the technology. Additionally, as machine translation users can readily attest, not all languages are as well supported as the world’s most spoken languages.
Nevertheless, what this and other successes in the speech recognition space mean for translation and localization customers is impressive. Global players, including the likes of Microsoft, Apple, and Google, are bringing together AI, deep learning tech, and machine translation engines to offer seamless multilingual product and service delivery—and accompanying multilingual marketing—to business customers worldwide.
Even end consumers are benefiting, because these speech recognition systems drive intelligent virtual assistants (IVAs) such as Microsoft’s Cortana (for Windows 10), Apple’s Siri, and Amazon’s Alexa, and are making their way into an increasing number of homes.
Just last week, in fact, Amazon announced the launch of the Alexa Voice Service Device SDK, opening Alexa to outside developers. Also recently, Mozilla announced a project called Common Voice, for which it is seeking volunteers to contribute to an open-source voice recognition system as a non-proprietary alternative. According to research firm Global Market Insights, the multilingual and global IVA market will reach more than USD 7.5 billion by 2024, driven (unsurprisingly) by developments in voice recognition technology and growth in mobile technology markets worldwide.
Getting started with the Alexa Voice Service Device SDK. Source: Amazon Alexa Developers
Whether developed for the private sector or the public, and whether used in our workplaces or automobiles, advances in speech recognition technology are set to transform our multilingual markets worldwide. Kudos to the Microsoft research team for their contribution.