Building Voice Apps with Data Annotation
Click here to close
Click here to close
Subscribe here

Building Voice Apps with Data Annotation

When most of us think of voice assistants, we view them as a recent phenomenon. And to some extent, that is true: The past decade has been full of innovation. Apple launched its ubiquitous voice assistant, Siri, in 2011. Microsoft announced a voice assistant of its own, Cortana, in 2014. And Amazon released its voice assistant, Alexa, with the Amazon Echo smart speaker device, also in 2014.

But the groundwork for modern voice assistants was laid nearly six decades ago when IBM developed the first digital speech-recognition tool, IBM Shoebox, in 1961. Other digital speech recognition systems followed during the 1970s and 1980s, including Carnegie Mellon’s Harpy program. The release of Dragon Dictate in 1990 made speech-recognition software available to consumers for the first time, for the not-so-modest price tag of $9,000.

The acceleration of voice assistants in recent years is largely attributable to changing consumer preferences and improved technology. And research clearly shows that voice assistants are not only here to stay, but will become the chief method by which consumers search for products and services online. Already, nearly 60 percent of Americans have relied on voice search to find out about local businesses.

How voice assistants work

A voice assistant works by converting spoken language into text. For that to happen, the human user must first speak a predefined signal phrase that essentially wakes up the assistant; for example, “Hey, Siri” or “Hey, Google”. Once the signal phrase is used, the device begins recording and stops when it detects a pause.

Once the recorded speech enters the database, the system breaks it into several small pieces to identify the user’s intent through a process known as parsing. For example, suppose a user says, “Hey, Siri. I really want to go on a vacation this Christmas”. The assistant would focus on the words in the phrase that convey intent: “vacation” and “Christmas”. The assistant would then convert text to speech and fulfil the request. In our example, the assistant would provide the user with travel options for December 25th.

Data annotation and voice-assistant technology

While it would be nice for voice assistants to come pre-packaged with the ability to understand different dialects and nuances of human language, it is not quite that simple. For Artificial Intelligence (AI) and Machine Learning (ML) to work, humans must feed the system relevant data sets—a process known as supervised learning. The system uses these data sets essentially to teach itself how to speak.

Data annotation is a technique used to label digital datasets so they can be understood and processed through ML. Human analysts are generally required to oversee the data annotation process, and they add tags, a form of metadata, to data sources such as text, images, videos and audio. Next, machines use an algorithm to process annotated data, which allows them to recognize patterns in new datasets. Thus, great accuracy is required to ensure that algorithms can learn effectively.

There are several types of data annotation methods, including:

  • Semantic annotation, which involves identifying and annotating concepts such as names or objects within text files. Using semantically annotated data, machines learn how to categorize new concepts.
  • Text categorization, which assigns categories to specific documents. An analyst will tag portions of a document based on a topic such as sports.
  • Video/image annotation, which has many examples. One common method of image annotation, semantic segmentation, involves assigning meaning to every pixel in an image to help the machine recognize the annotated area.

Google Actions and Alexa Skills

To allow businesses to connect with their customers through voice, businesses can use developer tools provided by platforms such as Google and Amazon to utilize their voice assistants. The Google developer platform is called Actions. Actions are essentially functions or intentions that tell Google Assistant what to do, allowing brands to create voice apps that meet their unique needs. For example, a restaurant could build an Action that allows customers to place a food order.

Google Actions are separated into three broad categories: functional (or contact) actions, home-based actions and templates. Functional actions are used for things such as recipes. Home-based actions control Google Smart Home devices and templates allow users to create games and quizzes. There are currently more than 30,000 Google Actions, with 3,617 added in the first quarter of 2020 alone.

English is by far the predominant language for Google Actions, accounting for 18,828 of the total as of 2020, but other languages are increasing in popularity. For instance, Hindi has the second-highest number of Actions at 7,554. The expansion of Google Actions to other languages will be key for brands looking to expand their reach to other markets.

Amazon has a similar developer platform called Alexa Skills. As of 2019, Alexa offered more than 100,000 skills in categories such as business, finance, news and weather. Both Google and Amazon require developers to annotate data to pair the relevant Action or Skill with a specific parameter. Google offers the example phrase: “Book a room on Tuesday.” Tuesday is the annotated term that is paired with the intent phrase of booking a room.

Building a voice app through data annotation is a major undertaking, especially for brands planning to offer functionality in multiple languages. As voice assistants continue to increase in popularity, platforms such as Google and Amazon will add more and more developer tools and features. These tools will allow brands to offer their customers voice functionality and superior user experiences in markets around the world, so adopting voice search may need to happen sooner rather than later.


Thanks to Hinde Lamrani, RWS Moravia’s International Search Subject Matter Expert, for her input on this post.