What exactly is Voice Technology?
Date: | 02 March 2022 |
Author: | Leslie Willis |
Every time that somebody asks me what I study, there’s a little pause and then I start to drop some keywords like “speech synthesis”, “Alexa, Siri”, “voice recognition”. Sometimes, my opposite will go “ahh, cool”, but most of the times I get a pair of eyes staring back at me in confusion.
The standard sentence I have come to use is: “Oh, Voice Technology is like a mixture of Linguistics and Programming. And then the faces continue to stare. With this blog post I hope to lift some curtains and give you a general idea of what Voice Technology is and how Speech Synthesis and Speech Recognition work. Don’t shy away if it is too bizarre. It takes some time to really get it.
In case you want to know more, and even learn how to apply this yourself, I will provide some helpful resources at the end of the blog, so just scroll or read your way down!
“Alexa, how old was Mozart when he got famous?” - But what more?
Thinking about conversational interfaces as in Alexa, the main purpose of Voice Technology seems to be asking questions, setting timers or creating shopping lists. But what else can we do with Voice Technology? We can ask it to read newspaper articles (or anything else) out loud,
use it for synthetic voices for people with speech impairments, implement it for tasks when we cannot use our hands (driving, housework,..) but need to get things done on the computer, for information, for automation, for detecting sentiments or even diseases in speech. The list goes on. Since the first speech synthesizer and recognizer have been invented, Voice Technology has come a long way. Yet, there are various areas to be explored and researched to get the whole potential out of it.
Speech Synthesis vs Speech Recognition - How does this work?
As with every other discipline, you can also break Voice Technology down into its thematic components. Therefore, we start with Speech Synthesis and Speech Recognition. Devices such as Alexa and Siri use both. When you say:”Hey Siri, what’s the capital of Ireland?”, Siri uses Speech Recognition to understand what you said. Of course, this does not work the way it does in our human minds. What happens is the following. The classical Speech Recognizer takes all the sounds from the sentence as input and then tries to match them to its inbuilt library of sounds, which again are matched to a meaning. If you dive deeper into this you will learn that what I referred to as ‘sounds’ are also ‘features’ of speech, which are extracted from the input audio. Have I lost you yet? I hope you get the idea.
If we now consider Speech Synthesis, this is how Siri gets her voice. Let us stick to the example above. So, Siri got it, you want to know the capital of Ireland. But how would she answer? Exactly, with a synthesized voice. How does that work? Well. Inside Siri’s technology brain, she has a whole dictionary of sounds. As you know by now, the words in a sentence are made from sounds that are put together. This is what happens in Speech Synthesis. Let’s assume Siri found the answer by doing a quick search on the internet, she knows it’s Dublin and she wants to tell you. Now she needs to look for all the sounds that are needed to build “Dublin”.
In case you are familiar with the international phonetic alphabet (IPA), this is what it would look like in the IPA:
[ˈdʌblɪn]
So, you need all these sounds to make Dublin. However, it is not as easy when you want the speech to sound natural. Try it out! Record all the sounds you think are necessary to create “Dublin” if you glue them together. You can use simple software as Audacity (it’s free and pretty intuitive!), or any other sound/music editing software. Paste in the single sounds you recorded, get rid of the silence in between and rate your result. Does it sound natural? Most probably it will be okay-ish, and this is just a single word. Imagine you’d want to build a whole sentence using this technique. All the sounds you want to use should match the intonation you are thinking of. And if you want to change that? That could be a whole new big task.
Luckily, there are algorithms that help to do the job more smoothly and that consider not only one bit of sound but also the sounds around it. This way we might take the “Du” the “ub” the “bl” etc from Dublin to make it sound natural. To automate this, Machine Learning plays a big role at this point. It is part of the artificial intelligence-family and definitely something to look into when you want to explore the Voice Technology world.
Alright, now you have a very basic understanding of what Voice Technology is and how it (Speech Synthesis and Speech Recognition) works. So, what now?
Interested? Here are some Resources:
Could I spark your interest for Voice Technology and perhaps demystify it a little? Great!
If you are interested in looking into it more, here are some very helpful resources, as promised.
In case you do decide to dive into the world of code and voice technology: Have fun, embrace the frustration if something does not work out (it’s normal!) and don’t stop trying things out!
Online courses by the University of Edinburgh:
Youtubers:
For getting into coding:
About the author
I am Leslie, 23 years old and currently studying the MSC Voice Technology at Campus Fryslân. Before I studied in Germany which also is where I am from. I’m a language enthusiast and I love music and coffee ..and ginger beer!