What is Deep Learning and How Will it Change Text-to-Speech?
Text-to-speech technology has advanced greatly over the past two decades. Once defined by the robotic sounding voices that they produced, text-to-speech voices today can sound just as lifelike as an actual human.
Today, making a natural sounding text-to-speech voice is labor intensive and expensive. The two most popular methods, HMM and USS, require hours of recordings from a voice actor. Then, computer programmers with an understanding of linguistics must break down all of that audio into the tiniest possible pieces, called phonemes, and appropriately tag them and define the rules for when each individual unit of speech should be used.
This is a very, very long process.
But what if instead of manually writing code to teach a text-to-speech voice how to speak, it could just learn how to do it on its own? Much like how we learned how to communicate as kids in school, the engine could use machine learning to learn how to speak without external input from a programmer.
Thanks to deep learning, a sub-field machine learning that attempts to mimic the way our brains function, this has become a possibility.
In the past 6 months, Google’s DeepMind and Baidu have announced breakthroughs by using deep neural networks (DNNs) to create synthesized speech. The research from these companies has shown that deep learning will change the way we approach text-to-speech.
With deep learning, we’ll be able to make text-to-speech voices quicker, and with much less labor. Also, these voices have the potential to sound even more realistic than anything on the market today.
Before we go into how machine learning, deep learning, and DNNs are going to revolutionize the text-to-speech industry, let’s define what each of these terms mean and how they are related to each other.
At its basic definition, if a machine is able to convince a human that it’s another human, whether it be through talking or through a text chat, it is considered artificially intelligent (AI). A more specific form of AI, called narrow AI, is when a computer is able to perform a specific task as well or even better than a human.
In the case of text-to-speech, a narrow AI text-to-speech engine should be able to speak, and learn how to speak, as well as or better than a human.
Machine Learning is a type of AI that gives computers the ability to learn on their own. Computers do this by using algorithms to analyze data, learn from it, and make determinations and even predictions. Computers that have machine learning capabilities are able to learn things that weren’t originally written into them by a computer programmer.
As discussed above, when creating a text-to-speech engine a programmer has to manually break down speech recordings, tag them, and define how each unit of speech is supposed to be used. With machine learning, the goal is to have a text-to-speech engine be able to tag units of speech itself and determine the rules for each of them on its own.
Deep learning is a sub-field of machine learning that focuses on using algorithms that were inspired by the biology of the human brain. By taking a look at the structure of our brains and how they work, computer scientists built these algorithms to mimic the ways our brains function.
These algorithms are called artificial neural networks.
Artificial Neural Networks
Our brain uses 100 billion interconnected neurons to process raw data. Let’s use image recognition as an example to see how a neural network works. If you’re looking at an object, what you’re seeing is the raw data. Then, your neurons start communicating with each other to process what you’re seeing. You’re looking at an object, the object is very large, yellow, moving on four wheels, and has about a dozen windows in a row with children looking out of them.
By communicating all of these pieces of information with each other, your neurons helped you conclude that you were looking at a school bus.
Artificial neural networks work the same way. Take a look at the image below to see a visualization of the structure of an artificial neural network:
Each circle is an artificial neural unit, and each column is a layer. The first layer is the input layer that receives the raw data. Each neural unit then analyzes a tiny aspect of the raw data, and then makes a determination and sends that information to the next layer.
Using the example above, one neural unit could have determined if the object was on wheels or not. By analyzing the data it has received and weighing the options, it determined that the object was indeed on wheels. The neural unit then sent that information to the next layer of neural units.
A countless number of neural units and layers made these small determinations until finally the data reached the final output layer, which made the final determination. In our example, the final determination was that the object was a school bus.
There are many different types of artificial neural networks that have different structures and functions. In their research, DeepMind and Baidu used deep neural networks (DNNs) to create their text-to-speech prototypes. A DNN is an artificial neural network with multiple hidden layers. There could be tens of thousands of hidden layers in a DNN.
Until recently, running a DNN required a lot of processing power, so much in fact that they weren’t considered practical. But ever since the introduction of GPUs that have made parallel processing cheaper and faster in the last few years, the use of DNNs and other types of artificial neural networks have become more feasible.
When first creating an artificial neural network, it must be trained by a computer programmer. The programmer feeds the artificial neural network with raw data known as “training data” and also lets it know what the output of that data should be. This lets the system compare how each neural unit classified each piece of data with how it was supposed to classify it on its own. Over time, the artificial neural network will learn from its mistakes and perfect the process so it becomes more and more accurate.
As the computer keeps analyzing more data over time, it will continue to learn on its own until it can perform the task just as well or even better than an actual human. At that point, the artificial neural network has achieved narrow AI status for the task that it is performing.
Now that we understand the basic terms and definitions, let’s take a look at how DNNs and deep learning in general were used to create synthesized speech and what it means for the future of the industry.
If you’re also interested in terms and definitions frequently used in the text-to-speech industry, check out our Ultimate Text-to-Speech Glossary!
Deep Neural Networks and Text-to-Speech
Google’s DeepMind was the first to release their research in September of 2016. They announced the creation of WaveNet, a “deep generative model of raw audio waveforms”. DeepMind’s WaveNets were able to generate speech that they claimed to sound more realistic than their existing text-to-speech systems.
WaveNet is a DNN designed to generate audio. To start, the researchers had to “train” WaveNet. They did this by feeding it raw audio files, which were recordings of real human speakers. WaveNet broke the audio down into the smallest possible samples and analyzed them. At each step, it drew a value from a probability distribution that it computed, and then fed back that information to the input so it could make a new prediction for the next step.
Basically, WaveNet analyzed recordings of human speech, broke it down into tiny units, and then determined when and how each unit of speech is supposed to be used based on probabilities so it could learn how to generate its own coherent speech.
DeepMind had human testers compare to the speech generated by WaveNet to Google’s existing text-to-speech engines. The testers concluded that WaveNet produced the most natural sounding synthesized speech out of all the text-to-speech engines. However, it still wasn’t rated as highly as actual human speech, meaning there’s still a long way to go until text-to-speech sounds exactly like human speech.
In the end, they were able to create a better sounding text-to-speech voice. But instead of having programmers manually break down, tag, and classify each unit of speech when building WaveNet, they just had to train the system to do that on its own.
Baidu’s Deep Voice, announced in February 2017, is very similar to WaveNet. It is a text-to-speech engine built entirely from DNNs.
Baidu looked at the pipelines for how traditional text-to-speech engines are built. They then replaced each component of that pipeline with DNNs. By doing this, Baidu claims that Deep Voice will be able to create new text-to-speech voices with much less labor than current methods for building text-to-speech engines.
At the time of publishing their research, Deep Voice wasn’t end-to-end yet, but Baidu claims that text-to-speech is at the tipping point for being revolutionized by deep learning.
So what does all of this mean for the future of text-to-speech? It means that down the road, most new text-to-speech engines will probably be built using DNNs. This will make building text-to-speech voices less labor intensive, cheaper, and faster.
Also, since DNNs are always learning and getting better, you can expect that the quality of text-to-speech voices will improve until it’s impossible to differentiate between real speech and synthesized speech.
However, a roadblock that is keeping DNNs from taking over the industry today is that it is still computationally expensive, meaning that it takes a lot of processing power to run a text-to-speech engine built with DNNs. Because of this, engines like WaveNet and Deep Voice are not feasible for commercial or personal users.
Eventually, as our devices become able to handle higher processing demands and the technique for using DNNs is fine-tuned and perfected, you can bet that deep learning will take over the text-to-speech industry.
What do you think?
Do you think deep learning is the future of text-to-speech? How do you think the industry will be impacted by DNNs? Let us know in the comments!
Learn More about NeoSpeech’s Text-to-Speech
Want to learn more about all the ways Text-to-Speech can be used? Visit our Text-to-Speech Areas of Application page. And check out our Text-to-Speech Products page to find the right package for any device or application.
If you’re interested in integrating Text-to-Speech technology into your product, please fill out our short Sales Inquiry form and we’ll get you all the information and tools you need.