NeoSpeech’s Text-to-Speech Is More Intelligent Than You Think
A look at how NeoSpeech’s Text-to-Speech engines are able to understand the meaning of the text that it converts into speech.
What separates a good text-to-speech engine from a great text-to-speech engine?
If you were asked that question, you’d probably say that the quality of the voice is what makes a text-to-speech engine great. Good text-to-speech engines can take in any text and convert it into speech that you’d be able to understand, but great text-to-speech engines do it with a voice that’s so natural sounding and realistic that you think another human is actually talking to you.
So what determines the quality of the voice? Several factors go into this. The quality of the recordings from the original voice actor can have an effect. However, it’s not difficult to create high quality voice recordings. Just about every text-to-speech provider out there is able to do this.
Another factor that could affect the quality of the voice is the speech synthesis technique that the engine uses. The two most common techniques are the HTS technique (which uses a statistical model to generate speech units) and the USS technique (which takes speech units from a large database of voice recordings and puts them together).
Today, the USS technique produces the most natural sounding voices, but as the HTS technique keeps improving over time it’ll eventually be as good if not better than the USS technique.
A new speech synthesis technique that uses Deep Neural Networks (DNNs) is currently be researched by several companies including Google’s DeepMind and Baidu. This technique could produce the most realistic text-to-speech voices ever made, but is still a long ways away from being commercially available.
While both of these factors and several others can affect the quality of a text-to-speech voice, perhaps the most important factor in the quality of the voice is how intelligent the text-to-speech engine is. When we say “intelligent”, we mean the ability of the engine to understand the meaning of text in order to generate speech that effectively and accurately conveys the true meaning of the text.
Why is “intelligence” important when it comes to text-to-speech?
A great text-to-speech engine must be smart enough to know the meaning of the text it receives. The process of converting text into speech isn’t as simple as looking at each individual letter and applying a sound, or phoneme, to it.
For example, the letter a in the word “walk” sounds very different from the letter a in “game”. There are several grammatical rules in every language that dictate how certain letters and words are supposed to be pronounced. A text-to-speech engine has to be able to know and apply these rules to the text it receives.
In addition to understanding how individual words are supposed to be pronounced, it’s important for a text-to-speech engine to be smart enough to understand the context of a group of words, or sentences.
A good example of why this is important are heteronyms. These are words that have identical spellings but are pronounced differently and have different meanings. Read this sentence aloud to yourself to see what we’re talking about:
“The wound in my knee hurt so bad that I wound up falling over.”
Did you notice that you read over the word “wound” twice? And did you also notice that each time you did, you pronounced it differently?
That’s a heteronym. You knew that the first “wound” meant “injury” and that “wound up” at the end of the sentence meant “ended up”.
By understanding the context that each word was used in, you were able to pronounce them correctly. This is exactly what a high quality text-to-speech engine needs to be able to do on a consistent basis.
How NeoSpeech’s Text-to-Speech engines understand the meaning of text
Here’s that same sentence again, but this time, read aloud by NeoSpeech’s James voice.
As you just heard, James was able to identify how each “wound” was being used and applied the correct pronunciation to each one. Being able to do this and ensuring that all of the other pieces of speech in the sentence fit together in a natural way is what makes James a great text-to-speech voice.
So how do NeoSpeech’s text-to-speech voices do this?
In the seemingly instantaneous time in between the text-to-speech engine receiving the text and turning it into speech, the engine breaks down the text and analyzes every part of it to understand the meaning.
Here is a simplified overview of the process:
Part-Of-Speech (POS) Tagging
One of the first things our text-to-speech engine does is assign a part-of-speech to each word. It analyzes entire sentences to determine what the subject is, and what words are nouns, verbs, adjectives, etc. Once the engine knows this, it can begin estimating how each word is supposed to be pronounced.
Grapheme-To-Phoneme (G2P) Conversion
A grapheme is the smallest unit in a writing system. In our case, letters are graphemes. Phonemes, as hinted at earlier, are the smallest units of speech. As you can probably guess by this point, Grapheme-To-Phoneme conversion is the process of converting a sequence of letters into a sequence of sounds, which in the end will create the speech.
There’s a lot that goes on here, but in simplified terms, this is when the text-to-speech engine determines the duration, timing, and pitch of the speech. This is an essential part of the process as it enables a text-to-speech voice to deliver a line of speech in the same manner that a human would.
This is the final step in the process. This is when the engine puts together the pieces of speech from the speech database. The database is where all the recordings from the voice actor are stored. At this point, the engine knows which specific pieces of speech it needs and where to place it.
If you’re using a USS-based speech synthesizer, the engine will have a very large database and be able to pull every phoneme it needs from it. If you’re using a HTS-based engine, it’ll have a smaller database and have to manipulate some of the phonemes it takes to make it sound like what the engine thinks it is supposed to sound like.
What do you think?
As you can see, there’s a lot more that goes into NeoSpeech’s text-to-speech engines than you think. We pride ourselves on providing the highest quality and most natural sounding text-to-speech voices on the market. To get to this point, we worked very hard on making sure our text-to-speech engines are the most intelligent ones in the world. Our goal is to continue pushing the limits of synthesized speech and making our voices sound as realistic as possible.
Did you know how much goes into a text-to-speech engine? What are your thoughts on the process? Let us know in the comments!
Learn More about NeoSpeech’s Text-to-Speech
Want to learn more about all the ways Text-to-Speech can be used? Visit our Text-to-Speech Areas of Application page. And check out our Text-to-Speech Products page to find the right package for any device or application.
If you’re interested in integrating Text-to-Speech technology into your product, please fill out our short Sales Inquiry form and we’ll get you all the information and tools you need.