HTS vs. USS: Which Speech Synthesis Technique is Better?
By now you probably know about Text-to-Speech and how it works (if not, feel free to check out our What is Text-to-Speech and How Does it Work post to learn more). You also know that there are 2 main types of speech synthesis techniques – HTS (HMM based speech system) and USS (Unit Selection Synthesis). So, the next big question on everyone’s mind is which speech synthesis technique is better?
Here at NeoSpeech, we use USS because it is considered to produce the most natural sounding synthesized speech. But HTS has its merits too, such as a much shorter recording time and a smaller database.
Let’s dive into examining these two techniques to find out the pros and cons of each one and help answer the question: which speech synthesis technique is better?
Unit Selection Synthesis (USS)
Let’s start with USS. As I said before, USS produces the most natural sounding synthesized speech. This means that USS produces synthesized voices that sound the most human-like, the most realistic, and the most pleasant to listen to. This is the main reason that NeoSpeech uses USS – because no one wants to listen to a robotic voice that sounds like it is from an 80’s movie!
How does USS Produce Speech?
In basic terms, USS involves inviting the voice actor in, recording hours and hours of audio, categorizing the actor’s speech into linguistic segments (e.g. words, phrases, and morphemes) and then putting all this information into a huge database. Our Text-to-Speech engine can then search this database for speech units that match the text you’ve typed in, concatenate them together and produce your audio file.
The process is much more complicated than this, but this gives you a good overall idea of how USS works to produce speech. If you’d like to learn more about how USS produces synthesized speech, check out this blog post.
Let’s take a look at the main advantages and disadvantages of USS to get a grasp on whether this technique is the best approach for speech synthesis or if its main competitor, HTS, would win the fight.
The Advantages of USS
- By now, you’ve probably figured out that USS produces the most natural sounding speech.
- Preservation of the Original Actor’s Voice: The Text-to-Speech Engine chooses speech units that best fit the text you have typed in. USS involves pulling these speech units directly from the voice database, thus preserving the original voice of the actor at all times.
- Higher quality audio files are produced: The TTS engine has at least 20 hours of recorded voice to choose from when it approaches the database to pull our speech units that match your text. This means that the audio files produced are much better quality than HTS audio files.
- Sophisticated techniques can be used to smooth the joins between speech units to make each sentence sound as natural as possible.
The Disadvantages of USS
- Very long recording and development time: And I mean really long. Every new voice actor needs to record at least 20 hours of audio. Sometimes we need 60 hours of recording – that’s a lot of talking! Then, the development of the TTS engine can take many months, even years.
- Large database / footprint size: Because we need so many more hours of recording, the database that hosts all the audio files needs to be much bigger than the database for an HTS based voice.
- Because USS does not tamper with the original actor’s voice, it is impossible to change the emotion of the voice once the voice has been made. This means we cannot change the voice to be sad, happy or angry. Instead we would have to pre-record the actor saying each sentence with an angry or happy tone and then develop the voice using those audio files.
Great! Now we know a bit more about the pros and cons of USS, but what about its competitor, HTS? What does HTS bring to the table?
HMM Based Speech Synthesis System (HTS)
HMM based speech synthesis system (also known as HTS) is a totally different ball game. HMM is a statistical parametric synthesis technique. The model is described as being parametric because it describes the speech using specific parameters, which is different to USS that uses stored speech segments. The statistical part comes in as the parameters are described, such as the mean and probability density.
For those of you who are not majoring in linguistics or statistics, the HMM process is most simply described as using a statistical model to generate a set of similar sounding speech units, which the computer believes to be as similar as possible to the text you have written.
How does HTS Produce Speech?
All speech synthesis requires voice recordings from an actor, but HTS synthesis requires only 2-3 hours of recorded voice to be able to create a Text-to-Speech engine (compared to at least 20 hours for USS).
The voice actor records 2-3 hours’ worth of speech. A database is developed that holds all the speech units from that voice actor. Then, the TTS engine is asked to turn text into speech. The HTS engine will then search and choose the most relevant statistical model to the text where statistical models were previously programmed through learning. The TTS engine finally uses the most correct statistical model to generate a synthetic audio file.
Once again, let’s take a look at some of the pros and cons of HTS so we can gain the knowledge needed to decide if HTS or USS is the king of speech synthesis.
The Advantages of HTS
- Much less development time: HTS only requires 2-3 hours of voice actor recording time and much less development time. This means a new voice can be produced in a matter of months rather than years.
- Much lower development cost: Because the company only needs to pay a voice actor for a fraction of the time that they would for USS, the cost of creating an HTS voice is much lower. This is also true of paying all the other participants in the creation process, such as computer scientists and linguists.
- Potential to add different emotions to a voice: This is quite an exciting idea – imagine being able to create an avatar that can tell you bad news in a sad voice and good news with a happy voice! Unfortunately, the technology to create emotional voices in not yet perfected, but this is something that HTS is capable of. Besides the emotional voices, HTS has much potential in such areas as speaker adaptation and speaker interpolation. It will be interesting to see how this technology progresses in the future.
- Smaller database/footprint size: About 2-3 hours of audio files need to be stored in the voice database, which means the database can be much smaller than a USS based voice database. Because HTS stores statistical models, the size of the footprint is much smaller than any USS-based voice.
The Disadvantages of HTS
- The biggest disadvantage of HTS is that the quality of the speech produced is lower than USS. You may have noticed when you listen to some TTS voices that they sound muffled and do not reflect the intonation and flow of a normal sentence. This is due to the limitations of using a statistical model, which does not express fine pitches of speech well. Additionally, unlike USS, HTS has to use a Vocoder that negatively compromises the quality of the voice.
- Does not conserve the original voice of the actor: HTS uses a statistical model to take the average sound of a selection of speech units. This means that the audio that is produced is no longer the original voice of the actor. This is one of the reasons why HTS voices sound less realistic and human-like when compared to USS voices. Again, using a Vocoder doesn’t help the quality of the voice.
- The voice can sound robotic: As HTS synthesizes speech based on a statistical model, the muffled sound makes the voice sound stable but unnatural and robotic.
HTS vs. USS: Which one wins?
So, which one is better? Basically, it is a trade off of time vs. quality.
USS takes a lot of time – hours and hours of recording the voice actor’s speech and then hours and hours of computer programming. It can take several months or even a year to produce a truly high quality Text-To-Speech engine. However, it produces higher quality audio than HTS based synthesis.
The HTS development process can be faster and cheaper than USS but the TTS engine requires more computation with more data quantity to process. In other words, it generally requires more computing resources, resulting in slower data processing than USS.
In short, the answer is that it depends on what matters most to you. If you want to develop a high quality text-to-speech engine with a voice that is smooth, natural and human-like then choose USS. But if you want to produce a TTS engine quickly and cheaply, then choose HTS.
Regardless of which speech algorithm we choose at NeoSpeech, we focus on high quality, natural sounding voices by realizing the best of the algorithms through our long R&D experiences and practices. NeoSpeech is now developing HTS-based emotional voices, speaker adaptation, and other applications. Leveraging our experiences and expertise, we are very close to releasing emotional voices. Stay tuned.
Learn More about Speech Synthesis
Looking to become a speech synthesis expert? Take a look at some of these academic articles on speech synthesis techniques:
- A beginner’s guide to Statistical Parametric Speech Synthesis by Simon King
- Statistical Parametric Speech Synthesis by Alan Black, Heiga Zen and Keiichi Tokuda
- Text to Speech Synthesis by Paul Taylor
- Unit Selection in a Concatenative Speech Synthesis System Using a Large Database by Andrew Hunt and Alan Black
Let Us Know What You Think
What are your thoughts on TTS? Is HTS the best or are you a USS fan? If you have any questions, comments or ideas, please feel free to contact us at firstname.lastname@example.org or comment below.