Text2Speech Blog

NeoSpeech: Text-to-Speech Solutions.

HTS vs. USS: Which Speech Synthesis Technique is Better?

HTS vs. USS Speech Synthesis: Which one is Better?


By now you probably know about Text-to-Speech and how it works (if not, feel free to check out our What is Text-to-Speech and How Does it Work post to learn more). You also know that there are 2 main types of speech synthesis techniques – HTS (HMM based speech system) and USS (Unit Selection Synthesis). So, the next big question on everyone’s mind is which speech synthesis technique is better?

Here at NeoSpeech, we use USS because it is considered to produce the most natural sounding synthesized speech. But HTS has its merits too, such as a much shorter recording time and a smaller database.

Let’s dive into examining these two techniques to find out the pros and cons of each one and help answer the question: which speech synthesis technique is better?

Unit Selection Synthesis (USS)

Let’s start with USS. As I said before, USS produces the most natural sounding synthesized speech. This means that USS produces synthesized voices that sound the most human-like, the most realistic, and the most pleasant to listen to. This is the main reason that NeoSpeech uses USS – because no one wants to listen to a robotic voice that sounds like it is from an 80’s movie!


How does USS Produce Speech?

In basic terms, USS involves inviting the voice actor in, recording hours and hours of audio, categorizing the actor’s speech into linguistic segments (e.g. words, phrases, and morphemes) and then putting all this information into a huge database. Our Text-to-Speech engine can then search this database for speech units that match the text you’ve typed in, concatenate them together and produce your audio file.

The process is much more complicated than this, but this gives you a good overall idea of how USS works to produce speech. If you’d like to learn more about how USS produces synthesized speech, check out this blog post.

Let’s take a look at the main advantages and disadvantages of USS to get a grasp on whether this technique is the best approach for speech synthesis or if its main competitor, HTS, would win the fight.


The Advantages of USS

  1. By now, you’ve probably figured out that USS produces the most natural sounding speech.
  2. Preservation of the Original Actor’s Voice: The Text-to-Speech Engine chooses speech units that best fit the text you have typed in. USS involves pulling these speech units directly from the voice database, thus preserving the original voice of the actor at all times.
  3. Higher quality audio files are produced: The TTS engine has at least 20 hours of recorded voice to choose from when it approaches the database to pull our speech units that match your text. This means that the audio files produced are much better quality than HTS audio files.
  4. Sophisticated techniques can be used to smooth the joins between speech units to make each sentence sound as natural as possible.


The Disadvantages of USS

  1. Very long recording and development time: And I mean really long. Every new voice actor needs to record at least 20 hours of audio. Sometimes we need 60 hours of recording – that’s a lot of talking! Then, the development of the TTS engine can take many months, even years. 
  2. Large database / footprint size: Because we need so many more hours of recording, the database that hosts all the audio files needs to be much bigger than the database for an HTS based voice.
  3. Because USS does not tamper with the original actor’s voice, it is impossible to change the emotion of the voice once the voice has been made. This means we cannot change the voice to be sad, happy or angry. Instead we would have to pre-record the actor saying each sentence with an angry or happy tone and then develop the voice using those audio files.


The pros and cons of USS Speech Synthesis Infographic


Great! Now we know a bit more about the pros and cons of USS, but what about its competitor, HTS? What does HTS bring to the table?

HMM Based Speech Synthesis System (HTS)

HMM based speech synthesis system (also known as HTS) is a totally different ball game. HMM is a statistical parametric synthesis technique. The model is described as being parametric because it describes the speech using specific parameters, which is different to USS that uses stored speech segments. The statistical part comes in as the parameters are described, such as the mean and probability density.

For those of you who are not majoring in linguistics or statistics, the HMM process is most simply described as using a statistical model to generate a set of similar sounding speech units, which the computer believes to be as similar as possible to the text you have written.


How does HTS Produce Speech?

All speech synthesis requires voice recordings from an actor, but HTS synthesis requires only 2-3 hours of recorded voice to be able to create a Text-to-Speech engine (compared to at least 20 hours for USS).

The voice actor records 2-3 hours’ worth of speech. A database is developed that holds all the speech units from that voice actor. Then, the TTS engine is asked to turn text into speech. The HTS engine will then search and choose the most relevant statistical model to the text where statistical models were previously programmed through learning. The TTS engine finally uses the most correct statistical model to generate a synthetic audio file.

Once again, let’s take a look at some of the pros and cons of HTS so we can gain the knowledge needed to decide if HTS or USS is the king of speech synthesis.


The Advantages of HTS

  1. Much less development time: HTS only requires 2-3 hours of voice actor recording time and much less development time. This means a new voice can be produced in a matter of months rather than years.
  2. Much lower development cost: Because the company only needs to pay a voice actor for a fraction of the time that they would for USS, the cost of creating an HTS voice is much lower. This is also true of paying all the other participants in the creation process, such as computer scientists and linguists.
  3. Potential to add different emotions to a voice: This is quite an exciting idea – imagine being able to create an avatar that can tell you bad news in a sad voice and good news with a happy voice! Unfortunately, the technology to create emotional voices in not yet perfected, but this is something that HTS is capable of. Besides the emotional voices, HTS has much potential in such areas as speaker adaptation and speaker interpolation. It will be interesting to see how this technology progresses in the future.
  4. Smaller database/footprint size: About 2-3 hours of audio files need to be stored in the voice database, which means the database can be much smaller than a USS based voice database. Because HTS stores statistical models, the size of the footprint is much smaller than any USS-based voice. 


The Disadvantages of HTS

  1. The biggest disadvantage of HTS is that the quality of the speech produced is lower than USS. You may have noticed when you listen to some TTS voices that they sound muffled and do not reflect the intonation and flow of a normal sentence. This is due to the limitations of using a statistical model, which does not express fine pitches of speech well. Additionally, unlike USS, HTS has to use a Vocoder that negatively compromises the quality of the voice.
  2. Does not conserve the original voice of the actor: HTS uses a statistical model to take the average sound of a selection of speech units. This means that the audio that is produced is no longer the original voice of the actor. This is one of the reasons why HTS voices sound less realistic and human-like when compared to USS voices. Again, using a Vocoder doesn’t help the quality of the voice. 
  3. The voice can sound robotic: As HTS synthesizes speech based on a statistical model, the muffled sound makes the voice sound stable but unnatural and robotic.


 The pros and cons of HTS Speech Synthesis Infographic


 HTS vs. USS: Which one wins?

So, which one is better? Basically, it is a trade off of time vs. quality.

USS takes a lot of time – hours and hours of recording the voice actor’s speech and then hours and hours of computer programming. It can take several months or even a year to produce a truly high quality Text-To-Speech engine. However, it produces higher quality audio than HTS based synthesis.

The HTS development process can be faster and cheaper than USS but the TTS engine requires more computation with more data quantity to process. In other words, it generally requires more computing resources, resulting in slower data processing than USS.

In short, the answer is that it depends on what matters most to you. If you want to develop a high quality text-to-speech engine with a voice that is smooth, natural and human-like then choose USS. But if you want to produce a TTS engine quickly and cheaply, then choose HTS.

Regardless of which speech algorithm we choose at NeoSpeech, we focus on high quality, natural sounding voices by realizing the best of the algorithms through our long R&D experiences and practices. NeoSpeech is now developing HTS-based emotional voices, speaker adaptation, and other applications. Leveraging our experiences and expertise, we are very close to releasing emotional voices. Stay tuned.


Learn More about Speech Synthesis

Looking to become a speech synthesis expert? Take a look at some of these academic articles on speech synthesis techniques:


Let Us Know What You Think

What are your thoughts on TTS? Is HTS the best or are you a USS fan? If you have any questions, comments or ideas, please feel free to contact us at marketing@neospeech.com or comment below.

  • Andrew Cameron Morris

    September 12, 2015 at 4:10 am Reply

    Thanks for the article. It set me thinking again about these competing speech synthesis techniques, which have both been around for some time. When I first heard about HMM techniques, in the 1990s, I thought it would be a short time before they started to compete with unit selection in terms of speech quality. From what I have heard since, there have been some advances in HMM based synthesis, but it still hasn’t anywhere near fulfilled its early promise. I am a firm believer in the power of statistical models, so I suspect the reason for this is mainly because the success of unit selection has taken most of the pressure off the drive to improve model based speech synthesis.

    My own experience is mostly in automatic speech recognition, which is predominantly HMM based. In recognition the emphasis is on speech to text, with no attempt to understand speech, or even to recognise manner of speech. This is useful enough for dictation and information retrieval requiring very little understanding, but will never be sufficient for intelligent human-machine dialogue, where information contained in the manner of speech (which standard text does not capture) is highly important. If more effort was spent developing statistical models which converted speech to a graphical representation which captured manner of speech information, then HMM based synthesis could also benefit from that. In fact, unit selection could also benefit, because “text to speech” is off to a bad start when the intended manner of speech is not specified in the text, unless it is laboriously marked up by humans.

    Regarding the possibility of changing the emotional expression of speech in unit selection, you start by saying that is not possible, but end saying that you are working on that and may have a working system soon. I spent a few months with a group at IRCAM in Paris in 2006 who were working on a project for emotional speech synthesis, based on unit selection. While they did record speech with a small number of different emotional states, they also worked on the automatic generation of manner of speech from text, and for morphing selected speech units to fit the manner required. As you must know, there is a big market, especially in Europe, for dubbing films and adverts into different languages and it is expensive to pay actors to do that. There is therefore also a big market for emotional speech synthesis, although for dubbing that still requires humans to provide suitably segmented and marked up text.

    I suspect that the problem of generating natural sounding speech from text which has already been marked up to include manner or speech, by speech morphing, is relatively much easier than the problem of generating accurate manner of speech markup from flat text. To do that well requires a level of text understanding which cannot be captured by any small number of simple rules. Genuine natural language understanding is perhaps making progress, but is still a long way off.

    • neoadmin

      September 14, 2015 at 10:08 am Reply

      Hi Andrew. Thanks for your detailed comment, it is great to hear from you and you make some interesting points. I think it will be interesting to see how HTS techniques improve over the next few years as USS has now got to a point where it has little room to improve and grow. Some USS synthesized voices sound very realistic, whereas the majority of HTS voices still have a way to come. Regarding the emotional expression of speech, I apologize if the wording was confusing. We are looking into producing emotional voices through HTS at the moment, although investigation into emotional voices through USS would be interesting.

  • Ruthie Rainbow

    November 1, 2015 at 7:17 pm Reply

    About all the voices on here…Just like the ones over at Loquendo, will Oddcast be able to give emotions to the voices on here?

    • neoadmin

      November 9, 2015 at 5:39 pm Reply

      Hi Ruthie – Thanks so much for commenting! This is quite a complicated question, which I’d love to discuss more with you over email if you like (email me at marketing@neospeech.com anytime). Short answer, no, Oddcast won’t be adding any emotions to the voices on here. However, if you need a custom voice with certain emotions please get in contact with our sales team and they’ll be happy to help you out!

  • Kuldeep Singh

    April 27, 2017 at 6:37 am Reply

    Very nice informative blog about HTS And USS. Thanks for sharing. I am also share hindi voice actor in india

  • akshay barge

    June 8, 2018 at 8:41 pm Reply


    Thanks for the information.

    So from what I understood for USS we record speech from one person for 10 hours and go with the concatenative approach to convert any text to his voice.
    But how do you convert speech into small units. Is it just based on phonemes or letters ? How do you know how to divide a speech when the duration is varying.
    Eg HELLO can be said as HELLO or HEEELLOOO.

    Now for HTS we require less time to record .
    So from what I understand , we record speeches from some people for 2-3 hours and then perform some averaging. Could you please explain what exactly does averaging mean ? Also When a new text comes in with some voice of a new person , we just adjust the averaged statistics to match the voice of a new person. Am I right ? Please correct me if wrong.I am totally new to this field.

    • neoadmin

      June 20, 2018 at 11:44 am Reply

      Hello Akshay,

      We are glad you enjoyed our article.

      To answer your question about USS,
      The speech units in the USS engine database are stored as phonemes. The engine is able to divide speech with varying duration because each phoneme has its unique prosodic characteristics (pitch, power, duration).

      As for HTS,
      The term “averaging” is used because the HTS engine is based on statistical models. The model is trained to “average” sequences of speech units into the synthesized speech, which can causes the speech to sound less natural.
      I am not sure about your last question, but I do know that the HTS engine uses voice data from only one voice actor at a time.

      You can read more about the HTS and USS approach in TTS in these following articles:

Post a Comment

Wordpress SEO Plugin by SEOPressor