What is Text-to-Speech and How Does It Work?
Some of our readers are linguistic or software geniuses with a complete understanding of Text-to-Speech software. But some of you may be relatively new to this part of speech technology and you may have found yourself wondering, what is text-to-speech and how does it work?
What is TTS?
TTS stands for Text-to-Speech (also written as Text to Speech) – a form of speech synthesis that converts text into voice output. Text-To-Speech software basically takes the text you write and turns it into speech files that you can use. From text to speech – nice and simple!
Okay, but how does Text to Speech work?
There are numerous ways you can create audio from text. At NeoSpeech, we use a process called Unit Selection Synthesis (USS). The process starts on both ends— voice database building language text processing —that meets in the middle to produce speech. But for purposes of understanding, we’re going to break down into a simple 6 step process to show you how we create such high quality speech.
Step 1: Record, Record, Record
First, we choose a voice actor with a great sounding voice who is fluent in a certain language. Then we bring him or her in to talk with us – for hours and hours and hours.
We record the voice actor saying a range of speech units, from whole sentences to syllables. These can be recipes, sports results, magazine articles or anything that lets us capture the natural sound of the actor’s voice. This covers examples of all the possible sounds in a given language.
Step 2 and 3: Sorting the Speech Units and Building a Voice Database
Now that we have thousands of recorded sound files, we need to sort them out and organize them. The speech units are labeled and segmented by phones, syllables, morphemes, words, phrases, and sentences.
These speech units are used to build a large voice database. The voice is now ready for you to use.
Step 4: So, let’s create some audio files
You sit down at your computer and open one of NeoSpeech’s products. You type in the text you want transformed to speech.
Step 5: Natural Language Processing
From the language processing end, your text is normalized and broken down into phonetic sounds before going through a series of analyses to understand the structure of the sentences as well as to determine the context of the word for pronunciation. This is called Natural Language Processing.
Through these processes, we are able to establish prosody—rhythm, stress, and intonation—and produce natural sounding speech.
Step 6: Choosing the Right Speech Units
This is where the Natural Language Processing (NLP) Part and the Voice Database come together to start producing speech.
Once the NLP is complete, our software searches the voice database and chooses the speech units that best fit together to produce the sounds associated with your text. This is called Unit Selection (hence the name, Unit Selection Synthesis).
Step 7: The Creation of Sound Files!
Voila! After a bit of technical processing, you have your new audio file with one of NeoSpeech’s top quality voices.
Why do you use Unit Selection Synthesis? What are the other methods of Speech Synthesis?
We use USS because it is considered to produce the most natural sounding speech. The other main speech synthesis technique used today is called HTS (HTS based speech synthesis system).
So, which one is better? Basically, it is a trade off of time vs. quality.
Using USS takes a lot of time to produce the TTS engine. We need many hours of voice recording from the actor and then hours of computer programming. It can take up to a year from start to finish to produce a truly high quality text-to-speech engine.
Despite this, we believe USS is worth the wait. It produces much higher quality audio than HTS based synthesis, which only requires between 30 minutes and 3 hours of voice recording and takes much less time to produce a new TTS engine.
Hungry for more Text-to-Speech?
Want to learn more about speech synthesis techniques? Keep an eye out for our next blog post: HTS vs. USS: Which Speech Synthesis Technique is Better? This is a more detailed analysis of HTS and USS and the pros and cons of each one.
If you’d like to learn more about why you should consider using TTS, check out our Advantages to Using Text-to-Speech. Or, for those of you who want a more in-depth explanation of how Text-to-Speech works, take a look at Text to Speech Synthesis by Paul Taylor.
Lastly, feel free to check out our text-to-speech products page to see what TTS options we have to offer.
Let Us Know What You Think!
What are your thoughts on TTS? If you have any questions, comments or ideas, please feel free to contact us at email@example.com or comment below. Or, if you’d like to discuss adding text-to-speech to your product or service, please fill out our Sales Inquiry form and our sales team will be happy to help.