The Ultimate Text-to-Speech Glossary
Ever wondered what a TTS SDK could be? Or what TTS stands for? You’re not alone. Text-to-Speech technology is full of complicated phrases and industry acronyms and if you are new to this field, it can get a bit overwhelming.
In this blog post we’re going back to the basics and looking at the definitions and uses of common speech technology terms. So if you’re ever a bit confused about the meaning of an acronym or word, you can come back to this post and discover what you need to know.
For those of you who are just looking for an overview, we’ve put together this infographic as a summary of the all the key terms and definitions you need to know. And for those of you searching for a specific phrase or just looking to learn more, scroll down to see the complete and ultimate text-to-speech glossary.
General Text-to-Speech Technology Terms
Let’s start with the general speech technology terms you’ll need to know and recognize, starting with the most general term, Speech Technology.
Speech Technology: Speech technology refers to all technologies that aim to duplicate and respond to the human voice. This includes technologies that mirror the human voice, like text-to-speech, and technologies that aim to understand and process the human voice, like speech-to-text. Speech technology is the general field in which the study and development of text-to-speech falls into.
Text-to-Speech: Text-to-Speech technology does exactly what it says – turns text into speech. We recently wrote an article specifically about text-to-speech technology and how it works if you’d like to learn more.
Also Known As: Voice Synthesis, Speech Synthesis
TTS: A common acronym for text-to-speech.
Text-to-Speech Engine: This is the core TTS technology that turns your text into speech. In basic terms, the text-to-speech engine takes your text, sorts it into linguistic segments (such as phrases and syllables) and assembles these into a large database. When you type text into the text-to-speech software, the engine analyzes your text and searches the database for the closest sounding speech units to your text, strings them together and produces them for you to hear.
Also Known As: TTS Engine, Voice Synthesizer, Speech Synthesizer
Unit Selection Synthesis: Unit Selection Synthesis, or USS, is a text-to-speech method used to turn text into speech. While there are many different methods for turning text into speech, this is one of the most respected and commonly used methods. It is known for producing the most natural, human-like voices and preserves the original voice of the actor at all times.
USS: A common acronym for Unit Selection Synthesis
HMM based Speech Synthesis System: HMM based Speech Synthesis, which often gets shorted to HTS, is another well-known speech synthesis technique. HTS uses a statistical model to generate the most similar sounding set of speech units to the text input. It produces a lower quality synthesized voice than USS. And, despite general belief, HTS requires more computing power than USS, which means it runs slower even with its small database. To learn more, check out this article about the differences and similarities between HTS and USS.
HTS: A common Acronym for HMM based Speech Synthesis System. HTS stands for “H and Three S’s” and refers to the first letter of each of the words in HMM based Speech Synthesis System.
HMM: The HMM in HMM based Speech Synthesis System stands for “Hidden Markov Model”
Speech Recognition: Speech recognition is the process of translating spoken words into text. An example is the voice activated assistant in your phone that you can instruct to call a friend or order a taxi. To learn more, LumenVox put together a great glossary of speech recognition terms.
Also Known As: Speech-to-Text (STT), Automatic Speech Recognition (ASR), Computer Speech Recognition
Linguistics plays a huge role in text-to-speech as develop new synthesized voices. Linguistics allows us to transform language and speech into something that a computer, namely the text-to-speech engine, can understand and utilize. Here in a guide to all the important linguistics terms used for TTS.
Linguistics: The scientific study of language and its structure.
Natural Language: Natural language is a human language, as opposed to a computer programming language. For example, Spanish and English are both natural languages, whereas HTML is not.
Natural Language Processing: Often shortened to NLP, Natural Language Processing is a field of study that stretches across linguistics, computer science and artificial intelligence. NLP is primarily concerned with the interactions between computers and natural languages. This is critical in text-to-speech production as the TTS Engine must be able to understand and interpret natural languages.
Natural Language Understanding: A subtopic of NLP, which deals with machine reading comprehension. NLU enables computers to derive meaning from natural language input or from human interactions. This is the aspect of linguistics that allows the text-to-speech engine to understand the natural language inputs you have typed in.
Phonetic Alphabet: A Phonetic Alphabet, also sometimes known as a Pronunciation Alphabet, is a set of symbols that represent the correct way to pronounce sounds in a certain spoken language.
Prosody: Prosody refers to a collection of phonological features that are used to define the characteristics of a spoken language. Features include pitch, range, volume, rate and duration.
Vocoder: An audio processor that produces sound from an analysis of text input. It is essential in speech synthesis.
Concatenation: The linking of two items so that they can be treated as one thing.
This next section relates to morphology, which is the area of study that enables us to examine the internal structure of words. Some of these definitions you will have heard before, such as syllables, but many of you might be new to the other definitions or even morphology itself. Note that we haven’t included definitions of morphology terms you definitely already know, such as “word”.
Morphology: The study of the form and internal structure of words
Morpheme: A morpheme is the smallest possible unit of grammar in a language that still holds meaning. For example, the word “processed” is made up of 2 morphemes – “process” and “ed”.
Syllable: Most of you will remember learning this early on in life, but just as a refresher, a syllable is a unit of pronunciation that has one vowel sound. This can be with or without consonants and together form part of, or sometime a whole word. For example, the word “synthesis” has 3 syllables – “syn” “the” and “sis”.
Phone: When discussing TTS software, a phone is no longer the cool mobile device that enables you to play games in your break time. Instead, a phone is a notation that represents a specific sound in a spoken language. Phones are usually letters, numbers or characters and are used to create the phonetic spellings to indicate how a word should be pronounced.
Diphone: In phonetics, a diphone refers to an adjacent pair of phones. This term is most commonly used to refer to a recording of the transition between two phones.
Text-to-speech technology is full of technical terms that can be overwhelming if you don’t have a computer science degree! Below you’ll find a guide to all the technical TTS terms you’ll ever need to understand.
Sampling Rate: The number of samples of audio carried per second. This is usually measured in Hz or kHz. This is used to describe the quality of the audio files produced by the text-to-speech engine. For example, NeoSpeech has 3 sampling rates you can choose from. 8kHz is best suited for IVR systems and emergency notifications, while the 16kHz and 44.1kHz sampling rates work best for other applications.
TTS SDK: What a mouthful! While this acronym may seem like a lot, it is quite simple. TTS SDK stands for Text-to-Speech Software Development Kit. If you’re a computer scientist, you’ll know quite a bit about SDKs and if not, a TTS SDK is basically the tool kit that allows computer scientists to incorporate text-to-speech functionality into their application.
TTS API: TTS API stands for Text-to-Speech application programming interface, which is a set of tools to aid in the building of TTS software applications.
SAPI: SAPI stands for Speech application programming interface, which is an API that was developed my Microsoft to enable the use of speech synthesis and speech recognition on Windows applications.
SSML: Stands for Speech Synthesis Markup Language. Many of you may have heard of HTML, which is also a markup language and is used to write websites. SSML is an XML-based markup language specifically designed for speech synthesis applications.
VTML: This is NeoSpeech’s version of SSML, where VTML stands for VoiceText markup language. Any of our users can use VTML tags to edit the speed, volume, pitch and other aspects of the voice that pronounces the words typed into our text-to-speech engine.
Applications of Text-to-Speech
One of the most exciting things about text-to-speech software is the huge range of circumstances where it can be of use. TTS can be used in education, transportation, announcement systems, broadcasting, entertainment, finance and more. Let’s take a look at some of the more common uses of TTS and the definitions of those fields.
Computer Telephony Integration: Computer Telephony Integration, or CTI, is a technology that enables a computer to interact with people. You may have encountered this technology when you call a customer service number. Here at NeoSpeech we have a lot of CTI and IVR customers.
Interactive Voice Response: This technology is a subset of CTI and allows the caller to respond by pressing keys on the keyboard or through voice response. IVR can also reroute a call to an appropriate employee if need be.
E-learning: E-learning is any learning that takes place through the use of electronic technology, typically through the internet. These lessons usually take place outside of the classroom, but recent years have seen teachers start to integrate e-learning into their courses. E-learning that utilizes text-to-speech software is often centered around learning a new language or a course where it is beneficial to hear the words on the screen being said out loud.
Assistive technologies: These are technologies with aid people with disabilities. For example, text-to-speech can be used to read text out loud so that blind or visually impaired users of technology can consume content.
Artificial Intelligence: Often shorted to AI, artificial intelligence is the comprehension and intelligence displayed by machines or software. Text-to-speech is used to give a voice to robots and machines.
Interactive Audio Kiosk: These kiosks provide access to necessary information in high traffic places, such as museums, stores, tourist attractions and shopping malls. This is where you would go to get information about how to get to a store with a mall or to find out the history of an artifact in a museum.
Audio Books: With the rise of e-books, you probably know what an audio book is. But just in case you don’t, an audio book allows you to listen to your favorite novel or textbook without having to read the words.
Learn More about Text-to-Speech
To learn more about the different areas in which Text-to-Speech technology can be used, visit our Text-to-Speech Areas of Application page.
If you’re interested in adding text-to-speech software to your application or would like to learn more about TTS, please fill out our Sales Inquiry form and one of our friendly team members will be happy to help.