Text2Speech Blog

NeoSpeech: Text-to-Speech Solutions.

How speech technology immortalized Joey from Friends

A look into the technology that made it possible to recreate a TV character’s voice.

How speech technology can recreate Joey's voice

Joey Tribianni from Friends is remembered as perhaps one of the most lovable characters in television history. His good-natured yet dim-witted personality easily made him a fan favorite and had viewers at home wishing they could be friends with the TV character as well.

That dream may become a reality in the near future.

Using the show’s original footage, which aired from 1994-2004, researchers at the University of Leeds sought to resurrect Joey by building a virtual talking avatar that captures his style of speech, visual appearance, and language.

The virtual talking avatar would work a lot like a voice-only virtual assistant such as Apple’s Siri.  It would be able to understand what you say to it and then respond using text-to-speech technology. Except this virtual avatar would sound and talk like Joey, and you’ll actually be able to see Joey on your screen talking to you with matching mouth movements.

Nothing quite like this had ever been done before, so the researchers at the University of Leeds were determined to develop a model for building virtual talking avatars from unconstrained data.

How did they do this? A lot of time and effort went into every part of this project. But, for the purpose of this blog, we’ll focus on the speech technology aspect and look into how they used text-to-speech to immortalized Joey’s voice.

Recreating Joey’s Voice

To recreate Joey’s voice, the researchers had to build a text-to-speech engine using Joey as the voice actor. All modern text-to-speech voices are created by using speech samples from a voice actor. Depending on the quality of the finished text-to-speech engine, it can sound exactly like the actor’s voice.

The main part of a text-to-speech engine is the speech database. This is where all the recorded units of speech are stored. When a working text-to-speech engine needs to convert text into speech, it takes the needed units of speech from the database and puts it together to generate the audible speech.

A big challenge in building a speech database with Joey’s voice was the fact that they had to get it from the actual show. Normally, voice actors recite anywhere from 2 to 20 hours’ worth of scripted lines in a controlled studio environment. Then, computer programmers can break down each unit of speech and label them appropriately.

The researchers had to go through all 236 episodes in the series (about 97 hours of video) to get as many samples of Joey’s speech as possible.

Text-to-speech with Joey's voice

A hurdle the researchers faced was the difficulty of getting clean speech samples. Background noises such as an audience’s laughter, cars driving by, and music were prevalent in the show. A critical task in collecting high quality speech recordings was muting all non-spoken audio from the show.

To overcome this problem, they created a speech detection system specifically for the show. Using one of the episodes from the series, they trained their system to recognize the voices of each main character and mute all other noises. Once this system was working, they were able to isolate Joey’s voice for the speech database and build a text-to-speech engine capable of saying new things with Joey’s voice.

With the text-to-speech engine built, the researchers successful developed a model for recreating a TV character’s voice! While the quality of the speech wasn’t quite to the standard it needs to be (more on that below), this model will allow others to build virtual talking avatars from other TV characters and perfect the process.

Methods and voice quality

To generate the speech, the researchers used the Unit Selection Synthesis (USS) method. USS is known to produce the most natural sounding synthesized speech, and to be the best at preserving the original actor’s voice.

However, text-to-speech engines built using the USS method can require up to 20 hours of speech recordings for the database. The reason for requiring so much speech is because the engine needs to have every single variation of every sound a person can make in its database in order to consistently produce high-quality synthesized speech. Unfortunately, Joey does not have 20 hours worth of speech throughout the show.

So what does this mean? It means whenever Joey’s text-to-speech voice is speaking, it’s selecting units from a much smaller sample size, which leads to poorer quality due to a lack of having the exact sounds required for the specific word or sentence he’s saying.

You can hear it in the example video below (around the 1:00 mark). While the USS-based engine does produce speech that preserves Joey’s original voice, the quality of his speech isn’t the best.

This isn’t to say that all hope is lost for recreating our favorite TV character’s voices. We’re always improving the way we build text-to-speech engines to develop natural-sounding voices. Plus, there’s a different method the researchers could’ve used to build Joey’s text-to-speech engine that would have produced different results.

The HMM based speech synthesis system (HTS) is another method for developing text-to-speech engines. It is a statistical parametric synthesis technique. Basically, it uses a statistical model to generate a set of similar sounding speech units that the engine believes is as similar as possible to the text input.

In other words, an HTS engine can take a unit of speech from the database that sounds close to what it needs, and morph it (via pitch, tempo, speed, etc.) so it sounds like what it is needed for that particular word or phrase. This is good and bad news for those building virtual talking avatars out of TV characters.

The good news is that HTS-based text-to-speech engines require only 2-3 hours of recorded speech for the voice database. The bad news is that it doesn’t preserve the original actor’s voice as well.

In the case of Joey from Friends, building his text-to-speech engine using the HTS technique might have produced speech that flowed more naturally. However, you would have noticed his voice would have been a little off and not sound like Joey’s authentic voice.

Down the road, it’ll be interesting to see others immortalizing TV characters (or any else from the past) with virtual talking avatars using either text-to-speech technique. Other techniques such as Deep Neural Networks (DNNs) could also be used for this purpose as they become more popular.

It’s great to see speech technology used in such an interesting way, and we can’t wait to have our own conversations with Joey in the future! We’re looking forward to seeing more developments in this type of speech technology work in the future.

What do you think?

Which TV character or person from the past would you like turned into a virtual talking avatar? Let us know in the comments!

Learn More about NeoSpeech’s Text-to-Speech

To learn more about the different areas in which Text-to-Speech technology can be used, visit our Text-to-Speech Areas of Application page. And to learn more about the products we offer, visit our Text-to-Speech Products page.

If you’re interested in adding Text-to-Speech software to your application or would like to learn more about TTS, please fill out our Sales Inquiry form and one of our friendly team members will be happy to help.

Related Articles

HTS vs. USS: Which Speech Synthesis Technique is Better?

How A Text-to-Speech Synthesizer Created One Of The World’s Biggest Pop Stars

What is Text-to-Speech and How Does It Work?

No Comments

Post a Comment

Wordpress SEO Plugin by SEOPressor