+++ to secure your transactions use the Bitcoin Mixer Service +++

 

Learning Machine

Generate Your Favourite Characters’ Voice Lines using Machine Learning

Natural emotive high-quality faster-than-real-time text-to-speech synthesis with minimal data

Image for post
Photo by Artur Tumasjan on Unsplash

One beautiful Sunday morning, I saw an unexpected video being recommended by YouTube.

It was a video about Team Fortress 2, an FPS game from 2007 that I have spent 1300 hours on during my high school years.

There are 9 possible characters inside the game, each with their own unique voice lines and distinct accents. The scout has a Boston accent, the medic speaks German most of the time, the spy dons his French accent, you get the point.

I was very familiar with all the voice lines from the game, and also from Team Fortress’ official videos, so it was very surprising for me to hear new lines being spoken by these characters inside that particular video.

My first thought was the video creator used cameo.com to pay for new dialogues from the original voice actors. I was wrong.

The voices were created using machine learning

15.ai

Image for post
Screenshot of 15.ai by Author

The project is created as part of MIT’s Undergraduate Research Opportunities Program, but nobody knows for sure who is behind it for now.

The author has not revealed any source code or published any paper for now, but he/she claims that this new technique beats SV2TTS in data efficiency and naturalness.

Natural emotive high-quality faster-than-real-time text-to-speech synthesis with minimal data

That is the description of the website by the author. The one part I could not agree on is the “faster-than-real-time” claim, because the voice generation is quite slow and not real-time. However, this might be caused by the queue of requests sent by other users.

Characters

By the time of writing, the character’s voice lineup is pretty wide but limited.

We have SpongeBob himself from SpongeBob SquarePants. The Tenth Doctor from Doctor Who is also available. Even Sans from Undertale is also there. Team Fortress 2 has all 8 characters, minus Pyro who always sounds muffled but plus the Administrator and Miss Pauling from the TF2 clips.

One unexpected lineups that I saw is from My Little Pony which has over 40+ characters to be selected from.

I was going to leave it at the creator must be a huge fan of MLP, but I found this tweet under the FAQ titled “Why are there so many MLP voices?”

I have never watched any MLP before, but I searched for a clip and their voices are quite expressive.

You can check out what characters are planned to be available in the upcoming release in this page.

Small Training Data

Compared to other TTS technique, 15.ai is able to mimic a character with very little data

…a voice can be convincingly cloned — emotion and all — with as little as 15 seconds of data

I could not verify how good is the result with only 15 seconds of data, but the voice of Portal’s Sentry Turret only has ~100 seconds of data and it surprisingly sounds pretty good.

My hypothesis is the model can benefit from the training data of other characters too, which explains why a character with very little data can still produce quality result.

Image for post
Screenshot of MLP Character in 15.ai by Author

However, the difference in quality between characters with small and large training data is still quite apparent.

MLP characters such as Twilight Sparkle and Fluttershy that have upwards of 120 minutes training data sounds way, way better and more natural compared to SpongeBob with only 27 minutes of training data.

Characters with large training data produce more natural dialogues with clearer inflections and pauses between words, especially for longer sentences.

The author also noted that due to technical reasons (approximately uniformly distributed vocal frequencies), high-pitched/feminine voices work best. I think this is due to a lot of the training data comes from MLP characters which have these characteristics in general.

Intonation & Emotion

Another interesting thing from 15.ai is how it uses DeepMoji to predict the emotion of a sentence.

Currently, we could not manually set the emotion of the voice, as the only available choice for emotion is “Contextual” which uses DeepMoji.

Image for post
Screenshot of DeepMoji Usage in 15.ai by Author

Personally, I am very intrigued in knowing how it would handle manually set emotion because it could force the bot into generating previously unknown data, such as saying “Today is a great day” with a sad or angry emotion. Usually that particular sentence will be said with a happy emotion, it will be interesting to see how the model will generate the voice for other emotions.

Final Thoughts

I have found out about the ability of GAN for creating voice lines that was done by Lyrebird. It seems that it has changed over the years because right now there is no option to train a model with your own voice.

Lyrebird also had its own website, but now it seems it had been acquired by Descript and is focused on creating Text-To-Speech model to help content creators.

Compared to what I remembered when creating voices with Lyrebird, the quality of voices done by 15.ai is miles ahead of it. The intonation is much more natural compared to Lyrebird at that time, especially for characters with a lot of training data.

Just like DeepFake, this technology has a potential to become a dangerous tool for creating fake speech, but I believe that this would open a lot of possibilities such as creating your own voice assistant to replace Google Assistant’s, Siri’s or Alexa’s voice.

You can try it out yourself here, but please keep in mind that any request you make will come out of the developer’s pocket. Here’s the developer’s patreon page if you want to help out.

Written by

Data Scientist. I write about DS & ML. Sometimes.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store