Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more...

Some time ago we suggested to pay a visit to the virtual ICASSP 2020 conference. For those of you who couldn’t make it, here’s a short recap of one of the most exiting research papers we stumbled upon. Please give a warm welcome to Mr. Mellotron!

Tacotron

If you are interested in deepfakes, you probably heard of the impressive audio deepfakes created by the Tacotron model, and by its rightful successor Tacotron 2. The goal of both models is to turn an input text into a complex time-frequency matrix which is then translated into an audio file. The birth of “modern” text-to-speech applications, which led to audio deepfakes, is largely due to these two papers.

Singing

Both Tacotron and Tacotron 2 were able to learn how a voice sounded and could reproduce speech from text with that very voice. This ability, by itself, was already remarkable. At ICASSP 2020 this year three researchers from NVIDIA went a step further: They managed to let a person sing without ever recording his/her voice performing any song. The Mellotron neural network is able to vary the pace and the intonation of any (singing or speaking) voice according to the user input, leaving infinite possibilities of variations and expressiveness.

Before & After

Before Mellotron, reproducing a lively and expressive voice required gathering plenty of audio material of a speaker and exploring all possible variations of the voice. After Mellotron much less material is going to be needed, “only” enough to learn a person’s voice timbre. That person being happy, angry or sad is up to the network to decide. A couple of years ago this would have been impossible. Thanks to this research, it just became reality.

If you are interested in what this sounds like, do not miss out on the audio examples produced by Mellotron, and have a look at the original paper:

R. Valle, J. Li, R. Prenger and B. Catanzaro, “Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens,” ICASSP 2020, Barcelona, Spain.

Happy Digging and keep an eye on our future “Audio Synthesis: What’s next?” posts!

Don’t forget: be active and responsible in your community – and stay healthy!

Related Content

In-Depth Interview – Sam Gregory

Sam Gregory is Program Director of WITNESS, an organisation that works with people who use video to document human rights issues. WITNESS focuses on how people create trustworthy information that can expose abuses and address injustices. How is that connected to deepfakes?

Audio Synthesis, what’s next? – Parallel WaveGan

The Parallel WaveGAN is a neural vocoder producing high quality audio faster than real-time. Are personalized vocoders possible in the near future with this speed of progress?

In-Depth Interview – Jane Lytvynenko

We talked to Jane Lytvynenko, senior reporter with Buzzfeed News, focusing on online mis- and disinformation about how big the synthetic media problem actually is. Jane has three practical tips for us on how to detect deepfakes and how to handle disinformation.

Audio Synthesis, what’s next? – Mellotron

Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more...

Tacotron

Singing

Before & After

Related Content

In-Depth Interview – Sam Gregory

Audio Synthesis, what’s next? – Parallel WaveGan

In-Depth Interview – Jane Lytvynenko

Other reads