Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more...

Some time ago we suggested to pay a visit to the virtual ICASSP 2020 conference. For those of you who couldn’t make it, here’s a short recap of one of the most exiting research papers we stumbled upon. Please give a warm welcome to Mr. Mellotron!

Tacotron

If you are interested in deepfakes, you probably heard of the impressive audio deepfakes created by the Tacotron model, and by its rightful successor Tacotron 2. The goal of both models is to turn an input text into a complex time-frequency matrix which is then translated into an audio file. The birth of “modern” text-to-speech applications, which led to audio deepfakes, is largely due to these two papers.

Singing

Both Tacotron and Tacotron 2 were able to learn how a voice sounded and could reproduce speech from text with that very voice. This ability, by itself, was already remarkable. At ICASSP 2020 this year three researchers from NVIDIA went a step further: They managed to let a person sing without ever recording his/her voice performing any song. The Mellotron neural network is able to vary the pace and the intonation of any (singing or speaking) voice according to the user input, leaving infinite possibilities of variations and expressiveness.

Before & After

Before Mellotron, reproducing a lively and expressive voice required gathering plenty of audio material of a speaker and exploring all possible variations of the voice. After Mellotron much less material is going to be needed, “only” enough to learn a person’s voice timbre. That person being happy, angry or sad is up to the network to decide. A couple of years ago this would have been impossible. Thanks to this research, it just became reality.

If you are interested in what this sounds like, do not miss out on the audio examples produced by Mellotron, and have a look at the original paper:

R. Valle, J. Li, R. Prenger and B. Catanzaro, “Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens,” ICASSP 2020, Barcelona, Spain.

Happy Digging and keep an eye on our future “Audio Synthesis: What’s next?” posts!

Don’t forget: be active and responsible in your community – and stay healthy!

Related Content

In-Depth Interview – Sam Gregory

Sam Gregory is Program Director of WITNESS, an organisation that works with people who use video to document human rights issues. WITNESS focuses on how people create trustworthy information that can expose abuses and address injustices. How is that connected to deepfakes?

Audio Synthesis, what’s next? – Parallel WaveGan

The Parallel WaveGAN is a neural vocoder producing high quality audio faster than real-time. Are personalized vocoders possible in the near future with this speed of progress?

In-Depth Interview – Jane Lytvynenko

We talked to Jane Lytvynenko, senior reporter with Buzzfeed News, focusing on online mis- and disinformation about how big the synthetic media problem actually is. Jane has three practical tips for us on how to detect deepfakes and how to handle disinformation.

ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

Here is what we think are the most relevant upcoming audio-related conferences. And which sessions you should attend at the ICASSP 2020.

To keep up-to-date with the latest on audio-technology for our software development, we follow other researchers studies and we usually visit many conferences. Sadly, this time, we cannot attend them in person. Nevertheless, we can visit them virtually, together with you. Here is what we think are the most relevant upcoming audio-related conferences:

ICASSP (https://2020.ieeeicassp.org/)
EUSIPCO (https://eusipco2020.org/)
INTERSPEECH (http://www.interspeech2020.org/)

Let’s take a more detailed look at,

ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

Date: 04th – 8th of May, 2020
Location: https://2020.ieeeicassp.org/program/schedule/live-schedule/

This is a list of panels we recommend during the ICASSP 2020:

Date: Tuesday 05th of May 2020

Opening Ceremony (9:30 – 10:00h)
Plenary by Yoshua Bengio on “Deep Representation Learning” (15:00 – 16:00h)
- Note: may be pretty technical, for deep learning enthusiastic
- Note: He’s one of the fathers of deep learning

Date: Wednesday 06th of May 2020

Show and Tell 5 (16:30 – 18:30h)
- Interesting: Real-Time Voice Conversion
- Interesting: Video-driven Speech Reconstruction

Date: Thursday 07th of May 2020

Image/Video Synthesis, Rendering and Visualization (08:00 – 10:00h)
- Interesting: End-to-End Generation of Talking Faces from Noisy Speech
Machine Learning for Speech Synthesis (11:45 – 13:45h)
- Interesting: Boffin TTS: Few-Shot Speaker Adaptation by Bayesian Optimization – A.K.A.: a quite obscure title corresponding to “learn to synthesize a voice using 5-10 minutes of training content”
Voice Conversion (15:15 – 17:15h)
- Interesting: End-to-End Voice Conversion via Cross-model Knowledge Distillation for Dysarthric Speech Reconstruction – A.K.A.: deepfake a voice of someone who cannot articulate sounds anymore (parkinson, ALS, brain injuries etc.)

We’re looking forward to seeing you there!

The Digger project aims:

to develop a video and audio verification toolkit, helping journalists and other investigators to analyse audiovisual content, in order to be able to detect video manipulations using a variety of tools and techniques.
to develop a community of people from different backgrounds interested in the use of video and audio forensics for the detection of deepfake content.

All sorts of video manipulation

What is the difference between a ‘face swap’, a ‘speedup’ or even a ‘frame reshuffling’ in a video? At the end of the day they all are manipulations of video content. We want to have a closer look into the different kinds of manipulations – whether it are audio changes, face swapping, visual tampering, or simply taking content out of context.

In Digger we look at synthetic media and how to detect manipulations in all of its forms.

This is not a tutorial on how to manipulate video. We want to highlight the different technical sorts of manipulation and raise awareness so that you might recognise one it crosses your path. Let’s start with:

Tampering of visuals and audio

Do you remember the Varoufakis finger?! Did he show it, or didn´t he?

This clip has been manipulated by pasting in a layer of an arm of another person. It is possible to crop and add any element in a video.

As well as deleting specific parts of audio tracks in a speech or conversation to mislead you. Be careful, also background noises can be added to change the whole context of a scene. Therefore it is important to find the original version so you can compare the videos with each other.

Synthetic audio and lip synchronisation

Imagine you can say anything fluently in 7 different languages, like David Beckham did?

It is incredible but the larger part of his video is completely synthetic. They created a 3D model of Beckham´s face and reanimate that. That means that a machine learned what David looks like, how he moves when speaking in order to reproduce David saying anything in any language. One tip by Hany Farid: Detect mouth and lip movements and compare them with your own human behaviour. This is one example for English speaking lip movements.

Cloned voices are already offered online, so make sure you search for the original version (yes, again), trusted media reports or try and get access to an official transcript if it was a public speech.

Shallowfakes or Cheapfakes

Just by slowing down or speeding up a video the whole context can change. In this example the speed of the video has been lowered. Nancy Pelosi, US Speaker of the House and Democrats Congresswoman, seems to be drunk in an interview.

In order to correct the lower voice the pitch of the voice has been turned up. All this effort was made to make you believe that Nanci Pelosi was drunk during an interview.

In the case of Jim Acosta part of a video has been sped up in order to suggest that he is making an aggressive movement in the situation where a microphone is being taken away from him.

It shows that also non-hightech manipulations can do harm and be challenging to detect . How can you detect low-tech manipulations? Again, find the original and compare. Try playing around with the speed in your own video player for example with the VLC player.

Face swap or Body swap

Imagine dancing like Bruno Mars or Beyonce Knowles without any training, a dream comes true.

This highly intelligent system captures the poses and motions of Bruno Mars and maps them on the body of the amateur. Copying dance moves, arms and legs, torso and head all at once is still challenging for artificial intelligence. If you focus on the details you will be able to see the manipulation. It’s still far from perfect, but it’s possible and just a matter of time till the technology is trained better.

Synthetic video and synthetic voice

You can change and tamper videos and audio, but what happens when you do all of it in one video? When you would be able to generate video completely synthetically? One could recreate a person who died already many years ago. Please meet Salvador Dalí anno 2019:

Hard to believe, right? Therefore, always ask yourself if what you see could be true. Check the source and search for more context on the video. Maybe a trustworthy media outlet already reported about it. If you cannot find anything, just do not share it.

The Liar´s Dividend

We also need to be prepared that people might claim that a video or audio is manipulated which actually isn’t. This is called “The Liar´s Dividend”.

If accused of having said or done something that he/she said or did, liars may generate and spread altered sound or images to create doubt or even say the authentic footage is a deepfake.

Make sure you have your facts checked. Ask colleagues or experts for help if needed and always watch a video more than twice.

Have you recently watched a music video? Musicians seem to be among the first professional customers for the deepfake industry. Have a look, this is where the industry is currently being built up.

Did we forget techniques for video manipulation? Let us know and we will add it to our collection in this article.

The Digger project aims:

to develop a video and audio verification toolkit, helping journalists and other investigators to analyse audiovisual content, in order to be able to detect video manipulations using a variety of tools and techniques.
to develop a community of people from different backgrounds interested in the use of video and audio forensics for the detection of deepfake content.

Audio Synthesis, what’s next? – Mellotron

Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more...

Tacotron

Singing

Before & After

Related Content

In-Depth Interview – Sam Gregory

Audio Synthesis, what’s next? – Parallel WaveGan

In-Depth Interview – Jane Lytvynenko

ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

Here is what we think are the most relevant upcoming audio-related conferences. And which sessions you should attend at the ICASSP 2020.

ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

Related Content

Train Yourself – Sharpen Your Senses

In-Depth Interview – Sam Gregory

In-Depth Interview – Jane Lytvynenko

All sorts of video manipulation

All sorts of video manipulation

Tampering of visuals and audio

Synthetic audio and lip synchronisation

Shallowfakes or Cheapfakes

Face swap or Body swap

Synthetic video and synthetic voice

The Liar´s Dividend

Related Content

In-Depth Interview – Sam Gregory

Audio Synthesis, what’s next? – Parallel WaveGan

In-Depth Interview – Jane Lytvynenko

Other reads