Select Page

Audio Synthesis, what’s next? – Mellotron

Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more...

Some time ago we suggested to pay a visit to the virtual ICASSP 2020 conference. For those of you who couldn’t make it, here’s a short recap of one of the most exiting research papers we stumbled upon. Please give a warm welcome to Mr. Mellotron!

Tacotron

If you are interested in deepfakes, you probably heard of the impressive audio deepfakes created by the Tacotron model, and by its rightful successor Tacotron 2. The goal of both models is to turn an input text into a complex time-frequency matrix which is then translated into an audio file. The birth of “modern” text-to-speech applications, which led to audio deepfakes, is largely due to these two papers.

Singing

Both Tacotron and Tacotron 2 were able to learn how a voice sounded and could reproduce speech from text with that very voice. This ability, by itself, was already remarkable. At ICASSP 2020 this year three researchers from NVIDIA went a step further: They managed to let a person sing without ever recording his/her voice performing any song. The Mellotron neural network is able to vary the pace and the intonation of any (singing or speaking) voice according to the user input, leaving infinite possibilities of variations and expressiveness.

Before & After

Before Mellotron, reproducing a lively and expressive voice required gathering plenty of audio material of a speaker and exploring all possible variations of the voice. After Mellotron much less material is going to be needed, “only” enough to learn a person’s voice timbre. That person being happy, angry or sad is up to the network to decide. A couple of years ago this would have been impossible. Thanks to this research, it just became reality.

If you are interested in what this sounds like, do not miss out on the audio examples produced by Mellotron, and have a look at the original paper:

Happy Digging and keep an eye on our future “Audio Synthesis: What’s next?” posts!

 Don’t forget: be active and responsible in your community – and stay healthy!

Related Content

From Rocket-Science to Journalism

From Rocket-Science to Journalism

In the Digger project we aim to implement scientific audio forensic functionalities in journalistic tools to detect both shallow- and deepfakes. At the Truth and Trust Online Conference 2020 we explained how we are doing this.

Audio Synthesis, what’s next? – Mellotron

Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more…

Video verification step by step

Video verification step by step

What should you do if you encounter a suspicious video online? Although there is no golden rule for video verification and each case may present its own particularities, the following steps are a good way to start.

Digger – Detecting Video Manipulation & Synthetic Media

Digger – Detecting Video Manipulation & Synthetic Media

What happens when we cannot trust what we see or hear anymore? First of all: don’t panic! Question the content: Could that be true? And when you are not 100 percent sure, do not share, but search for other media reports about it to double-check.

What happens when we cannot trust what we see or hear anymore? First of all: don’t panic! Question the content: Could that be true? And when you are not 100 percent sure, do not share, but search for other media reports about it to double-check.

How do professional journalists and human rights organisations do this? Every video out there could be manipulated. With video editing software anyone can edit a video.

It is challenging to verify content which has been edited, mislabeled or staged. What is even more complex is to verify content that has been modified. We roughly see two kinds of manipulation:

  1. Shallow fakes: manipulated audiovisual content (image, audio, video) generated with ‘low tech’ technologies like Cut & Paste or speed adjustments. 
  2. Deepfakes: artificial (synthetic) audiovisual content (image, audio, video) generated with technologies like Machine Learning.

Deepfakes and synthetic media are some of the most feared things in journalism today. It is a term which describes audio and video files that have been created using artificial intelligence. Synthetic media is non-realistic media and often referred to as Deepfakes at the moment. Generated by algorithms it is possible to create or swap faces, places, and digital synthetic voices that realistically mimic human speech and face impressions but actually do not exist and aren´t real. That means machine-learning technology can fabricate a video with audio to make people do and say things they never did or said. These synthetic media can be extremely realistic and convincing but are actually artificial.

Detection of synthetic media

Face or body swapping, voice cloning and modifying the speed of a video is a new form of manipulating content and the technology is becoming widely accessible

At the moment the real challenge are the so called shallow fakes. Remember the video where Nancy Pelosi appeared to be drunk during a speech. It turned out the video was just slowed down, but with the pitch turned up to cover up the manipulation. Video manipulation and creation of synthetic media is not the end of the truth but it makes us more cautious before using the content in our reporting. 

On the technology side it is a rat race. Forensic journalism can help detect altered media. DW´s Research & Cooperation team works together with ATC, a technology company from Greece and the Fraunhofer Institute for digital media technology to detect manipulation in videos. 

Digger – Audio forensics

In the Digger project we focus on using audio forensics technologies to detect manipulation. Audio is an essential part of video and with a synthetic voice of  a politician or the tampered noise of a gunshot a story can change completely. Digger aims to provide functionalities to detect audio tampering and manipulation in videos. 

Our approach makes use of:

  1. Microphone analysis: Analysing the device being used for the recording of audio. 
  2. Electrical network Frequency Analysis: Detect editing (cut & paste analyses) of audio.
  3. Codec Analysis: We follow the digital footprint of audio by extraction of ENF traces.

Synthetic media in reality

Synthetic media technologies can have a positive as well as a negative impact on society.

It is exciting and scary at the same time to think about the ability to create audio-visual content in the way we want it and not in the way it exists in reality. Voice synthesis will allow us to speak in hundreds of languages in our own voice. (Hyperlink: Video David Beckham) 

Or we could bring the master of surrealism back to life:

With the same technology you can also make politicians say something they never have or place people in scenes they have never been. These technologies are being used in pornography a lot but the unimaginable impact is also showcased in short clips in which actors are placed in films they have never acted in. Possibly one of the most harmful effects is that perpetrators can also easily claim “that’s a deepfake” in order to dismiss any contested information. 

How can the authenticity of information be proofed reliably? This is exactly what we aim to address with our project Digger.  

Stay tuned and get involved

We will publish regular updates about our technology, external developments and interview experts to learn about ethical, legal and hand-on expertise. 

The Digger project is developing a community to share knowledge and initiate collaboration in the field of synthetic media detection. Interested? Follow us on Twitter @Digger_project and send us a DM or leave a comment below. 

Related Content

In-Depth Interview – Jane Lytvynenko

In-Depth Interview – Jane Lytvynenko

We talked to Jane Lytvynenko, senior reporter with Buzzfeed News, focusing on online mis- and disinformation about how big the synthetic media problem actually is. Jane has three practical tips for us on how to detect deepfakes and how to handle disinformation.

Audio Synthesis, what’s next? – Mellotron

Audio Synthesis, what’s next? – Mellotron

Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more…

Video verification step by step

Video verification step by step

What should you do if you encounter a suspicious video online? Although there is no golden rule for video verification and each case may present its own particularities, the following steps are a good way to start.