Select Page

Audio Synthesis, what’s next? – Parallel WaveGan

Audio Synthesis, what’s next? – Parallel WaveGan

The Parallel WaveGAN is a neural vocoder producing high quality audio faster than real-time. Are personalized vocoders possible in the near future with this speed of progress?

In our previous post of the “Audio Synthesis: What’s next?” series, we started talking about the latest advancement of audio synthesis. In this post we will introduce you to the Parallel WaveGAN network.

Not easy

You probably already guessed that speech-to-text is actually a very HARD task. Let us tell you something: it is so hard that it has been split into two separate problems. Some researchers focused on translating the input text into a time-frequency representation. The spectrograms below  for example, are generated by Tacotron-like networks. Other researchers focused on translating those pictures into proper audio files, sounding as natural as possible. Which is the general goal of “Neural Vocoders” as the Parallel WaveGAN itself. 

Time-frequency spectrograms generated by Tacotron-like networks.

Fast

The Parallel WaveGAN has been proposed by three researchers from the LINE (Japan) and NAVEL (South Korea) corporations, with the goal of improving the pre-existing neural vocoders. The researchers focused on one of the most demanding requirements of neural vocoders producing high quality audio in a “reasonable” amount of time. Their efforts were rewarded when they managed to achieve a system which could work faster than real-time and could be trained four times as fast as the competition.

Before & After

Before Parallel WaveGAN, to achieve this level of audio quality faster than real-time one had to spend at least 2 weeks training a neural vocoder on a very high-end GPU. After Parallel WaveGAN it is possible to create remarkable high quality audio files with only 3 days of training on the same high-end GPU as before. On top it is possible to produce content 28 times faster than real-time! With such speed of progress, sooner or later also consumer GPUs can be used to train such models. Which would mean that a new era of personalized vocoders is getting closer quickly.

If you are interested in what this sounds like, do not miss out on the audio examples produced by the Parallel WaveGAN, and have a look at the original paper:

    Happy Digging and keep an eye on our future “Audio Synthesis: What’s next?” posts!

     Don’t forget: be active and responsible in your community – and stay healthy!

    Related Content

    In-Depth Interview – Jane Lytvynenko

    In-Depth Interview – Jane Lytvynenko

    We talked to Jane Lytvynenko, senior reporter with Buzzfeed News, focusing on online mis- and disinformation about how big the synthetic media problem actually is. Jane has three practical tips for us on how to detect deepfakes and how to handle disinformation.

    From Rocket-Science to Journalism

    From Rocket-Science to Journalism

    In the Digger project we aim to implement scientific audio forensic functionalities in journalistic tools to detect both shallow- and deepfakes. At the Truth and Trust Online Conference 2020 we explained how we are doing this.

    Audio Synthesis, what’s next? – Mellotron

    Audio Synthesis, what’s next? – Mellotron

    Expressive voice synthesis with rhythm and pitch transfer. Mellotron managed to let a person sing, without ever recording his/her voice performing any song. Interested? Here is more...

    Some time ago we suggested to pay a visit to the virtual ICASSP 2020 conference. For those of you who couldn’t make it, here’s a short recap of one of the most exiting research papers we stumbled upon. Please give a warm welcome to Mr. Mellotron!

    Tacotron

    If you are interested in deepfakes, you probably heard of the impressive audio deepfakes created by the Tacotron model, and by its rightful successor Tacotron 2. The goal of both models is to turn an input text into a complex time-frequency matrix which is then translated into an audio file. The birth of “modern” text-to-speech applications, which led to audio deepfakes, is largely due to these two papers.

    Singing

    Both Tacotron and Tacotron 2 were able to learn how a voice sounded and could reproduce speech from text with that very voice. This ability, by itself, was already remarkable. At ICASSP 2020 this year three researchers from NVIDIA went a step further: They managed to let a person sing without ever recording his/her voice performing any song. The Mellotron neural network is able to vary the pace and the intonation of any (singing or speaking) voice according to the user input, leaving infinite possibilities of variations and expressiveness.

    Before & After

    Before Mellotron, reproducing a lively and expressive voice required gathering plenty of audio material of a speaker and exploring all possible variations of the voice. After Mellotron much less material is going to be needed, “only” enough to learn a person’s voice timbre. That person being happy, angry or sad is up to the network to decide. A couple of years ago this would have been impossible. Thanks to this research, it just became reality.

    If you are interested in what this sounds like, do not miss out on the audio examples produced by Mellotron, and have a look at the original paper:

    Happy Digging and keep an eye on our future “Audio Synthesis: What’s next?” posts!

     Don’t forget: be active and responsible in your community – and stay healthy!

    Related Content

    In-Depth Interview – Jane Lytvynenko

    In-Depth Interview – Jane Lytvynenko

    We talked to Jane Lytvynenko, senior reporter with Buzzfeed News, focusing on online mis- and disinformation about how big the synthetic media problem actually is. Jane has three practical tips for us on how to detect deepfakes and how to handle disinformation.

    From Rocket-Science to Journalism

    From Rocket-Science to Journalism

    In the Digger project we aim to implement scientific audio forensic functionalities in journalistic tools to detect both shallow- and deepfakes. At the Truth and Trust Online Conference 2020 we explained how we are doing this.

    ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

    ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

    Here is what we think are the most relevant upcoming audio-related conferences. And which sessions you should attend at the ICASSP 2020.

    To keep up-to-date with the latest on audio-technology for our software development, we follow other researchers studies and we usually visit many conferences. Sadly, this time, we cannot attend them in person. Nevertheless, we can visit them virtually, together with you. Here is what we think are the most relevant upcoming audio-related conferences:

    Let’s take a more detailed look at,

    ICASSP 2020 International Conference on Acoustics, Speech, and Signal Processing

    Date: 04th – 8th of May, 2020
    Location: https://2020.ieeeicassp.org/program/schedule/live-schedule/

    This is a list of panels we recommend during the ICASSP 2020:

    Date: Tuesday 05th of May 2020

    • Opening Ceremony (9:30 – 10:00h)
    • Plenary by Yoshua Bengio on “Deep Representation Learning” (15:00 – 16:00h)
      • Note: may be pretty technical, for deep learning enthusiastic
      • Note: He’s one of the fathers of deep learning

    Date: Wednesday 06th of May 2020

    Date: Thursday 07th of May 2020

    We’re looking forward to seeing you there!

    The Digger project aims:

    • to develop a video and audio verification toolkit, helping journalists and other investigators to analyse audiovisual content, in order to be able to detect video manipulations using a variety of tools and techniques.
    • to develop a community of people from different backgrounds interested in the use of video and audio forensics for the detection of deepfake content.

    Related Content

    In-Depth Interview – Sam Gregory

    In-Depth Interview – Sam Gregory

    Sam Gregory is Program Director of WITNESS, an organisation that works with people who use video to document human rights issues. WITNESS focuses on how people create trustworthy information that can expose abuses and address injustices. How is that connected to deepfakes?

    From Rocket-Science to Journalism

    From Rocket-Science to Journalism

    In the Digger project we aim to implement scientific audio forensic functionalities in journalistic tools to detect both shallow- and deepfakes. At the Truth and Trust Online Conference 2020 we explained how we are doing this.