Voice fakes

September 6, 2018 — June 17, 2021

dynamical systems
machine learning
signal processing
time series
Figure 1

A placeholder. Generating speech, without a speaker, or possibly style transferring speech.

1 Style transfer

You have a recording of me saying something self-incriminating. You would prefer it to be a recording Hillary Clinton saying something incriminating. This is achievable.

There has been a tendency for the open source ones to be fairly mediocre while the pay-to-play options produce provocative demos but do not let you use them.

As of December 2020, there are two impressive nearly-released ones, [QianAutoVC2019;Qian et al. (2020)], (in the sense they have released full models, although not the trained weights)

The “vocoder free” approach of kaen2891/kaen2891.github.io: Research Results might be good, although their website is fried.

For an overview of recent history see Kyle Kastner’s suggestions:

VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this.

There’s another cool webdemo of how this works. Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.

You can see an example of concatenative audio for “music transfer” here.

I personally think Apple’s hybrid approach has a lot more potential than plain VoCo for style transfer—I like this paper a lot!

For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph, then watching Alex Graves’ lecture which shows the extension to speech, and maybe checking out our paper char2wav. There’s a lot of background we couldn’t fit in 4 pages for the workshop, but reading Graves’ past work should cover most of that, along with WaveNet and SampleRNN. Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.

I wouldn’t recommend GAN for audio, unless you are already quite familiar with GAN in general. It is very hard to get any generative model working on audio, let alone GAN.

By Corentin Jemin, Real-Time-Voice-Cloning will “clone a voice in 5 seconds to generate arbitrary speech in real-time.” Needs multi-gigabyte GPU and tedious training but quality is OK.

There are various networks in the cyclegan/stargan families that do voice style transfer. The original authors do not release implementations, but there are community versions, e.g. 1, 2, 3. None of them sound great. If you were going to bother with stargan, why not instead do it no-frills poor-man style using RandomCNN? It performs voice style transfer via random feature matching, and is not worse than the fancier ones to my ears.

sprocket is old-school manually designed mixture model conversion. It looks like it requires simple hardware but lots of steps?

Voice conversion (VC) is a technique to convert a speaker identity of a source speaker into that of a target speaker. This software enables the users to develop a traditional VC system based on a Gaussian mixture model (GMM) and a vocoder-free VC system based on a differential GMM (DIFFGMM) using a parallel dataset of the source and target speakers.

2 Text to speech


This is Uberduck. It’s a synthetic speech toy. I started working on Uberduck in 2020 with the goal of creating a friendly, creative, open-ended dialog agent. I built an interactive audio chatbot over WebRTC that generated text responses with a Transformer model and synthesized them to audio, but I found that speech synthesis was the most exciting part of the project.

char2wav shows how to conditionalize an acoustic model

lyrebird had a flashy launch, but then vanished (because, AFAICT the product is commercialised, selling voice fakery to podcasters and for all I know, QAnon.)

How siri does speech is kinda interesting.

wavegan does do GANs for audio, at least over short sequences Kastner might be right about long sequences being hard for GANs. But it does have an online demo.

nnmnkwii (nanamin kawaii)

Library to build speech synthesis systems designed for easy and fast prototyping.


Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).


PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).

3 References

Arik, Chen, Peng, et al. 2018. Neural Voice Cloning with a Few Samples.” arXiv:1802.06006 [Cs, Eess].
Chaudhuri, Roth, Ellis, et al. 2018. AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies.” arXiv:1808.00606 [Cs, Eess].
Donahue, McAuley, and Puckette. 2018. Synthesizing Audio with Generative Adversarial Networks.” arXiv:1802.04208 [Cs].
Jin, Mysore, Diverdi, et al. 2017. VoCo: Text-Based Insertion and Replacement in Audio Narration.” ACM Transactions on Graphics.
Kalchbrenner, Elsen, Simonyan, et al. 2018. Efficient Neural Audio Synthesis.” arXiv:1802.08435 [Cs, Eess].
Kaneko, and Kameoka. 2017. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks.” arXiv:1711.11293 [Cs, Eess, Stat].
Kobayashi, and Toda. n.d. “Sprocket: Open-Source Voice Conversion Software.”
Kumar, Kumar, de Boissiere, et al. 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.” arXiv:1910.06711 [Cs, Eess].
Lee, Ko, Lee, et al. 2020. Many-To-Many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Mehri, Kumar, Gulrajani, et al. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Prenger, Valle, and Catanzaro. 2018. WaveGlow: A Flow-Based Generative Network for Speech Synthesis.” arXiv:1811.00002 [Cs, Eess, Stat].
Qian, Zhang, Chang, et al. 2019. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.” In International Conference on Machine Learning.
Qian, Zhang, Chang, et al. 2020. Unsupervised Speech Decomposition via Triple Information Bottleneck.” arXiv:2004.11284 [Cs, Eess].
van den Oord, Dieleman, Zen, et al. 2016. WaveNet: A Generative Model for Raw Audio.” In 9th ISCA Speech Synthesis Workshop.
Zhou, Horgan, Kumar, et al. 2018. Voice Conversion with Conditional SampleRNN.” arXiv:1808.08311 [Cs, Eess].