Voice fakes

2018-09-06 — 2025-02-01

Wherein voice fakery is surveyed as a technical practice, and style-transfer methods are described as enabling cloning from seconds of speech but are noted to demand multi-gigabyte GPUs and tedious training.

dynamical systems

language

machine learning

NLP

optimization

signal processing

time series

A placeholder. Generating speech, without a speaker, or possibly style-transferring speech.

I wrote this notebook for a project in 2020 and have only slightly updated it despite radical changes in the field since then. At this point, its main value is historical.

1 Prehistory

2 Style transfer

Problem setting: You have a recording of me saying something self-incriminating. You would prefer it to be a recording of Hillary Clinton saying something incriminating. This is achievable.

There has been a tendency for the open-source projects to be fairly mediocre while the pay-to-play options produce provocative demos but do not let you use them.

As of December 2020, there were two impressive nearly-released ones, (Qian et al. 2019; Qian et al. 2020), (in the sense they have released full models, although not the trained weights). It is no longer 2020, and so I suspect the landscape is very different. Many of the community AI projects have worked very hard on helping the AI make them sing like Frank Sinatra etc. Read on for a list of the ones I knew about before all that.

The “vocoder-free” approach of kaen2891/kaen2891.github.io: Research Results might be good, although their website is fried.

For an overview of recent history see Kyle Kastner’s suggestions:

VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this.

There’s another cool webdemo of how this works. Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.

You can see an example of concatenative audio for “music transfer” here.

I personally think Apple’s hybrid approach has a lot more potential than plain VoCo for style transfer—I like this paper a lot!

For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph, then watching Alex Graves’ lecture which shows the extension to speech, and maybe checking out our paper char2wav. There’s a lot of background we couldn’t fit in 4 pages for the workshop, but reading Graves’ past work should cover most of that, along with WaveNet and SampleRNN. Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.

I wouldn’t recommend GAN for audio unless you are already quite familiar with GANs in general. It is very hard to get any generative model working on audio, let alone GAN.

By Corentin Jemin, Real-Time-Voice-Cloning will “clone a voice in 5 seconds to generate arbitrary speech in real-time.” Needs multi-gigabyte GPU and tedious training but the quality is OK.

There are various networks in the cyclegan/stargan families that do voice style transfer. The original authors do not release implementations, but there are community versions, e.g. 1, 2, 3. None of them sound great. If you were going to bother with stargan, why not instead do it no-frills poor-man style using RandomCNN? It performs voice style transfer via random feature matching and is not worse than the fancier ones to my ears.

sprocket is old-school manually designed mixture model conversion. It looks like it requires simple hardware but lots of steps?

Voice conversion (VC) is a technique to convert a speaker identity of a source speaker into that of a target speaker. This software enables the users to develop a traditional VC system based on a Gaussian mixture model (GMM) and a vocoder-free VC system based on a differential GMM (DIFFGMM) using a parallel dataset of the source and target speakers.

3 Text to Speech

Kokoro TTS - a Hugging Face Space by hexgrad seems to be the most popular as of 2025.

The ones below are hopelessly outdated, I expect.

Uberduck

This is Uberduck. It’s a synthetic speech toy. I started working on Uberduck in 2020 with the goal of creating a friendly, creative, open-ended dialogue agent. I built an interactive audio chatbot over WebRTC that generated text responses with a Transformer model and synthesised them to audio, but I found that speech synthesis was the most exciting part of the project.

char2wav shows how to condition an acoustic model

lyrebird had a flashy launch, but then vanished (because, AFAICT the product is commercialised, selling voice fakery to podcasters and, for all I know, QAnon).

How Siri does speech is kinda interesting.

wavegan does GANs for audio, at least over short sequences. Kastner might be right about long sequences being hard for GANs. But it does have an online demo.

nnmnkwii (nanamin kawaii)

Library to build speech synthesis systems designed for easy and fast prototyping.

Merlin

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).

gantts

PyTorch implementation of Generative Adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).

4 References

Arik, Chen, Peng, et al. 2018. “Neural Voice Cloning with a Few Samples.” arXiv:1802.06006 [Cs, Eess].

Chaudhuri, Roth, Ellis, et al. 2018. “AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies.” arXiv:1808.00606 [Cs, Eess].

Donahue, McAuley, and Puckette. 2018. “Synthesizing Audio with Generative Adversarial Networks.” arXiv:1802.04208 [Cs].

Jin, Mysore, Diverdi, et al. 2017. “VoCo: Text-Based Insertion and Replacement in Audio Narration.” ACM Transactions on Graphics.

Kalchbrenner, Elsen, Simonyan, et al. 2018. “Efficient Neural Audio Synthesis.” arXiv:1802.08435 [Cs, Eess].

Kaneko, and Kameoka. 2017. “Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks.” arXiv:1711.11293 [Cs, Eess, Stat].

Kobayashi, and Toda. n.d. “Sprocket: Open-Source Voice Conversion Software.”

Kumar, Kumar, de Boissiere, et al. 2019. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.” arXiv:1910.06711 [Cs, Eess].

Lee, Ko, Lee, et al. 2020. “Many-To-Many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Mehri, Kumar, Gulrajani, et al. 2017. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.

Prenger, Valle, and Catanzaro. 2018. “WaveGlow: A Flow-Based Generative Network for Speech Synthesis.” arXiv:1811.00002 [Cs, Eess, Stat].

Qian, Zhang, Chang, et al. 2019. “AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.” In International Conference on Machine Learning.

Qian, Zhang, Chang, et al. 2020. “Unsupervised Speech Decomposition via Triple Information Bottleneck.” arXiv:2004.11284 [Cs, Eess].

van den Oord, Dieleman, Zen, et al. 2016. “WaveNet: A Generative Model for Raw Audio.” In 9th ISCA Speech Synthesis Workshop.

Zhou, Horgan, Kumar, et al. 2018. “Voice Conversion with Conditional SampleRNN.” arXiv:1808.08311 [Cs, Eess].