Voice fakes

A placeholder. Generating speech, without a speaker, or possibly style transferring speech.

Style transfer

You have a recording of me saying something self-incriminating. You would prefer it to be a recording Hillary Clinton saying something incriminating. This is achievable, although the open-source options are not impressive, the pay-to-play options are getting very good.

Kyle Kastner’s suggestions

VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this.

There’s another cool webdemo of how this works. Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.

You can see an example of concatenative audio for “music transfer” here.

I personally think [Apple’s hybrid approach]((https://machinelearning.apple.com/2017/08/06/siri-voices.html) has a lot more potential than plain VoCo for style transfer — I like this paper a lot!

For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph , then watching Alex Graves’ lecture which shows the extension to speech, and maybe checking out our paper char2wav. There’s a lot of background we couldn’t fit in 4 pages for the workshop, but reading Graves’ past work should cover most of that, along with WaveNet and SampleRNN. Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.

I wouldn’t recommend GAN for audio, unless you are already quite familiar with GAN in general. It is very hard to get any generative model working on audio, let alone GAN.

By Corentin Jemin, Real-Time-Voice-Cloning will Clone a voice in 5 seconds to generate arbitrary speech in real-time. Need multi-gigabyte GPU but quality is OK.

There are various networks in the cyclegan/stargan familes that do voice style transfer. The original authors do not release implementations, but there are community versions, e.g. 1, 2, 3.

No-frills poor-man style, RandomCNN voice transfer does voice style transfer via random feature matching.

sprocket is old-school manually designed mixture model conversion. It looks like it requires simple hardware but lots of steps?

Voice conversion (VC) is a technique to convert a speaker identity of a source speaker into that of a target speaker. This software enables the users to develop a traditional VC system based on a Gaussian mixture model (GMM) and a vocoder-free VC system based on a differential GMM (DIFFGMM) using a parallel dataset of the source and target speakers.

Text to speech

char2wav shows how to conditionalize an acoustic model

lyrebird had a flashy launch, but then vanished (because, AFAICT the product is commercialised, selling voice fakery to podcasters and for all I know, QAnon.)

How siri does speech is kinda interesting.

wavegan does do GANs for audio biu the sequences are quite shout. Kastner might be right about long sequences being hard for GANs. But it does have online demo.

nnmnkwii ([nanamin kawaii])

Library to build speech synthesis systems designed for easy and fast prototyping.


Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).


PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).

Arik, Sercan O., Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. “Neural Voice Cloning with a Few Samples.” February 14, 2018. http://arxiv.org/abs/1802.06006.

Chaudhuri, Sourish, Joseph Roth, Daniel P. W. Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru, et al. 2018. “AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies.” August 1, 2018. http://arxiv.org/abs/1808.00606.

Donahue, Chris, Julian McAuley, and Miller Puckette. 2018. “Synthesizing Audio with Generative Adversarial Networks.” February 12, 2018. http://arxiv.org/abs/1802.04208.

Jin, Zeyu, Gautham J. Mysore, Stephen Diverdi, Jingwan Lu, and Adam Finkelstein. 2017. “VoCo: Text-Based Insertion and Replacement in Audio Narration.” ACM Transactions on Graphics 36 (4): 1–13. https://doi.org/10.1145/3072959.3073702.

Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis.” February 23, 2018. http://arxiv.org/abs/1802.08435.

Kaneko, Takuhiro, and Hirokazu Kameoka. 2017. “Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks.” November 30, 2017. http://arxiv.org/abs/1711.11293.

Kobayashi, Kazuhiro, and Tomoki Toda. n.d. “Sprocket: Open-Source Voice Conversion Software,” 8.

Kumar, Kundan, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.” October 8, 2019. http://arxiv.org/abs/1910.06711.

Lee, Shindong, BongGu Ko, Keonnyeong Lee, In-Chul Yoo, and Dongsuk Yook. 2020. “Many-to-Many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6279–83. https://doi.org/10.1109/ICASSP40776.2020.9053726.

Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017. http://arxiv.org/abs/1612.07837.

Oord, Aäron van den. 2016. “Wavenet: A Generative Model for Raw Audio.”

Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. 2018. “WaveGlow: A Flow-Based Generative Network for Speech Synthesis.” October 30, 2018. http://arxiv.org/abs/1811.00002.

Zhou, Cong, Michael Horgan, Vivek Kumar, Cristina Vasco, and Dan Darcy. 2018. “Voice Conversion with Conditional SampleRNN.” August 24, 2018. http://arxiv.org/abs/1808.08311.