Voice fakes

A placeholder. Generating speech, without a speaker, or possibly style transferring speech.

Style transfer

You have a recording of me saying something self-incriminating. You would prefer it to be a recording Hillary Clinton saying something incriminating. This is achievable.

There has been a tendency for the open source ones to be fairly mediocre while the the pay-to-play options leave provocative demos about but do not let you use them.

As of December 2020, there are two impressive nearly-released ones, [], (in the sense they have released full models, although not the trained weights) AutoVC SpeechSplit.

The “vocoder free” approach of kaen2891/kaen2891.github.io: Research Results might be good, although their website is fried.

For an overview of recent history see Kyle Kastner’s suggestions:

VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this.

There’s another cool webdemo of how this works. Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.

You can see an example of concatenative audio for “music transfer” here.

I personally think Apple’s hybrid approach has a lot more potential than plain VoCo for style transfer — I like this paper a lot!

For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph , then watching Alex Graves’ lecture which shows the extension to speech, and maybe checking out our paper char2wav. There’s a lot of background we couldn’t fit in 4 pages for the workshop, but reading Graves’ past work should cover most of that, along with WaveNet and SampleRNN. Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.

I wouldn’t recommend GAN for audio, unless you are already quite familiar with GAN in general. It is very hard to get any generative model working on audio, let alone GAN.

By Corentin Jemin, Real-Time-Voice-Cloning will “clone a voice in 5 seconds to generate arbitrary speech in real-time.“ Needs multi-gigabyte GPU and tedious training but quality is OK.

There are various networks in the cyclegan/stargan families that do voice style transfer. The original authors do not release implementations, but there are community versions, e.g. 1, 2, 3. None of them sound great. If you were going to bother with stargan, why not instead do it no-frills poor-man style using RandomCNN? It oes voice style transfer via random feature matching, and is not worse than the fancier ones to my ears.

sprocket is old-school manually designed mixture model conversion. It looks like it requires simple hardware but lots of steps?

Voice conversion (VC) is a technique to convert a speaker identity of a source speaker into that of a target speaker. This software enables the users to develop a traditional VC system based on a Gaussian mixture model (GMM) and a vocoder-free VC system based on a differential GMM (DIFFGMM) using a parallel dataset of the source and target speakers.

Text to speech

char2wav shows how to conditionalize an acoustic model

lyrebird had a flashy launch, but then vanished (because, AFAICT the product is commercialised, selling voice fakery to podcasters and for all I know, QAnon.)

How siri does speech is kinda interesting.

wavegan does do GANs for audio, at least over short sequences Kastner might be right about long sequences being hard for GANs. But it does have online demo.

nnmnkwii (nanamin kawaii)

Library to build speech synthesis systems designed for easy and fast prototyping.


Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).


PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).


Arik, Sercan O., Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. “Neural Voice Cloning with a Few Samples.” February 14, 2018. http://arxiv.org/abs/1802.06006.
Chaudhuri, Sourish, Joseph Roth, Daniel P. W. Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru, et al. 2018. AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies.” August 1, 2018. http://arxiv.org/abs/1808.00606.
Donahue, Chris, Julian McAuley, and Miller Puckette. 2018. “Synthesizing Audio with Generative Adversarial Networks.” February 12, 2018. http://arxiv.org/abs/1802.04208.
Jin, Zeyu, Gautham J. Mysore, Stephen Diverdi, Jingwan Lu, and Adam Finkelstein. 2017. VoCo: Text-Based Insertion and Replacement in Audio Narration.” ACM Transactions on Graphics 36 (4): 1–13. https://doi.org/10.1145/3072959.3073702.
Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis.” February 23, 2018. http://arxiv.org/abs/1802.08435.
Kaneko, Takuhiro, and Hirokazu Kameoka. 2017. “Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks.” November 30, 2017. http://arxiv.org/abs/1711.11293.
Kobayashi, Kazuhiro, and Tomoki Toda. n.d. “Sprocket: Open-Source Voice Conversion Software,” 8.
Kumar, Kundan, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.” October 8, 2019. http://arxiv.org/abs/1910.06711.
Lee, Shindong, BongGu Ko, Keonnyeong Lee, In-Chul Yoo, and Dongsuk Yook. 2020. “Many-To-Many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6279–83. https://doi.org/10.1109/ICASSP40776.2020.9053726.
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017. http://arxiv.org/abs/1612.07837.
Oord, Aäron van den. 2016. “Wavenet: A Generative Model for Raw Audio.”
Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. 2018. WaveGlow: A Flow-Based Generative Network for Speech Synthesis.” October 30, 2018. http://arxiv.org/abs/1811.00002.
Qian, Kaizhi, Yang Zhang, Shiyu Chang, David Cox, and Mark Hasegawa-Johnson. 2020. “Unsupervised Speech Decomposition via Triple Information Bottleneck.” August 11, 2020. http://arxiv.org/abs/2004.11284.
Qian, Kaizhi, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.” In International Conference on Machine Learning, 5210–19. PMLR. http://proceedings.mlr.press/v97/qian19c.html.
Zhou, Cong, Michael Horgan, Vivek Kumar, Cristina Vasco, and Dan Darcy. 2018. “Voice Conversion with Conditional SampleRNN.” August 24, 2018. http://arxiv.org/abs/1808.08311.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.