Voice fakes

A placeholder. Generating speech, without a speaker.

Kyle Kastner’s suggestions

VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this (http://kastnerkyle.github.io/posts/bad-speech-synthesis-made-simple/) .

There’s another cool webdemo of how this works (http://jungle.horse/#) . Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.

You can see an example of concatenative audio for “music transfer” here (http://spectrum.mat.ucsb.edu/~b.sturm/sand/VLDCMCaR/VLDCMCaR.html)

I personally think Apple’s hybrid approach has a lot more potential than plain VoCo for style transfer (https://machinelearning.apple.com/2017/08/06/siri-voices.html) – I like this paper a lot!

For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph (https://arxiv.org/abs/1308.0850) , then watching Alex Graves’ lecture which shows the extension to speech (https://www.youtube.com/watch?v=-yX1SYeDHbg&t=37m00s) , and maybe checking out our paper char2wav (https://mila.quebec/en/publication/char2wav-end-to-end-speech-synthesis/) . There’s a lot of background we couldn’t fit in 4 pages for the workshop, but reading Graves’ past work should cover most of that, along with WaveNet and SampleRNN (https://arxiv.org/abs/1612.07837). Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.

I wouldn’t recommend GAN for audio, unless you are already quite familiar with GAN in general. It is very hard to get any generative model working on audio, let alone GAN.

char2wav shows how to conditionalize an acoustic model

lyrebird How siri does it

wavegan does do GANs for audio bit the sequences are quite shout. Kastner might be right about long sequences with GANs. But it does have online demo.

By Corentin Jemin, Real-Time-Voice-Cloning will Clone a voice in 5 seconds to generate arbitrary speech in real-time. Need multi-gigabyte GPU though.

cyclegan is pure voice conversion.

A handy data set of speech on youtube It’s not clear where to download it from. The dataset it is based on, AVA doesn’t have the speech part.

Apparently Audioset also?

Arik, Sercan O., Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. “Neural Voice Cloning with a Few Samples,” February. http://arxiv.org/abs/1802.06006.

Chaudhuri, Sourish, Joseph Roth, Daniel P. W. Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru, et al. 2018. “AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies,” August. http://arxiv.org/abs/1808.00606.

Donahue, Chris, Julian McAuley, and Miller Puckette. 2018. “Synthesizing Audio with Generative Adversarial Networks,” February. http://arxiv.org/abs/1802.04208.

Jin, Zeyu, Gautham J. Mysore, Stephen Diverdi, Jingwan Lu, and Adam Finkelstein. 2017. “VoCo: Text-Based Insertion and Replacement in Audio Narration.” ACM Transactions on Graphics 36 (4): 1–13. https://doi.org/10.1145/3072959.3073702.

Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis,” February. http://arxiv.org/abs/1802.08435.

Kaneko, Takuhiro, and Hirokazu Kameoka. 2017. “Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks,” November. http://arxiv.org/abs/1711.11293.

Kumar, Kundan, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” October. http://arxiv.org/abs/1910.06711.

Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017. http://arxiv.org/abs/1612.07837.

Oord, Aäron van den. 2016. “Wavenet: A Generative Model for Raw Audio.”

Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. 2018. “WaveGlow: A Flow-Based Generative Network for Speech Synthesis,” October. http://arxiv.org/abs/1811.00002.

Zhou, Cong, Michael Horgan, Vivek Kumar, Cristina Vasco, and Dan Darcy. 2018. “Voice Conversion with Conditional SampleRNN,” August. http://arxiv.org/abs/1808.08311.