A placeholder. Generating speech, without a speaker, or possibly style transferring speech.
You have a recording of me saying something self-incriminating. You would prefer it to be a recording Hillary Clinton saying something incriminating. This is achievable.
There has been a tendency for the open source ones to be fairly mediocre while the the pay-to-play options leave provocative demos about but do not let you use them.
As of December 2020, there are two impressive nearly-released ones, [QianAutoVC2019;@QianUnsupervised2020], (in the sense they have released full models, although not the trained weights) AutoVC SpeechSplit.
The “vocoder free” approach of kaen2891/kaen2891.github.io: Research Results might be good, although their website is fried.
For an overview of recent history see Kyle Kastner’s suggestions:
VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won’t really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this.
There’s another cool webdemo of how this works. Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.
You can see an example of concatenative audio for “music transfer” here.
I personally think Apple’s hybrid approach has a lot more potential than plain VoCo for style transfer — I like this paper a lot!
For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph , then watching Alex Graves’ lecture which shows the extension to speech, and maybe checking out our paper char2wav. There’s a lot of background we couldn’t fit in 4 pages for the workshop, but reading Graves’ past work should cover most of that, along with WaveNet and SampleRNN. Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.
I wouldn’t recommend GAN for audio, unless you are already quite familiar with GAN in general. It is very hard to get any generative model working on audio, let alone GAN.
By Corentin Jemin, Real-Time-Voice-Cloning will “clone a voice in 5 seconds to generate arbitrary speech in real-time.“ Needs multi-gigabyte GPU and tedious training but quality is OK.
There are various networks in the cyclegan/stargan families that do voice style transfer. The original authors do not release implementations, but there are community versions, e.g. 1, 2, 3. None of them sound great. If you were going to bother with stargan, why not instead do it no-frills poor-man style using RandomCNN? It oes voice style transfer via random feature matching, and is not worse than the fancier ones to my ears.
Voice conversion (VC) is a technique to convert a speaker identity of a source speaker into that of a target speaker. This software enables the users to develop a traditional VC system based on a Gaussian mixture model (GMM) and a vocoder-free VC system based on a differential GMM (DIFFGMM) using a parallel dataset of the source and target speakers.
Text to speech
char2wav shows how to conditionalize an acoustic model
How siri does speech is kinda interesting.
nnmnkwii (nanamin kawaii)
Library to build speech synthesis systems designed for easy and fast prototyping.
Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor (e.g., Festival) and a vocoder (e.g., STRAIGHT or WORLD).
PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).