Synthesis and Cloning Human Voices (Eliya Nachmani, DSI Learning Club)

March 22, 2018

Mar 22nd, Thu 10:00 , Eliya Nachmani.

Tel-Aviv University / Facebook AI Research (PhD Student).

Location: Gonda Building (901), Room 101.

Synthesis and Cloning Human Voices


Text to speech (TTS) is able to transform text to speech. In this talk we present a new neural TTS for voices that are sampled in the wild. We introduce a new network architecture – VoiceLoop which is simpler than those in the existing literature and is based on a novel shifting buffer working memory. Our solution is able to deal with unconstrained voice samples and without requiring aligned phonemes or linguistic features. We also show how we can control the emotion variability in the generated speech by priming the network buffer.

We further propose a TTS systems have the potential to generalize from one speaker to another with relatively short sample of any new voice. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.




Gonda Building (901), Room 101