You are here

Realistic Voice Cloning with Artificial Intelligence

Submitted by OodlesAI on Mon, 05/25/2020 - 00:03

In a bid to sound more like humans, artificial intelligence (AI) is all set to break new records, literally. A new technology, called ‘Voice Cloning’ is replacing the robotic tonality of virtual assistants with natural human voices. Voice cloning with artificial intelligence can master unique human voices to make chatbots, video clips, and other interactions more intuitive and engaging.

In this article, we take a closer look at how deep learning and AI development services power voice cloning to build effective business solutions.

The Science Behind Voice Cloning with Artificial Intelligence
AI’s underlying technologies, machine learning and deep learning have constantly demonstrated significant potential for text-to-speech (TTS) interactions, also called speech synthesis. The technology when coupled with speech recognition becomes the backbone for virtual assistants such as Siri, Alexa, and the likes. However, providers of chatbot Development Services still struggle at eliminating the robotic tonality associated with voice-controlled assistants.

With voice cloning, deep neural networks are moving a step closer to quality, interactive, personalized, and highly intuitive human-chatbot interactions.

A recent research paper, Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis by Jia, Zhang, and others introduce an arguably earlier and more efficient way for voice cloning. The paper proposes a new technique, Speech Vector to TTS (SV2TTS) that generates near-similar speech audio using only a few seconds of a sample voice. Unlike highly expensive traditional training methods that required several hours of professionally recorded speech, SV2TTS can-

a) Clone voices without excessive training or retraining

b) Produce high-quality audio results, and

c) Synthesize natural speech from speakers unseen during the training.

1) Speaker Encoder Network
In the first stage, the speaker encoder takes an audio sample fro a single speaker as input to derive an embedding. Representing the speaker’s voice, the embedding captures the unique characteristics such as high/low pitched voice, tone, and accent with high similarity using only a short audio file.

2) Synthesizer
Synthesizer constitutes the second phase of the SV2TTS model that involves text analysis to create mel spectrograms, wherein the sound frequencies are converted into mel scale. The synthesizer combines the smallest units of human sounds, called phonemes with the embeddings to me spectrogram frames.

Learn more: Realistic Voice Cloning with Artificial Intelligence