Microsoft’s new AI needs just 3 seconds of audio to clone a voice

VALL-E can even mimic a speaker’s emotions and acoustic environment.
Sign up for the Freethink Weekly newsletter!
A collection of our favorite stories straight to your inbox

Microsoft’s new voice-cloning AI can simulate a speaker’s voice with remarkable accuracy — and all it needs to get started is a three-second sample of them talking.

Voice cloning 101: Voice cloning isn’t new. Google the term, and you’ll get a long list of links to websites and apps offering to train an AI to produce audio that sounds just like you. You can then use the clone to hear yourself “read” any text you like.

For a writer, this can be useful for creating an author-narrated audio version of their book without spending days in a recording studio. A voice actor, meanwhile, might clone their voice so that they can rent out the AI for projects they don’t have time to tackle themselves.

Shorter source samples typically lead to voice clones that sound less realistic.

Depending on the service, the voice cloning process might start with you reciting 50 predetermined sentences or uploading a clip of you saying anything at all. Some services will ask for hours of audio to train their AI, while others will boast about needing just 5 seconds.

Often, you get out of these voice cloning services what you put into them — a shorter sample typically leads to a clone that sounds like a robot trying to impersonate a person, while longer clips can result in AI-generated audio that sounds just like the original speaker.

Short and sweet: Microsoft’s new voice-cloning AI, VALL-E, bucks this trend, generating audio that sounds remarkably like the original speaker from a voice sample just three seconds long. 

You can’t clone your own voice with VALL-E, but Microsoft has shared a research paper on arXiv and created a Github page where you can compare snippets of human voices to speech generated by VALL-E and a “baseline” voice-cloning AI (YourTSS).

On this page, Microsoft also demonstrates how the AI can mimic a speaker’s emotion and the acoustic environment of a sample — if the speaker sounds angry, VALL-E can generate angry-sounding audio, and if the original clip sounds like it was recorded over the phone, the AI can generate audio that matches those acoustics.

VALL-E’s training library was hundreds of times larger than other systems’.

How it works: An AI is typically only as good as its training data, and Microsoft opted to use Meta’s LibriLight — an audio library containing 60,000 hours of speech from more than 7,000 English speakers — to train VALL-E.

This means the AI’s training set was “hundreds of times larger” than those used to train existing voice cloning systems, according to the research paper.

When VALL-E is presented with a new voice to clone, it breaks the three second audio clip into bits Microsoft calls “acoustic tokens.” Using those tokens and its training data, it can then predict what the voice would sound like saying other phrases.

The big picture: If you go back to that list of “voice cloning” search results, you’ll likely find links to articles detailing how the AIs are being used for nefarious purposes.

There’s the cybercriminal who cloned a boss’s voice to trick an employee into transferring company cash into their bank account, and warnings to seniors that bad actors can now clone the voices of their grandchildren to extort money.

The Microsoft team addresses the potential for people to misuse VALL-E in their research paper, noting that such risks could be mitigated by the creation of a “detection model” capable of determining if a clip was generated by the AI. 

Even if bad actors find ways around such tools, though, other people will use the tech for good: creating synthetic voices for ALS patients, helping people connect with deceased loved ones, or doing something so remarkable we can’t even yet imagine it.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].

Sign up for the Freethink Weekly newsletter!
A collection of our favorite stories straight to your inbox
Related
How technology has transformed private espionage
Combining AI and a deluge of open data has enabled some intelligence vendors to surpass the capabilities of government agencies.
Will LLMs lead to an artificial general intelligence?
An exclusive excerpt from AI podcaster Dwarkesh Patel’s first book, The Scaling Era: An Oral History of AI 2019-2025.
A dozen reasons to read Peter Leyden at this critical juncture in history
To truly understand our historic moment, you need a comprehensive, big-picture, long-term perspective that deeply understands artificial intelligence and the next wave of transformative technologies.
What is The Great Progression: 2025 to 2050?
We have a historic opportunity to harness AI and other transformative technologies in order to make a much better world in the next 25 years.
Humanoid helpers are now entering our homes
Robotics startup 1X Technologies is now sending its humanoid robots into homes to help people with chores and provide companionship.
Up Next
Subscribe to Freethink for more great stories