New voice cloning AI lets “you” speak multiple languages

And it only needs to hear 4 seconds of your voice to do it.

March 18, 2023

This article is an installment of Future Explored, a weekly guide to world-changing technology. You can get stories like this one straight to your inbox every Thursday morning by subscribing here.

In January, Microsoft unveiled an AI that can clone a speaker’s voice after hearing them talk for just three seconds. While this system, VALL-E, was far from the first voice cloning AI, its accuracy and need for such a small audio sample set a new bar for the tech.

Microsoft has now raised that bar again with an update called “VALL-E X,” which can clone a voice from a short sample (4 to 10 seconds) and then use it to synthesize speech in a different language, all while preserving the original speaker’s voice, emotion, and tone.

We could soon be living in a world where anyone can generate audio that sounds like anyone saying anything in any language.

Microsoft hasn’t released VALL-E X to the public yet, but it has published a demo page featuring translations between English and Chinese, along with a preprint paper in which it reveals plans to expand the AI to include other languages.

If Microsoft decides to make the tool available, or if similar tools are rolled out by the myriad other AI firms out there, we could soon be living in a world where anyone can generate audio that sounds like anyone saying anything in any language — and that could have massive consequences.

Good talk: Dozens of voice cloning AIs are already available online, and like VALL-E, they’re trained on large datasets of speech. Given a sample of a new voice, they can then use their training to predict what it would sound like reading a text prompt and generate the audio.

Some can even do what VALL-E X does and produce audio in languages other than the one originally spoken.

These services often require longer samples than Microsoft’s AI — a person might need to recite a few dozen sentences or even provide hours of audio — and the quality of the output can vary, but having a voice clone can be hugely useful, especially for content creators.

An author, for example, might use their voice clone to generate an audiobook, saving them from having to spend days in the recording studio or hire a professional. They could even feed it written translations of their book to generate author-read audiobooks in multiple other languages.

Speech, again: Aside from helping writers, filmmakers, podcasters, and other creators reach new audiences — and new revenue streams — voice clones can also help people who’ve lost their own voices to disease or injury still sound like themselves.

University of Edinburgh spin-out SpeakUnique, for example, creates voice clones for people with ALS and other forms of motor neurone disease. If samples from before the disease started to affect the person’s speech aren’t available, SpeakUnique can even repair minor impairments in the training recordings.

While SpeakUnique requires users to recite 150 and 300 sentences to create a voice clone, advances like VALL-E could eventually allow them to do so with just a sentence, which could make the tech more accessible to people for whom speaking doesn’t come easy.

Once they have their voice clone, they can pair it with text-to-speech apps or eye-tracking software to communicate in their own voice. As mind-reading technology gets better, users might eventually be able to use their clones after they’ve lost the ability to move even their eyes.

Actor Val Kilmer famously took advantage of voice cloning. After a battle with throat cancer left him unable to speak clearly, AI company Sonantic used 30 minutes of audio from his past films to create a voice clone for him.

Kilmer can now use it to dub his acting performances, which he recently did in “Top Gun: Maverick.”

“[Val] and his team knew that building a custom voice model would help him explore new ways to communicate, connect, and create in the future,” John Flynn, Sonantic’s co-founder and CTO, wrote in a 2021 blog post.

Deepfake audio: While voice cloning is giving Kilmer more work opportunities, it can have the opposite effect on other performers.

Motherboard recently reported how studios are pressuring actors with less cachet than Kilmer into agreeing to having their voices cloned. In theory, they could get paid for one session in the recording studio, and then see their clone replace them for future work.

Tim Friedlander, president and founder of the National Association of Voice Actors, told Motherboard that some even use confusing language in contracts so that they can get away with cloning actors’ voices without them knowing.

“Many voice actors may have signed a contract without realizing language like this had been added,” he said.

Other actors are told they can either agree to the clause or get passed over for work, according to Friedlander, but some performers are never given the opportunity to decide whether they’re OK having their voices cloned or for what purpose.

https://youtu.be/17_xLsqny9E

In January, internet users took advantage of startup ElevenLabs’ free voice cloning app, which needs just one minute of audio to create a clone, to generate clips of Emma Watson, Joe Rogan, and other celebs “saying” hateful things they never really said.

Pair a voice clone with deepfake visuals, and you have content that looks and sounds real but is anything but, making it easier for bad actors to not only tarnish a celeb’s reputation, but also create convincing propaganda and spread misinformation.

ElevenLabs now requires users to pay for the service, but before it added that safeguard, a Motherboard journalist demonstrated how he was able to create a free voice clone of himself with just five minutes of audio and then use it to get past his bank’s voice-recognition system.

If systems like VALL-E and VALL-E X become widely available, something as short as your voicemail message might be enough for criminals to breach your bank accounts, hack your tech, or scam your loved ones.

The bottom line: Microsoft seems keenly aware that people might misuse its voice cloning AIs — the demo pages for both VALL-E and VALL-E X end with ethics statements highlighting the potential for spoofing.

The VALL-E preprint also mentions the possibility of creating a system to detect the AI’s voice clones to mitigate risks. While that hasn’t come to fruition yet, we are already seeing other researchers devise novel ways to distinguish AI-generated voices from human ones.

For such systems to be useful, we’ll need to figure out a way to implement them, though, and it’s not clear yet how that’s going to work.

For now, combining voice-based passwords with other, less-easily-spoofed authentication methods can help us avoid getting hacked by voice clones, and we can also encourage our less-skeptical loved ones to hang up and call us back if they think we’re in trouble — and remind them (and ourselves) that we shouldn’t believe everything we see and hear online.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at tips@freethink.com.