April 24, 2024

Best AI voice cloning tools: Which tools pass the mom test?

AI voice cloning tools claim to create a carbon copy of your voice. But would your mom think so? One writer decided to find out.
April 24, 2024

Best AI voice cloning tools: Which tools pass the mom test?

AI voice cloning tools claim to create a carbon copy of your voice. But would your mom think so? One writer decided to find out.
April 24, 2024
Briana Brownell
In this article
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions

Now that AI can do pretty much anything, I try to use it to do all the things I don't really like. And one of the things I don't really like is recording myself. I find it time consuming, and, as an introvert, kind of exhausting.

So when Descript asked me to test out some voice cloning tools I was thrilled. 

For the experiment, I tested four different tools in how easy they were to set up and how well they imitated my voice in their text-to-speech function. As my test text, I used an excerpt from a recent keynote I did on generative AI. Here’s what I actually sound like reading it:

To train the tools, you typically need to upload recorded audio or to read an example script. I tried two different clips:

  • Voice 1: An iPhone recording of my side of a conversation I had with a good friend. In it, I'm talking pretty informally.
  • Voice 2: An audio recording for a course that I taught where I am deliberately speaking more slowly on a technical topic.

Would the tools pass the ultimate test: The mom test? Would my mom think it sounded like me?

Here are my results.

Descript

Descript's AI voice feature is one part of a much larger suite of tools. To train it, you need to record a specific statement live, or you can record it on another device and upload it. I tried the second option and had some trouble with my recording being an unsupported file type, but I got there eventually.

I tried the voice generator on its own, where it used the statement I read to generate the voice, as well as generating the voice based on a project. For the second option, the workflow to create the voice is a lot different than the other tools, as you’ll see. You can't just click "create voice" and upload all your files like the other options; you have to create the voice within the project itself. Once you get the hang of it, it's easy to generate the AI voices, but I found myself on Descript's help site to try to figure it out. 

Here’s what they sounded like:

The voices generated with the two options were very similar to each other—so similar I had to load them into different tracks in Audition to see if they were exactly the same (they aren't). So I'm not sure how much adding additional audio factored into the voice.

But once I got it running, Descript’s AI voice generator was straightforward to use. It was the most robotic-sounding of all the voices and didn’t have any direct controls to change the pacing or expressiveness. For that, Descript recommends creating multiple voices with different delivery styles—I could have made a second voice where I read the statement faster, or with more expressiveness. But for this test, I stuck with the garden-variety voice. 

Despite having less control over the output, Descript did well on the mom test. My mom said the voices sounded good, albeit less expressive and less like me than other voices.

Mom test: ✅ Passed

Pros:

  • Fast process
  • Ability to edit the recording without switching programs
  • Part of a larger suite of audio and video editing tools
  • Passed the mom test

Cons:

  • Creating my first voice involved a learning curve
  • No direct style controls; must record multiple voices with different delivery styles

Pricing: All Descript tiers include use of the entire suite of editing tools. 

  • Free: 1 hour of transcription per month and AI voices with a 1,000-word vocabulary
  • $12/month: 10 hours of transcription per month and AI voices with a 1,000-word vocabulary
  • $24/month: 30 hours of transcription per month and AI voices with an unlimited vocabulary

ElevenLabs

To create my voice, ElevenLabs let me upload up to 25 samples but each of them had to be below 10MB. I was using uncompressed audio, so I had to chop up my samples in Audition. It was a really time-consuming process, but I could have saved some of the cutting by using a compressed format like MP3.

The tool allows you to change a number of settings to affect the way the voice sounds, including the stability, clarity + similarity, and style exaggeration. You then input your text and voila! It generates the audio sample, which you can listen to and download. 

I generated several with both audio samples using different settings, and the samples gave me a bit of a shock. Here’s what they sounded like:

The tool added some undesirable effects to the speech, like laughter, breathing, and at one point even giving me an "um"! The speed was off too, adding strange pauses between passages that were extremely fast. Basically the opposite problem of the voice being too robotic, it was way too casual. I generated a second voice using the steadier audio sample, and it gave me a new, nasal tone and strange accent. Playing with the settings changed the kinds of artifacts it added to the passage, but it consistently added them to all of the samples.

The default settings worked the best, and so I'd recommend not straying too far from those. This one my mom called good but she thought it was a little "too monotone." The more expressive settings did not pass the mom test. She called the various versions "irritating", "jerky" and "too hard to follow."

Mom test: ⚠️ 1 of 2 voices passed

Pros: 

  • Includes style settings like stability, clarity + similarity, and style exaggeration
  • Allows you to upload up to 25 audio samples

Cons: 

  • More expressive styles sound unrealistic
  • Added undesirable effects to the speech, like laughter, breathing, and filler words

Pricing: 

  • Free plan: ~10 minutes of audio per month
  • $5/month: ~30 minutes of audio per month, access to an editing tool and commercial use license
  • $11/month: ~2 hours of audio per month, “professional voice cloning” plus everything in lower tiers
  • $99/month: ~10 hours of audio per month, 44.1 kHz PCM audio output, plus everything in lower tiers
  • $330/month: ~40 hours of audio per month, priority support, plus everything in lower tiers

Play.ht 

Play.ht allows you to create an “Instant” voice clone using a minimum of 30 seconds of audio (up to 50MB), or a “High Fidelity” clone with more audio. I tried both. The High Fidelity Clone recommends 2–3 hours or more of audio, so you need to be more prolific than me to take advantage of the larger potential training files.

Once the voice is cloned, you can input your text. It also gives you three settings to control the voice: stability, similarity, and intensity.

Instead of generating everything all at once like ElevenLabs, Play.ht allows you to generate multiple clips and stitch them together. I really liked this feature. You can regenerate each clip individually to give you greater control over the output, then download it as a single file or in multiple parts. You can also change the settings by paragraph rather than universally so you can add intensity to specific sentences.

Here’s what it sounded like.

The high fidelity voice was definitely better. However, it made a number of mispronunciations that needed to be regenerated. To get it right, you would have to add phonetic spellings for those particular words. 

There were also fewer options for changing the settings of the voice, and the pacing was really off for some of the outputs. 

The clones didn't do that well on the mom test: She didn't think the default settings sounded like me, and felt it was too monotone. Like me, she also felt it was too fast in places. However, even if the voices didn't sound like me, she felt the voice was clear and expressive.

Mom test: ✅ Passed, with notes

Pros: 

  • Ability to generate multiple clips to stitch together for more control
  • Ability to change settings by paragraph to add intensity to certain phrases

Cons:

  • Mispronounced some words
  • Few options for changing voice settings
  • Pacing is off

Pricing: 

  • Free plan: ~10 minutes of audio per month, one Instant voice clone, attribution required
  • $39/month: ~5.5 hours of audio per month, 10 Instant voice clones, commercial use allowed
  • $99/month: Unlimited audio, regeneration, and Instant voice clones; one High Fidelity clone, commercial use allowed

Resemble AI

I signed up for Resemble AI to test both the Rapid voice clone and the Professional clone, but the Rapid voice clone wasn't available, so I created the Professional voice clone instead. 

Like Descript, Resemble AI makes you record a specific sentence confirming you have permission to clone the voice you're uploading. While I appreciate that this is for security purposes, it was a real pain in the butt.

Resemble AI also requires a single file upload which must be in WAV/AIFF/FLAC format. It takes about an hour to generate the voice, the longest of all the tools.

Here’s what it sounded like:

Once I got it running, I noticed that Resemble AI had several desirable features. It splits the generated audio into smaller chunks so you can regenerate certain parts, rather than the whole thing. You can also specify the part of speech of a word. For instance, for the word "live," which was mispronounced in one of the other generations, it allowed me to specify whether I meant the adjective or the verb.

Resemble AI also had an interesting feature that none of the others had: localization. I'm Canadian and most people can't differentiate between our accent and that of the US Pacific Northwest. But there are indeed differences. That meant that I could "translate" my text into Canadian English. When I used this feature it did indeed change the couple of American-sounding vowels to their Canadian counterparts.

Unfortunately, it was sort of buggy. Sometimes it would skip words, and the spacing between words was odd.

Mom's verdict: She didn't think it sounded like me, calling it too "sing-song" and that the "speaker sounds like she is bored."

Mom test: 🛑 Failed

Pros:

  • Splits audio into chunks for more control over regeneration
  • Specify parts of speech for more accurate pronunciation
  • Localization for accurate accents

Cons:

  • Buggy; skipped some words and had odd pacing
  • Longest generation time of all the tools
  • Didn’t pass the mom test

Pricing: 

  • Pay-as-you-go pricing: $0.0006/second of audio (3.6 cents per minute)
  • $29/month: 10,000 seconds of audio per month, 5 Rapid voice clones, 1 Professional voice clone
  • $99/month: 80,000 seconds of audio per month, 25 Rapid voice clones, 3 Professional voice clones, localization
  • $299/month: 200,000 seconds of audio per month, 100 Rapid voice clones, 5 Professional voice clones, localization
  • $499/month: 320,000 seconds of audio per month, 500 Rapid voice clones, 10 Professional voice clones, localization, API access, authorized partner program

Drawbacks to AI voice cloning tools

Although I'm not a person who hates their voice when I hear a recording of myself, I definitely did hate it when put through some of these tools. It was like a vocal funhouse mirror: some of the tools added strange mannerisms I don't have, added a specific quality to my voice I didn't like, or made pronunciation mistakes I wouldn't make. They ranged from way too robotic to completely unhinged, and the inability to make the changes I wanted left me frustrated with the output.

The pacing was also almost always off—the pauses seemed to be generated almost at random. They would have to be edited to sound natural. Plus, pronunciation was an issue with many of the generations, too, and it would be time consuming to edit the text with phonetic spellings.

The tools also left me frustrated with their varied and conflicting requirements for training the voice, with some requiring specific file types, limits on file size, or requiring all audio to be in a single file.

Benefits to AI voice cloning tools 

There were some good points, though, and this is where I see some possibilities in the future. I appreciated the tools that let me play with different settings for emotionality and tone, so that I could at least attempt to recalibrate what I heard. I also liked being able to split it up into parts and regenerate only a piece of the audio. Some AI voice cloning tools even let me, at least partially, control pronunciation.

But it was tough to see how any of them would fit into a workflow. Even the ones I liked the best took me about an hour to actually regenerate the audio to find an example that I liked—it's time consuming to have to listen to the passage over and over to see whether it worked out right, and each regeneration takes processing time too.

Compare that with just recording it: When I recorded it myself, I didn't do anything fancy, but rather did a single take on my iPhone, which took about 2 minutes to read, including re-doing a few passages. I then uploaded and put it into Descript, where I had to wait a minute or so for the transcription, and then I had to listen to it once more to do the edits. Overall, the process of just recording it and editing it took me less than ten minutes and sounded better than any of the AI-generated options, even if it did have some obvious audio issues I didn't correct. I snuck it in with the samples of the AI voice cloning tools and my mom liked it the best (but she did dunk on my hissing s's in the recording).

Conclusion

I think I would use the AI voice cloning tools as a second possibility when doing especially heavy edits. But even the tools I liked took me far, far longer than it would have taken for me to just record a half dozen takes of the passage and edit them together. Plus, for some of the higher-quality options, you need quite a bit of audio already recorded to make a feasible voice.

If I were editing someone else's audio and couldn't ask them to re-record, I could see myself using an AI voice cloning tool to generate some passages and editing them in (with their permission, of course). Generating a longer passage that was only from the voice clone just didn't work well, at least at the state the tools are at now.

But we are pretty early in this game, so I think there is a possibility that these tools will prove useful eventually. After all, they are getting better and better all the time.

My mom agreed, at least partially: "This was tough; when I liked the pace of one, then something else was off. But the four highlighted ones were my best choices to sound like you…good pacing…and clear diction."

Her four favorites were (in no particular order):

  • Descript - Voice 1
  • Descript - Voice 2
  • Play.ht - Voice 2 - Default Settings
  • My real voice (that I secretly snuck in)
Briana Brownell
Briana Brownell is a Canadian data scientist and multidisciplinary creator who writes about the intersection of technology and creativity.
Share this article
Start creating—for free
Sign up
Join millions of others creating with Descript

Best AI voice cloning tools: Which tools pass the mom test?

Now that AI can do pretty much anything, I try to use it to do all the things I don't really like. And one of the things I don't really like is recording myself. I find it time consuming, and, as an introvert, kind of exhausting.

So when Descript asked me to test out some voice cloning tools I was thrilled. 

For the experiment, I tested four different tools in how easy they were to set up and how well they imitated my voice in their text-to-speech function. As my test text, I used an excerpt from a recent keynote I did on generative AI. Here’s what I actually sound like reading it:

To train the tools, you typically need to upload recorded audio or to read an example script. I tried two different clips:

  • Voice 1: An iPhone recording of my side of a conversation I had with a good friend. In it, I'm talking pretty informally.
  • Voice 2: An audio recording for a course that I taught where I am deliberately speaking more slowly on a technical topic.

Would the tools pass the ultimate test: The mom test? Would my mom think it sounded like me?

Here are my results.

Descript

Descript's AI voice feature is one part of a much larger suite of tools. To train it, you need to record a specific statement live, or you can record it on another device and upload it. I tried the second option and had some trouble with my recording being an unsupported file type, but I got there eventually.

I tried the voice generator on its own, where it used the statement I read to generate the voice, as well as generating the voice based on a project. For the second option, the workflow to create the voice is a lot different than the other tools, as you’ll see. You can't just click "create voice" and upload all your files like the other options; you have to create the voice within the project itself. Once you get the hang of it, it's easy to generate the AI voices, but I found myself on Descript's help site to try to figure it out. 

Here’s what they sounded like:

The voices generated with the two options were very similar to each other—so similar I had to load them into different tracks in Audition to see if they were exactly the same (they aren't). So I'm not sure how much adding additional audio factored into the voice.

But once I got it running, Descript’s AI voice generator was straightforward to use. It was the most robotic-sounding of all the voices and didn’t have any direct controls to change the pacing or expressiveness. For that, Descript recommends creating multiple voices with different delivery styles—I could have made a second voice where I read the statement faster, or with more expressiveness. But for this test, I stuck with the garden-variety voice. 

Despite having less control over the output, Descript did well on the mom test. My mom said the voices sounded good, albeit less expressive and less like me than other voices.

Mom test: ✅ Passed

Pros:

  • Fast process
  • Ability to edit the recording without switching programs
  • Part of a larger suite of audio and video editing tools
  • Passed the mom test

Cons:

  • Creating my first voice involved a learning curve
  • No direct style controls; must record multiple voices with different delivery styles

Pricing: All Descript tiers include use of the entire suite of editing tools. 

  • Free: 1 hour of transcription per month and AI voices with a 1,000-word vocabulary
  • $12/month: 10 hours of transcription per month and AI voices with a 1,000-word vocabulary
  • $24/month: 30 hours of transcription per month and AI voices with an unlimited vocabulary

ElevenLabs

To create my voice, ElevenLabs let me upload up to 25 samples but each of them had to be below 10MB. I was using uncompressed audio, so I had to chop up my samples in Audition. It was a really time-consuming process, but I could have saved some of the cutting by using a compressed format like MP3.

The tool allows you to change a number of settings to affect the way the voice sounds, including the stability, clarity + similarity, and style exaggeration. You then input your text and voila! It generates the audio sample, which you can listen to and download. 

I generated several with both audio samples using different settings, and the samples gave me a bit of a shock. Here’s what they sounded like:

The tool added some undesirable effects to the speech, like laughter, breathing, and at one point even giving me an "um"! The speed was off too, adding strange pauses between passages that were extremely fast. Basically the opposite problem of the voice being too robotic, it was way too casual. I generated a second voice using the steadier audio sample, and it gave me a new, nasal tone and strange accent. Playing with the settings changed the kinds of artifacts it added to the passage, but it consistently added them to all of the samples.

The default settings worked the best, and so I'd recommend not straying too far from those. This one my mom called good but she thought it was a little "too monotone." The more expressive settings did not pass the mom test. She called the various versions "irritating", "jerky" and "too hard to follow."

Mom test: ⚠️ 1 of 2 voices passed

Pros: 

  • Includes style settings like stability, clarity + similarity, and style exaggeration
  • Allows you to upload up to 25 audio samples

Cons: 

  • More expressive styles sound unrealistic
  • Added undesirable effects to the speech, like laughter, breathing, and filler words

Pricing: 

  • Free plan: ~10 minutes of audio per month
  • $5/month: ~30 minutes of audio per month, access to an editing tool and commercial use license
  • $11/month: ~2 hours of audio per month, “professional voice cloning” plus everything in lower tiers
  • $99/month: ~10 hours of audio per month, 44.1 kHz PCM audio output, plus everything in lower tiers
  • $330/month: ~40 hours of audio per month, priority support, plus everything in lower tiers

Play.ht 

Play.ht allows you to create an “Instant” voice clone using a minimum of 30 seconds of audio (up to 50MB), or a “High Fidelity” clone with more audio. I tried both. The High Fidelity Clone recommends 2–3 hours or more of audio, so you need to be more prolific than me to take advantage of the larger potential training files.

Once the voice is cloned, you can input your text. It also gives you three settings to control the voice: stability, similarity, and intensity.

Instead of generating everything all at once like ElevenLabs, Play.ht allows you to generate multiple clips and stitch them together. I really liked this feature. You can regenerate each clip individually to give you greater control over the output, then download it as a single file or in multiple parts. You can also change the settings by paragraph rather than universally so you can add intensity to specific sentences.

Here’s what it sounded like.

The high fidelity voice was definitely better. However, it made a number of mispronunciations that needed to be regenerated. To get it right, you would have to add phonetic spellings for those particular words. 

There were also fewer options for changing the settings of the voice, and the pacing was really off for some of the outputs. 

The clones didn't do that well on the mom test: She didn't think the default settings sounded like me, and felt it was too monotone. Like me, she also felt it was too fast in places. However, even if the voices didn't sound like me, she felt the voice was clear and expressive.

Mom test: ✅ Passed, with notes

Pros: 

  • Ability to generate multiple clips to stitch together for more control
  • Ability to change settings by paragraph to add intensity to certain phrases

Cons:

  • Mispronounced some words
  • Few options for changing voice settings
  • Pacing is off

Pricing: 

  • Free plan: ~10 minutes of audio per month, one Instant voice clone, attribution required
  • $39/month: ~5.5 hours of audio per month, 10 Instant voice clones, commercial use allowed
  • $99/month: Unlimited audio, regeneration, and Instant voice clones; one High Fidelity clone, commercial use allowed

Resemble AI

I signed up for Resemble AI to test both the Rapid voice clone and the Professional clone, but the Rapid voice clone wasn't available, so I created the Professional voice clone instead. 

Like Descript, Resemble AI makes you record a specific sentence confirming you have permission to clone the voice you're uploading. While I appreciate that this is for security purposes, it was a real pain in the butt.

Resemble AI also requires a single file upload which must be in WAV/AIFF/FLAC format. It takes about an hour to generate the voice, the longest of all the tools.

Here’s what it sounded like:

Once I got it running, I noticed that Resemble AI had several desirable features. It splits the generated audio into smaller chunks so you can regenerate certain parts, rather than the whole thing. You can also specify the part of speech of a word. For instance, for the word "live," which was mispronounced in one of the other generations, it allowed me to specify whether I meant the adjective or the verb.

Resemble AI also had an interesting feature that none of the others had: localization. I'm Canadian and most people can't differentiate between our accent and that of the US Pacific Northwest. But there are indeed differences. That meant that I could "translate" my text into Canadian English. When I used this feature it did indeed change the couple of American-sounding vowels to their Canadian counterparts.

Unfortunately, it was sort of buggy. Sometimes it would skip words, and the spacing between words was odd.

Mom's verdict: She didn't think it sounded like me, calling it too "sing-song" and that the "speaker sounds like she is bored."

Mom test: 🛑 Failed

Pros:

  • Splits audio into chunks for more control over regeneration
  • Specify parts of speech for more accurate pronunciation
  • Localization for accurate accents

Cons:

  • Buggy; skipped some words and had odd pacing
  • Longest generation time of all the tools
  • Didn’t pass the mom test

Pricing: 

  • Pay-as-you-go pricing: $0.0006/second of audio (3.6 cents per minute)
  • $29/month: 10,000 seconds of audio per month, 5 Rapid voice clones, 1 Professional voice clone
  • $99/month: 80,000 seconds of audio per month, 25 Rapid voice clones, 3 Professional voice clones, localization
  • $299/month: 200,000 seconds of audio per month, 100 Rapid voice clones, 5 Professional voice clones, localization
  • $499/month: 320,000 seconds of audio per month, 500 Rapid voice clones, 10 Professional voice clones, localization, API access, authorized partner program

Drawbacks to AI voice cloning tools

Although I'm not a person who hates their voice when I hear a recording of myself, I definitely did hate it when put through some of these tools. It was like a vocal funhouse mirror: some of the tools added strange mannerisms I don't have, added a specific quality to my voice I didn't like, or made pronunciation mistakes I wouldn't make. They ranged from way too robotic to completely unhinged, and the inability to make the changes I wanted left me frustrated with the output.

The pacing was also almost always off—the pauses seemed to be generated almost at random. They would have to be edited to sound natural. Plus, pronunciation was an issue with many of the generations, too, and it would be time consuming to edit the text with phonetic spellings.

The tools also left me frustrated with their varied and conflicting requirements for training the voice, with some requiring specific file types, limits on file size, or requiring all audio to be in a single file.

Benefits to AI voice cloning tools 

There were some good points, though, and this is where I see some possibilities in the future. I appreciated the tools that let me play with different settings for emotionality and tone, so that I could at least attempt to recalibrate what I heard. I also liked being able to split it up into parts and regenerate only a piece of the audio. Some AI voice cloning tools even let me, at least partially, control pronunciation.

But it was tough to see how any of them would fit into a workflow. Even the ones I liked the best took me about an hour to actually regenerate the audio to find an example that I liked—it's time consuming to have to listen to the passage over and over to see whether it worked out right, and each regeneration takes processing time too.

Compare that with just recording it: When I recorded it myself, I didn't do anything fancy, but rather did a single take on my iPhone, which took about 2 minutes to read, including re-doing a few passages. I then uploaded and put it into Descript, where I had to wait a minute or so for the transcription, and then I had to listen to it once more to do the edits. Overall, the process of just recording it and editing it took me less than ten minutes and sounded better than any of the AI-generated options, even if it did have some obvious audio issues I didn't correct. I snuck it in with the samples of the AI voice cloning tools and my mom liked it the best (but she did dunk on my hissing s's in the recording).

Conclusion

I think I would use the AI voice cloning tools as a second possibility when doing especially heavy edits. But even the tools I liked took me far, far longer than it would have taken for me to just record a half dozen takes of the passage and edit them together. Plus, for some of the higher-quality options, you need quite a bit of audio already recorded to make a feasible voice.

If I were editing someone else's audio and couldn't ask them to re-record, I could see myself using an AI voice cloning tool to generate some passages and editing them in (with their permission, of course). Generating a longer passage that was only from the voice clone just didn't work well, at least at the state the tools are at now.

But we are pretty early in this game, so I think there is a possibility that these tools will prove useful eventually. After all, they are getting better and better all the time.

My mom agreed, at least partially: "This was tough; when I liked the pace of one, then something else was off. But the four highlighted ones were my best choices to sound like you…good pacing…and clear diction."

Her four favorites were (in no particular order):

  • Descript - Voice 1
  • Descript - Voice 2
  • Play.ht - Voice 2 - Default Settings
  • My real voice (that I secretly snuck in)

Featured articles:

No items found.

Articles you might find interesting

Podcasting

How to add sound effects to a podcast

Sound effects can help set the tone for your episode, emphasize certain points, or create a more immersive experience for your listeners.

Podcasting

5 Best Audio Mixers for Podcasters and Musicians

There are a lot of audio mixers on the market, each with its own suite of features. Here’s what to consider as you choose a mixer for podcast or music recording.

Podcasting

The 8 Best Tech Podcasts to Get Inspired

Learn all about how the tech industry is huge and always evolving, and how today’s tech podcasts cover a lot of ground keeping tabs on Digital Chaos.

Product Updates

It's here: the all-new Descript, backed by OpenAI Startup Fund

We're releasing an all-new version of Descript and we're announcing that the OpenAI Startup Fund will be leading our $50 million series C fundraising round.

Related articles:

Share this article

Get started for free →