What type of content do you primarily create?
Recording yourself is one of those things that sounds easier than it is. As an introvert, I find it time-consuming and honestly, kind of exhausting. It's why I've been eyeing AI voice cloning with a mix of skepticism and desperate hope.
So when Descript asked me to test some voice cloning tools, I didn't just say yes—I practically volunteered as tribute. Could these tools actually deliver on their promise, or was this just another case of AI overhype?
I put four different voice cloning tools through their paces, testing how easy they were to set up and how convincingly they could mimic my voice in text-to-speech. For my test, I used an excerpt from a recent keynote I gave on generative AI. Here's what I actually sound like reading it:
For training these tools, you typically upload recorded audio or read a sample script. I experimented with two different audio samples:
- Voice 1: An iPhone recording of my side of a conversation I had with a good friend. In it, I'm talking pretty informally.
- Voice 2: An audio recording for a course that I taught where I am deliberately speaking more slowly on a technical topic.
But would any of these tools pass the ultimate test: The mom test? Would my own mother recognize these AI voices as mine? Because if you can fool Mom, you can fool anyone.
Here are my results.
Descript
![]() |
Descript's AI voice cloning feature is one part of its comprehensive suite of audio editing tools. To train your AI voice clone, you need to record a specific statement live, or you can record it on another device and upload it. I tried the second option and encountered some issues with unsupported file types, but eventually got it working after checking the required formats.
I tested the AI voice generator in two ways: first as a standalone voice clone using the statement I recorded, and second by generating the voice within a project. For the project-based option, the voice cloning workflow differs significantly from other tools. Instead of simply clicking "create voice" and uploading files like with other AI voice cloning software, you must create the voice within the project itself. While there's a learning curve, once you understand the process, creating AI voices becomes straightforward—though I did need to consult Descript's help documentation to figure it out initially.
Here's what they sounded like:
The voices generated with the two options were very similar to each other—so similar I had to load them into different tracks in Audition to see if they were exactly the same (they aren't). This raised questions about how much the additional audio actually contributed to the voice cloning quality.
Once set up, Descript's AI voice cloning tool was straightforward to use. Among all the tools tested, it produced the most robotic-sounding voice and lacked direct controls for adjusting pacing or expressiveness. Descript's recommended workaround is creating multiple voice clones with different delivery styles—I could have created a second voice where I read the statement faster or with more emotion. For this evaluation, however, I stuck with the standard voice clone.
Despite having less control over the output, Descript did well on the mom test. My mom said the voices sounded good, albeit less expressive and less like me than other voices.
Mom test: ✅ Passed
Pros: Integrated within a full-featured editing suite; simple to use once set up; consistent voice output; secure voice authorization process; commercial usage rights included.
- Fast process
- Ability to edit the recording without switching programs
- Part of a larger suite of audio and video editing tools
- Passed the mom test
Cons: More robotic-sounding than competitors; limited expressiveness controls; requires learning Descript's specific workflow; voice training process can be finicky with file formats.
- Creating my first voice involved a learning curve
- No direct style controls; must record multiple voices with different delivery styles
Pricing: All Descript tiers include use of the entire suite of editing tools. Free plan includes 1 hour of AI voice generation per month, Creator plan ($15/month) includes 4 hours, Pro plan ($30/month) includes 10 hours, and Enterprise plans offer custom quotas.
- Free: 1 hour of transcription per month and AI voices with a 1,000-word vocabulary
- $12/month: 10 hours of transcription per month and AI voices with a 1,000-word vocabulary
- $24/month: 30 hours of transcription per month and AI voices with an unlimited vocabulary
ElevenLabs
![]() |
To create my voice clone with ElevenLabs' AI voice cloning software, I could upload up to 25 samples, but each had to be below 10MB. Since I was using uncompressed audio, I had to manually segment my samples in Audition. This proved to be time-consuming, though I could have simplified the process by using compressed formats like MP3 from the start.
The voice cloning tool allows you to adjust several settings that affect how your AI voice sounds, including stability, clarity + similarity, and style exaggeration. After configuring these parameters, you simply input your text, and the system generates your audio sample, which you can immediately preview and download.
I generated several with both audio samples using different settings, and the samples gave me a bit of a shock. Here's what they sounded like:
The tool added some undesirable effects to the speech, like laughter, breathing, and at one point even giving me an "um"! The speed was off too, adding strange pauses between passages that were extremely fast. Basically the opposite problem of the voice being too robotic, it was way too casual. I generated a second voice using the steadier audio sample, and it gave me a new, nasal tone and strange accent. Playing with the settings changed the kinds of artifacts it added to the passage, but it consistently added them to all of the samples.
The default settings worked the best, and so I'd recommend not straying too far from those. This one my mom called good but she thought it was a little "too monotone." The more expressive settings did not pass the mom test. She called the various versions "irritating", "jerky" and "too hard to follow."
Mom test: ⚠️ 1 of 2 voices passed
Pros: Highly customizable voice settings; fine control over voice characteristics; supports multiple languages; allows for expressive voice cloning; offers high-quality voice output with proper settings.
- Includes style settings like stability, clarity + similarity, and style exaggeration
- Allows you to upload up to 25 audio samples
Cons: Can add unwanted speech artifacts (laughter, breathing, "ums"); inconsistent pacing issues; time-consuming setup process; requires careful adjustment of settings to avoid unnatural results.
- More expressive styles sound unrealistic
- Added undesirable effects to the speech, like laughter, breathing, and filler words
Pricing: Free tier includes limited minutes of voice generation; Creator tier ($5/month) includes 10,000 characters; Pro tier ($22/month) includes 30,000 characters; Enterprise plans available for higher volume needs.
- Free plan: ~10 minutes of audio per month
- $5/month: ~30 minutes of audio per month, access to an editing tool and commercial use license
- $11/month: ~2 hours of audio per month, “professional voice cloning” plus everything in lower tiers
- $99/month: ~10 hours of audio per month, 44.1 kHz PCM audio output, plus everything in lower tiers
- $330/month: ~40 hours of audio per month, priority support, plus everything in lower tiers
Play.ht
![]() |
Play.ht's voice cloning software offers two options: an "Instant" voice clone using a minimum of 30 seconds of audio (up to 50MB), or a "High Fidelity" clone with more extensive training data. I tested both approaches. The High Fidelity Clone recommends 2–3 hours or more of audio, which requires significantly more recorded content than most casual users would have readily available.
Once the voice is cloned, you can input your text. It also gives you three settings to control the voice: stability, similarity, and intensity.
Instead of generating everything all at once like ElevenLabs, Play.ht allows you to generate multiple clips and stitch them together. I really liked this feature. You can regenerate each clip individually to give you greater control over the output, then download it as a single file or in multiple parts. You can also change the settings by paragraph rather than universally so you can add intensity to specific sentences.
Here's what it sounded like.
The high fidelity voice was definitely better. However, it made a number of mispronunciations that needed to be regenerated. To get it right, you would have to add phonetic spellings for those particular words.
There were also fewer options for changing the settings of the voice, and the pacing was really off for some of the outputs.
The clones didn't do that well on the mom test: She didn't think the default settings sounded like me, and felt it was too monotone. Like me, she also felt it was too fast in places. However, even if the voices didn't sound like me, she felt the voice was clear and expressive.
Mom test: ✅ Passed, with notes
Pros: Paragraph-by-paragraph generation control; ability to regenerate specific sections; supports phonetic spelling adjustments; clear voice output; expressive voice capabilities with proper settings.
- Ability to generate multiple clips to stitch together for more control
- Ability to change settings by paragraph to add intensity to certain phrases
Cons: Pronunciation issues requiring manual correction; limited voice customization options compared to competitors; inconsistent pacing in outputs; high-fidelity option requires extensive audio samples.
- Mispronounced some words
- Few options for changing voice settings
- Pacing is off
Pricing: Free tier available with limited features; Creator tier ($9/month) includes 200 minutes of audio; Pro tier ($29/month) includes 600 minutes; Business plans available for higher volume needs.
- Free plan: ~10 minutes of audio per month, one Instant voice clone, attribution required
- $39/month: ~5.5 hours of audio per month, 10 Instant voice clones, commercial use allowed
- $99/month: Unlimited audio, regeneration, and Instant voice clones; one High Fidelity clone, commercial use allowed
Resemble AI
![]() |
I signed up for Resemble AI to test both their Rapid voice clone and Professional voice cloning options, but discovered the Rapid voice clone wasn't available, so I proceeded with creating the Professional voice clone instead.
Like Descript, Resemble AI requires recording a specific consent statement confirming you have permission to clone the voice you're uploading. While this security measure is important for ethical voice cloning, it did create an additional hurdle in the setup process.
Resemble AI also requires a single file upload which must be in WAV/AIFF/FLAC format. It takes about an hour to generate the voice, the longest of all the tools.
Here's what it sounded like:
Once I got it running, I noticed that Resemble AI had several desirable features. It splits the generated audio into smaller chunks so you can regenerate certain parts, rather than the whole thing. You can also specify the part of speech of a word. For instance, for the word "live," which was mispronounced in one of the other generations, it allowed me to specify whether I meant the adjective or the verb.
Resemble AI also had an interesting feature that none of the others had: localization. I'm Canadian and most people can't differentiate between our accent and that of the US Pacific Northwest. But there are indeed differences. That meant that I could "translate" my text into Canadian English. When I used this feature it did indeed change the couple of American-sounding vowels to their Canadian counterparts.
Unfortunately, it was sort of buggy. Sometimes it would skip words, and the spacing between words was odd.
Mom's verdict: She didn't think it sounded like me, calling it too "sing-song" and that the "speaker sounds like she is bored."
Mom test: 🛑 Failed
Pros: Part-of-speech specification for better pronunciation; regional accent support (Canadian English); regeneration of specific sections; strong security and consent features.
- Splits audio into chunks for more control over regeneration
- Specify parts of speech for more accurate pronunciation
- Localization for accurate accents
Cons: Buggy performance with skipped words; odd spacing between words; long generation time (about an hour); limited file format support; unnatural speech cadence.
- Buggy; skipped some words and had odd pacing
- Longest generation time of all the tools
- Didn't pass the mom test
Pricing: Professional plan starts at $25/month for 100 minutes; Business plan at $100/month for 500 minutes; Enterprise plans available for custom needs; no free tier available.
- Pay-as-you-go pricing: $0.0006/second of audio (3.6 cents per minute)
- $29/month: 10,000 seconds of audio per month, 5 Rapid voice clones, 1 Professional voice clone
- $99/month: 80,000 seconds of audio per month, 25 Rapid voice clones, 3 Professional voice clones, localization
- $299/month: 200,000 seconds of audio per month, 100 Rapid voice clones, 5 Professional voice clones, localization
- $499/month: 320,000 seconds of audio per month, 500 Rapid voice clones, 10 Professional voice clones, localization, API access, authorized partner program
Legal and ethical considerations
Voice cloning requires explicit user consent [Consent Requirements] in many jurisdictions, ensuring that individuals are aware their voice is being reproduced. This is crucial for compliance with privacy laws like the GDPR [Privacy Regulations], which mandate transparent handling of personal data. If companies or individuals misuse cloned voices, they could face lawsuits over identity theft or defamation [Potential Litigation]. Additionally, intellectual property rights may protect a person’s voice from unauthorized replication [Intellectual Property Rights]. For these reasons, some voice cloning software includes built-in consent mechanisms to help users ethically manage voice data.
Limitations of AI voice cloning tools
Although I don't typically dislike hearing recordings of my own voice, I definitely did hate it when processed through some of these AI voice cloning tools. The experience was like looking into a vocal funhouse mirror: some tools added strange mannerisms I don't have, altered voice qualities in unflattering ways, or made pronunciation errors I wouldn't make. The results ranged from excessively robotic to completely unhinged, and the limited customization options left me frustrated with the output quality.
The pacing was also consistently problematic across most voice cloning software—pauses seemed randomly generated rather than naturally placed. These would require significant editing to sound natural. Additionally, pronunciation issues were common in many of the generated voices, which would necessitate time-consuming phonetic spelling edits to correct.
The best AI voice cloning tools also frustrated me with their varied and conflicting technical requirements for training voices. Some demanded specific file formats, others imposed strict file size limits, and some required all audio to be consolidated into a single file—making the initial setup process unnecessarily complicated.
Technical criteria for evaluating voice clones
When it comes to assessing audio realism, metrics like the Mean Opinion Score [MOS] provide valuable feedback from real listeners. Advanced measures such as Perceptual Evaluation of Speech Quality [PESQ] can offer objective insights into speech distortion. Some platforms even evaluate the True Acceptance Rate [TAR] to determine how frequently a cloned voice is deemed authentic by users. Likewise, setting thresholds for False Acceptance Rate [FAR] helps developers prevent unintentional acceptance of voices that do not match the original. In practice, factors like file size, supported languages, and training data requirements all play a role in how well a given tool can replicate a speaker’s unique speech patterns.
Benefits of AI voice cloning tools
Despite these limitations, I did see promising aspects that suggest potential for AI voice cloning technology in the future. I appreciated tools that offered adjustable settings for emotionality and tone, allowing some recalibration of the output. The ability to split audio into sections and regenerate only specific parts was also valuable. Some voice cloning software even provided partial control over pronunciation, which helped address common issues.
However, it was difficult to envision how any of these AI voice generators would integrate efficiently into a professional workflow. Even the best-performing tools required about an hour to generate and refine audio samples to an acceptable quality—listening to passages repeatedly to check quality and waiting for processing time between each regeneration attempt quickly adds up.
Compare that with simply recording it myself: I recorded a single take on my iPhone, taking about 2 minutes to read the passage and redo a few sections. After uploading to Descript, I waited briefly for transcription, then spent a few minutes editing. The entire process took less than ten minutes and produced better results than any of the AI voice cloning tools, even with some audio imperfections I didn't correct. I included this recording alongside the AI-generated samples, and my mom preferred it over all the voice clones (though she did critique my sibilant s's).
Final verdict on best AI voice cloning tools
I could see using these voice cloning applications as a secondary option when making extensive edits to existing content. However, even the tools I preferred took significantly longer than simply recording multiple takes and editing them together. Additionally, the higher-quality voice cloning options typically require substantial pre-recorded audio to create an effective voice model.
If I were editing someone else's audio and couldn't request a re-recording, an AI voice cloning tool could be useful for generating specific passages to edit in (with their explicit permission, addressing important legal and ethical concerns). However, generating longer content exclusively with voice clones didn't produce satisfactory results with the current state of the technology.
We're still in the early stages of voice cloning technology development, so there's significant potential for these tools to become more useful in the future. The best AI voice cloning software is continuously improving, with each update bringing better quality and more natural-sounding results.
My mom agreed, at least partially: "This was tough; when I liked the pace of one, then something else was off. But the four highlighted ones were my best choices to sound like you…good pacing…and clear diction."
Her four favorites were (in no particular order):
- Descript - Voice 1
- Descript - Voice 2
- Play.ht - Voice 2 - Default Settings
- My real voice (that I secretly snuck in)
FAQs
Are there legal risks in using AI voice clones?
Yes, unauthorized voice cloning can lead to potential lawsuits for violations of privacy or intellectual property [Potential Litigation]. Regulations like GDPR also require explicit consent, so ignoring them could result in hefty fines. Many reputable tools have built-in processes to secure user authorization, aiming to mitigate these risks. Still, it’s crucial for end-users to clearly disclose and obtain permission whenever cloning someone's voice.
What is the difference between cost and quality in voice cloning software?
Generally, higher-priced plans provide more advanced training data options and higher fidelity outputs. Some premium tiers also allow for bigger audio files and advanced metrics like Perceptual Evaluation of Speech Quality [PESQ]. In contrast, free or low-cost plans often limit the number of voice samples you can upload, affecting the clone's accuracy. Assessing signal-to-noise ratios and false acceptance rates can help you gauge if a costlier plan is truly needed.
Do I need advanced technical skills to use these tools?
Many AI voice cloning platforms focus on user-friendly interfaces, making them accessible to non-technical individuals [Ease of Use]. However, optimizing training data or adjusting advanced settings like style exaggeration can be more complex. Consulting tool-specific guides or support resources often resolves most setup issues. Professional use cases, such as audiobook narration, might still require some audio-editing know-how.
