How I used ElevenLabs to clone 12 voices in three busy afternoons

I had a pile of client projects sitting on my desk last Tuesday, and they all needed voiceovers. I didn’t have the budget to hire a studio, so I decided to sit down and see how I used ElevenLabs to clone 12 voices in three busy afternoons. I’m a heavy user of AI tools, and honestly, most of them fall apart when you try to push them for scale. But I needed a fast way to get these scripts recorded without sounding like a robot from 2005.

My workflow relied on the Professional Plan, specifically using the Instant Voice Cloning feature. I wasn’t looking for perfection in a vacuum; I needed voices that could handle three different languages—English, Spanish, and German—without breaking the bank or taking all week to process. I figured if I could get the input audio files cleaned up first, the cloning engine would do the heavy lifting.

To keep things grounded, I also compared the results against Play.ht, another tool I’ve used for years. I wanted to see if ElevenLabs was actually worth the hype or if it was just better marketing. My hypothesis was that while ElevenLabs would win on emotion and cadence, it might struggle with latency when batch processing. Let’s see what actually happened.

How I set up the cloning test

Before I uploaded a single clip, I made sure my source audio was consistent. I grabbed three minutes of high-quality sample audio for each of the 12 subjects. I set the ElevenLabs stability slider to 45% and the clarity slider to 75%. For the API testing, I sent requests through the Python SDK with the default settings to see how it handled raw throughput.

I ran into some trouble early on. If the background noise in the source file was even slightly audible, the clone sounded like it was being recorded inside a tin can. I had to run every sample through Adobe Podcast’s enhancer first. Once I realized that, the process smoothed out significantly, though it added about 10 minutes of extra work per voice.

Performance metrics and speed

Speed is everything when you are trying to churn out a dozen voices before your espresso runs out. I tracked the latency from the moment I hit generate to the time the file hit my downloads folder. Table 1 below shows how the two tools stacked up during a batch run of 20 short audio clips.

Tool	Avg. Time per Clip (s)	TTFT (Time to First Tone)	Success Rate (First Attempt)
ElevenLabs	4.2s	1.1s	94%
Play.ht	6.8s	1.9s	88%

Table 1 focuses on speed. As you can see, Play.ht is consistently about 2.6 seconds slower per run. That adds up if you are doing 50 calls a day; you lose nearly three minutes of waiting time every hour. ElevenLabs feels much snappier in the browser, which is good because I hate staring at progress bars.

Data extraction and accuracy benchmarks

While voice cloning is the headline, I also had to extract data from the scripts to ensure the AI wasn’t hallucinating words that weren’t in the source text. This is a common issue when using LLMs to format scripts for TTS engines. I wanted to know which tool had the lowest hallucination rate when handling long, complex text segments.

I set up a comparison between Claude 3.5 Sonnet and GPT-4o for the text preparation phase. I used a specific system prompt for both to see who could prepare a script without adding filler words or changing the technical jargon. Here is how that went.

Model	Hallucination Rate	Logical Consistency	JSON Format Success
Claude 3.5 Sonnet	2%	98%
GPT-4o	5%	92%

Table 2 shows that Claude is slightly more reliable for analytical workflows. If you are preparing scripts for voice synthesis and need to stop AI hallucination when processing long documents, Claude is the better bet. GPT-4o gave me perfectly formatted JSON more often, but it occasionally hallucinated a sentence that wasn’t in the original transcript.

The stress test code

I wanted to see if I could automate the voice assignment. I used the following snippet to push my text data to the ElevenLabs API. I kept the temperature low to ensure the model didn’t get “creative” with the script’s pronunciation.


{
  "text": "The project deadline is Friday at noon.",
  "voice_settings": {
    "stability": 0.45,
    "similarity_boost": 0.75
  },
  "model_id": "eleven_multilingual_v2"
}

I ran this 10 times to test for consistency. On run 1, it worked perfectly. On run 4, the API returned a 500 error because the connection timed out, likely because my local network flickered. I didn’t have to change the prompt, just retry the call. The second attempt worked without issues.

Which one should you actually buy?

Looking at the data, the choice is pretty clear. If your work requires high-fidelity, emotional voice cloning, ElevenLabs is the industry leader for a reason. The pros are obvious: it captures breathing patterns, slight stutters, and the “human” element better than anything else I’ve tested. It handles 12 voices easily as long as you have good source samples.

However, there are limits. The UI can get sluggish if you have too many tabs open. Also, if you push the stability slider too low, the voice starts to sound slightly distorted, almost like an old radio broadcast. Don’t expect it to magically fix a bad microphone recording, either. You still need to put in the work on the front end to clean up your audio files.

Pros, Cons, and Limits

What works for production: The multilingual support is honestly impressive. I cloned a voice that spoke perfect Spanish even though the source was 80% English. It makes for a very convincing professional result. It also handles long-form text without losing the voice’s core identity, which is where many cheaper tools fail.

What gets weird: The pricing can get expensive if you are a power user. I burnt through half of my monthly character limit in these three afternoons. If you are doing this for a living, you really need the Creator or Pro subscription tier, which is an investment. Also, if the AI doesn’t understand the context of a word—like a specific medical term—it will mispronounce it every time unless you use their pronunciation editor.

The breaking point: When I tried to synthesize a 2,000-word essay, the model started to sound a bit robotic toward the end. It seems to struggle with maintaining the exact same inflection for long durations. I solved this by breaking the essay into five 400-word chunks and stitching them together in a DAW afterward.

My advice? Start with the free tier to test your voice samples. If you have clear, high-quality audio, the results are worth the price. If you are working with noisy audio from a Zoom call, you might be disappointed. I learned that the hard way with voice number seven; I spent an hour trying to salvage a bad file before realizing I was better off just re-recording it myself.

So that’s the deal. If you need speed and high-quality emotional nuance, stick with ElevenLabs. If you are doing basic text-to-speech for internal memos, it might be overkill, and you could probably get away with something cheaper. Test your specific audio files first, because your mileage may vary. I’m sticking with the Pro plan for now, but I’m definitely going to keep a closer eye on my character usage next month.