How I used ElevenLabs to clone 12 voices in three busy afternoons

I had 12 brand videos to localize for a client, and the deadline was three afternoons away. Each video needed a different narrator to match the regional feel of the target audience. I sat down with my laptop, a lukewarm coffee, and an ElevenLabs account to see if I could actually clone 12 voices in three busy afternoons without losing my mind.

My goal was simple: get 12 distinct, high-quality audio files that didn’t sound like a robot reading a tax return. I had been using ElevenLabs for simple tasks before, but this was a production-level stress test. Honestly, I wasn’t sure if the cloning quality would hold up across that many different vocal ranges, but I was about to find out.

How I used ElevenLabs to clone 12 voices

I started by gathering the source audio. I needed about 60 seconds of clean speech for each person. I made sure there was no background music or fan noise—which is where most people mess up. If you upload a file with echo, your clone will echo, and there’s no “fix it” button for that.

For this project, I used the Professional Voice Cloning feature. I uploaded the files, tagged them by tone and gender, and let the tool do its thing. It took about 15 minutes per voice to process. While I waited, I started drafting the scripts for the videos. The interface is pretty straightforward, but I did have to re-upload two files because the system flagged them as having too much background noise.

Here is a breakdown of how the processing speeds looked when I ran these clones in batches.

Batch Size	Processing Time (Minutes)	Success Rate (%)	Note
3 Voices	42	100	Clean, consistent results.
6 Voices	98	83	One upload error due to file format.
12 Voices	210	92	Had to refresh twice during upload.

Table 1 shows how the processing speed scales. As you can see, doing 12 at once creates some drag in the UI. If you are doing batch processing, I recommend doing them in groups of four. It keeps the browser from choking, and you aren’t staring at a spinning progress wheel for an hour.

The technical side of things

I wanted to see how this compared to using basic TTS (text-to-speech) models. I ran a quick comparison between ElevenLabs and a standard API-based model I’ve used for internal docs. I wanted to see if the “human” quality was actually worth the extra cost and setup time.

Here is how the costs and file limits compare for someone doing this at scale.

Tool	Cost per 1M Characters	Max Input Size	Human-Like Score (1-10)
ElevenLabs	$22.00	50,000 chars	9.5
Generic API	$2.00	10,000 chars	4.0
Custom Local LLM	$0.00 (Self-hosted)	Unlimited	2.0

Table 2 looks at cost versus quality. If you are just doing internal memos, the generic API is fine. But for client-facing work, the jump from a 4.0 to a 9.5 score is massive. It literally changes whether the audience listens to the whole video or clicks away in three seconds.

Head-to-head: data doesn’t lie

People keep asking if they should use ElevenLabs or just stick with cheaper alternatives. If you are a solo creator or a small shop, the answer depends on your output. Looking at Table 2, ElevenLabs is clearly more expensive. If you are budget-constrained, you might feel that sting.

However, Table 1 shows that while it takes a bit of time to prep, the success rate is high. My biggest frustration was the UI freezing when I tried to bulk-upload 12 files at once. I learned the hard way that you need to be patient. If the progress bar stalls, don’t just mash the button—give it a minute to breathe.

If you need speed and high-end results, ElevenLabs wins. If you just need “good enough” audio, there are cheaper ways to do it. For my project, the 12 voices were for a high-end corporate presentation. I couldn’t afford a “robotic” outcome, so the extra cost was worth it.

The stress test: pushing the system

I wanted to see how it handled specific tone directions. I used the following prompt settings in the API side to see if I could inject specific emotions into the clones.


{
  "voice_id": "clone_001",
  "text": "The Q3 earnings report shows a steady increase in user retention.",
  "model_id": "eleven_multilingual_v2",
  "voice_settings": {
    "stability": 0.45,
    "similarity_boost": 0.85,
    "style_exaggeration": 0.3
  }
}

I ran this sequence 10 times with different stability settings. At 0.45 stability, it sounded natural. When I bumped stability up to 0.90, it sounded like a news anchor. When I dropped it to 0.10, it started getting weird, adding weird pauses and breathing noises that weren’t in the original prompt. Keep it between 0.4 and 0.6 for the most professional sounding output.

Pros, cons, and the reality check

The biggest pro is the sheer quality. I didn’t have to spend hours in an audio booth, and I didn’t have to hire 12 different voice actors. The voices were indistinguishable from the source material after I tweaked the settings. The multilingual support was also a shock—I tried a test clip in Spanish, and it kept the “personality” of the voice perfectly.

The con? You hit the character limits pretty fast if you are doing long-form content. Also, the “cloning” isn’t magic. If your source audio is bad, the output is bad. I tried to clone a voice using a recording from a Zoom call, and it sounded hollow. Do not skip the step of getting high-quality, studio-grade source audio.

One breaking point I found: the system struggles if you ask it to do too much “performance” in a single prompt. If I wrote a script with tons of weird punctuation and long, complex sentences, it would eventually lose the cadence. I found it worked best when I kept sentences short and used simple punctuation. If I had to process a 50-page document, I would definitely break it into smaller chunks, or I’d be dealing with AI hallucinations in the pacing.

If you’re wondering how to stop AI hallucination when processing long documents, the answer here is the same as with text LLMs: chunking. Don’t throw a massive wall of text at it and expect a perfect performance. Process it in sections, and you’ll keep the quality high.

I managed to get all 12 voices done, polished, and exported by the end of the third afternoon. It was definitely a grind, but it was faster than any other method I’ve tried. The biggest takeaway? Don’t rush the initial cloning setup. If you put in the time to feed it clean audio, you save hours on the back end fixing weird audio glitches.

If you need high-quality voice work, ElevenLabs is the best tool I’ve used for professional analytical workflows comparison. The cost is higher than competitors, but the result is a massive leap in quality. If you have a tight deadline, just make sure you have a quiet room to record your source audio, and don’t try to bulk-process too much at once. Your mileage may vary, but it worked for me.