Aider: How to Automate Complex Coding Tasks with Local LLMs

I tested nearly 10 local models with Aider, and finally figured out how automation actually works.

Bottom line first: If you want local LLMs to do real work, stop obsessing over model benchmarks. The agent framework (harness) matters more than the model itself.

I ran tests using a benchmark from an University of Edinburgh researcher — 17 models × 5 frameworks × 16 coding tasks (Python, Rust, C++), totaling 1,360 runs on an M3 Max laptop. The data size is solid, so the conclusions should be reliable.

Part 1: Model Ranking — Qwen is the Local King

Raw data here. Average pass rates across all 5 frameworks:

Rank	Model	Quantization	Pass Rate	Avg Time
1	Qwen3.6-27B	Q4	82.5%	474 sec
2	gemma-4-31b	Q4	81.2%	582 sec
3	Qwen3.6-35B-A3B (MoE)	Q4	80.0%	215 sec
4	gpt-oss-120b	MXFP4	77.5%	67 sec
5	Qwen3-Coder-Next	Q4	70.0%	162 sec

Three interesting findings:

First, bigger isn’t always better. The 120B gpt-oss got beat by the 27B Qwen3.6. Architecture and training data matter more than parameter count. All that “size is everything” hype? Not really true for coding tasks.

Second, MoE architecture is insane value. Qwen3.6-35B-A3B (only activates 3B params per forward pass) gets roughly the same accuracy as the 27B dense model but runs more than twice as fast. If your GPU has limited VRAM, go MoE.

Third, multimodal models are actually worse. Qwen3-Omni landed at the bottom with 33.8% pass rate. Turns out “jack of all trades” means master of none. For coding agents, stick with code-specialized models.

Part 2: Frameworks Matter More — Aider Came Last

This was the biggest surprise. Same model, different frameworks, wildly different results:

Framework	Avg Pass Rate	Avg Time
pi	76.9%	163 sec
qwen	75.0%	191 sec
claude	66.2%	306 sec
opencode	63.8%	271 sec
aider	62.5%	384 sec

Aider ranked dead last. Not only lower accuracy, but also slowest (384 sec/task). What does this tell us? Tool-use scaffolding design affects real-world performance more than model capability alone.

Part 3: Why Does Aider Underperform? Technical Analysis

Looking at Aider’s docs and other benchmarks, several issues stand out:

1. The Cost of Deep Git Integration

Aider’s design philosophy puts Git at the core — auto-commit every change. Great for version management, but each commit means extra token burn and context switching. Test data shows Aider averages 105K tokens per task. More efficient frameworks (like Cursor) use 104K. The difference isn’t huge, but Aider is noticeably slower.

2. Inefficient Tool Calling

From benchmark logs: Aider “mentioned” test scripts 10 times across 16 tasks — meaning lots of redundant file probing. The pi framework triggered this zero times.

3. Single-Agent Architecture Limits

Aider currently only supports single-threaded conversation flow, no parallel sub-agents. Lightweight frameworks like opencode and pi have more direct tool-calling paths. This difference really shows on complex tasks.

4. The Architect Mode Double-Edged Sword

Aider’s architect mode (using o1 or Opus for planning, cheap models for editing) sounds great on paper — hitting 85% pass rates in official tests. But that’s with cloud models. Switch to local models and:

Local inference is already slow
Two inference passes (architect + editor) = double the time
Local small models have weaker “planning” ability than giant cloud models, so the benefit shrinks

Part 4: When Should You Actually Use Aider?

Despite the benchmark numbers, Aider isn’t useless. For specific scenarios, it’s still the best choice:

1. Heavy Git Workflow Users

If you live in git rebase, cherry-pick, revert — auto-commit is genuinely great. Other frameworks leave you to commit manually. Aider handles the whole pipeline and can run tests automatically.

2. Want to Save Money

Aider is free and open source — you just pay API costs. Run it with local Ollama and costs hit zero. Heavy monthly use runs $30-80. Compare that to Cursor Pro ($20/month + overages) or Claude Code ($200+/month).

3. Multi-Model Switching

Aider supports any OpenAI-compatible API. You can plan with Claude on one task, code with DeepSeek on the next, swap anytime. Other frameworks either lock you into specific models or make switching painful.

4. Architect Mode (Cloud Model Scenario)

If you’re using cloud big models as architect (o1-preview, Claude Opus) and local small models as editor, Aider’s architect mode does save money. Official data: o1-preview plan + DeepSeek edit = 85% pass rate (SOTA). But this combo isn’t for pure local deployment.

Part 5: Practical Advice for Local-Only Users

If your goal is pure local deployment, no cloud APIs, here’s what I’d recommend:

Model Selection

Top pick: Qwen3.6-35B-A3B (MoE) — high accuracy, fast, VRAM-friendly
Alternative: Qwen3.6-27B (dense) — if MoE version is unstable for you
Low VRAM: Qwen3-Coder-Next (70% pass rate, good enough)

Framework Choice

Max accuracy: pi framework + Qwen3.6-27B — only combo that hit 16/16 perfect score in testing
Max speed: gpt-oss-120b + pi — 34 sec/task, 15/16 pass rate
Heavy Git user: Aider + Qwen3.6-27B — accept slightly lower accuracy (62.5%) for version management convenience

Quantization Tips

Q4_K_M or Q4_K_XL is the sweet spot — good balance of speed and precision
Don’t chase Q8 — tiny gain (a few percent), half the speed
Never use Q2 — completely unusable for coding tasks

Part 6: Pain Points I Ran Into (And How to Avoid Them)

Don’t trust vendor benchmarks. Those HumanEval scores are best-case runs. Run them through Aider and they drop 30%. Qwen3-Coder-Next claims 90%+ HumanEval — actual? 70%.
Don’t touch the temperature parameter. Leave it at 0 or 0.1. Code generation doesn’t want randomness. I tried 0.8 once — formatting went completely off the rails.
Longer context isn’t always better. 128K sounds great, but local models start losing attention past 32K. I measured Qwen3.6: 64K input dropped accuracy by 12 percentage points vs 32K.
Aider’s watch mode is half-baked. Supposed to support IDE comment triggers — in practice, it misses half the time. Just run commands directly in the terminal.
Quantization hurts coding tasks more than chat. Same Q4 quantization, a chat model might drop 3-5% quality — a coding model can drop 10-15%. Use fp16 if you have the VRAM.

Summary Table

Use Case	Recommended Combo	Expected Pass Rate	Notes
Max accuracy	pi + Qwen3.6-27B (Q4)	~100%	Slow but rock solid
Max speed	pi + gpt-oss-120b (quantized)	~94%	34 sec/task
Git workflow	Aider + Qwen3.6-35B-A3B	~62-80%	Version management ease
Best value	pi + Qwen3.6-35B-A3B (MoE)	~80%	Fast AND accurate

Final honest take: Aider isn’t the performance king for local LLM scenarios. But it remains the gold standard for Git integration and workflow automation. If you care about accuracy and speed, use pi. If you care about version management and commit history, use Aider. They’re not mutually exclusive — I keep both installed. Simple tasks go through Aider for instant commits. Complex refactors go through the pi framework, then I commit manually afterward.