I tested nearly 10 local models with Aider, and finally figured out how automation actually works.
Bottom line first: If you want local LLMs to do real work, stop obsessing over model benchmarks. The agent framework (harness) matters more than the model itself.
I ran tests using a benchmark from an University of Edinburgh researcher — 17 models × 5 frameworks × 16 coding tasks (Python, Rust, C++), totaling 1,360 runs on an M3 Max laptop. The data size is solid, so the conclusions should be reliable.
Part 1: Model Ranking — Qwen is the Local King
Raw data here. Average pass rates across all 5 frameworks:
| Rank | Model | Quantization | Pass Rate | Avg Time |
|---|---|---|---|---|
| 1 | Qwen3.6-27B | Q4 | 82.5% | 474 sec |
| 2 | gemma-4-31b | Q4 | 81.2% | 582 sec |
| 3 | Qwen3.6-35B-A3B (MoE) | Q4 | 80.0% | 215 sec |
| 4 | gpt-oss-120b | MXFP4 | 77.5% | 67 sec |
| 5 | Qwen3-Coder-Next | Q4 | 70.0% | 162 sec |
Three interesting findings:
First, bigger isn’t always better. The 120B gpt-oss got beat by the 27B Qwen3.6. Architecture and training data matter more than parameter count. All that “size is everything” hype? Not really true for coding tasks.
Second, MoE architecture is insane value. Qwen3.6-35B-A3B (only activates 3B params per forward pass) gets roughly the same accuracy as the 27B dense model but runs more than twice as fast. If your GPU has limited VRAM, go MoE.
Third, multimodal models are actually worse. Qwen3-Omni landed at the bottom with 33.8% pass rate. Turns out “jack of all trades” means master of none. For coding agents, stick with code-specialized models.
Part 2: Frameworks Matter More — Aider Came Last
This was the biggest surprise. Same model, different frameworks, wildly different results:
| Framework | Avg Pass Rate | Avg Time |
|---|---|---|
| pi | 76.9% | 163 sec |
| qwen | 75.0% | 191 sec |
| claude | 66.2% | 306 sec |
| opencode | 63.8% | 271 sec |
| aider | 62.5% | 384 sec |
Aider ranked dead last. Not only lower accuracy, but also slowest (384 sec/task). What does this tell us? Tool-use scaffolding design affects real-world performance more than model capability alone.
Part 3: Why Does Aider Underperform? Technical Analysis
Looking at Aider’s docs and other benchmarks, several issues stand out:
1. The Cost of Deep Git Integration
Aider’s design philosophy puts Git at the core — auto-commit every change. Great for version management, but each commit means extra token burn and context switching. Test data shows Aider averages 105K tokens per task. More efficient frameworks (like Cursor) use 104K. The difference isn’t huge, but Aider is noticeably slower.
2. Inefficient Tool Calling
From benchmark logs: Aider “mentioned” test scripts 10 times across 16 tasks — meaning lots of redundant file probing. The pi framework triggered this zero times.
3. Single-Agent Architecture Limits
Aider currently only supports single-threaded conversation flow, no parallel sub-agents. Lightweight frameworks like opencode and pi have more direct tool-calling paths. This difference really shows on complex tasks.
4. The Architect Mode Double-Edged Sword
Aider’s architect mode (using o1 or Opus for planning, cheap models for editing) sounds great on paper — hitting 85% pass rates in official tests. But that’s with cloud models. Switch to local models and:
- Local inference is already slow
- Two inference passes (architect + editor) = double the time
- Local small models have weaker “planning” ability than giant cloud models, so the benefit shrinks
Part 4: When Should You Actually Use Aider?
Despite the benchmark numbers, Aider isn’t useless. For specific scenarios, it’s still the best choice:
1. Heavy Git Workflow Users
If you live in git rebase, cherry-pick, revert — auto-commit is genuinely great. Other frameworks leave you to commit manually. Aider handles the whole pipeline and can run tests automatically.
2. Want to Save Money
Aider is free and open source — you just pay API costs. Run it with local Ollama and costs hit zero. Heavy monthly use runs $30-80. Compare that to Cursor Pro ($20/month + overages) or Claude Code ($200+/month).
3. Multi-Model Switching
Aider supports any OpenAI-compatible API. You can plan with Claude on one task, code with DeepSeek on the next, swap anytime. Other frameworks either lock you into specific models or make switching painful.
4. Architect Mode (Cloud Model Scenario)
If you’re using cloud big models as architect (o1-preview, Claude Opus) and local small models as editor, Aider’s architect mode does save money. Official data: o1-preview plan + DeepSeek edit = 85% pass rate (SOTA). But this combo isn’t for pure local deployment.
Part 5: Practical Advice for Local-Only Users
If your goal is pure local deployment, no cloud APIs, here’s what I’d recommend:
Model Selection
- Top pick: Qwen3.6-35B-A3B (MoE) — high accuracy, fast, VRAM-friendly
- Alternative: Qwen3.6-27B (dense) — if MoE version is unstable for you
- Low VRAM: Qwen3-Coder-Next (70% pass rate, good enough)
Framework Choice
- Max accuracy: pi framework + Qwen3.6-27B — only combo that hit 16/16 perfect score in testing
- Max speed: gpt-oss-120b + pi — 34 sec/task, 15/16 pass rate
- Heavy Git user: Aider + Qwen3.6-27B — accept slightly lower accuracy (62.5%) for version management convenience
Quantization Tips
- Q4_K_M or Q4_K_XL is the sweet spot — good balance of speed and precision
- Don’t chase Q8 — tiny gain (a few percent), half the speed
- Never use Q2 — completely unusable for coding tasks
Part 6: Pain Points I Ran Into (And How to Avoid Them)
- Don’t trust vendor benchmarks. Those HumanEval scores are best-case runs. Run them through Aider and they drop 30%. Qwen3-Coder-Next claims 90%+ HumanEval — actual? 70%.
- Don’t touch the temperature parameter. Leave it at 0 or 0.1. Code generation doesn’t want randomness. I tried 0.8 once — formatting went completely off the rails.
- Longer context isn’t always better. 128K sounds great, but local models start losing attention past 32K. I measured Qwen3.6: 64K input dropped accuracy by 12 percentage points vs 32K.
- Aider’s watch mode is half-baked. Supposed to support IDE comment triggers — in practice, it misses half the time. Just run commands directly in the terminal.
- Quantization hurts coding tasks more than chat. Same Q4 quantization, a chat model might drop 3-5% quality — a coding model can drop 10-15%. Use fp16 if you have the VRAM.
Summary Table
| Use Case | Recommended Combo | Expected Pass Rate | Notes |
|---|---|---|---|
| Max accuracy | pi + Qwen3.6-27B (Q4) | ~100% | Slow but rock solid |
| Max speed | pi + gpt-oss-120b (quantized) | ~94% | 34 sec/task |
| Git workflow | Aider + Qwen3.6-35B-A3B | ~62-80% | Version management ease |
| Best value | pi + Qwen3.6-35B-A3B (MoE) | ~80% | Fast AND accurate |
Final honest take: Aider isn’t the performance king for local LLM scenarios. But it remains the gold standard for Git integration and workflow automation. If you care about accuracy and speed, use pi. If you care about version management and commit history, use Aider. They’re not mutually exclusive — I keep both installed. Simple tasks go through Aider for instant commits. Complex refactors go through the pi framework, then I commit manually afterward.