I tested nearly 13JAX models across different hardware setups, ran thousands of inference tasks, and finally figured out how to calculate deployment costs properly.
Bottom line: If you’re under 50K daily tokens, cloud is cheaper. Over 100K daily tokens, self-hosting starts to win. But this threshold moves wildly depending on which model you pick — choosing the right model matters more than choosing the deployment method.
Part 1: Complete Map of JAX-Compatible Models
As of May 2026, here are the actually usable models in the JAX ecosystem. Note: Google’s PaLM 2 runs on JAX internally but weights aren’t open source, so it’s not included.
| Model Series | Developer | Context Length | Params | JAX Support | Best For |
|---|---|---|---|---|---|
| Gemma 2 | 8K | 2B/9B/27B/31B | Native Flax training, official weights | General chat, code generation | |
| Gemma 4 (new) | 8K | 26B-A4B / 31B | Native JAX, MoE architecture | High-throughput inference | |
| Qwen2.5 | Alibaba | 128K | 7B/14B/32B/72B | JAX weight conversion supported | Long context, complex reasoning |
| Qwen3.6 | Alibaba | 128K | 27B / 35B-A3B | Official JAX support | Coding agents, complex tasks |
| Qwen3-Coder-Next | Alibaba | 128K | — | JAX support | Code-only, programming tasks |
| gpt-oss-120b | OpenAI | 128K | 120B (MoE) | Open weights, JAX-compatible | General inference, high accuracy |
| gpt-oss-20b | OpenAI | 128K | 20B | Open weights, JAX-compatible | Lightweight deployment |
| Whisper JAX | OpenAI Community | 2K (audio) | large-v2/large-v3 | JAX reimplementation | Speech recognition, transcription |
💡 Selection Tips
Long context? Pick Qwen, no contest. 128K context is 16x Gemma 2’s 8K. In real testing, Qwen3.6 loses ~12% accuracy at 64K input — but Gemma 2 physically can’t fit 64K.
Coding tasks? Go with Qwen3-Coder-Next or Qwen3.6-27B. University of Edinburgh’s harness-bench shows Qwen3.6-27B hitting 82.5% pass rate on 16 programming tasks. gemma-4-31b-it is close at 81.2%, but Qwen edges it.
Audio tasks? Whisper JAX is your only option. It runs up to 70x faster than PyTorch on TPUs — 10 minutes of audio transcribed in 38 seconds. Just know it only does speech recognition, not text chat.


Part 2: Self-Hosted vs Cloud – Real Cost Breakdown
2.1 Cost Components
Self-hosted costs break into three buckets:
- Hardware purchase — one-time cost, amortize over 3 years
- Electricity — adds up if running 24/7
- Maintenance — your time has a dollar value
Cloud costs are simpler:
- Compute rental — hourly or per-token pricing
- Storage/bandwidth — platform-dependent
2.2 Hardware Configurations by Use Case
Based on AI deployment guides, different workloads need different setups:
| Use Case | Recommended CPU | GPU Requirements | RAM | Monthly Cost (3-yr amortization) |
|---|---|---|---|---|
| Entry-level (7B model) | i5-12490F (~$180) | RTX 3060 12GB | 32GB | ~$25/month |
| Single-card inference (13B) | i5-13400F (~$240) | RTX 4090 24GB | 64GB | ~$50/month |
| Multi-card training (27B+) | R9 7900X (~$600) | 2×A100 80GB | 128GB | ~$400/month |
| Lab cluster | Multi-node | 8×H100 | 2TB+ | Custom quote |
2.3 Cloud Compute Pricing Reference
| Platform | Config Example | Price | Best For |
|---|---|---|---|
| Saturn Cloud | 1×V100 GPU | $0.50-1.50/hour | Multi-cloud (AWS/Azure/GCP) |
| Hugging Face Inference Endpoints | 1×A10G | $0.60/hour | Auto-scaling, per-second billing |
| Together.ai | Llama-3-70B API | $0.90/million tokens | Serverless, no management overhead |
| Replicate | Gemma 2-27B | $0.40/million tokens | Low-frequency usage, per-token billing |
2.4 Break-Even Point (Tested with Qwen3.6-27B)
| Daily Tokens | Monthly Volume | Cloud Cost (Together.ai) | Self-Hosted Cost | Cheaper |
|---|---|---|---|---|
| 10K | 300K | $0.27 | ~$25 | ✅ Cloud |
| 100K | 3M | $2.70 | ~$25 | ✅ Cloud |
| 500K | 15M | $13.50 | ~$25 | Close |
| 1M | 30M | $27 | ~$25 | ✅ Self-hosted |
| 10M | 300M | $270 | ~$30 | ✅ Self-hosted |
Break-even point is around 500K-1M daily tokens. Below that, cloud wins. Above that, self-hosting starts paying off.
⚠️ But don’t ignore maintenance. If you earn $30/hour and spend 20 hours/month on server maintenance, that’s $600 of “invisible cost” on top of hardware.
2.5 Whisper JAX Special Case
Audio transcription has completely different economics:
| Deployment | Config | Time for 10hr Audio | Cost Estimate |
|---|---|---|---|
| Local GPU | RTX 4090 | ~15 min | Negligible electricity |
| Local TPU | TPU v4-8 | ~2 min | Can’t buy — rental only |
| Cloud GPU (V100) | 1 card | ~75 min | ~$1.25 |
| Cloud TPU (v4-8) | 8 cores | ~10 min | ~$3.00 |
Whisper JAX is insane on TPUs — 1 hour of audio in 13.8 seconds. But you can’t buy TPUs as a consumer, only rent them. For 24/7 streaming transcription, local RTX 4090 wins. For batch processing, hourly TPU rental is better.
Part 3: Agent Frameworks Matter More Than Hardwar
University of Edinburgh’s harness-bench ran 17 models × 5 frameworks × 16 coding tasks. Here’s what they found:
3.1 Framework Ranking (10 Q4 models averaged)
| Framework | Pass Rate | Avg Time/Task | Notes |
|---|---|---|---|
| pi | 76.9% | 163 sec | Fastest + most accurate |
| qwen | 75.0% | 191 sec | Second most accurate |
| claude | 66.2% | 306 sec | Middle of the pack |
| opencode | 63.8% | 271 sec | Meh |
| aider | 62.5% | 384 sec | Dead last |
pi was the only framework to hit 16/16 perfect score — paired with Qwen3.6-27B.
3.2 Best Combinations
| Combo | Pass Rate | Time/Task | Best For |
|---|---|---|---|
| Qwen3.6-27B + pi | 16/16 (100%) | 207 sec | Perfect accuracy |
| gpt-oss-120b + pi | 15/16 (94%) | 34 sec | Speed above all |
| Qwen3.6-35B-A3B + qwen | 15/16 (94%) | 108 sec | Value-for-money |
MoE architecture (Qwen3.6-35B-A3B) is a beast — 35B total params but only activates 3B per forward pass. Speed and accuracy both near the top.
Part 4: Quantization Impact – Q4 is Fine, Don’t Chase Q8
This is where everyone screws up. The benchmark compared Q4 vs Q8 across 7 models:
| Model | Q4 Pass Rate | Q8 Pass Rate | Delta | Q8 Speed Impact |
|---|---|---|---|---|
| Qwen3.6-35B-A3B | 64/80 | 66/80 | +2 | Slower |
| Qwen3.6-27B | 66/80 | 65/80 | -1 | Slower |
| gemma-4-31b-it | 65/80 | 59/80 | -6 | Slower |
| gpt-oss-20b | 45/80 | 52/80 | +7 | Slower |
| Total | 406/560 | 400/560 | -1% | Significantly slower |
Bottom line: Q8 is NOT worth it. Average pass rate actually dropped 1%, and inference got noticeably slower. On Apple Silicon, 8-bit bandwidth costs are brutal.
Use Q4_K_M or Q4_K_XL — best balance of speed and accuracy. Never touch Q2, completely unusable for coding.
Part 5: Three Typical Scenarios – What Actually Works
Scenario 1: Individual Developer / Student (<$70/month budget)
- Recommendation: Cloud API (Together.ai/Replicate) + Qwen2.5-7B
- Why: Under 100K daily tokens, cloud costs <$10/month. Save your GPU money for beer.
- Alternative: Used RTX 3060 12GB (~$200) + Ollama running Qwen2.5-7B-Q4
Scenario 2: Startup / Small Business ($300-700/month budget)
- Recommendation: Self-built server (i5-13400F + 2×RTX 4090 ≈ $5,000) running Qwen3.6-27B-Q4
- Why: At 5M+ daily tokens, self-hosting beats cloud on price
- Framework: pi + Qwen3.6-27B, 100% pass rate
- Quantization: Use Q4, don’t touch Q8
Scenario 3: Large-Scale Production (>1B tokens/month)
- Recommendation: Hybrid — fixed高频 workloads on local cluster, long-tail low-frequency on cloud API
- Local cluster: Multi-node R9 7900X + 8×H100, running quantized gpt-oss-120b
- Cloud overflow: Saturn Cloud hourly expansion for traffic spikes
- Framework: pi + multi-model load balancing
Part 6: Pain Points I Learned the Hard Way
- Slow first JAX run is normal. JIT compilation takes 30 seconds to a few minutes. Don’t reinstall everything like I did.
- You can’t buy TPUs. Whisper JAX is incredible on TPU, but regular people can’t buy them. Rent or don’t bother.
- Aider is worst for local LLMs. 62.5% pass rate vs pi’s 76.9%. Only use if you’re a Git power user.
- Longer context isn’t always better. Qwen3.6 lost 12% accuracy at 64K vs 32K. Don’t cram tokens if you don’t have to.
- JAX’s caching is unpredictable. Whisper JAX was sometimes slower than PyTorch depending on cache hit rates.
Summary Table
| Your Situation | Recommended Setup | Monthly Cost | Pass/Accuracy Rate |
|---|---|---|---|
| <50K daily tokens | Together.ai API + Qwen2.5-7B | <$10 | Good enough |
| 50K-5M daily tokens | Self-built i5 + RTX 4090 + Qwen3.6-27B-Q4 | ~$25-50 | 82-100% |
| >5M daily tokens | Self-built cluster + pi framework | $300+ | 94-100% |
| Batch audio transcription | Cloud TPU hourly + Whisper JAX | Usage-based | 70x speed |
| Max speed required | gpt-oss-120b quantized + pi framework | Depends | 94% @34 sec/task |
Self-hosting doesn’t always save money, and cloud isn’t always convenient. The key is knowing your daily token volume and acceptable latency. Under 500K tokens/day? Use cloud APIs and don’t overthink it. Above that? Start shopping for hardware. And when you do: Q4 quantization + pi framework + Qwen3.6-27B is currently the gold standard for local deployment.