Qwen3.6-27B 4-bit on Mac M4 48 GB

Qwen3.6-27B at 4-bit on a MacBook Pro M4 48 GB: 9.8 t/s solo, 26.9 t/s at 8× batching, 25 GB peak RAM. Full numbers below — single-request at three prompt sizes, continuous batching to 8×, and a capacity table for other Mac configs.

Setup#

Machine     MacBook Pro M4, 48 GB unified memory
Runtime     oMLX
Model       Qwen3.6-27B-UD-MLX-4bit (Unsloth Dynamic quant on MLX)
Prompts     pp1024 / pp4096 / pp8192, generation 128 tokens

Weights: unsloth/Qwen3.6-27B-UD-MLX-4bit. Runtime: oMLX.

Single request#

Test           TTFT (ms)   tg TPS   pp TPS    Peak RAM
───────────────────────────────────────────────────────
pp1024/tg128       8249      9.8    124.1     25.22 GB
pp4096/tg128      31673      9.6    129.3     26.63 GB
pp8192/tg128      64209      9.3    127.6     27.25 GB

tg TPS barely drops as the context grows (9.8 → 9.3 from 1K to 8K), and peak RAM rises in line with the KV cache. TTFT grows with the prompt — prefill is the slow part.

Continuous batching (pp1024 / tg128)#

Batch    tg TPS    Speedup    pp TPS
─────────────────────────────────────
1×        9.8      1.00×      124.1
2×       12.8      1.31×      125.9
4×       20.0      2.04×      125.5
8×       26.9      2.74×      124.5

At 8× concurrency, total throughput is 2.74× the single-request baseline, while prompt processing stays flat. If you run agents, evals, or anything concurrent, that’s where the model earns its keep on a Mac.

Why bandwidth wins on Apple Silicon#

Apple Silicon is bandwidth-bound on inference, not compute-bound. Each token reads the full weights once through unified memory. Smaller weights move faster, so generation speeds up. That’s why 4-bit is the working quant here. Prompt processing barely changes between quants — it’s driven by KV math, not weight movement.

Context capacity#

The KV cache grows linearly with context length. I measured ~0.46 MB/token at fp16. Base RAM at 4-bit is ~24.7 GB.

Formula: peak ≈ 24.7 GB + (tokens × 0.46 MB). Leave 3–4 GB for the OS.

Mac unified RAM    Max usable context
──────────────────────────────────────
32 GB              ~8K
36 GB              ~16K
48 GB              ~40K          ← this machine
64 GB              ~80K
96 GB              ~150K+
128 GB             model cap

48 GB is the smallest Mac that runs this comfortably for solo work. 64 GB is the sweet spot.

Why 4-bit, not 6-bit?#

Quick check, since “more bits = better quality” is the easy assumption. Same machine, same prompts, 6-bit run for reference:

Metric              4-bit       6-bit       Δ
───────────────────────────────────────────────
Peak RAM (pp1024)   25.22 GB    29.24 GB    -14%
tg TPS (solo)       9.8         8.2         +20%
tg TPS (8× batch)   26.9        19.1        +41%

The gap grows under batching — the bandwidth advantage adds up as you run more requests at once. On coding tasks, I can’t see a quality drop on the small evals I trust. Anyone telling you 4-bit “feels dumber” on this model owes you a benchmark.

The call#

4-bit Qwen3.6-27B is the working choice for solo Mac use. Plenty of headroom on 48 GB, batching that scales to 8×, usable contexts up to ~40K. Run it.