Eight agent frameworks, one task, one model

CrewAI burns roughly 7× the input tokens of pydantic-ai on the exact same task — and finishes #6 of 8. Pydantic-ai tops the board at NDCG@3 0.857 with a clean 100% Hit@1. Vercel AI SDK runs the cheapest by 3×, then loses on justification quality. Same model, same tools, same prompt — the framework still moves the leaderboard by a 50% spread.

same-model-same-task is an open benchmark I built to settle one question. When every framework calls the same model through the same tools on the same prompt, does the framework actually move the needle — or is it glue code with a brand?

The setup#

                 same-model-same-task
─────────────────────────────────────────────────────────
Model            gemini-2.5-flash
Tools            4 (identical across frameworks)
Task             candidate–job ranking
Trials           30 per framework
Frameworks       8 (5 Python · 3 TypeScript)
Total runs       240
Scoring          NDCG@3 · Hit@1 · LLM-judged JustifQ

Five Python contenders (pydantic-ai, LangGraph, CrewAI, Google ADK, baseline-python) and three TypeScript ones (Vercel AI SDK, Mastra, baseline-typescript). NDCG@3 is the standard ranking score (0–1, higher is better): how well the agent’s top-3 picks match the rule-based gold top-3. A separate LLM judge rates justification quality on a 5-point scale.

The leaderboard#

#  Framework             NDCG@3   Hit@1   Tokens in/out    Cost/run
───────────────────────────────────────────────────────────────────
1  pydantic-ai (Py)       0.857  100.0%    6,149 /   480   $0.0181
2  langgraph (Py)         0.823   92.6%    5,167 /   502   $0.0164
3  vercel-ai-sdk (TS)     0.662   77.8%    1,605 /   228   $0.0060
4  google-adk (Py)        0.621   72.4%    6,128 /   510   $0.0184
5  mastra (TS)            0.610   73.3%    6,154 /   548   $0.0189
6  crewai (Py)            0.598   69.2%   42,785 / 1,806   $0.1072
7  baseline-typescript    0.589   69.0%    5,897 /   495   $0.0177
8  baseline-python        0.570   65.2%    7,027 /   515   $0.0202

Two frameworks clear 0.80. The other six cluster between 0.57 and 0.66. The naked baselines finish last — frameworks do earn their keep, just not equally.

Where the framework actually shows up#

Token mix. CrewAI sends ~43k input tokens per run when the same task fits in 6k. Six times the bill for a worse score. The orchestrator’s internal chatter — agent personas, role descriptions, planner-executor loops — leaks straight into the model context, and you pay for every round trip.

Latency tail. Google ADK posts a clean p50 of 19.9s and a p95 of 471.8s. Something in the framework hits a retry-or-stall cliff in the long tail that the others avoid. A median that looks fine can still hide a SLA bomb.

Justification quality. Vercel AI SDK wins on cost by 3× but lands at 2.89/5 on justification — a short output budget (228 tokens) saves money and starves the reasoning. LangGraph posts the best 3.70/5 with 502 output tokens. You can buy cheaper answers; you cannot buy cheaper good answers.

The takeaway#

Same model, same tools, same prompt — and the framework controls a 50% spread on quality and an order-of-magnitude spread on cost. The boring conclusion holds: pydantic-ai and LangGraph are the safe bets. The interesting one is that “agent framework” is mostly a token-discipline contract dressed up as orchestration.

Full methodology, per-framework drill-downs and raw runs at same-model-same-task.vercel.app.