Good Code, Wrong Model. How to Benchmark AI Coding Agents for Distributed SQL

March 31, 2026

We’ve been testing major AI coding agents against YugabyteDB’s distributed gotchas. This blog covers the design of the new distributed SQL benchmark and examines what breaks, why it breaks, and provides a handy open-source fix you can drop in today.

Lost in Translation

Our team has been using AI coding tools every day for three years. Reviewing internal PRs recently has felt like looking at an uncanny valley of database engineering. The code looked like PostgreSQL, but it was not optimized for a distributed system.

The AI wasn’t failing because it lacked data. It was failing because it was too well-trained on the wrong database.

That observation turned into a question: how do you measure this disconnect? That question became a new benchmark.

We built this benchmark to be more rigorous than existing code generation benchmarks, which test syntax and logic but rarely evaluate whether a model understands the system it’s writing for. We wanted execution-based scoring, grounding ablation, and real cluster validation.

YugabyteDB is the right test case because the gap between what models produce and what distributed SQL requires is measurable, reproducible, and consequential.

The PostgreSQL Surface Trap

YugabyteDB runs the actual PostgreSQL query layer. This is not a reimplementation, but the real thing on top of a distributed storage engine. Your existing PostgreSQL code runs, migrations work, and drivers connect. Range sharding that means most queries behave exactly as expected coming from Postgres.

The AI problem is precisely due to this. Models are trained on millions of single-node PostgreSQL examples, and most of that code works fine on YugabyteDB.

Failures are at the edges: patterns that are valid on a single node but expensive or broken in a distributed system:

SERIAL primary keys create write hotspots
LISTEN/NOTIFY is a silent no-op (in versions before 2025.1)
Missing retry logic drops transactions under concurrent writes

None of these produce an error at generation time. Most don’t produce one at test time either. They wait for production.

The full list of unsupported PostgreSQL features is in YugabyteDB docs. When models keep getting it wrong despite all the publicly available data, the problem isn’t data availability. The PostgreSQL training signal is strong enough to drown out the distributed systems context.

Benchmark Structure

The benchmark is built around three key pillars:

Prompts that tempt real anti-patterns
Grounding levels that isolate what the model knows from what we tell it
A scoring engine that checks code against a live cluster rather than comparing strings

12 Prompts

Each prompt describes a real-world database scenario whose requirements naturally tempt PostgreSQL idioms that break or underperform in YugabyteDB. The traps aren’t labeled; they are baked into the problem description, exactly the way they appear in a real ticket.

A few examples:

A prompt asking for “real-time notification through the database” tempts LISTEN/NOTIFY
“No overlapping time ranges, enforced declaratively” tempts EXCLUDE USING
“Durability is not needed, optimize for throughput” tempts UNLOGGED TABLE
“Insert or update in one pass, most natural SQL construct” tempts MERGE INTO

Generated code is executed against a live YugabyteDB cluster. If execution fails, all dimension scores zero out. You can’t score well on awareness if your code doesn’t run. Comments and docstrings are stripped before regex checks, so a model can’t pass checklist items by stuffing keywords in comments.

5 Context Levels

We run every prompt through five levels of grounding. This lets us measure the exact delta between a raw model and one equipped with specific YugabyteDB knowledge.

Level	Prompt includes
zero-shot (C0)	Description only
schema-aware	+ setup SQL and schema context
expert-context (C1)	+ curated YugabyteDB documentation snippets
skill-file (C2)	Skill file prepended + schema context
mcp (C3)	yugabyte-mcp server active, no skill file
skill-file-mcp (C4)	Skill file + MCP

Scoring Engine: 4 Dimensions

No vibe checks. The scoring engine uses regex, AST analysis, and cluster runtime evidence to calculate a weighted total.

D1 – Anti-Pattern Avoidance (25%): Static and AST analysis across Python, SQL, Java, Go, Node. Security findings fold in: SQL injection, hardcoded credentials, and missing TLS.

D2 – Positive Pattern Presence (35%): UUID primary keys, retry with backoff and jitter, ROLLBACK before retry, smart driver config, covering indexes, explicit sharding strategy. This is the most weighted dimension because this is where distributed-native engineering lives.

D3 – Architectural Quality (30%): Deterministic heuristic over structure: function decomposition, error handling, resource cleanup. It starts from a baseline and only adjusts modestly, so it tells you whether code is structured, not whether it’s brilliantly designed. We’re working on something better: feeding real cluster metrics to an LLM council and asking them to interpret numbers rather than review code.

D4 – Efficiency Proxy (10%): Real execution data from the cluster. Sequential scans are penalized, index scans are rewarded, pulled from pg_stat_statements after each run. Weighted at 10% so it informs the score without dominating it. We want the benchmark to reward awareness first.

Benchmark Results

These tables compare zero-shot (C0) against skill-file (C2) only, and show Cursor-based models. The benchmark also has results for Claude Code, Gemini, and others.

This is a sample of current runs, not the full picture.

Baseline

The following tables show the raw performance of major models

Model	Avg Score	D1 Anti-Pattern	D2 Positive	D3 Quality	D4 Efficiency
GPT 5.3	2.66	0.87	3.91	2.63	2.81
Gemini 3.1 Pro	2.65	1.26	3.63	2.54	2.88
Composer 1.5	2.99	2.83	3.53	2.50	2.56
Claude 4.6 Opus	2.70	0.79	3.92	2.67	3.50
Grok	2.91	2.07	4.07	2.33	2.50

D1 near zero means these models hit almost every anti-pattern when ungrounded.

GPT 5.3 and Claude 4.6 Opus score 0.87 and 0.79 respectively: SERIAL keys, OFFSET pagination, string-concatenated SQL on the majority of prompts.

D2 is higher because models produce some distributed-native patterns incidentally, not by design.

Grounded Models (With Skill Files)

Model	Avg Score	D1 Anti-Pattern	D2 Positive	D3 Quality	D4 Efficiency
GPT 5.3	3.77	4.66	4.26	2.63	2.69
Gemini 3.1 Pro	3.74	4.63	4.20	2.54	3.00
Composer 1.5	3.59	4.92	3.68	2.50	2.56
Claude 4.6 Opus	3.49	3.75	4.10	2.67	2.63
Grok	3.35	3.04	4.59	2.33	2.50

D1 jumps from near-zero to 3-4.9. That’s the grounding doing its job.

D3 and D4 don’t move meaningfully between grounded and ungrounded. This is expected as D3 measures structural code quality, which skills files don’t target. D4 is low-weight by design.

The real signal is D1 + D2: does the model know YugabyteDB, or does it just write PostgreSQL?

The Skills

The skill file closes the distributed systems gap in two directions. The new code gets the right patterns from the start, and the existing code gets concrete suggestions on what to change and why.

This is published on skills.sh under yugabyte/skills. You can install it with:

npx skills add yugabyte/skills

The difference between models at zero-shot is real but small. The difference between ungrounded and grounded is the gap between 0.79 and 3.75 on anti-pattern avoidance. Grounding is the higher-leverage intervention.

What This Means

Your AI isn’t broken, it’s just well-trained on a database that isn’t distributed. The skills are open source and available on our GitHub.

Stay tuned for more insights in the future as we will be deep-diving into how individual models handle specific traps, which MCP vs skill files actually change in generated code, efficiency analysis from real cluster metrics, and results from additional models and agent configurations.

Follow the Yugabyte blog for the latest insights, or join our new Discord server to share your experiences of agentic AI.

March 31, 2026