Good Code, Wrong Model. How to Benchmark AI Coding Agents for Distributed SQL

We’ve been testing major AI coding agents against YugabyteDB’s distributed gotchas. This blog covers the design of the new distributed SQL benchmark and examines what breaks, why it breaks, and provides a handy open-source fix you can drop in today.

Lost in Translation

Our team has been using AI coding tools every day for three years. Reviewing internal PRs recently has felt like looking at an uncanny valley of database engineering. The code looked like PostgreSQL, but it was not optimized for a distributed system. The AI wasn't failing because it lacked data. It was failing because it was too well-trained on the wrong database. That observation turned into a question: how do you measure this disconnect? That question became a new benchmark. We built this benchmark to be more rigorous than existing code generation benchmarks, which test syntax and logic but rarely evaluate whether a model understands the system it’s writing for. We wanted execution-based scoring, grounding ablation, and real cluster validation. YugabyteDB is the right test case because the gap between what models produce and what distributed SQL requires is measurable, reproducible, and consequential.

The PostgreSQL Surface Trap

YugabyteDB runs the actual PostgreSQL query layer. This is not a reimplementation, but the real thing on top of a distributed storage engine. Your existing PostgreSQL code runs, migrations work, and drivers connect. Range sharding that means most queries behave exactly as expected coming from Postgres. The AI problem is precisely due to this. Models are trained on millions of single-node PostgreSQL examples, and most of that code works fine on YugabyteDB. Failures are at the edges: patterns that are valid on a single node but expensive or broken in a distributed system:
  • SERIAL primary keys create write hotspots
  • LISTEN/NOTIFY is a silent no-op (in versions before 2025.1)
  • Missing retry logic drops transactions under concurrent writes
None of these produce an error at generation time. Most don't produce one at test time either. They wait for production. The full list of unsupported PostgreSQL features is in YugabyteDB docs. When models keep getting it wrong despite all the publicly available data, the problem isn't data availability. The PostgreSQL training signal is strong enough to drown out the distributed systems context.

Benchmark Structure

The benchmark is built around three key pillars:
  1. Prompts that tempt real anti-patterns
  2. Grounding levels that isolate what the model knows from what we tell it
  3. A scoring engine that checks code against a live cluster rather than comparing strings

12 Prompts

Each prompt describes a real-world database scenario whose requirements naturally tempt PostgreSQL idioms that break or underperform in YugabyteDB. The traps aren't labeled; they are baked into the problem description, exactly the way they appear in a real ticket. A few examples:
  • A prompt asking for "real-time notification through the database" tempts LISTEN/NOTIFY
  • "No overlapping time ranges, enforced declaratively" tempts EXCLUDE USING
  • "Durability is not needed, optimize for throughput" tempts UNLOGGED TABLE
  • "Insert or update in one pass, most natural SQL construct" tempts MERGE INTO
Generated code is executed against a live YugabyteDB cluster. If execution fails, all dimension scores zero out. You can't score well on awareness if your code doesn't run. Comments and docstrings are stripped before regex checks, so a model can't pass checklist items by stuffing keywords in comments.

5 Context Levels

We run every prompt through five levels of grounding. This lets us measure the exact delta between a raw model and one equipped with specific YugabyteDB knowledge.
LevelPrompt includes
zero-shot (C0)Description only
schema-aware+ setup SQL and schema context
expert-context (C1)+ curated YugabyteDB documentation snippets
skill-file (C2)Skill file prepended + schema context
mcp (C3)yugabyte-mcp server active, no skill file
skill-file-mcp (C4)Skill file + MCP

Scoring Engine: 4 Dimensions

No vibe checks. The scoring engine uses regex, AST analysis, and cluster runtime evidence to calculate a weighted total. D1 - Anti-Pattern Avoidance (25%): Static and AST analysis across Python, SQL, Java, Go, Node. Security findings fold in: SQL injection, hardcoded credentials, and missing TLS. D2 - Positive Pattern Presence (35%): UUID primary keys, retry with backoff and jitter, ROLLBACK before retry, smart driver config, covering indexes, explicit sharding strategy. This is the most weighted dimension because this is where distributed-native engineering lives. D3 - Architectural Quality (30%): Deterministic heuristic over structure: function decomposition, error handling, resource cleanup. It starts from a baseline and only adjusts modestly, so it tells you whether code is structured, not whether it’s brilliantly designed. We’re working on something better: feeding real cluster metrics to an LLM council and asking them to interpret numbers rather than review code. D4 - Efficiency Proxy (10%): Real execution data from the cluster. Sequential scans are penalized, index scans are rewarded, pulled from pg_stat_statements after each run. Weighted at 10% so it informs the score without dominating it. We want the benchmark to reward awareness first.

Benchmark Results

These tables compare zero-shot (C0) against skill-file (C2) only, and show Cursor-based models. The benchmark also has results for Claude Code, Gemini, and others. This is a sample of current runs, not the full picture.

Baseline

The following tables show the raw performance of major models
ModelAvg ScoreD1 Anti-PatternD2 PositiveD3 QualityD4 Efficiency
GPT 5.32.660.873.912.632.81
Gemini 3.1 Pro2.651.263.632.542.88
Composer 1.52.992.833.532.502.56
Claude 4.6 Opus2.700.793.922.673.50
Grok2.912.074.072.332.50
D1 near zero means these models hit almost every anti-pattern when ungrounded. GPT 5.3 and Claude 4.6 Opus score 0.87 and 0.79 respectively: SERIAL keys, OFFSET pagination, string-concatenated SQL on the majority of prompts. D2 is higher because models produce some distributed-native patterns incidentally, not by design.

Grounded Models (With Skill Files)

ModelAvg ScoreD1 Anti-PatternD2 PositiveD3 QualityD4 Efficiency
GPT 5.33.774.664.262.632.69
Gemini 3.1 Pro3.744.634.202.543.00
Composer 1.53.594.923.682.502.56
Claude 4.6 Opus3.493.754.102.672.63
Grok3.353.044.592.332.50
D1 jumps from near-zero to 3-4.9. That's the grounding doing its job. D3 and D4 don't move meaningfully between grounded and ungrounded. This is expected as D3 measures structural code quality, which skills files don't target. D4 is low-weight by design. The real signal is D1 + D2: does the model know YugabyteDB, or does it just write PostgreSQL?

The Skills

The skill file closes the distributed systems gap in two directions. The new code gets the right patterns from the start, and the existing code gets concrete suggestions on what to change and why. This is published on skills.sh under yugabyte/skills. You can install it with: npx skills add yugabyte/skills The difference between models at zero-shot is real but small. The difference between ungrounded and grounded is the gap between 0.79 and 3.75 on anti-pattern avoidance. Grounding is the higher-leverage intervention.

What This Means

Your AI isn't broken, it’s just well-trained on a database that isn’t distributed. The skills are open source and available on our GitHub. Stay tuned for more insights in the future as we will be deep-diving into how individual models handle specific traps, which MCP vs skill files actually change in generated code, efficiency analysis from real cluster metrics, and results from additional models and agent configurations. Follow the Yugabyte blog for the latest insights, or join our new Discord server to share your experiences of agentic AI.

Multi-Cloud YugabyteDB with Kubernetes

YugabyteDB Anywhere’s embedded Kubernetes operator allows declarative, Kubernetes-native infrastructure-as-code management of multi-cloud topologies leveraging popular service meshes. Discover how the latest YugabyteDB Anywhere features simplify the deployment of YugabyteDB universes across multiple clouds with Kubernetes.

Read more

Multi-Cloud YugabyteDB in Practice

Several recent regional and cloud-wide outages have made multi-cloud solutions more appealing. In this blog, we explore the benefits of these architectures, the considerations to keep in mind, and practical guidance for deploying a multi-cloud YugabyteDB universe.

Read more

Migrating Schema and Data from Sybase ASE/SAP ASE to YugabyteDB with HexaRocket

The need for cloud-native agility, horizontal scalability, and open-source ecosystems is driving organizations to move away from legacy monolithic architectures. Discover how you can migrate from Sybase ASE/SAP ASE to YugabyteDB using HexaRocket, and how HexaRocket treats YugabyteDB as a PostgreSQL-compatible system and delivers 100% accurate schema conversion and deployment.

Read more

Lift-and-Shift of High Write-Throughput Apps with YugabyteDB

This blog demonstrates how common distributed SQL system patterns can create write hotspots in distributed architectures, and how ordering, key design, and access paths directly impact scalability. Understanding and addressing the trade-offs is now significantly simpler with YugabyteDB, which enables true lift-and-shift of high-write-throughput applications.

Read more

How Shopify Is Re-Architecting for an Agentic Commerce Future with YugabyteDB

What happens when a relational database design built for e-commerce is pushed to planetary scale? For Shopify, the answer involved years of custom sharding, massive replication overhead, and growing application complexity. Discover how Shopify is navigating the move from large-scale MySQL deployments to YugabyteDB, and what that shift means for engineers building systems at extreme scale.

Read more

Deploying GraphRAG at Scale: Unified AI Architecture with Dify and YugabyteDB

Discover how to turn scattered documents into a connected knowledge graph using Dify and YugabyteDB. Dify handles AI orchestration while YugabyteDB serves as the unified data backbone, consolidating vectors, relationships, and metadata in one operationally straightforward platform. The end result is a robust system that’s simpler to operate, cost-efficient to scale, and easier to take from prototype to production-ready platform.

Read more

Data Modernization Versus Data Migration

When planning the next step for your organization’s data infrastructure, the first decision is whether to migrate your existing data “as‑is” or modernize your data platforms. This blog explains the difference between data modernization and data migration, when to choose each approach, and why adopting a distributed database is often central to a successful modernization strategy.

Read more

How the YugabyteDB Performance Team Harnesses the Power of AI 

As performance infrastructure expands, so does the volume of data generated. This can make the review process challenging. Discover how the YugabyteDB performance team addressed this issue by building two AI-driven systems, one focused on regression detection and the other on making performance knowledge easier to query and share.

Read more

Get started in any cloud, container or data center