Benchmarking AI Coding Agents for Distributed SQL: What We Learned

We ran 350 evaluations across every major model and tool. Here's what the data shows!

May 20, 2026

In part 1 of this 2-part blog series, we made a bold claim: The AI wasn’t failing because it lacked data. It was failing because it was too well-trained on the wrong database.

AI models write vanilla PostgreSQL. If your database is distributed, providing the AI model with a YugabyteDB skill file closes the gap and ensures it writes code that works for your application.

In this post, we break down the benchmarking results across 17 model configurations. We share what moved the score, what regressed, and the one structural finding that changes how every team should think about delivering context to an AI coding tool.

The Benchmark in Action

We conducted 350+ evaluations across 17 model configurations: Claude 4.5, 4.6, and 4.7, Gemini 3.1 Pro, GPT-5.x, Composer 2, Codex CLI – through Claude Code CLI, Cursor, and Codex.

Our claim that your Postgres-focused AI lacks the correct data holds. But three other things surfaced that we didn’t expect:

The tool wrapping the model matters as much as the model itself.
Skill-file rules reliably regress below zero-shot when the pattern requires working control flow, not just a prohibition.
Six iterations of tuning a skill file for one specific workload quietly made it worse everywhere else.

How The Skill File Works

Add a YugabyteDB skill file and anti-pattern avoidance jumps from 2.42 to 3.79 – a +57% lift.

We scored three dimensions, each 0-5:

D1 – anti-pattern avoidance: the silent failures (SERIAL PKs, UNLOGGED tables, system columns). Checked by regex against a 38-item checklist – deterministic, no judgment calls.
D2 – positive pattern adoption: UUID keys, retry logic, smart driver config.
D3 – architectural quality: heuristic static analysis of code structure, error handling, resource management, and distributed awareness. Five sub-dimensions scored by regex, averaged to 0-5.

We ran 55 tasks across:

schema design
application code
query optimization
fault tolerance
operations

With a minimum of 11 prompts per model configuration.

Context	Overall	D1 Anti-Pattern	D2 Positive	D3 Quality
Zero-shot	2.72	2.42	3.35	2.38
Skill-file	3.25	3.79	3.44	2.53

Every dimension improves. The effect concentrates in D1: the things models don’t know, not the things they have to reason about. You cannot deduce that ctid and xmin don’t exist in YugabyteDB’s storage layer. You cannot infer that CREATE UNLOGGED TABLE silently writes through Raft anyway. Unless the model is specifically trained on this edge case, it will default to the most common pattern. You either supply the facts, or you get silent failures.

The D1 patterns that lift most – UNLOGGED TABLE, SERIAL PKs, system columns – are documented in YugabyteDB’s compatibility notes. The zero-shot score is 2.42. The gap isn’t due to missing documentation; it’s because general training data doesn’t override the dominant PostgreSQL pattern at inference time. The skill file operates differently – it injects context at inference time, when the specific task, schema, and cluster are known. Those aren’t in any training set.

From Part 1: the scoring was rebuilt. The original single “awareness” dimension split into D1 and D2 – avoidance and adoption move independently, and averaging them hid both signals. The checklist grew from 26 items to 38. Execution efficiency (D4) is excluded from the scores above – 96% of runs scored 2.5 or 3.0, producing no useful differentiation across the 55-task benchmark. Execution efficiency is covered separately in the TPC-H experiment section below.

What It Fixes. And What It Doesn’t

The biggest lifts are all knowledge gaps. These are PostgreSQL features that compile on YugabyteDB, but behave differently than expected.

SELECT id, ctid, xmin FROM orders tracks row versions in standard PostgreSQL. YugabyteDB’s DocDB storage layer doesn’t expose those columns – the query errors out. The model never saw the failure during training, so it repeats the pattern. Zero-shot avoidance: 13%. With skill file: 71%.

CREATE UNLOGGED TABLE is a standard PostgreSQL optimization for scratch tables. On YugabyteDB, the syntax parses, the table is created, and every write still goes through Raft consensus. No error, no warning, no benefit. Zero-shot avoidance: 27%. With skill file: 81%.

Three items had near-zero lift, and all three are procedural patterns – not knowledge gaps:

UUID primary keys – 77% zero-shot, 75% with skill file. Models already reach for UUIDs when the task mentions scale – the skill file adds no lift here. Part 1 showed why this matters: SERIAL routes every insert to the same tablet, defeating hash sharding entirely.
Retry with backoff – 79% zero-shot, 64% with skill file. The rule adds the distributed-specific error codes (40001, 40P01) but doesn’t teach the control flow, and actually regresses below zero-shot.
ROLLBACK before retry – 69% zero-shot, 50% with skill file. The same regression. A detail that requires working code, not a rule.

The skill file fixes knowledge gaps, but it doesn’t fix procedural patterns, which can actively make things worse. That distinction is the main finding in the next section.

The Tool Tax

Here’s where the benchmark got interesting.

Claude Opus 4.6 scores 4.62 overall through the Claude Code CLI with a skill file. The same model, same context, through an older Claude Code build: 2.01.

Access path	Overall	D1	D2	D3
Claude Code CLI (current, Opus 4.7)	4.35	4.73	4.46	3.85
Claude Code CLI (Opus 4.6)	4.62	4.95	4.46	4.46
Cursor (Opus 4.6)	3.91	3.75	4.14	3.83
Claude Code CLI (2026-03, Opus 4.6)	2.01	4.65	0.61	0.77

The original Claude Code result – 2.01 – came from a version where the agentic scaffold burned its entire turn budget (default: 2 turns) on filesystem exploration, then returned a JSON error log instead of code. 87% of runs returned {"subtype":"error_max_turns"}. D1 stayed high because the checklist found no anti-patterns in an error log, but there was no code to score for D2 or D3.

We reran the full 55-task benchmark with the current Claude Code (Opus 4.7, agentic mode, 10-turn budget). Every single task was completed in 1 turn. Zero max_turns failures. Overall score: 4.35 – above GPT-5.3 and Gemini 3.1 Pro.

The Haiku Subagent

Claude Opus 4.7 uses Haiku 4.5 as a fast subagent for tool-use turns. The large model plans; Haiku executes the tool calls. We ran Haiku 4.5 standalone to understand how much of the output quality comes from each layer.

Model	Tool	Overall	D1	D2	D3
Claude Opus 4.7	Claude Code	4.35	4.73	4.46	3.85
Claude Haiku 4.5	Claude Code	2.14	4.57	0.14	2.44

Haiku’s D2 is 0.14 – near zero on YugabyteDB-idiomatic code. That’s expected: Haiku is the implementer, not the planner. It follows instructions faithfully and avoids mistakes (D1: 4.57), but it doesn’t generate UUID keys, implement retry logic, or configure smart drivers on its own. Those come from the Opus planning layer.

In any agentic system where a large model plans and a smaller one executes, the skill file needs to reach the planner.

Rules vs Recipes

The skill file lifts the overall score. but not evenly. Some patterns jump, others fall below zero-shot.

Pattern	Zero-shot	Skill-file
Batch inserts	30%	79%
UNLOGGED TABLE avoid	27%	84%
Retry with backoff	79%	64%
ROLLBACK handling	69%	50%

The distinction is what the pattern requires of the model:

Pattern	The rule	The recipe
Retries	“Use backoff on 40001”	`try/except` loop with `ROLLBACK`, jitter, bounded attempts, `40001`+`40P01` catch
Primary keys	“Avoid SERIAL”	`id UUID DEFAULT gen_random_uuid() PRIMARY KEY`
Indexes	“Use covering indexes”	`CREATE INDEX ON t(col_a) INCLUDE (col_b, col_c)`
UNLOGGED TABLE	“Don’t use UNLOGGED”	– (prohibition is enough, no control flow needed)

Skill files win on concrete prohibitions. “Don’t use UNLOGGED TABLE” maps cleanly to a rule that stays in working memory. They also win on batch inserts (79% vs 30%), because “batch your writes” is specific enough to follow directly.

They regress on anything that requires control flow. ROLLBACK handling with our skill file (50%) falls below zero-shot (69%). We had “always rollback before retrying” as a text rule. Without showing why it belongs inside a loop, models place rollback() in the wrong place. They know the token belongs somewhere, but can’t construct the surrounding control flow from the prohibition alone.

The fix is to replace the rule with a working example in the skill file directly, as below:

# Recipe: retry with backoff for YugabyteDB transient errors
TRANSIENT = {"40001", "40P01"}
for attempt in range(MAX_RETRIES):
    try:
        cur.execute(sql)
        conn.commit()
        break
    except psycopg2.Error as e:
        conn.rollback()   # must rollback before retry
        if e.pgcode not in TRANSIENT:
            raise
        time.sleep((2 ** attempt) * 0.01 + random.random() * 0.01)

This goes in CLAUDE.md or .cursor/rules/ verbatim – not as a prose description, as runnable code the model can copy and adapt.

Models follow working code examples. They’re bad at turning prose rules into working control flow. Every round, patterns that regressed as text rules get promoted to code examples.

Not all models respond the same way. Codex (gpt-5.5) with a skill file hits D2=3.97 – the highest D2 of any Codex CLI run – while D1 stays at 2.76. It readily adopts prescriptions (40001/40P01 error codes, load_balance=true) but still generates anti-patterns that the skill file explicitly prohibits.

Compare that to Claude Haiku, which avoids anti-patterns well (D1=4.57) but originates almost nothing on its own (D2=0.14). The same skill file produces opposite failure modes depending on the model’s role: one that follows recipes but ignores rules, one that follows rules but doesn’t know any recipes. Both need the other half.

The Leaderboard

Provider	Model	Tool	Overall	D1	D2	D3
Anthropic	Claude Opus 4.6	Claude Code	4.62	4.95	4.46	4.46
	Claude Opus 4.7	Claude Code	4.35	4.73	4.46	3.85
	Claude Opus 4.6	Cursor	3.91	3.75	4.14	3.83
	Claude Sonnet 4.6	Cursor	3.42	3.73	3.75	2.79
	Claude Opus 4.7 High	Cursor	3.29	3.41	4.41	1.88
	Claude Opus 4.6 High	Cursor	3.24	3.43	4.07	2.10
OpenAI	GPT-5.3	Cursor	4.22	4.66	4.30	3.71
	GPT-5.4 High	Cursor	3.14	3.28	3.60	2.54
	GPT-5.5	Codex	3.12	2.76	*3.97*	2.42
	GPT-5.4	Codex	3.04	3.20	3.70	2.12
Google	Gemini 3.1 Pro	Cursor	4.17	4.63	4.25	3.63
xAI	Grok	Cursor	3.67	3.63	4.64	2.75
Cursor	Composer 2	Cursor	2.93	3.16	3.40	2.23

All entries run at skill-file context level with a minimum of 12 prompts per configuration. The top three results, Claude Opus 4.6, GPT-5.3, Gemini 3.1 Pro, are within half a point of each other.

The coding tool is another variable that impacts skill performance, even when running the same model and provider. GPT-5.3 via Cursor scores 4.22; gpt-5.5 via Codex CLI scores 3.12 – both OpenAI, same context type, 1.10 point gap. Claude Opus 4.6 shows the same pattern: 4.62 via Claude Code, 3.91 via Cursor. The model family doesn’t explain the spread. The tool does.

Newer Isn’t Always Better

The same pattern in the Claude family through Cursor: Sonnet 4.6 (3.42) outscores Opus 4.7 High (3.29).

Opus 4.7 High actually adopts more YugabyteDB idioms – D2 is 4.41 vs 3.75 – but D3 quality collapses to 1.88 against Sonnet’s 2.79. More reasoning budget, lower architectural coherence.

Extended reasoning appears to spend its tokens analyzing trade-offs rather than writing coherent code. D2 climbs because the model checks more pattern boxes. D3 drops because the resulting structure is fragmented – correct pieces, incoherent whole.

The same effect appears across providers: GPT-5.4 High (3.14 overall) scores well below GPT-5.3 (4.22) on D1/D2/D3 despite the higher model tier. For schema design, a model that reaches for the obvious correct DDL beats one that over-engineers it.

The models with the biggest D1 lift all started lowest on zero-shot – they had the most room. Claude Opus 4.6 goes from 1.40 zero-shot to 4.95 with a skill file (+3.55). The strongest general models gain the most from the right domain context.

TPC-H Experiment: Does Structure Predict Execution?

D1, D2, and D3 score structure. The natural follow-up question is: do those structural choices hold up when the DDL actually runs on a live cluster? We ran a separate experiment to find out.

Each model receives a 3-node YugabyteDB cluster with the canonical TPC-H schema at SF=0.01:

roughly 60,000 rows on lineitem
15,000 on orders
Standard dbgen DDL: SERIAL primary keys
no extra indexes
no sharding hints

The same prompt to every model:

Optimize this schema for distributed execution of the 22 standard TPC-H queries. You may change primary keys, add indexes, change sharding, add colocation, add tablespaces. You may not modify query text, table names, column names, column types, constraints, or row counts.

Every model’s DDL is checked before the timing begins. Any violation hard-zeroes the result – fast-but-wrong doesn’t win: every original table, column, type, and constraint must still be present; per-table COUNT(*) must exact-match the pre-run snapshot; a deterministic hash over Q1 and Q3 result sets must match.

geo_mean  = (∏ speedup[q]) ^ (1/22)     # geometric mean - one query can't mask another
penalty   = 0.02 × new_indexes_count     # every new index taxes writes
exec_score = clamp( 2.5 + log2(geo_mean) − penalty , 0 , 5 )

2.5 is neutral. 3.5 is a 2x average speedup. 4.5 is 4x.

Speedups are measured by running each of the 22 queries before and after applying the DDL and taking the median wall time. The geometric mean is the load-bearing choice: an index that doubles Q3 and halves Q1 scores neutral – exactly the trade-off a naive average hides.

EXPLAIN (ANALYZE, DIST) is available to the model for plan inspection, exposing storage RPCs and rows scanned across tablet servers. These are metrics that don’t exist in standard PostgreSQL plans.

Initial Results

All models below ran via Cursor unless noted as (CLI). Timing on SF=0.01 (≈60K line item rows, 15K orders), 22×3 query repeats.

Rank	Provider	Model	Zero-shot	With Skill File	Delta
1	OpenAI	GPT 5.4 High	3.03	3.15	+0.12
2	Anthropic	Claude Opus 4.7	0 (schema fail)	2.95	—
3	Google	Gemini 3.1 Pro	2.66	2.91	+0.25
4	Anthropic	Claude Opus 4.6	0 (schema fail)	2.74	—
5	OpenAI	GPT 5.5 (Codex CLI)	2.71	2.71	0.00
6	Anthropic	Claude Sonnet 4.6	2.50	2.49	-0.01
7	Cursor	Composer 2	2.52	2.40	-0.12
8	OpenAI	GPT 5.3 Codex	2.10	2.56	+0.46
9	Google	Gemini 2.5 Flash (CLI)	0 (schema fail)	2.39	—
10	Google	Gemini Flash Lite (CLI)	2.84	2.38	-0.46
11	xAI	Grok 4.20	0 (schema fail)	0 (schema fail)	—

GPT 5.4 High is the only model to break 3.0 in both modes: 3.03 zero-shot, 3.15 with the skill file. It’s targeted indexes lift query throughput without broad write amplification.

GPT 5.3 Codex benefits most from the skill file proportionally – zero-shot over-indexes and slightly slows the benchmark; the skill file cuts index count and brings the score from 2.10 to 2.56.

Gemini Flash Lite (Gemini CLI) zero-shot: 2.84. This was the highest zero-shot score of any native CLI model, and the sharpest skill-file regression in the benchmark. Its DDL was pure indexes plus ALTER TABLE ... SPLIT INTO N TABLETS – no table rebuilds, no schema changes, only 7 new indexes. By staying conservative, it avoided every schema invariant pitfall. When the skill file was applied, performance dropped sharply (2.84 → 2.38, -0.46): the additional guidance pushed the model toward more indexes (7 → 10), increasing the write-amplification penalty without commensurate query speedup. The model had already internalized a near-optimal strategy; the skill file disrupted it.

GPT-5.5 via Codex CLI: 2.71 zero-shot, 2.71 with skill file. Consistent and solid – above GPT-5.3 Codex zero-shot (2.10) and close to its skill-file score (2.56), no regression from the skill file. The stability across context levels suggests the model’s index strategy is robust enough that additional guidance neither helps nor hurts this workload.

Composer 2’s skill-file result (2.40) came in below zero-shot (2.52). The same pattern as Gemini Flash Lite, but in a different direction. The skill file added index guidance that didn’t translate to this workload.

Gemini 2.5 Flash (Gemini CLI): 0 zero-shot, 2.39 with skill file. Zero-shot failed: the model re-specified the full table schema with NOT NULL on comment columns – the same mistake as Opus 4.7. With the skill file, Gemini 2.5 Flash used CREATE TABLE ... (LIKE source INCLUDING DEFAULTS), which copies nullability from the live table (nullable comments), passing invariants and scoring 2.39.

Grok 4.20 (Cursor): 0 zero-shot, 0 with skill file. Both runs failed invariants. The skill-file DDL attempted to recreate tables with colocation enabled. This feature is not available on a non-colocated database. The CREATE TABLE for region and nation failed, leaving those tables dropped. The zero-shot attempt made no schema changes at all (0 indexes, 0 statements applied). Neither recovered.

Multiple models hard-zeroed on zero-shot. The problem is different in each case:

Opus 4.7 retyped column definitions and set nullable comment columns to NOT NULL, breaking the schema contract.
Opus 4.6 used CREATE TABLE ... AS SELECT, which strips all NOT NULL constraints from non-PK columns – the opposite error.
When Opus 4.7 attempted to reshard distribution keys with SPLIT INTO N TABLETS, it hit a YugabyteDB-specific wall: SPLIT INTO N TABLETS requires a hash-partitioned leading column; colocation = true requires a colocated database; ALTER TABLE ... DROP PRIMARY KEY is unsupported on hash-sharded tables. Partial failures left tables dropped without replacements, zeroing the run regardless of the indexes that succeeded.
Gemini 2.5 Flash zero-shot replicated the same NOT NULL mistake as Opus 4.7.
Grok hit another wall – unsupported DDL operations crashed the rebuild mid-execution, leaving the schema in a broken state.
Claude Sonnet 4.6 and Gemini Flash Lite took the conservative path – only CREATE INDEX statements, no table rebuilds – and passed.

With the skill file, Opus models recovered (Opus 4.7 to 2.95, Opus 4.6 to 2.74), and Gemini 2.5 Flash recovered to 2.39. Simple, conservative DDL won.

The pattern is consistent: any model that rewrites the table schema explicitly risks mis-specifying nullability.

Models were trained on the standard TPC-H schema, where comment columns are NOT NULL, but this benchmark’s reset script made them nullable for test flexibility. Models that reused the live schema via CREATE TABLE ... (LIKE source INCLUDING DEFAULTS) were immune to this; models that typed the schema from memory were not.

This is a knowledge gap, not a reasoning failure. The models chose the correct approach for a distributed cluster. None of them had the specific execution constraints in their skill file. The right fix is a project-skill entry listing unsupported DDL operations for your cluster setup.

Tuning the Skill

D1, D2, and D3 improved reliably with context – the patterns were clear.

Execution was the outlier: several models hard-zeroed because their DDL hit YugabyteDB-specific constraints we had never mentioned in the skill file. If those failures were knowledge gaps – missing CBO flags, unsupported DDL operations, wrong index cost assumptions – then adding that knowledge should close the gap.

So we ran six iterations of the skill file, each targeting a specific failure pattern, and measured what actually moved.

What Worked

Adding yb_enable_cbo = on, yb_enable_derived_saops = true, and ANALYZE as a mandatory end-of-script block lifted every model. These settings are off by default in YugabyteDB, and not one model included them without being told. Pure knowledge gap, clean gain – the kind of thing that belongs in any skill file.

Framing index cost in distributed terms – each index is a tablet-bearing Raft-replicated object, not a lightweight B-tree – brought Claude Sonnet 4.6 from 33 indexes to 11. Composer 2 went from 25 indexes and 2.40 to 12 and 2.92 for the same reason. When models understand why something is expensive in a distributed system, they make better tradeoffs without being told what to choose.

Where Prescriptive Rules Backfired

Gemini 3.1 Pro was already at 14 indexes and 2.91 – near-optimal for its strategy. When we added a hard index cap (<=15), it cut indexes that were genuinely useful, reducing the geo-mean speedup from 1.62x to 1.30x. Gemini never fully recovered across all six versions. Any rule that constrains a model’s strategy – as opposed to correcting a knowledge gap – risks hurting models that have already internalized the right approach.

Across six iterations, knowledge additions were transferred to most models. Hard rules produced the most variable results – helping models that lacked the underlying knowledge, hurting the ones that had already internalized it.

The v5 skill added a full Distributed Query Planning section. It pushed Claude Sonnet 4.6 (Cursor) to 3.12 – above Opus 4.7 and Sonnet 4.6 on this workload. It made things measurably worse for Opus 4.7, Composer 2, and Sonnet 4.6 on the general benchmark. A longer skill file gives the model more surface area to be distracted from what matters for the task.

(Score scale: 2.5 = no change from baseline, 3.5 ≈ 2x average query speedup. v2 = universal win – CBO settings and distributed cost framing with no workload-specific rules. v3-v6 = TPC-H-specific experiments; gains on this workload, regressions on the 55-task general benchmark.)

Provider	Model	original	v2	v3	v4	v5	v6	peak
Anthropic	Claude Sonnet 4.6 (Cursor)	2.75	2.89	–	–	3.12	2.75	3.12
	Claude Opus 4.7 (Claude Code)	2.95	3.09	–	–	2.84	2.92	3.09
	Claude Opus 4.6 High (Cursor)	2.71	2.73	2.82	2.65	2.71	2.54	2.82
	Claude Sonnet 4.6 (Claude Code)	2.49	2.80	–	–	2.62	2.74	2.80
OpenAI	GPT 5.4 High	3.15	–	–	–	–	–	3.15
Google	Gemini 3.1 Pro	2.91	2.72	2.65	2.77	2.85	2.86	2.91
Cursor	Composer 2	2.40	2.92	–	–	2.80	2.96	2.96

No single version is best for all models. The only universal win: v2’s CBO settings and distributed cost framing – pure knowledge, no strategy constraints – produced gains across the broadest set with the fewest regressions.

The split is workload-based, not model-based.

The v5 Distributed Query Planning section lifted Claude Sonnet 4.6 (Cursor) because it was running more scan-heavy TPC-H prompts. But it hurt Opus 4.7 and Sonnet 4.6 on general OLTP tasks, where the same content was noise. The divergence tracks what the model was asked to do, not which model it was. The two-layer model handles this naturally: analytical guidance belongs in a project skill for analytical workloads, not in a universal skill tuned to a specific model version.

The Wrong Layer

When we validated the v6 skill against the full 55-task benchmark – schema design, application code, fault tolerance, operations – the damage was clear:

Model	Dimension	Original skill	v6 skill	Delta
Sonnet 4.6	D1 anti-pattern	4.88	4.91	+0.03
	D2 positive pattern	1.80	0.54	-1.26
	D3 quality	3.37	2.80	-0.57
Opus 4.7	D1 anti-pattern	4.72	4.87	+0.15
	D2 positive pattern	4.34	3.38	-0.96
	D3 quality	4.01	3.72	-0.29

D2 dropped by more than a point for Sonnet 4.6 and nearly a point for Opus 4.7. The skill made models better at avoiding prohibitions and worse at applying the YugabyteDB idioms that actually matter across general workloads. TPC-H execution scores improved, but everything else got worse.

The tuning itself wasn’t wrong – the layer was.

TPC-H-specific guidance (CBO settings, tablet cost framing, index count heuristics) is exactly what should live in a project skill for an analytical workload. What broke things was putting it in the universal skill and letting it crowd out the general-purpose patterns that every YugabyteDB workload needs.

What D4 Should Be

The TPC-H experiment taught us that execution efficiency cannot be fairly measured alongside structure and code quality within the same benchmark. There are three problems:

Workload coupling. TPC-H is one workload. A model that wins on analytical scans may do little to improve OLTP point reads. A single execution score doesn’t generalize.
Hard-zero fragility. One unsupported DDL call wipes the entire execution result, regardless of what else the model did correctly. This is structurally different from how D1/D2/D3 handle partial correctness.
Measurement mismatch. D1/D2/D3 score the 55-task general benchmark. Folding a single TPC-H run into the same number creates a false equivalence between “schema design awareness across 55 tasks” and “query throughput on one analytical workload.”

The right split:

A static performance intent score. Does the code include CBO flags, cost-aware indexing, ANALYZE calls, and distributed-aware hints? This runs inside the 55-task benchmark, is deterministic, and rewards knowledge of YugabyteDB’s execution model without requiring a live cluster.
A separate live execution benchmark covers workload-specific tuning – TPC-H for analytical, pgbench-style workloads for OLTP – reported independently with its own methodology.

The Two-Layer Fix

The right response wasn’t to revert everything; it was to separate what belongs in a universal skill from what belongs in a project-specific one.

The universal skill (yugabyte/yugabytedb-skills) covers what every model needs to know about YugabyteDB, regardless of project: anti-patterns, PK strategies, sharding, colocation, covering indexes, retry recipes, and smart-driver configuration. Don’t tune it to your specific workload.

The project skill covers what a model needs to know about your codebase: your schema, your cluster configuration, your observed performance characteristics, and your team’s conventions. This is where TPC-H-specific guidance belongs, and where any team running YugabyteDB in production should add context that no general skill can carry.

What might go in a project skill:

## Cluster Configuration
- 3-node RF=3, us-east-1 (a/b/c zones), YugabyteDB 2025.2
- yb_enable_cbo = on already set at cluster level

## Schema Context
- orders (o_orderkey HASH, 15M rows) + lineitem (l_orderkey HASH, 60M rows)
  -> same hash key: joins between these two are tablet-local
- partsupp is colocated with part (low write throughput, frequent joins)
- Q20 is the slowest query (7s baseline); focus index effort here first

## Team Conventions
- All new indexes: prefix idx_ + table + column pattern
- Use INCLUDE columns for all covering indexes - no bare secondary indexes
- CBO block runs via migration hook - do not add it to DDL

Add it to your tool:

Claude Code – CLAUDE.md at the repo root, under a ## Database section. Loaded automatically for every session in that directory.

Cursor – .cursor/rules/yugabyte-project.md with alwaysApply: true in the frontmatter. Cursor injects it into every chat and Composer session.

# Universal skill
npx skills add yugabyte/yugabytedb-skills

# Add project context to CLAUDE.md or .cursor/rules/

The universal skill carries the D1/D2 gains: anti-pattern avoidance, UUID keys, retry logic, and smart driver config. These are the patterns that apply to every YugabyteDB workload.

The project skill carries what only you know: your schema shape, your cluster’s observed bottlenecks, and your team’s index conventions.

Six TPC-H iterations taught us exactly where that boundary sits. The universal skill doesn’t know your slowest query. The project file doesn’t know that UNLOGGED TABLE is a silent no-op. Neither should try to do the other’s job.

The Bottom Line

Give the model the list of things that silently break. Models default to PostgreSQL patterns that compile fine on YugabyteDB but fail silently in production. You must tell them what to avoid, or they will repeat the mistake.
The tool wrapping the model can undo the lift. The same model, through different tools, produced a point of variance in our benchmark. Check your tool config, turn budgets, and routing before assuming a model upgrade changed anything.
Tell the model what to avoid; show it how to handle the rest. Rules work for prohibitions. Anything requiring working code – retry loops, error handling – regresses when written as a text rule. Paste working examples, not descriptions.
Knowledge gaps belong in the universal skill. Strategy belongs in the project skill. What works for one workload is noise for another. Keep workload-specific guidance in your project skill, not the universal one.
Benchmark scores measure knowledge, not what runs on your cluster. The model that scored highest on code quality got hard-zeroed when its DDL hit unsupported operations. Validate the generated DDL on your actual cluster before shipping it.

Conclusion

The model family is not the primary variable; context is. The tool wrapping the model shifts the score more than a version upgrade.

The two-layer approach handles this directly: a universal skill in database fundamentals, a project skill for your cluster, and a schema. Neither should be a substitute for the other.

The next blog in this series discusses local models and agent memory. Whether they show the same skill-file lift for air-gapped or privacy-sensitive deployments is an open question. Stay tuned!

Don’t miss part one of this blog series: Good Code, Wrong Model. How to Benchmark AI Coding Agents for Distributed SQL

May 20, 2026