Developer

Writing SQL with AI in 2026: what works, what does not

An honest review of LLM-generated SQL after two years in production: where the models are excellent, where they still fail, and the engineering patterns that close the gap.

By Shlok Zanwar · 4 May 2026 · 5 min read

After running LLM-generated SQL in production for over two years across PostgreSQL, MySQL, SQL Server and ClickHouse, the picture is clearer than it was. The models got dramatically better. The failure modes got more specific.

A practical, opinionated review.

What works well

The current generation of frontier models (mid-2026) is excellent at:

  • Single-table aggregations: totals, counts, group-bys with one or two filters. Essentially never wrong.
  • Two- and three-table joins when the schema is reasonably normalised and column names are clear.
  • Window functions (ROW_NUMBER, LAG, LEAD, SUM OVER), including the right PARTITION BY 90%+ of the time.
  • CTE structure: when given a complex question, models tend to decompose into 2–4 CTEs that are readable. This was unreliable in 2024.
  • Translating between dialects: Postgres → MySQL, MySQL → SQL Server. With minor cleanup.
  • Explaining existing SQL in plain language. Very strong.

For these cases, prompt the model with the schema and a few-shots from your domain and you are 95%+ accurate.

What still does not work

The failure list is more interesting:

1. Implicit business logic

“Active customers” means different things in different companies: paying, not churned, with at least one order in 90 days, with non-zero balance. The model can’t know which. It picks one and runs.

Fix: encode business definitions as views (active_customers, core_revenue, gross_margin) and only expose those views to the NL layer. The semantic layer earns its keep here.

2. Date handling

Astonishingly still a top failure mode. “Last quarter”: fiscal or calendar? “Yesterday”: IST or UTC? “Month-to-date”: including today or up to yesterday?

Fix: in the system prompt, explicitly define fiscal year start, timezone, “today” reference and what “last X” means. Two paragraphs, prevents 30% of the bugs.

3. NULL semantics

The classic ones still trip models:

  • WHERE x != 'a' excludes NULL rows. The user usually wanted them included.
  • COUNT(col) ignores NULL but COUNT(*) does not. Subtle, often wrong.
  • LEFT JOIN ... WHERE right.col IS NULL for “missing from right” gets confused with regular filters.

Fix: train a few-shots set with explicit NULL handling and reject SQL that uses != on a nullable column without a IS NULL companion clause.

4. Performance-unaware joins

Models will happily write a five-table cross-join that runs in 40 seconds. They don’t see EXPLAIN.

Fix: run EXPLAIN before execution. If estimated cost > threshold, route to a “this looks expensive, confirm?” path.

5. Subtle math errors

Percentage growth where the model writes (new − old) / new instead of / old. Ratios where it forgets to cast to a float and gets integer division.

Fix: integration tests with golden questions, run nightly. When a model version changes, the bad SQL gets caught before the user does.

The engineering patterns that close the gap

Five patterns that take an LLM-SQL system from “demo-grade” to “production-grade”:

Pattern 1: Semantic layer

Pre-build views (or dbt models) that encode business definitions. The LLM only sees the semantic layer, not raw tables. This is the highest-ROI investment.

Pattern 2: Few-shot rotation

A curated set of 30–80 (question, SQL, output) tuples per major data domain. Retrieve the 3–5 most relevant to the user’s question before prompting. Curate weekly from real user failures.

Pattern 3: Two-pass generation

For complex questions, two LLM calls:

  1. Plan: list the tables, the filters, the joins, the metric, the date range. No SQL.
  2. SQL: given the plan, write the SQL.

The plan pass is fast (small output) and surfaces ambiguity. If the plan is wrong, ask the user before generating SQL.

Pattern 4: Validation layer

Before execution:

  • Parse to AST.
  • Reject DDL/DML.
  • Allow-list tables and schemas.
  • Inject mandatory predicates (tenancy, soft-delete).
  • Wrap in LIMIT if not present.

This is non-negotiable for production.

Pattern 5: Result review loop

After execution, a quick LLM pass to sanity-check the result:

  • Does the row count look reasonable?
  • Does the unit match? (₹ vs %)
  • Are there obviously wrong values (negative for non-signed metrics)?

If the check fails, retry the SQL with a corrected prompt before showing the user.

When to NOT use LLMs for SQL

Three cases where deterministic SQL beats LLM SQL:

  1. Production reports: daily owner dashboard, monthly board pack. Write the SQL once, schedule it, version-control it. LLMs are for ad-hoc.
  2. Compliance numbers: GST returns, statutory filings. Use deterministic SQL with explicit sign-off.
  3. Customer-facing metrics: what your app shows to users. Hardcode it; don’t have an LLM mediate.

LLM SQL shines for the long tail of unique questions that nobody will ask twice. That is where it earns its place.

Doing this with AnalytAI

AnalytAI runs the patterns above by default: semantic layer, two-pass generation, validation, result review, scheduled reports for the things that should be deterministic. Connect a SQL database, ask anything, see the SQL and the rationale.

Book a 20-minute demo. Bring a read replica connection string.

Related reads:


More on Developer

See all Developer posts →

See AnalytAI on your data

Bring one Tally company or one SQL database. We will turn it into a live dashboard on a 20-minute call.

Book a demo