If you've been following the text-to-SQL space, you'll know the benchmark numbers have climbed steeply over the past two years. GPT-4 hit 85% on Spider. Claude 3 nudged ahead on join-heavy tests. Gemini Ultra claimed improvements on complex subquery generation. But benchmark numbers and real-world accuracy are two very different things.
This article breaks down how the leading large language models actually perform when given real production schemasambiguous column names, multi-table joins, and the kind of nuanced business logic that shows up in day-to-day queries, not cleaned-up academic datasets.
How Text-to-SQL Works Under the Hood
When you ask an AI "show me revenue by country last 30 days," the model doesn't magically know which tables to query. It needs schema context: which tables exist, what the columns are, their data types, and ideally some sample data or cardinality hints.
The typical pipeline looks like this:
Each step can fail. But the SQL generation step is where LLM choice matters most.
What "Accurate" Actually Means for SQL Generation
Benchmarks like Spider and BIRD score models on exact match or execution accuracy against a reference query. In production, you care about a different set of criteria:
status, type, or created appearing in multiple tables. Can the model figure out the right one?orders table has a deleted_at soft-delete column, a model that ignores it will inflate your numbers.GPT-4, Claude, and Gemini: Practical Differences
Testing these models on production-style schemas reveals consistent patterns.
GPT-4 (including GPT-4o) produces clean, valid SQL in most cases. It handles JOINs well and has solid table disambiguation. Its main weakness is aggressive assumption-makingit picks a table when uncertain rather than flagging the ambiguity, which can produce subtly wrong queries. It also handles very large schemas (100+ tables) less gracefully as context fills up.
Claude 3.5 and Claude 3 Opus show stronger performance on queries requiring careful reasoning about schema relationships. Claude tends to produce more conservative queriesit will add explicit filters like WHERE deleted_at IS NULL even when not asked, which is usually the right behaviour. It also generates better inline comments explaining its reasoning, which helps you catch errors before running.
Gemini 1.5 Pro and Ultra handle long context better than either GPT-4 or Claude on a per-token basis, which is an advantage with large schemas. However, testing against complex schemas shows a higher rate of syntactically valid but semantically wrong queries in JOIN-heavy scenarios.
Here's a concrete example. Given this schema fragment:
-- tables: orders, order_items, products, customers
-- orders: id, customer_id, created_at, status, deleted_at
-- order_items: id, order_id, product_id, quantity, unit_price
-- products: id, name, category, price
-- Question: "What was our total revenue by product category last month?"GPT-4 typically generates:
SELECT p.category, SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND o.created_at < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY p.category
ORDER BY total_revenue DESC;Clean and correctbut it missed the deleted_at filter, which means cancelled or deleted orders inflate the revenue numbers.
Claude typically adds:
WHERE o.created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND o.created_at < DATE_TRUNC('month', CURRENT_DATE)
AND o.deleted_at IS NULL
AND o.status NOT IN ('cancelled', 'refunded')Which is almost certainly what you wanted, even if you didn't say it.
How Schema Context Changes Everything
The model you choose matters less than how well you feed it schema context. An improperly prompted GPT-4 will underperform a well-prompted smaller model.
Key schema context elements that improve accuracy:
status VARCHAR, include status VARCHAR -- values: active, trial, cancelled, churnedTools like AI for Database handle schema enrichment automaticallythey introspect your database, cache schema metadata, and inject the right context per query. This is why purpose-built NL-to-SQL tools often outperform copy-pasting a schema into ChatGPT even when using the same underlying model.
When to Trust the Generated SQL
Generated SQL should always be treated as a first draft, not a final answer. Before running any query against production data, check:
The most dangerous scenario is a query that runs, returns a plausible-looking number, but is subtly wrongfor example, double-counting due to a missing DISTINCT.
AI for Database includes a query preview step before execution, letting you review the generated SQL and catch issues before they hit your data.
The Full Loop: Generation Is Only Half the Problem
Text-to-SQL accuracy is only one part of getting useful answers from your database. The other partsdatabase connection management, result formatting, error handling, and iterationare equally important.
If the first query fails or returns the wrong thing, a good system should let you refine in plain English rather than requiring SQL edits. "Actually, exclude the test accounts" should produce a corrected query automatically.
This is where dedicated tools differ most from general-purpose AI assistants. When you ask ChatGPT to write SQL, you're on your own for running it, handling errors, and iterating. A dedicated platform like AI for Database handles the whole loop: it generates, validates, executes, catches errors, reformats, and lets you ask follow-up questions with full conversation context.
For teams that query databases regularly, that full loop matters more than which underlying model scores highest on Spider.
The Bottom Line
There is no single best LLM for SQL generation in all situations. The gap between top models has narrowed significantly, and what separates good text-to-SQL implementations from mediocre ones is schema context quality, error handling, and iteration speednot raw model performance.
If you want accurate answers from your database without managing any of this complexity yourself, try AI for Database free at aifordatabase.com. It connects to your existing database in minutes, handles schema enrichment automatically, and lets you query in plain English without worrying about which AI model is doing the work.