Using DSPy Optimization Framework to Evaluate and Refine Production SQL System Prompts
Simon Willison demonstrated using the DSPy framework to optimize Datasette Agent's production system prompts. By running automated evaluations on a gold-standard dataset, the framework exposed critical prompting flaws like column guessing.
Impact: Medium
Why it matters
Instead of manually tweaking prompts and hoping for the best, you can use DSPy's structured evaluations and metrics to programmatically fix prompt hallucinations.
TL;DR
- 01DSPy automates prompt evaluation against a static gold-standard dataset.
- 02Context optimization prompts can inadvertently trigger model hallucinations.
- 03Integrating agents against in-process databases simplifies mock validation environments.
Key facts
- Optimized system
- Datasette Agent SQL execution prompt
- Evaluation framework
- DSPy
- Evaluation LLMs
- GPT-4.1-mini and nano
The Core Harness Setup
To evaluate system prompts reliably, the architecture bypasses expensive mock objects. DSPy agents invoke Datasette Agent's real Python tool implementations against a live, in-process SQLite engine loaded with test databases.
Prompts vs. Performance Realities
The baseline tests evaluated prompts using GPT-4.1-mini and nano. The evaluation exposed that context minimization tactics can backfire: 1. The Culprit: The guideline strictly warned against redundant metadata lookups via describe_table. 2. The Consequence: The LLM frequently hallucinated column targets, resulting in SQL parsing errors. 3. The Solution: Directly including column signatures in the table list or explicitly permitting dynamic database inspection.
Structuring LLM Optimization
Rather than editing files by hand, DSPy lets you define objective metrics (e.g., verifying SQL validity and accuracy of final responses) and compiles prompts programmatically using training data.
Try it in 2 minutes
pip install datasette datasette-agent dspybash
✓ When to use
- Refining production LLM agent system prompts with strict, repeatable quality metrics.
- Migrating agent workflows to cheaper, smaller models (e.g., GPT-4.1-mini, Claude Haiku) while preserving accuracy.
✕ When NOT to use
- Simple, single-turn chatbot interfaces where prompt engineering plays a minor role in structured output success.
- Projects lacking pre-validated, gold-standard answer datasets to score prompts against.
What to do today
- Install datasette, datasette-agent, and dspy to explore structured prompt evaluation.
- Check your system prompts for 'anti-patterns' that force models to guess missing information to save tokens.
Sources