Abstract
Large Language Models (LLMs) have shown impressive capabilities in transforming natural language questions about relational databases into SQL queries. Despite recent improvements, small LLMs struggle to handle questions involving multiple tables and complex SQL patterns under a Zero-Shot Learning (ZSL) setting. Supervised Fine-Tuning (SFT) partially compensates for the knowledge deficits in pretrained models but falls short while dealing with queries involving multi-hop reasoning. To bridge this gap, different LLM training strategies to reinforce reasoning capabilities have been proposed, ranging from leveraging a thinking process within ZSL, including reasoning traces in SFT, or adopt Reinforcement Learning (RL) strategies. However, the influence of reasoning on Text2SQL performance is still largely unexplored.
This paper investigates to what extent LLM reasoning capabilities influence their Text2SQL performance on four benchmark datasets. To this end, it considers the following LLM settings: (1) ZSL, including general-purpose reasoning or not; (2) SFT, with and without task-specific reasoning traces; (3) RL, exploring the use of different rewarding functions, both the established EXecution accuracy (EX) and a mix with fine-grained ones that also account the precision, recall, and cardinality of partially correct answers; (4) SFT+RL, i.e, a two-stage approach that combines SFT and RL.
The results show that general-purpose reasoning under ZSL proves to be ineffective in tackling complex Text2SQL cases. Small LLMs benefit from SFT with reasoning much more than larger ones. RL is generally beneficial across all tested models and datasets. The use of the fine-grained metrics turns out to be the most effective RL strategy. Thanks to RL and the novel text2SQL rewards, the 7B Qwen-Coder-2.5 model performs on par with 400+ Billion ones (including gpt-4o) on the Bird dataset.
Key Contributions
1. Comprehensive Training Strategy Analysis
We systematically evaluate four different training approaches for Text2SQL:
- Zero-Shot Learning (ZSL) with and without general-purpose reasoning
- Supervised Fine-Tuning (SFT) with task-specific reasoning traces
- Reinforcement Learning (RL) with novel reward functions
- Hybrid SFT+RL approach combining both strategies
2. Novel Reward Functions for Text2SQL
We introduce fine-grained reward functions based on QATCH metrics:
- Cell Precision (CP): Fraction of predicted cells that are correct
- Cell Recall (CR): Fraction of target cells that are predicted
- Tuple Cardinality (TC): Ratio-based cardinality matching
- Format Reward (FR): Adherence to reasoning tag structure
- Tag Count Reward (TCR): Prevention of reward hacking
3. State-of-the-Art Results
Our Think2SQL-7B model achieves:
- 56.1% weighted average on BIRD dataset
- Performance comparable to 400+ billion parameter models (including GPT-4o)
- 8.5% improvement over base Qwen-Coder-2.5-7B model
- Superior performance on challenging multi-hop reasoning queries
4. Key Insights
- Task-specific reasoning is essential; general reasoning capabilities alone are insufficient
- Dense rewards (QATCH metrics) outperform sparse rewards (execution accuracy) in RL
- Small LLMs benefit more from reasoning traces than larger models
- RL training is particularly effective for complex SQL patterns involving multiple tables
Conclusion
This work demonstrates that small, efficiently trained models can compete with much larger ones when equipped with appropriate reasoning capabilities and training strategies. The introduction of fine-grained rewards for Text2SQL RL opens new avenues for improving structured query generation.
Future directions include:
- Extending the approach to more complex database schemas
- Investigating the combination of sparse and dense rewards
- Applying the methodology to other structured generation tasks
- Developing more sophisticated schema linking techniques
Citation
@article{papicchio2025think2sql,
title={Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL},
author={Papicchio, Simone and Rossi, Simone and Cagliero, Luca and Papotti, Paolo},
journal={arXiv preprint arXiv:2504.15077},
year={2025}
}