Hasan Alp Caferoğlu

SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

October 7, 2025

Overview

SING-SQL is a fully automated, two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database — without relying on SQL logs or manual annotations. The framework partitions schemas hierarchically, synthesizes complexity-controlled SQL paired with natural language, validates with LLM-as-a-judge, checks executability, repairs automatically, and generates reasoning traces. To boost coverage, it balances table-level with column-focused generation so underrepresented attributes are included.

SING-SQL: Synthetic Text-to-SQL Generation Pipeline

Figure 1: SING-SQL quality-aware SQL–Text generation pipeline with validation, executability checks, repair, and reasoning traces.



Main Results

Below we highlight performance on California Schools (BIRD dev subset), and on SING-SQL's synthetic California Schools Dev/Test with 8 candidate SQLs:

Comparison on BIRD California Schools (Dev)

Figure 2: System comparison on California Schools subset of BIRD Dev.

Comparison on SING-SQL Synthetic Dev/Test (California Schools)

Figure 3: System comparison on SING-SQL Synthetic Dev/Test (California Schools), 8 candidate SQLs.


Data Statistics

SING-SQL produces datasets with comprehensive schema coverage and broader SQL diversity. Below, we include the question counts by difficulty level and compare join/aggregation characteristics across datasets.

Question Count Comparison
DatasetOverallSimpleModerateChallengingWindow
BIRD-Dev89543052
Synthetic Train34,2668,6858,5568,0469,286
Synthetic Dev1,124297259259319
Synthetic Test1,124299248248340
Join Count Comparison

The average number of joins per SQL query is a key indicator of relational reasoning complexity in Text-to-SQL datasets. As shown in Figure 4, the synthetic data generated by SING-SQL exhibits comparable or slightly higher join complexity than the BIRD development set across most difficulty levels, except for simple queries. This finding indicates that the proposed generation framework effectively integrates complex multi-table reasoning patterns into the synthetic data. Such diversity is crucial for training models that can handle realistic database interactions, where queries often require joining information from multiple related entities.

Average number of joins per SQL across difficulty levels

Figure 4: SING-SQL captures richer relational reasoning patterns (higher join counts) compared to BIRD.

Aggregation Comparison

Aggregation operations—such as SUM, AVG, and COUNT—reflect the analytical depth of SQL queries.Figure 5 compares the proportion of queries containing aggregation operators across datasets. While BIRD-Dev shows a higher share of aggregations in simple queries (31.48%), SING-SQL provides a more balanced and representative distribution. In particular, it introduces richer aggregation usage at moderate, challenging, and window levels, reaching up to 78.76% for challenging queries.This broader aggregation coverage ensures that the synthetic dataset captures advanced analytical reasoning patterns, helping models trained on it to better generalize to real-world data analysis scenarios where aggregations are prevalent.

Aggregation usage across difficulty levels

Figure 5: SING-SQL provides broader aggregation coverage, especially at moderate, challenging, and window levels.


Models

We release SingSQL-LM: compact, specialized LMs fine-tuned on the synthetic California Schools data, achieving strong in-domain performance.

ModelSpecialized DBBase ModelTrain MethodHuggingFace
SingSQL-LM-1.5B-R32_CSCalifornia SchoolsQwen2.5-Coder-1.5B-InstructSFT🤗 HuggingFace
SingSQL-LM-1.5B-R64_CSCalifornia SchoolsQwen2.5-Coder-1.5B-InstructSFT🤗 HuggingFace
SingSQL-LM-3B-R32_CSCalifornia SchoolsQwen2.5-Coder-3B-InstructSFT🤗 HuggingFace
SingSQL-LM-3B-R64_CSCalifornia SchoolsQwen2.5-Coder-3B-InstructSFT🤗 HuggingFace

Dataset

The synthetic California Schools dataset includes train, dev, and test splits with full schema coverage and balanced complexity levels.

DatasetLink
California Schools🤗 HuggingFace

Citation

@misc{caferoğlu2025singsqlsyntheticdatageneration,
  title={SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation},
  author={Hasan Alp Caferoğlu and Mehmet Serhat Çelik and Özgür Ulusoy},
  year={2025},
  eprint={2509.25672},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2509.25672}
}


You might also be interested in reading this:  E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL


You can accompany my story on other platforms.

Hasan Alp Caferoglu © 2025