ВИКОРИСТАННЯ ВЕЛИКИХ МОВНИХ МОДЕЛЕЙ  ДЛЯ ПЕРЕТВОРЕННЯ ПРИРОДНОЇ МОВИ В SQL ЗАПИТ

В. М. Борисюк; А. В. Козловський

Автор(и)

В. М. Борисюк Вінницький національний технічний університет https://orcid.org/0009-0002-1005-7492
А. В. Козловський Вінницький національний технічний університет https://orcid.org/0009-0002-1005-7492

Ключові слова:

Text-to-SQL, великі мовні моделі, генерація коду, машинне навчання, обробка природної мови, SQL

Анотація

Проблема автоматичного перетворення запитань, сформульованих природною мовою, у структуровані SQL-запити (технологія Text-to-SQL), залишається актуальною впродовж останніх десятиліть через постійне зростання обсягів даних і потребу в доступі до них з боку нефахових користувачів. Основними викликами цього завдання є складність інтерпретації запитів користувачів, необхідність врахування структур баз даних, неоднозначність природної мови, а також забезпечення високої точності синтезу SQL-запитів. Традиційні підходи, що поєднують ручну розробку правил із використанням глибоких нейронних мереж, продемонстрували значний прогрес, однак часто вимагають значних людських ресурсів для створення та підтримки правил, а також демонструють обмежену здатність до узагальнення на нові домени. Подальший розвиток у цій сфері був пов’язаний із появою попередньо натренованих мовних моделей (PLM), які забезпечили суттєве покращення результатів у задачах Text-to-SQL за рахунок глибшого розуміння семантики природної мови. Проте зі зростанням складності схем баз даних і мовних формулювань виникає проблема: моделі, обмежені за розміром, часто генерують некоректні SQL-запити, що зумовлює потребу в складних оптимізаційних стратегіях та знижує масштабованість таких рішень. У цьому кон- тексті великі мовні моделі (LLM) демонструють нові можливості завдяки своїй високій здатності до розуміння природної мови, багатозадачності, контекстної обізнаності та глибокого семантичного аналізу, що покращується зі зростанням розміру моделей.У статті проведено ґрунтовний аналіз ключових технічних викликів, таких як лінгвіс- тична неоднозначність, розуміння та репрезентація схеми бази даних, генерація рідкісних SQL-операцій і проблема узагальнення на різні домени. Описано основні етапи розвитку напряму Text-to-SQL, розглянуто актуальні набори даних, що охоплюють мультидоменні, багатомовні, контекстно-залежні та знання-доповнені задачі, а також наведено харак- теристики метрик оцінювання якості генерації SQL, серед яких Component Matching, Exact Matching, Execution Accuracy і Valid Efficiency Score. Також узагальнено останні наукові досягнення, зокрема інтеграцію великих мовних моделей, використання стратегій навчання з контекстом (in-context learning), донавчання (fine-tuning), аугментацію даних та багатозадачне налаштування.

Посилання

J. Li, B. Hui, G. QU, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. Chang, F. Huang, R. Cheng, and Y. Li. “Can LLM already serve as a database interface? a BIg bench for large-scale database grounded Text-to-SQLs,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.

Lijie Wang, Ao Zhang, Kun Wu, Ke Sun, Zhenghua Li, Hua Wu, Min Zhang, and Haifeng Wang. 2020. DuSQL: A Large-Scale and Pragmatic Chinese Text- to-SQL Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6923–6935, Online. Association for Computational Linguistics.

T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. Lin, Y. C. Tan, T. Shi, Z. Li, Y. Jiang, M. Yasunaga, S. Shim, T. Chen, A. Fabbri, Z. Li, L. Chen, Y. Zhang, S. Dixit, V. Zhang, C. Xiong, R. Socher, W. Lasecki, and D. Radev, “CoSQL: A conversational Text-to-SQL challenge towards cross-domain natural language interfaces to databases,” in Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task,” in Empirical Methods in Natural Language Processing (EMNLP), 2018.

V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured queries from natural language using reinforcement learning,” arXiv preprint arXiv:1709.00103, 2017.

Y. Gan, X. Chen, Q. Huang, and M. Purver, “Measuring and improving compositional generalization in text-toSQL via component alignment,” in Findings of North American Chapter of the Association for Computational Linguistics (NAACL), 2022.

X. Pi, B. Wang, Y. Gao, J. Guo, Z. Li, and J.-G. Lou, “Towards robustness of Text-to-SQL models against natural and realistic adversarial table perturbation,” in Association for Computational Linguistics (ACL), 2022.

Y. Gan, X. Chen, Q. Huang, and M. Purver, “Measuring and improving compositional generalization in text-toSQL via component alignment,” in Findings of North American Chapter of the Association for Computational Linguistics (NAACL), 2022.

Y. Gan, X. Chen, and M. Purver, “Exploring underexplored limitations of cross- domain Text-to-SQL generalization,” in Empirical Methods in Natural Language Processing (EMNLP), 2021.

Y. Gan, X. Chen, Q. Huang, M. Purver, J. R. Woodward, J. Xie, and P. Huang, “Towards robustness of Text-to-SQL models against synonym substitution,” in Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACLIJCNLP), 2021.

X. Deng, A. H. Awadallah, C. Meek, O. Polozov, H. Sun, and M. Richardson, “Structure-grounded pretraining for Text-to-SQL,” in North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), 2021.

Q. Min, Y. Shi, and Y. Zhang, “A pilot study for Chinese SQL semantic parsing,” in Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li, H. Er, I. Li, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor, S. Shim, J. Kraft, V. Zhang, C. Xiong, R. Socher, and D. Radev, “SParC: Cross-domain semantic parsing in context,” in Association for Computational Linguistics (ACL), 2019.

F. Lei, J. Chen, Y. Ye, R. Cao, D. Shin, H. SU, Z. SUO, H. Gao, W. Hu, P. Yin, V. Zhong, C. Xiong, R. Sun, Q. Liu, S. Wang, and T. Yu, “Spider 2.0: Evaluating language models on real-world enterprise Text-to-SQL workflows,” in International Conference on Learning Representations (ICLR), 2025.

A. Tuan Nguyen, M. H. Dao, and D. Q. Nguyen, “A pilot study of Text-to- SQL semantic parsing for Vietnamese,” in Findings of Empirical Methods in Natural Language Processing (EMNLP), 2020.

S. Chang, J. Wang, M. Dong, L. Pan, H. Zhu, A. H. Li, W. Lan, S. Zhang, J. Jiang, J. Lilien et al., “Dr. spider: A diagnostic evaluation benchmark towards Text- to-SQL robustness,” in International Conference on Learning Representations (ICLR), 2023.

C. Zhang, Y. Mao, Y. Fan, Y. Mi, Y. Gao, L. Chen, D. Lou, and J. Lin, “Finsql: model-agnostic llmsbased Text-to-SQL framework for financial analysis,” in Conference on Management of Data (SIGMOD), 2024.

T. Zhang, T. Yu, T. B. Hashimoto, M. Lewis, W. tau Yih, D. Fried, and S. I. Wang, “Coder reviewer reranking for code generation,” in International Conference on Machine Learning (ICML), 2023.

M. Pourreza and D. Rafiei, “DIN-SQL: Decomposed in-context learning of Text- to-SQL with self-correction,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.

C.-Y. Tai, Z. Chen, T. Zhang, X. Deng, and H. Sun, “Exploring chain of thought style prompting for textto-SQL,” in Empirical Methods in Natural Language Processing (EMNLP), 2023.

X. Dong, C. Zhang, Y. Ge, Y. Mao, Y. Gao, J. Lin, D. Lou et al., “C3: Zero-shot Text-to-SQL with chatgpt,” arXiv preprint arXiv:2307.07306, 2023.

B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li, “Macsql: A multi-agent collaborative framework for textto-sql,” in International Conference on Computational Linguistics (COLING), 2024.

Y. Xie, X. Jin, T. Xie, M. Lin, L. Chen, C. Yu, L. Cheng, C. Zhuo, B. Hu, and Z. Li, “Decomposition for enhancing attention: Improving llm-based Text-to-SQL through workflow paradigm,” in Findings of Association for Computational Linguistics (ACL), 2024.

Q. Zhang, J. Dong, H. Chen, W. Li, F. Huang, and X. Huang, “Structure guided large language model for sql generation,” arXiv preprint arXiv:2402.13284, 2024.

Y. Fan, Z. He, T. Ren, C. Huang, Y. Jing, K. Zhang, and X. S. Wang, “Metasql: A generate-then-rank framework for natural language to sql translation,” in International Conference on Data Engineering (ICDE), 2024.

Z. Li, X. Wang, J. Zhao, S. Yang, G. Du, X. Hu, B. Zhang, Y. Ye, Z. Li, R. Zhao, and H. Mao, “Pet-sql: A prompt-enhanced two-stage Text-to-SQL framework with cross-consistency,” arXiv preprint arXiv:2403.09732, 2024.

T. Ren, Y. Fan, Z. He, R. Huang, J. Dai, C. Huang, Y. Jing, K. Zhang, Y. Yang, and X. S. Wang, “Purple: Making a large language model a better sql writer,” in International Conference on Data Engineering (ICDE), 2024

G. Qu, J. Li, B. Li, B. Qin, N. Huo, C. Ma, and R. Cheng, “Before generation, align it! a novel and effective strategy for mitigating hallucinations in Text-to-SQL generation,” in Findings of Association for Computational Linguistics (ACL), 2024.

D. Lee, C. Park, J. Kim, and H. Park, “MCS-SQL: Leveraging multiple prompts and multiple-choice selection for Text-to-SQL generation,” in International Conference on Computational Linguistics (COLING), 2025.

S. Talaei, M. Pourreza, Y.-C. Chang, A. Mirhoseini, and A. Saberi, “Chess: Contextual harnessing for efficient sql synthesis,” arXiv preprint arXiv:2405.16755, 2024.

B. Li, Y. Luo, C. Chai, G. Li, and N. Tang, “The dawn of natural language to sql: Are we fully ready?” in International Conference on Very Large Data Bases (VLDB), 2024.

K. Maamari, F. Abubaker, D. Jaroslawicz, and A. Mhedhbi, “The death of schema linking? Text-to-SQL in the age of well-reasoned language models,” in NeurIPS 2024 Third Table Representation Learning Workshop (NeurIPS), 2024.

Z. Cao, Y. Zheng, Z. Fan, X. Zhang, and W. Chen, “Rslsql: Robust schema linking in Text-to-SQL generation,” arXiv preprint arXiv:24[4].00073, 2024.

J. Shi, B. Xu, J. Liang, Y. Xiao, J. Chen, C. Xie, P. Wang, and W. Wang, “Gen- SQL: Efficient Text-to-SQL by bridging natural language question and database schema with pseudo-schema,” in International Conference on Computational Linguistics (COLING), 2025.

M. Deng, A. Ramachandran, C. Xu, L. Hu, Z. Yao, A. Datta, and H. Zhang, “Reforce: A Text-to-SQL agent with self-refinement, format restriction, and column exploration,” arXiv preprint arXiv:2502.00675, 2025.

C. Guo, Z. Tian, J. Tang, P. Wang, Z. Wen, K. Yang, and T. Wang, “Prompting gpt-3.5 for Text-to-SQL with de-semanticization and skeleton retrieval,” in Pacific Rim International Conference on Artificial Intelligence (PRICAI), 2024.

J. Jiang, K. Zhou, Z. Dong, K. Ye, X. Zhao, and J.- R. Wen, “StructGPT: A general framework for large language model to reason over structured data,” in Empirical Methods in Natural Language Processing (EMNLP), 2023.

L. Nan, Y. Zhao, W. Zou, N. Ri, J. Tae, E. Zhang, A. Cohan, and D. Radev, “Enhancing Text-to-SQL capabilities of large language models: A study on prompt design strategies,” in Findings of Empirical Methods in Natural Language Processing (EMNLP), 2023.

C. Guo, Z. Tian, J. Tang, S. Li, Z. Wen, K. Wang, and T. Wang, “Retrieval- augmented gpt-3.5-based Text-to-SQL framework with sample-aware prompting and dynamic revision chain,” in International Conference on Neural Information Processing (ICONIP), 2024.

D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, and J. Zhou, “Text-to-SQL empowered by large language models: A benchmark evaluation,” in International Conference on Very Large Data Bases (VLDB), 2024.

S. Chang and E. Fosler-Lussier, “Selective demonstrations for cross-domain Text-to-SQL,” in Findings of Empirical Methods in Natural Language Processing (EMNLP), 2023.

H. Zhang, R. Cao, L. Chen, H. Xu, and K. Yu, “ACT-SQL: In-context learning for Text-to-SQL with automatically-generated chain-of-thought,” in Findings of Empirical Methods in Natural Language Processing (EMNLP), 2023.

D. Wang, L. Dou, X. Zhang, Q. Zhu, and W. Che, “Improving demonstration diversity by human-free fusing for Text-to-SQL,” in Findings of Empirical Methods in Natural Language Processing (EMNLP), 2024.

Z. Hong, Z. Yuan, H. Chen, Q. Zhang, F. Huang, and X. Huang, “Knowledge- to-sql: Enhancing sql generation with data expert llm,” in Findings of Association for Computational Linguistics (ACL), 2024.

D. G. Thorpe, A. J. Duberstein, and I. A. Kinsey, “Dubosql: Diverse retrieval-augmented generation and fine tuning for Text-to-SQL,” arXiv preprint arXiv:2404.12560, 2024.

R. Toteja, A. Sarkar, and P. M. Comar, “In-context reinforcement learning with retrieval-augmented generation for Text-to-SQL,” in International Conference on Computational Linguistics (COLING), 2025.

J. Lee, I. Baek, B. Kim, and H. Lee, “Safe-sql: Self-augmented in-context learning with fine-grained example selection for Text-to-SQL,” arXiv preprint arXiv:2502. [4]438, 2025.

R. Sun, S. O. Arik, H. Nakhost, H. Dai, R. Sinha, P. Yin, and T. Pfister, “Sql-palm: Improved large language model adaptation for Text-to-SQL,” Transactions on Machine Learning Research (TMLR), 2023.

H. Xia, F. Jiang, N. Deng, C. Wang, G. Zhao, R. Mihalcea, and Y. Zhang, “Sql-craft: Text-to-SQL through interactive refinement and enhanced reasoning,” arXiv preprint arXiv:2402.14851, 2024.

Y. Gu, Y. Shu, H. Yu, X. Liu, Y. Dong, J. Tang, J. Srinivasa, H. Latapie, and Y. Su, “Middleware for llms: Tools are instrumental for language agents in 17 complex environments,” in Empirical Methods in Natural Language Processing (EMNLP), 2024.

M. Pourreza, H. Li, R. Sun, Y. Chung, S. Talaei, G. T. Kakkar, Y. Gan, A. Saberi, F. Ozcan, and S. O. Arik, “CHASE-SQL: Multi-path reasoning and preference optimized candidate selection in Text-to-SQL,” in International Conference on Learning Representations (ICLR), 2025.

F. Shi, D. Fried, M. Ghazvininejad, L. Zettlemoyer, and S. I. Wang, “Natural language to code translation with execution,” in Empirical Methods in Natural Language Processing (EMNLP), 2022

A. Ni, S. Iyer, D. Radev, V. Stoyanov, W.-t. Yih, S. I. Wang, and X. V. Lin, “Lever: Learning to verify language-to-code generation with execution,” in International Conference on Machine Learning (ICML), 2023.

X. Chen, M. Lin, N. Scharli, and D. Zhou, “Teaching large language models to self-debug,” in International Conference on Learning Representations (ICLR), 2024.

H. A. Caferoglu and O. Ulusoy, “E-sql: Direct schema linking via question enrichment in Text-to-SQL,” arXiv preprint arXiv:2409.16751, 2024.

S. Kou, L. Hu, Z. He, Z. Deng, and H. Zhang, “Cllms: Consistency large language models,” in International Conference on Machine Learning (ICML), 2024.

H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zhu, R. Wei, H. Pan, C. Li, and H. Chen, “Codes: Towards 15 building open-source language models for Text-to-SQL,” in Conference on Management of Data (SIGMOD), 2024.

F. Xu, Z. Wu, Q. Sun, S. Ren, F. Yuan, S. Yuan, Q. Lin, Y. Qiao, and J. Liu, “Symbol-llm: Towards foundational symbol-centric interface for large language models,” in Association for Computational Linguistics (ACL), 2024.

A. Zhuang, G. Zhang, T. Zheng, X. Du, J. Wang, W. Ren, S. W. Huang, J. Fu, X. Yue, and W. Chen, “Structlm: Towards building generalist models for structured knowledge grounding,” in Conference on Language Modeling (COLM), 2024.

Y. Gao, Y. Liu, X. Li, X. Shi, Y. Zhu, Y. Wang, S. Li, W. Li, Y. Hong, Z. Luo et al., “Xiyan-sql: A multigenerator ensemble framework for Text-to-SQL,” arXiv preprint arXiv:24[4].08599, 2024.

M. Pourreza and D. Rafiei, “Dts-sql: Decomposed textto-sql with small large language models,” in Findings of Empirical Methods in Natural Language Processing (EMNLP), 2024.

Z. Yuan, H. Chen, Z. Hong, Q. Zhang, F. Huang, and X. Huang, “Knapsack optimization-based schema linking for llm-based Text-to-SQL generation,” arXiv preprint arXiv:2502.129[4], 2025.

S. K. Gorti, I. Gofman, Z. Liu, J. Wu, N. Vouitsis, G. Yu, J. C. Cresswell, and R. Hosseinzadeh, “MScSQL: Multi-sample critiquing small language models for Text- to-SQL translation,” in North American Chapter of the Association for Computational Linguistics (NAACL), 2024.

Y. Qin, C. Chen, Z. Fu, Z. Chen, D. Peng, P. Hu, and J. Ye, “ROUTE: Robust multitask tuning and collaboration for Text-to-SQL,” in International Conference on Learning Representations (ICLR), 2025.

P. Ma and S. Wang, “Mt-teql: evaluating and augmenting neural nlidb on real- world linguistic and schema variations,” in International Conference on Very Large Data Bases (VLDB), 2021.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Empirical Methods in Natural Language Processing (EMNLP), 2016.

F. Li and H. V. Jagadish, “Constructing an interactive natural language interface for relational databases,” in International Conference on Very Large Data Bases (VLDB), 2014.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 1997.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.

D. Choi, M. C. Shin, E. Kim, and D. R. Shin, “Ryansql: Recursively applying sketch-based slot fillings for complex Text-to-SQL in cross-domain databases,” Computational Linguistics, 2021.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.[4]692, 2019.

H. Li, J. Zhang, C. Li, and H. Chen, “Resdsql: Decoupling schema linking and skeleton parsing for Text-to-SQL,” in Conference on Artificial Intelligence (AAAI), 2023.

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” OpenAI Blog, 2018.

J. Wang, E. Shi, S. Yu, Z. Wu, C. Ma, H. Dai, Q. Yang, Y. Kang, J. Wu, H. Hu et al., “Prompt engineering for healthcare: Methodologies and applications,” arXiv preprint arXiv:2304.14670, 2023.

P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,” arXiv preprint arXiv:2402.07927, 2024.

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.

J. Li, B. Hui, R. Cheng, B. Qin, C. Ma, N. Huo, F. Huang, W. Du, L. Si, and Y. Li, “Graphix-t5: Mixing pre-trained transformers with graph-aware layers for Text-to-SQL parsing,” in Conference on Artificial Intelligence (AAAI), 2023.

ВИКОРИСТАННЯ ВЕЛИКИХ МОВНИХ МОДЕЛЕЙ ДЛЯ ПЕРЕТВОРЕННЯ ПРИРОДНОЇ МОВИ В SQL ЗАПИТ

Автор(и)

Ключові слова:

Анотація

Посилання

##submission.downloads##

Опубліковано

Як цитувати

Номер

Розділ

Мова