Abstract
Text-to-SQL, a computational linguistics task, seeks to facilitate the conversion of natural language queries into SQL queries. Recent methodologies have leveraged the concept of slot-filling in conjunction with predetermined SQL templates to effectively bridge the semantic gap between natural language questions and structured database queries, achieving commendable performance by harnessing the power of multi-task learning. However, employing identical features across diverse tasks is an ill-suited practice, fraught with inherent drawbacks. Firstly, based on our observation, there are clear boundaries in the natural language corresponding to SELECT and WHERE clauses. Secondly, the exclusive features integral to each subtask are inadequately emphasized and underutilized, thereby hampering the acquisition of discriminative features for each specific subtask. In an endeavor to rectify these issues, the present work introduces an innovative approach: the hierarchical feature decoupling model for SQL query generation from natural language. This novel approach involves the deliberate separation of features pertaining to subtasks within both SELECT and WHERE clauses, further dissociating these features at the subtask level to foster better model performance. Empirical results derived from experiments conducted on the WikiSQL benchmark dataset reveal the superiority of the proposed approach over several state-of-the-art baseline methods in the context of text-to-SQL query generation.
Introduction
Relational databases, repositories of extensive data, persist as inaccessible inaccessible to non-technical stakeholders who lack proficiency in Structured Query Language (SQL) and the foundational database schema. Over the course of the preceding decades, the realm of text-to-SQL, which endeavors to transmute natural language queries into SQL equivalents, has experienced an ascending wave of scholarly. This surge in interest is notably propelled by the contemporary strides made in the domain of deep neural networks and the amassment of substantial corpora, thereby yielding enhanced solutions for the text-to-SQL paradigm. In view of the intrinsic diversity in the nature of inquiries that may traverse either a solitary table or multiple tables, the text-to-SQL process can be broadly classified into two distinguishable categories: single-table-based and multiple-table-based. Within the purview of this paper, our particular focus is on the single-table-based text-to-SQL approach.
Contents in database table 1-10416547-1
Contents in database table 1-10416547-1
In the realm of single-table-based text-to-SQL models, a classification can be made into three distinct categories: sequence-to-sequence based, syntax-based, and sketch-slot-filling based methodologies. The sequence-to-sequence based models operate by generating a sequence of SQL tokens and schema elements, ultimately culminating in the production of the final SQL query prediction or a substantial portion thereof. In essence, they endeavor to undertake the conversion of the input natural language sequence into a SQL sequence. Notably, an early pioneer in the text-to-SQL domain, known as Seq2SQL [43], employed a sequence-to-sequence paradigm. However, it failed to fully account for the rigorous syntactic constraints inherent to SQL, leading to the ineluctable occurrence of syntactically flawed queries during the prediction phase. Subsequent to the sequence-to-sequence paradigm, the syntax-based approaches emerged as an evolution, which generate a sequence of syntax rules rather than mere tokens. These syntax rules manifest as clauses, and upon their application, conduce to the formation of a valid SQL query [42, 5, 9, 2, 25, 32]. An inherent advantage of syntax-based techniques is their ability to substantially mitigate the likelihood of generating erroneous tokens or syntactically flawed queries. Nonetheless, these methods exhibit diminished recall and necessitate substantial human intervention in formulating the pertinent rules, a process inherently challenging in its own right. Concurrently, the sketch-slot-filling based approaches endeavor to transmute the intricate endeavor of SQL query generation into a more tractable task, focusing on predicting specific segments within queries (e.g., determining which table columns will populate the SELECT clause). This paradigm effectively reimagines SQL generation as a classification task with accompanying methodologies, as witnessed in the methodological works of [37, 41, 11, 10, 17, 18, 38, 3]. A body of research has conclusively demonstrated the efficacy of sketch-slot-filling based approaches in handling single-table-based text-to-SQL challenges, yielding markedly superior results when juxtaposed with their aforementioned counterparts.
For the single-table-based text-to-SQL, every SQL query is composed of a “SELECT” (column/aggregation) clause with a “WHERE” (condition) clause that consists of one or multiple conditions. It can be observed from the WikiSQL dataset that there is a clear decoupling of natural language questions corresponding to SQL queries. For example, as shown in Fig. 1, “What is the total number of lyricist” clearly corresponds to the SELECT clause, while “where the lyrics theme is romance and the song lasts 3:50” corresponds to the WHERE clause. Then, turn attention to slot-fillings in SELECT and WHERE clauses where each slot can correspond to a specific word or feature in the natural language question. Herein, “Lyricist” clearly corresponds to the column in the SELECT clause. Table 1 shows the table schema corresponding to the question in Fig. 1. The feature representation acquired by the slot-filling-based approach is slightly useless or even negative for individual subtasks.
Example diagram of hierarchical feature decoupling.
Moreover, Qin et al. [22] found that some words or features associated with some class labels usually exist in text classification, but there are also many other common features that cannot distinguish different classes. They proposed a feature decoupling method to enhance the feature representation and make it more discriminative in classification. In addition, a series of work in computer vision, acoustics, and other areas where representation learning are performed by decoupling [34, 35, 27, 14].
Based on our observations from the dataset and the problems with current slot-filling-based methods, we propose a text-to-SQL approach with hierarchical feature decoupling. Specifically, we designed a hierarchical feature decoupling model based on HydraNet [17]. The model achieves a two-part feature decoupling for the subtasks in SELECT and WHERE clauses. Decoupled features are employed for slot prediction in corresponding clauses. In slot prediction, we draw on the work [28] to achieve feature decoupling between subtasks.
We conclude our contributions as follows.
Based on our observations from the dataset and the problems with current slot-filling-based methods, we propose a text-to-SQL approach with hierarchical feature decoupling. The validity of the model was verified on the benchmark dataset WikiSQL, and the proposed model achieves better performance than several state-of-the-art baselines. Our source code will be publicly available on GitHub.2
The remainder of this paper is structured as follows: Section 2 succinctly reviews pertinent prior work, Section 3 offers an exposition of the HFD model’s implementation, Section 4 expounds upon the conducted experiments and their corresponding outcomes, and, lastly, Section 5 delivers a comprehensive denouement to this paper.
Natural language to structured query language
In the domain of Text-to-SQL, the formalization of the concept was preceded by earlier developmental efforts. Norouzifard et al. [20] introduced an algorithm that enabled the selection of a query statement from a set of multiple query statements by parsing and subsequently converting the input statement into an SQL query statement. This marked an initial step in advancing the field. Rao et al. [23] later put forth an architectural framework aimed at transforming queries into SQL statements through a consideration of semantics and syntax. Subsequent research further refined the text-to-SQL process. Chaudhari [21] proposed a system with the capability to handle both simple queries and aggregate functions, broadening the scope of its applicability. Ghosh et al. [8] made strides in developing an automatic query generation system that could support more complex multi-table queries. Reinaldha and Widagdo [24] employed semantic rules to establish mapping relationships between words in a question and attributes in a database, further enhancing the precision of text-to-SQL conversion. Singh and Solanki [26] adopted an approach that utilized verb lists, noun lists, and predefined rules to facilitate the mapping of attributes and tables to words within sentences, even in cases of ambiguous textual input. Kate et al. [12] introduced a model designed to transform natural language problems into SQL queries, contributing to the evolution of text-to-SQL methodologies. Utama et al. [30] presented the DBPal system, an innovative solution that harnessed deep learning to convert natural language statements into SQL, effectively allowing users to interact with databases without needing a detailed understanding of the database schema or query characteristics.
Recent advancements in the field of text-to-SQL have seen a shift towards the utilization of large pre-trained models such as BERT [13], XLNet [40] as natural language problem solvers and feature extractors for database tables, resulting in substantial enhancements in model effectiveness. Hwang et al. [11] introduced the BERT-based network model SQLova, and He et al. proposed the MT-DNN [15] based network model X-SQL, drawing inspiration from the BERT-style architecture. Furthermore, Microsoft Dynamics365AI presented X-SQL [10] in 2019, and in 2020, Microsoft introduced the HydraNet model [17], both of which are based on pre-trained models. Building upon the progress of the HydraNet model [17], our work in this paper aligns with the HydraNet model and puts forth a novel approach to text-to-SQL conversion that leverages slot-filling techniques, contributing to the ongoing evolution of text-to-SQL methodologies.
Multi-task learning
Multi-task learning (MTL) stands as a pivotal paradigm within the realm of machine learning, manifesting its utility by harnessing knowledge from one or more source tasks to enhance the performance of a designated target task. The overarching objective of MTL is to fortify the learning process by encapsulating the modeling of multiple tasks within a singular computational framework. Notably, the inception of a deep multi-task learning network with hard parameter sharing can be attributed to Caruana et al. [1]. This methodology for MTL pertains to the concurrent training of multiple tasks while internally intertwining their performance metrics. This approach renders the data derived from supplementary tasks as a form of induced error, serving as a constraint that fosters advancements in overall accuracy and learning efficiency.
Of particular relevance, hard parameter sharing stands as a preeminent MTL technique within neural networks, prominently addressing concerns of overfitting, as elucidated by Thung et al. [29]. Various avenues for enhancing the efficacy of deep multi-task learning methodologies have been explored. A noteworthy innovation encompasses the cross-stitch network, an end-to-end trainable architecture that offers generalizability across multiple tasks, as advanced by Misra et al. [19]. In the context of convolutional neural networks, Fang et al. have introduced a dynamic multi-task learning approach characterized by the Task Transfer Connection (TTC) framework [7]. This innovative structure introduces TTC factors between analogous network layers pertaining to multiple tasks, thereby facilitating the sharing of knowledge among these tasks. Yang and Hospedales [39] have proposed a novel tensor decomposition technique within the domain of deep multi-task learning, grounded in matrix decomposition principles and seamlessly amalgamating tensor decomposition methods with multi-task learning strategies. Moving forward, the work of Chen et al. [4] has bolstered the facet of multi-task learning from a model training perspective, through the presentation of a gradient normalization algorithm engineered to adapt deep multi-task learning networks and optimize adaptive loss balancing. Furthermore, the research by Tang et al. [28] has introduced a pioneering MTL model, denoted as Progressive Layered Extraction (PLE), which explicitly segregates task-sharing and task-specific parameters. This demarcation mitigates negative transfer and the see-saw phenomenon, thereby fostering more efficient information exchange and joint representation learning. In the context of slot prediction, insights from Tang et al. [28] have been harnessed to achieve feature decoupling across subtasks.
Hierarchical feature decoupling model
We present a novel neural network model characterized by hierarchical feature decoupling, as depicted in Fig. 2. The framework’s application is directed toward the text-to-SQL task, a computational architecture composed of four integral modules, namely: a semantic representation module, a clause feature decoupling module, a slot feature decoupling module, and a slot prediction module.
Architecture of hierarchical feature decoupling model.
In the ensuing sections, we embark upon a comprehensive exposition of these components. Initially, we expound upon the intricacies of semantic representation, encompassing the encoding of natural language inquiries and the tabular schema. Subsequently, we delve into the intricate details of the clause feature decoupling module. This is succeeded by an exhaustive examination of the slot feature decoupling module, which specializes in the isolation of individual slot features within clauses. Finally, we furnish a detailed discourse on the slot prediction module, dedicated to the anticipation of the content resident within corresponding slots.
In our formalization, we represent a natural language query as
In this study, a dual multi-head self-attention mechanism was employed for the purpose of feature extraction, drawing upon the seminal work by Vaswani et al. [31]. It should be noted that the parameters of these two distinct networks were not shared, a crucial design choice. This segregation of parameters allowed for the specific exploration of semantic representations within the SELECT and WHERE clauses of single-table-based SQL queries. Consequently, these two networks were found to facilitate the decoupling of the SELECT and WHERE clauses, and the application of multi-head self-attention was faithfully replicated as presented below.
where
where
In accordance with the fundamental principles of the attention mechanism, the vectors denoted as
In this way, we obtain two decoupled feature representations,
SELECT clause.
WHERE clause.
In Section 3.2, we derive the decoupled features denoted as
It is worth mentioning that a more efficient network architecture for achieving effective decoupling, referred to as the customized gate control model, has been previously proposed by Tang et al. [28]. We incorporate their methodology into our current investigation with the aim of attaining a decoupled feature representation of slot fillings.
As illustrated in Fig. 2, the network module within the present Section comprises multiple expert modules and task-specific tower networks. Each expert module encompasses numerous sub-networks, known as experts, with the number of experts in each module being a tunable hyper-parameter. Similarly, a tower network is a multi-layered network with width and depth as configurable hyper-parameters. In particular, shared experts are responsible for capturing common patterns, whereas task-specific patterns are extracted by dedicated task-specific experts. Each tower network assimilates knowledge from both the shared experts and its task-specific counterparts. This implies that the parameters of the shared experts are influenced by all tasks, while the parameters of the task-specific experts are exclusively shaped by their respective tasks.
The integration of shared experts and task-specific experts for selective fusion is accomplished through a gated network structure. The architecture of the gated network is based on a single-layer feed-forward network, employing SoftMax as the activation function, and the input serves as a selector that calculates the weighted sum of the selection vectors, representing the output of the expert. More precisely, the output of the gated network for task
In the context of this study, the variable denoted as
The discrete realization of the particular implementation exhibits a bifurcation into two distinct segments, namely the SELECT clause and the WHERE clause. Employing the formalism denoted as Eq. (5), we are able to derive the disentangled slot features intrinsic to both the SELECT and WHERE clauses, encompassing entities denoted as
SELECT clause.
WHERE clause.
where
The SQL prognostication module, as delineated in Lyu et al.’s work [17], is comprised of an assortment of sub-modules designed to autonomously anticipate discrete segments of SQL statements. An all-encompassing SQL statement can be dissected into two fundamental constituents, namely the Select Clause and the Where Clause. As expounded upon in Section 3.3, the feature representation vectors corresponding to the disentangled constituent elements, denoted as
SELECT Clause. The S-col module is designed with the objective of identifying the column that corresponds to the specific inquiry and schema, while the S-agg module is designed to ascertain the appropriate aggregation operator for the given context.
where the
The S-agg module endeavors to determine an optimal aggregation operator from a set of six potential selections, namely: {None, Max, Min, Count, Sum, Avg}.
where
WHERE Clause. In a manner analogous to the
where
Moreover, it is imperative to elucidate that the W-operator (W-op) presents a trinary selection, encompassing the options
where
where
W-no module predicts the number of columns in the where clause.
where
In the course of the training process, our focus centers on optimizing the objective function, which comprises the amalgamation of losses associated with individual sub-tasks. Specifically, for the tasks denoted as S-col, S-agg, W-col, W-no, W-op, and W-val, we employ the cross-entropy loss. Notably, the framework outlined in this paper is rooted in the utilization of the pre-trained RoBERTa model. To facilitate the acquisition of a holistic feature representation by the HFD model prior to its decoupling stage, we directly input the output features generated by RoBERTa into the slot prediction module. This incorporation introduces a regular term via the application of the cross-entropy loss, thereby constraining the feature representation originating from RoBERTa.
It is worth acknowledging that during the inference process when converting text to SQL, the predicted SQL queries may exhibit errors related to schema linkage or syntax. To address these issues, we draw inspiration from the concept of execution-guided decoding, as introduced by Wang et al. [33] in their work on execution-guided decoding. The proposed methodology leverages iterative execution, involving a sequence of SQL queries from the candidate list being passed to the executor, with subsequent elimination of those queries that cannot be executed or return null results. In line with this approach, we incorporate execution-guided decoding subsequent to the model’s initial prediction.
Experiment
Dataset
In our study, we employ the WikiSQL dataset, a single-table-based text-to-SQL dataset meticulously documented by Zhong et al. under the reference [43]. This dataset comprises a substantial corpus of 56,355, 8,421, and 15,878 question-SQL query pairs allocated for the purposes of training, development, and testing, respectively. An essential characteristic of the dataset is the stringent division of tables between training and development sets, ensuring that any table found in either of these sets will not be replicated in the test set.
Evaluation metrics
Two prevailing evaluation metrics employed in the assessment of text-to-SQL systems are as follows: the first metric pertains to Logic Form Accuracy (LFA), which serves the purpose of ascertaining the fidelity of SQL statements engendered by the model with respect to their correctness. The quantification of LFA is facilitated by the mathematical formulation presented in Eq. (18).
The subsequent objective involves ascertaining the concordance between the anticipated outcomes of SQL statement executions and the actual results achieved during SQL statement executions, with particular emphasis on assessing the Execution Accuracy, as elucidated by the mathematical expression presented in Eq. (20).
Within the formula, denoted as
The present study represents a notable advancement over the prior research endeavor, known as HydraNet, as delineated in the seminal work by Lyu et al. [17]. The investigation herein is underpinned by the instantiation of the PyTorch deep learning framework, featuring an intricately orchestrated training regimen spanning ten epochs. A learning rate of 3e-5 is employed, accompanied by a judiciously selected batch size of 128. The optimization process is orchestrated via the AdamW algorithm, with an additional safeguard against overfitting being realized through the deployment of the Dropout regularization method. Notably, all experimental facets transpire within the computational realm, orchestrated across a triad of Nvidia GeForce RTX 3090Ti Graphics Processing Units, thereby attesting to the computational intensity underpinning this inquiry.
Compared models
The current study presents a comparative analysis of the novel method under investigation in relation to several contemporary state-of-the-art models. These models include: (1) Seq2SQL, characterized as a generation-based approach in natural language interface to databases (NLIDB) [43]; (2) SQLnet, acknowledged as a sketch-based methodology for semantic parsing in NLIDB systems [37]; (3) TypeSQL, an innovative model that leverages type information for enhanced comprehension of infrequent entities and numerical values within user queries [41]; (4) Coarse-to-Fine, a sophisticated neural architecture, designed to address the semantic parsing task in two distinct stages, thereby enhancing structural awareness and interpretability [5]; (5) SQLova, a pioneering model that integrates pre-trained language models into the sketch-based approach for NLIDB [11]; (6) X-SQL, which enriches the representation of structural database schemas by incorporating contextual embeddings [10]; (7) HydraNet, an innovative approach that partitions the NLIDB task into column-wise ranking and decoding processes [17]; (8) IE-SQL, a distinctive model proposing an information extraction framework for Text-to-SQL conversion, emphasizing sequence-labeling-based relation extraction techniques [18]; (9) QA-SQL, which reformulates the NLIDB task as a question-answering problem, employing a unified machine reading comprehension (MRC) model for multi-slot prediction [38]; (10) MC-SQL, a unique approach tailored for zero-shot text-to-SQL conversion tasks, notable for its independence from additional manual annotations [3]; (11) RLC, a model that simplifies the decoder layer of existing methods and introduces a novel loss re-weighting technique for enhanced performance [36]; and (12) UniSAr, a recent extension of autoregressive language models, incorporating non-invasive structural enhancements to promote interpretability and performance in NLIDB tasks [6].
Experimental results
Logical form (LF) and execution (EX) accuracy on WikiSQL dataset. The HFD’s batch size is set to 128 due to limitations in memory resources, and in the HydraNet model, we followed they set the batch size to 256. In their work, they describe that larger batch size gives better results.3 In the later section of the ablation experiments, we reproduced HydraNet at a batch size of 128 for comparison
Logical form (LF) and execution (EX) accuracy on WikiSQL dataset. The HFD’s batch size is set to 128 due to limitations in memory resources, and in the HydraNet model, we followed they set the batch size to 256. In their work, they describe that larger batch size gives better results.3 In the later section of the ablation experiments, we reproduced HydraNet at a batch size of 128 for comparison
In this study, we conduct a comprehensive comparative analysis of our proposed methodology against established models that have previously reported their performance in the context of the WikiSQL task. A summary of this comparison is provided in Table 2. Specifically, our method is contrasted with the Sequence-to-sequence model Seq2SQL [37] and the syntax-based approach Coarse-to-Fine [5]. Notably, we also assess our approach against several sketch-based slot-filling techniques, including SQLNet [37], SQLova [11], TypeSQL [41], X-SQL [10], HydraNet [17], QA-SQL [38], MC-SQL [3], and RLC [36], which exhibit similarities to the HFD model proposed in this manuscript.
It is imperative to emphasize that the HFD model places a heightened emphasis on feature decoupling across subtasks, setting it apart from existing slot-filling-based methods. Our comparative analysis, as depicted in Table 2, reveals that our approach demonstrates a notably promising enhancement in the performance of logic form accuracy concerning SQL statements. Moreover, in the context of execution-guided decoding, our method exhibits superior performance when compared to existing approaches, including SQLovaEG, X-SQLEG, and HydraNetEG_256. Furthermore, we observe that our methodology holds its ground favorably when compared to methods employing larger pre-trained models, such as UniSArT5-Large and UniSArBART-Large.
Ablation experiments
Logical form (LF) and execution (EX) accuracy on WikiSQL dataset in ablation experiments
Logical form (LF) and execution (EX) accuracy on WikiSQL dataset in ablation experiments
In this section, we present the findings of a series of ablation experiments, accompanied by a comprehensive description of the ablation model, which is expounded upon in the subsequent content.
HFD: It is the benchmark model, which contains all the components described in Section 3, which is our proposed hierarchical features decoupling model. HFD-SFD: It removes the features decoupling of the slot-fillings in SELECT and WHERE clauses, and the part that’s removed is the approach we describe in Section 3.3. HFD-SFD-CFD: It removes our proposed hierarchical features decoupling method, which merely ignores the irrelevance ranking score in HydraNet. HydraNet128: HFD is based on an improved version of HydraNet, so in the table, we show the difference between the two models for the same batch size.
Table 3 provides a comprehensive illustration of the noteworthy enhancements in Hierarchical Feature Decoupling (HFD) when contrasted with the benchmark model, HydraNet128, which was chosen as the foundation for this study. Furthermore, the ablation experiments conducted herein substantiate the viability of our proposed methodology grounded in the concept of hierarchical feature decoupling. Significantly improved experimental outcomes are also observed with the deployment of a single-layer decoupling approach, HFD-SFD, in comparison to non-decoupled methods, specifically HFD-CFD-SFD and HydraNet128. Our experimental findings underscore the essentiality of decoupling techniques among diverse subtasks within the domain of single-table-based text-to-SQL tasks. Furthermore, the conceptual framework outlined in this study offers a transferable paradigm for multi-task learning scenarios.
Fine-grained analysis for various methods on the test set of WikiSQL dataset
Fine-grained analysis for various methods on the test set of WikiSQL dataset
To facilitate a comprehensive exploration of the operational efficacy of individual sub-modules within diverse computational methodologies, as exemplified in Table 4, an in-depth scrutiny was conducted. In this context, we designate
In comparison to HydraNet, HFD demonstrated results congruent with its performance in the slot prediction task. Specifically, concerning the HydraNet256 variant, HFD exhibited a marked enhancement in its effectiveness across all subtasks. It is worth noting that resource limitations in terms of GPU capacity precluded the execution of experiments with a batch size of 256; nonetheless, it is evident from prior research that the model’s performance improves concomitantly with an increase in batch size.4
This paper introduces an innovative multi-task framework denoted as Hierarchical Feature Decoupling (HFD), designed to address the intricacies of single-table-based text-to-SQL tasks. HFD incorporates a bifurcated feature decoupling strategy for subtasks related to the SELECT and WHERE clauses, wherein features specific to each subtask within these clauses are independently disentangled. The HFD model, as proposed, demonstrates noteworthy performance outcomes when evaluated on the WikiSQL benchmark.
Footnotes
Acknowledgments
The authors extend their gratitude to the anonymous reviewers for their valuable and perceptive commentary. Financial support for this research endeavor was provided by the National Natural Science Foundation of China (Grant No. 62176053). This research work is supported by the Big Data Computing Center of Southeast University.
