Abstract
Empirically, symbolic regression tries to identify, through genetic programming and within the sphere of mathematical expressions, a model which best explains the relationship between variables in a given set of data, in terms of precision and simplicity. Virtual teaching and learning environments focused on evaluation have been previously investigated, as they offer teachers an effective teaching and learning tool and the student the possibility of computer-assisted evaluation and customized learning. Within this context, the present paper introduces an alternative approach to automatic evaluation in virtual teaching and learning environments, which offers the following improvements when compared to other methods: a) superior accuracy when compared with the linear regression method; b) simplicity of implementation; c) possible deduction of final student grades; and d) context adaptive. To this extent, a case study was applied to the LabSQL environment, with the purpose of clarifying the benefits of symbolic regression via genetic programming, while emphasizing its efficiency and simplicity of implementation.
Keywords
Introduction
Artificial intelligence (AI) offers automatic computer-based solutions to problems through machine learning, automated reasoning and knowledge representation [1]. In AI, genetic programming (GP) is an evolutionary computation (EC) technique which automatically solves problems without requiring the form or structure of the solution from the user in advance. In its most abstract level, GP constitutes a systematic, domain-independent automatic solving method from a high level statement of a problem [2, 3].
John Koza introduced a concept of GP where individuals are represented by computer programs, such as mathematical expressions, logical expressions, amongst others. This way, instead of researching a set of solutions, programs are automatically generated to solve the problem without a previous knowledge of the structure of the solution [3, 4].
Symbolic regression is a GP which manipulates mathematical expressions in order to find the equation which best explains the relationship between the variables of a given set of data, adopting a measure that minimizes error, for instance, the square root of the sum of the squares of the differences [4].
In virtual teaching and learning environments, one difficulty involves finding a function that expresses the performance of the student associating a set of data available in such environments. A number of scholars have solved this problem via linear regression, whose prerequisites include knowledge of statistics and a series of tests to validate the identified function. On the other hand, addressing the same problem via symbolic regression renders the process simpler, as the goal of the GP is to find the solution without requiring previous knowledge of statistics, and it may provide more accurate results.
In the present investigation work we propose a symbolic function with a good or perfect fit with the set of data available in the LabSQL environment, a virtual teaching and learning tool for SQL used by top-level students, the identified solution is a function in symbolic form, composed from a set of functions and terminals to be defined.
Therefore, what is herein proposed is an alternative approach to automatic evaluation in virtual teaching and learning environments, which offers a number of improvements when compared with other methods: a) superior accuracy when compared with the conventional regression method; b) simplicity of implementation; c) the possible deduction of final student grades; and d) context adaptive.
Beyond this introduction, the paper is organized in the following sections: 2) Symbolic Regression, 2.1) Gene Expression Programming, 3 Evaluation in Virtual Teaching and Learning Environments, 3.1) SQL Teaching Laboratory, 4) Case Study, 4.1) Software Engineering Metrics, 4.2) Database Metrics, 4.3) Performance Evaluation Parameters, 4.4) Development of and Empirical Model using GEP, 4.5) Symbolic Predictor Function of SQL Complexity for LabSQL, 5) Sensitivity Analysis, 6) Conclusions.
Symbolic regression
In statistics, regression is used to find the coefficients of a predefined function that are best suited for certain data. One of the problems surrounding regression analyses is that, when the fitness is not good, the analyst has to keep trying different functions by hand until a good model is found for the available data. This requires a lot of work and the results of the analysis are highly dependent on the skills and creativity of the analyst. Moreover, even the most experienced user is strongly biased when selecting the functions to be suited. For instance, in many application fields there is a strong tendency towards the adoption of linear or quadratic models, even when a more complex model offers a better adjustment to the available data.
Symbolic regression goes beyond this aspect. It finds a function which adjusts data points without making any assumptions about the structure of this function. GP does not make any such assumption, and suits this type of task perfectly.
It was introduced by Koza [4] as an extension of genetic algorithms (GAs). The main difference between GP and GAs lies in the representation of the solution. GP solutions are computer programs, while GA solutions are created as number sequences [4]. GP works with randomly generated populations of individuals (computer programs). Each program represents a possible solution for a certain problem. The classic GP technique is also known as a tree based GP pattern [9, 10]. A program developed by classic GP constitutes a hierarchically structured tree with functions and terminals. Functions and terminals are randomly selected and built together to form a computer model of a tree structure with a root point where branches grow from each function and end in a terminal [9, 10]. Figure 1 illustrates a simple and classic GP tree representation.
Genetic expression programming is a linear GP variant. Individuals created through linear GP variants are represented as linear chains, which are decoded and expressed as nonlinear entities (trees) [12, 13].
Once a population of models is randomly created, the GP algorithm evaluates individuals, selecting them for reproduction and generating new individuals and ultimately creating a new generation from all interactions [11] e [4]. New individuals are created through mutation, crossover and direct reproduction. Figure 2 shows a typical GP crossover operation. In the course of this operation, a point on a branch of each program is randomly selected and the set of terminals and/or functions of each program is then exchanged to create two new programs. During the mutation process, a function or terminal from a model is, occasionally, randomly selected and mutated (see Fig. 3). The best program obtained in any generation defines the output of the GP algorithm [4, 11].
Genetic expression programming
Genetic expression programming (GEP) was developed by Ferreira [14]. Most of the genetic operators used in GAs can be implemented in GEP with minor modifications. GEP consists of five main components, including: (1) set of functions, (2) terminal set, (3) fitness function, (4) control parameters, and (5) stopping criterion. GEP uses a fixed length of strings to represent solutions for problems, which are subsequently expressed as trees of analysis with different sizes and shapes [10, 15] and. These trees are known as expression trees (ETs).
A technical advantage of GEP is that the creation of genetic diversity is rendered extremely simple when genetic operators work on a chromosome level. Another strong point of GEP concerns its unique multigene nature, which allows for the evolution of more complex programs comprised by several subprograms [10, 15]. Each GEP gene contains a list of symbols with a fixed length, which can be any element of a function defined as {+ , − , x } and a terminal defined as {a, b, c, 1 } . The set of functions and the set of terminals should display the closure property: each function should be capable of assuming any data type value, which can be returned by a function or assumed by a terminal [10, 15]. A typical GEP gene with data sets and a terminal function can be:
The conversion starts from the first position in the K-expression, which corresponds to the ET root, and is read one by one along the chain. The previously offered GEP gene can also be expressed mathematically, as:
Inversely, an ET can be converted into a K-expression recording nodes from left to right in each layer of the ET, from the root node to the deepest level of the tree, forming a chain. As was already mentioned, the GEP genes have a fixed length, which is predetermined for a given problem. Therefore, what varies in GEP is not the length of the genes, but the size of the corresponding ETs. This means that there is a specific number of redundant elements, which are not useful when mapping the genome. Thus, the valid length of a K-expression can be equal or inferior to the length of the GEP gene.
To ensure the validity of a randomly selected genome, GEP applies the head-tail method. Each GEP gene is composed by a head and a tail. The head can hold both the function and terminal symbols, while the tail can only hold terminal symbols [10, 15] and. A basic representation of the GEP algorithm is shown in Fig. 5. In GEP, individuals are selected and copied into the next generation according to aptitude via elitist roulette sampling. This ensures the survival and cloning of the best individual for the next generation. Variation of population is introduced by carrying out unique or multiple genetic operators selected in chromosomes, which include crossover, mutation and rotation. The rotation operator is used to rotate two subparts of a genome element sequence in relation to a randomly selected point. It can also drastically reformulate the ETs [10, 15].
This type of regression finds the forming equation behind the set of mapped data in an input/output relationship. GP is a commonly used evolutionary technique in this type of regression. Several tools, such as GPTIPS [16], JGAP framework [17] and EpochX [18], GeneXProtools [14], have been proposed to minimize implementation difficulties, and they offer a set of features which are visually easy to adapt in GP and allow for a speedy evaluation of generated models. The GeneXPro tool was adopted here because it implements GEP, it is a reference in scientific works, it offers the most satisfactory results in terms of accuracy, it encompasses features which allow for a more detailed analysis, which have been maintained from its inception to the present days [19]. Additionally, this tool can be used in Windows, 32 and 64 bit versions [19].
The parameters to be defined in these systems form part of the preparatory steps for the application of GP in problem solving, which involve the identification of: a) the set of terminals; b) the set of functions; c) the fitness measure; d) the parameters and variables which control execution; e) the result definition method and the stopping criterion.
In (a) the set of terminals is identified. In this problem, the information that will be processed by the mathematical expression corresponds to the value of the independent variable x. Thus, the set of terminals, known as T, is T ={ 2 }.
In (b) the set of functions used to generate mathematical expressions is defined, in an attempt to adjust the set of given points. For instance, linear regression functions are comprised by addition and multiplication operations, and in GP this set of functions would be enough to solve the problem. A more general choice could include subtraction and division functions. Besides, in order to solve a greater variety of problems, it would be advisable to include sine, cosine, exponential and logarithm functions. Therefore, for this problem the set of functions, represented by F, is: F ={ + , − , * , % , sin, cos, exp, rlog }.
In (c) the fitness measure is determined. The direct fitness for this problem is the sum, taken from the fitness cases, of the absolute value of the difference between the value obtained with the symbolic expression and the correct value. The lower this difference, the better the individual program. In this case, standardized fitness equals direct fitness. The adjustment measure counts the number of fitness cases where numeric values returned by the symbolic expression display a small tolerance, known as the adjustment criterion, in relation to the correct value. As an example, we can define the adjustment criterion as 0,01, that is, the difference between the estimated and the correct value should the inferior to 0,01 in module. The adjustment measurement is more intuitive than fitness, hence its usefulness when the user observes the evolution of GP generations.
In (d) the parameters and variables that control the expression are determined, restricted to a population size of 500 individuals and a number of generations equal to 51, with a random initial generation and 50 more for evolution. These numbers were proposed by Koza [4] when applying genetic programming to problems in general, including symbolic regression.
In (e) and for each execution, the relative frequency per generation of individuals that met the predicate for success is observed, and the cumulative probability is obtained.
A common method of representing computer programs in GP is through syntax trees which can be easily transformed into known programming languages. The adopted programming language in the original implementation of GP is List Processing (LISP) [4].
The main goal of educational evaluation is to determine if the students know, understand and are able to perform [20]. The subdivision of evaluation into three different classes is widely used and accepted, namely: a) diagnostic – before the learning process begins, this is used to determine the level of pre-required knowledge and skills and previous knowledge of the contents to be addressed (for instance, reading readiness test); b) formative – in the course of a unit and within a class cycle, used to measure progress (for instance, practical student problems, weekly evaluation tests) and; c) summative – in the end of a unit or year, used to measure growth and formal accomplishment (for instance, final evaluation).
In the context of virtual teaching and learning environments (VLE), diagnostic, formative and summative assessments can be used to ensure the quality of teaching. In this sense, in terms of validity and reliability, besides minimizing subjectivity, standardized tests are frequently adopted and grades assigned, following a process where test results are converted in grades. Grades are then recognized as student learning measurements [20].
The following subsection describes the LabSQL environment, the case study of the present investigation work, which automatically evaluates SQL questions and assigns a final grade that can be used in formative, diagnostic and summative assessment.
SQL teaching laboratory
LabSQL, developed in 2005, is an interactive environment that helps students learn the database Structured Query Language (SQL) and can be used as a support tool for teachers, which automatically evaluates laboratory activities. For students involved in online e-courses, this tool can help assess learning levels, helping them regulate their study and learning path. In his environment evaluation, Lino [21] observed that 96,67% of the students considered that LabSQL increased their learning when compared with the traditional teaching-learning process.
In LabSQL there are four types of questions: multiple choice (or T/F), open-ended questions, conceptual maps and SQL programming exercises. When the student interacts with the system and submits his SQL query, the system executes and evaluates the complexity of this query in relation to the mediator query. This way, the student receives an automatic feedback, containing: a) the result of his query, allowing him to evaluate if the answer is correct; b) an automatic evaluation of the student response, considering the result of the execution and the complexity level when compared with the best answer from the system; c) the number of trials, and d) the global evaluation of the test or exercise.
With this tool, 1.448 students from the computer science graduation or post-graduation fields used this environment, amounting to 517 thousand answers by the end of 2014, as shown in Table 1. This volume of data provided the opportunity to design other queries using AI techniques: knowledge discovery through data mining [22–24]; development of an automatic student concept with fuzzy logic [25–27], automatic evaluation of SQL questions [28, 29] and open-ended questions [22, 30]. These last two approaches form the core of this investigation work.
Case study
LabSQL proposed a solution to automatically evaluate open-ended and closed-ended questions in SQL, based in software engineering (SE) metrics and multiple linear regression [29]. In [28] different qualitative aspects of evaluation are covered through tips and feedback, encouraging students to find better solutions.
The case study adopts SE and database (DB) metrics, defined in [29] to obtain ETs via GEP with the software [14], and confronts the results obtained in the first investigation work [21] against the GEP model, based on accuracy and simplicity of implementation.
Software engineering metrics
SE metrics, according to Basili [31], allow for the characterization, evaluation, prediction and analysis of a computer program. Therefore, these metric offer a quantitative approach to measure software quality.
In [21], the following set of SE metrics was defined: The Halstead Volume (VL) is calculated as presented in formula 3, where the variable N represents the sum of operators/operands total occurrences and n the number of the sum of single operators/operands [32]. The cyclomatic McCabe (MC) number, developed for procedural programs, counts the number of alternative paths of the execution flow existing in the program unit. Building the execution flow of the program, metrics are calculated using formula 4, where V (G) represents the graphic that is associated to the flowchart and DE the number of decision points. For non-procedural languages an approximation to the McCabe number is assumed: the number of comparisons plus 1 [33]. The length (NM) is similar to row count [34], however in SQL any query can be described in a single row. To overcome this problem, this metric counts the number of tokens (words or symbols) found in the query.
Specific DB metrics were included in order to evaluate the complexity of SQL queries [21], as substantiated in [35, 36]. DB metric are: The number of tables (NT) counts the number of tables used in the query. The number of columns (CN) counts the number of available columns in tables involved in the query. The number of columns (CI) counts the number of columns used in the query, without repetition. The number of tables (NT) counts the number of tables used in the query, multiplied by 10 and adding 5 if the query has an aggregate function.
The training and trial corpus is comprised by a subset of answers to the exercises proposed in two SQL programming textbooks [29]: the paper [37] containing 30 questions concerning a DB with seven relational tables and 28 attributes, the Kroenk book [38] on DB, with 27 questions from chapter 10, based on three relational tables, with ten attributes.
The corpus was categorized into 4 different complexity classes for SQL commands [21]: the 1st class uses a single table; the 2nd class used a table and aggregated functions; the 3rd class uses two or more tables; the 4th class uses two or more tables with aggregated functions. Classes are ranked from the less to the most complex.
The complexity to be predicted (CX), was defined from the mean of two grades assigned by DB experts, in a scale of 40 to 90, assuming that: a complexity value was assigned to each query, ranking queries from the less complex to the most complex, no two queries should not display the same complexity level; the 4 query classes are respected, where the lowest value of a higher class is above the highest value of a lower class.
Next, we demonstrate a descriptive statistics of the variables (Table 2).
Some of the predictor variables, SE and DB metrics, can be fundamentally interdependent. The first step in the data interdependence analysis involves a careful study of the measurement variables, and the identification of any highly correlated pair. HA high positive or negative correlation coefficient between pairs can lead to poor model performances and complicate the interpretation of the effects of explanatory variables on the response. This interdependency can originate problems during the analysis, as they tend to exaggerate the strength of relationships between variables. This is a simple case commonly known as the multicollinearity problem [39]. Thus, the correlation coefficient between every possible pair was determined as shown in Table 3. As shown in this table, a high correlation was identified between the predictor variables NM and VL, however, in this case, the conditions that generate this multicollinearity are determined by the VL formula, which means that it does not negatively affect the model to the extent that the conditions that generate this model are maintained in the prediction [40].
In [21, 29], we obtained a regression function with an error rate of 3,21% and an R-squared rate of 95,82% for 30 observations (Table 4).
In GEP and regression based analyses, data is randomly divided into training and validation subsets. Training data is used in learning (genetic evolution). Validation data is used to measure the performance of models with data that does not play any role in their generation. In order to obtain a consistent data division, multiple training and validation set combinations are considered. For the analysis, data values (80%) are used in the generation process and the remaining values (20%) are used to validate the generalizability of the models.
Performance evaluation parameters
The best model was selected based on a multiple-goal strategy, as follows: Selection of the simplest model, although this does not constitute a predominant factor. Provide the best fitness value based on training data. Provide the best fitness value with the validation data.
The first goal can be controlled by the user through the definition of parameters (for instance, the size or number of genes). For other goals, the following objective function (OBJ) is adopted as a measure of the extent to which the output of the predicted model complies with the experimentally measured output. The best GEP model selection is deduced by minimizing the following function [17]:
The NM, VL, MC, TB, CN, CI and CX metrics are used to create the GEP model. Several parameters involved in the GEP algorithm are shown in Table 5. In the present study, the basic arithmetic operators (+ , − , x, /) and a number of basic mathematical functions are used to obtain the optimal GEP model. The size of the population (number of chromosomes) defines the number of programs for the population, where a larger population leads to a slower execution time. The suitable population size depends on the number of possible solutions and the complexity of the problem.
A significant number of chromosomes was tested in order to find models with a minimum error and the program was run until no significant performance improvements were identified and the races were automatically completed. The chromosome architectures of the models developed by GEP include head size and number of genes. Head size determines the complexity of each term within the evolved model. The number of terms in the model is determined by the number of genes per chromosome. Each gene codes a different sub-expression tree or sub-ET. Different values are tested for head size and number of genes. For a number of genes above one, the function connecting multiplication is used to link mathematical terms coded in each gene. The MAE function is used to calculate the general fitness of evolved programs.
The acceptable time span for an evolution without a best fitness improvement is defined through generations without a changing parameter. After 2.000 generations, herein adopted, a mass extinction or a neutral gene is automatically added to the model [10, 15] and. The values of the other parameters involved are selected based on the previously suggested values [9, 15], and a new trial and error approach. For the development of GEF based empirical models we used the GeneXproTools [19] software.
Symbolic predictor function of SQL complexity for LabSQL
The symbolic GEP based function predicts the SQL complexity through SE and DB metrics, and the best obtained function is shown in Fig. 6 assuming the format of four sub trees (Sub-ETs). The elements represented by the letter “d” in each Sub-ET correspond to metrics (d0 to d5, respectively NM, VL, MC, TB and CI). The constants are represented by the letter “c” (c0, c7 to c9). The basic arithmetic operators and functions used to generate this model were described in the previous section.
The first goal, that is, selecting the simplest model, can be controlled by the user through the definition of parameters (for instance, the size or the number of genes). For other goals, the following objective function (OBJ) is adopted as a measure of the extent to which the output of the predicted model complies with the experimentally measured output. The best GEP model selection is deduced by minimizing the following function [17].
A comparison between the symbolic function model and the value predicted by the GEP is shown in Fig. 7. The model reveals a mean percentage error below 1,5, which in the graphic is translated in the short distance observed between the generated model and the CX values for each observation.
The CX (y) classification is shown in Fig. 8, with a square error of 0,9877 and a growing angular line. In this graphic we can clearly observe that the conceived model, represented by the straight line, displays a small margin of error in CX classifications, revealing a good square error adjustment.
The descriptive statistics of Table 6 show the values of the obtained symbolic regression model. Confronting this model against the model obtained in [21], and described in, we can see that the proposed model surpasses the prediction in approximately 3%, while displaying a smaller margin of error for all coefficients.
Sensitivity analysis
The contribution of the most relevant predictor variables (NM, VL, MC, TB, E CI) in the best GEP models is measured by a sensitivity analysis. For this purpose, frequency of input variable values are obtained. A frequency of input value of 1,00 indicates that the variable appeared in 100% of the best programs developed by GEP, a commonly adopted methodology in GP based analyses [15]. The values of frequency predictor variables are shown in Fig. 9. According to these results, we can observe that the VL, TB and CI metrics influence complexity variations to a greater extent when compared to other inputs. The sensitivity analysis results conform to the expected structural point of view.
Conclusions
This study addressed a GP variant, that is, a GEP used to assess the complexity of an SQL code. We obtained a symbolic model, through empirical means, in order to predict the SQL complexity from a database of results obtained in complexity tests and develop a predictive model. The validity of the model was tested using part of the trials and the trial database. This validation stage confirms the general efficiency in the application of the model when estimating the complexity of the SQL code. Moreover, the GEP predictor model effectively meets the conditions of different criteria adopted for its external validation.
The proposed model was assessed against the linear regression model and significantly surpassed the later in terms of statistical analysis for using nonlinear functions. Owing to the nonlinearity of its behaviour, GEP leads to better results when compared to regression based models. Its concept also proved to be superior, owing to its simplicity of implementation, in measurement results, besides being adaptable to a different evaluation system, such as open-ended questions or conceptual maps.
Additionally, we can easily calculate the complexity of the model from SE and DB metrics. Therefore, this model doesn’t require validation tests by DB experts. Similarly, as the creation of other models for different evaluations requires a minimum of metrics, for example, to evaluate open-ended questions coefficients would be based in n-grams. The model can thus be safely adapted and implemented for other evaluation and pre-practical purposes to the extent that it derives from tests with a wide variety of metrics, and this can be considered the main advantage of the GEP model.
Footnotes
Acknowledgments
This work was conduct out with CNPq support, National Council for Scientific and Technological Development - Brazil. We also appreciate the financial support of AISTI (Iberian Association for Information Systems and Technologies), which permitted the registration in the WorldCIST’16 (4th World Conference on Information Systems and Technologies), held in Recife, Brazil, 22 - 24 March 2016, and consequently this publication.
