SQLiDDS: SQL injection detection using document similarity measure

Abstract

SQL injection attack has been a major security threat to web applications for over a decade. Now a days, attackers use automated tools to discover vulnerable websites from search engines and launch attacks on multiple websites simultaneously. Being extremely heterogeneous in nature, accurate run-time detection of SQL injection attacks, particularly those previously unseen, is still a challenge using regular-expression or parse-tree matching techniques suggested in the literature. In this paper, we present a novel approach for real-time detection of SQL injection attacks by applying document similarity measure on run-time queries after normalizing them into sentence-like form. The proposed approach acts as a database firewall and can protect multiple web applications using the database server. With additional inputs from human expert, the system can also become more robust over time. We implemented the approach in a tool named SQLiDDS and the experimental results are very encouraging. The approach can effectively detect all types of SQL injection attacks and previously unseen attacks with substantial accuracy yet negligible impact on overall performance of web applications. The tool was built with PHP and tested on web applications built with PHP and MySQL, but it can be adapted to other platforms with minimal changes.

Keywords

SQL injection detection query normalization document similarity database firewall phrase similarity

1. Introduction

Web applications are exposed to different types of security threats like Denial of Service (DoS), Structured Query Language (SQL) injection, Cross Site Scripting (XSS), etc. Among these, SQL Injection attack is predominantly used against web databases. The Open Web Application Security Project (OWASP) ranks it on top among the Top-10 security threats [38]. According to TrustWave [44] Global Security Report, SQL injection was the number one attack method for four consecutive years. Attacking a website using SQL injection has become much easier ever since attackers started using sophisticated tools and Botnets [33] which automatically discover URLs of vulnerable web pages from search engines like Google and launch mass SQL injection attacks on them simultaneously from distributed sources. About 97% of data breaches across the world occur due to SQL injection attacks alone [10]. As per a recent study [13] conducted by the Ponemon Institute1

¹
Ponemon Institute conducts research on privacy and information security – http://www.ponemon.org/.

and Symantec, a data breach can cost an enterprise about $5.4 million per incident.

Research on SQL injection in the literature can be broadly categorized into three classes: defensive coding approaches, vulnerability testing approaches, and prevention based approaches. Defensive coding approaches proposed by Boyd and Keromytis [5], Johns et al. [22], Livshits and Erlingsson [32], Nguyen-Tuong et al. [37] etc., consist of techniques which are applied during application development so that SQL injections cannot happen. These require programmers to write their application code in a specific manner for securing against SQL injection attacks, but in practice, security aspects are often ignored during programming. On the other hand, vulnerability testing approaches developed by Benedikt et al. [3], Kals et al. [23], Wassermann et al. [50] etc., rely on extensively testing web applications to find out possible SQL injection vulnerabilities so that they can be fixed before releasing to production. The effectiveness of this approach is limited by the ability of tools to discover all possible security holes. Majority of research on SQL injection has been done under the prevention based approaches. In general, prevention based approaches by Bisht et al. [4], Buehrer et al. [6], Halfond and Orso [19], Lee et al. [29], Wang and Li [49] etc., consist of preparing a model of SQL queries and/or the application’s behavior during its normal-use in a secured environment and then utilize that model at run-time to detect anomalies for preventing SQL injection attempts. The major drawback in this approach is that, any modifications or enhancements to the application code afterwards (which is quite probable), requires rebuilding the normal-use model. Further, these are usually designed to protect only one web application from SQL injection attacks, and mostly suitable for a specific language and database platform.

The above discussion re-establishes the necessity for creating a system that (1) does not require writing code in a specific way, (2) does not need extensive vulnerability testing, (3) does not require building a normal-use model, (4) can protect multiple web applications interfacing with one database server (as in shared hosting environment), (5) is able to detect new forms of SQL injection attacks, and finally, (6) is adaptive and learns by itself (or with help of human expert) to become more robust over time. Accuracy of detection and minimal performance overheads are obvious requirements for any such system.

In this paper, we present a novel technique to detect SQL injection attacks by applying the concept of document similarity in an attempt to address the above requirements. Document similarity measures are typically used for web search, text categorization, information retrieval and in several other domains. We extend the query transformation scheme proposed by Kar and Panigrahi [24] to normalize SQL queries into a sentence-like form, which facilitates comparison between SQL queries by applying a document similarity measure. The technique was implemented in a tool named SQLiDDS (SQL injection Detection using Document Similarity) and validated experimentally on five sample web applications. The query normalization scheme and application of document similarity for identifying SQL injection attacks are the main contributions of our work. Based on interesting observations made during the course of research, we postulate that, it is sufficient to examine only the WHERE clause part of run-time queries for detecting SQL injection attacks, which is another important contribution as it greatly reduces the scope of processing required. Acting as a database firewall, SQLiDDS can protect multiple web applications interacting with a database server such as in shared hosting environments, which gives it an edge over existing methods. The accuracy of the proposed system is very encouraging as well as the performance overhead is almost imperceptible for the end user.

This article is an extension of our previous work [25] in which we first introduced the idea of using document similarity measure for detecting SQL injection attacks at the database firewall layer. Besides elaborating the approach in a more comprehensible manner and providing fresh results of experiments, this version also incorporates the following additions and improvements to the earlier system:

The query normalization scheme has been modified based on a few weaknesses identified which could enable an attacker to bypass detection as discussed towards end of Section 4.1.

Various other similarity measures available in the literature have been looked into in Section 4.2 and the best measure suitable for our approach has been selected by performing experiments discussed in Section 4.3, thereby almost doubling the run-time performance.

A new clustering algorithm has been developed for clustering of normalized injection patterns, discussed in Section 4.5, that produces cohesive and spherical clusters improving accuracy.

The threshold similarity for clustering is now automatically determined through analysis and smoothing of histogram created from the similarity matrix, which is described in Section 4.6.

The architecture of the system has been enhanced by introducing a repository for storing genuine query structures (shown in Figs 4 and 5), so that known benign queries are instantly identified without going through the similarity matching process.

We compare our approach in Section 6.5 with some notable existing methods considering various criteria and type of SQL injection attacks detected.

The rest of the paper is organized as follows. Section 2 introduces the mechanism of SQL injection attack by taking a simple example. Section 3 states the SQL injection attack detection problem from the view point of a database firewall and lays down our motivation behind this study. Section 4 describes the overview of our approach in detail, covering query normalization, document similarity measures, detection strategy based on two interesting observations, and a new clustering algorithm for clustering of normalized queries. The system architecture is explained in Section 5. Experimental evaluation and results are discussed in Section 6 along with assessment of performance overhead. Quoting related works in Section 7, we conclude the paper in Section 8 with a note on future directions of research.

2. SQL injection attack

SQL injection attacks occur due to a commonly found vulnerability of constructing dynamic SQL queries using raw input data received through web forms or URL query string parameters without proper validation. An attacker exploits this vulnerability by inserting carefully crafted SQL keywords, values and symbols through the parameters or form fields, which effectively alters the structure and semantics of the dynamic queries, causing them to behave differently. The injected query returns results intended by the attacker instead of the results expected by the programmer. To understand the basic mechanism of SQL injection, consider a product detail page of an E-commerce website accessed by the URL:

http://www.ecomstore.com/product_details.php?pid=24

The script product_details.php receives the product ID through the query string parameter pid. The programmer uses the parameter to construct an SQL query and executes it on the database using the following PHP code:

$query = "SELECT * FROM products WHERE prod_id = ".$_GET[’pid’]; $result = mysql_query($query, $dbconn);

When a genuine request with a normal integer value is received through the parameter pid, for example 24 as shown in the example URL, the run-time query generated by this code would be:

SELECT * FROM products WHERE prod_id = 24;

Upon execution, the query returns one row containing all attribute values for the product ID 24 from the products table (assuming that the product exists in the database), which is what the programmer intended. It may be observed that in the PHP code, the value of pid received through the URL parameter has been used without any validation or type-checking. Suppose an attacker enters the following URL in his browser:

http://www.ecomstore.com/product_details.php?pid=24+OR+1+=+1

Each ‘+’ sign in the query string part is the url-encoded form of a space character. Therefore, the parameter pid now carries the string value “24 OR 1 = 1”. Since the code directly uses the parameter’s value to construct the dynamic query without any validation, the SQL query would be generated as:

SELECT * FROM products WHERE prod_id = 24 OR 1 = 1;

When executed, this query will fetch all records from the products table because the query evaluates to true for all rows, the record pointer will be positioned at the first record which could be any other product, and the web page will show its details instead of product ID 24. By appending “+OR+1+=+1” to the query string parameter, the attacker could change the intended structure and hence the behavior of the dynamic query. The resulting query is said to have been SQL injected. This is an example of the simplest type of SQL injection known as tautological attack – a tautology is a formula that always evaluates to true. There are several other types of SQL injection attacks as classified by Halfond et al. [20], such as piggy-backed queries, UNION based attacks, blind SQL injection, time-delay based injection, stored procedure attacks, alternate encoding, etc. By manipulating input values through form fields and query string parameters, an attacker can make the web page display sensitive information from other database tables. Exploiting just one vulnerable parameter of a single web page of a web application, an attacker can potentially extract the contents of the entire back-end database. Many programmers are not aware of this serious threat or do not take care to validate and properly sanitize the received parameter values before using them in dynamic SQL queries. SQL injection vulnerability is therefore the most commonly found flaw in web applications.

Given that SQL injection vulnerability stems from use of unvalidated inputs from external sources to construct dynamic SQL queries, prepared statements are highly recommended as they offer a good degree of protection against such attacks. As a defensive coding method, prepared statements can prevent most of SQL injection attack attempts, especially when the bound parameters in the prepared queries are of numeric data types. However, improper use of prepared statements can still expose SQL injection vulnerabilities. For example, consider a web page showing products from a category ordered by a key field in ascending or descending order, accessed by the following URL:

http://www.ecomstore.com/products.php?cid=5&key=price&order=ASC

Suppose the programmer writes the PHP code using prepared statements in the following manner:

$qry = "SELECT * FROM products WHERE category_id = ? ORDER BY ? ?"; $stmt = $mysqli->prepare($qry); $stmt->bind_param("iss", $_GET[’cid’], $_GET[’key’], $_GET[’order’]); $stmt->execute();

In the code, the programmer correctly binds the three parameters with type specification “iss” (integer, string, string) in the prepared statement. While the first parameter ‘cid’ bound as integer is immune to SQL injection, the ‘key’ and ‘order’ parameters are vulnerable, particularly to UNION-based attacks. Unlike dynamically constructed queries, prepared statements require two round-trips with the database server, which makes them unsuitable in shared hosting environments. They generally have larger memory footprint and may end up with a less optimized execution plan. Under certain situations, e.g., in an IN(?,?,...) clause with a variable list of values, it is cumbersome to use prepared statements. They are also not suitable for dynamically binding table/column names in a query. Injection attacks can happen through quoted string parameters in prepared statements using SQL smuggling [15]. Popular PHP/MySQL applications like Wordpress,2

²
Over 74 million websites built with Wordpress exist on the Internet.

phpBB, osCommerce etc., do not use prepared statements. Interestingly, examining the codebase of Wordpress we found that, it uses $wpdb->prepare() throughout the application code, but the prepare() function of wpdb class does not truly prepare the statement at the database; rather performs basic sanitization of the values supplied for the placeholders.

Many organizations deploy commercially available Network Intrusion Detection Systems (NIDS) or Web Application Firewalls (WAF) to protect their web infrastructure against various attacks including SQL injection attacks. While these can be an effective component of a layered defense strategy, they are still penetrable. Most WAFs require a tremendous amount of expert configuration and tuning before they can provide adequate protection. These WAF systems mostly rely on regular expression based rule-sets and filters created from previously known attack signatures. However, signature-based detection is not foolproof and can be circumvented using different techniques [11,35,39], such as alternate character encoding, mixing of upper and lowercase alphabets, whitespace spreading, and comment embedding. An attacker usually starts with running simple tests to determine if any NIDS or WAF has been deployed. After identifying the characteristics of the perimeter security, the subsequent attacks are devised for bypassing detection resulting in successful SQL injection. For example, consider the following attacks, showing only the injection code added by an attacker:

OR ’A’ = ’A’ OR -5.24 = 17.43 - 22.67 OR 419320 = 0x84B22 - 0x1E52A OR ’ABC’ = CONCAT(’A’, ’B’, ’C’) OR ’ABC’ = CoNcAt(ChAr(0x28 + 25), cHaR(0x42), chAr(80 - 0x0D)) OR ’XYZ’ = sUbStRiNg(cOncAt(’AB’, ’CX’, ’YZ’, ’EF’), 4, 3) OR ’XYZ’ = /*!SuBsTrInG(*/CoNcAt(’aB’,/*!’cX’,’YZ’,*/’eF’)/*!,4,3)*/

All of these expressions evaluate to true, hence they are tautological attacks. Variations only in the RHS of the expressions have been shown here, but can be used in the LHS or both, making it possible to bypass detection by the WAF. Similar bypassing techniques can also be used in other types of SQL injection attacks. In fact, SQL injection attack expressions can be formed in a number of ways to the extent that constructing regular expression based filters becomes almost impossible. Signature based systems can be eluded by the attacker with creative changes to the expressions with a little effort.

SQL injection vulnerability was originally published by Jeff Forristal [26] under the alias RFP (rain.forest.puppy) in Phrack Magazine [40] in a section titled “ODBC and MS SQL Server 6.5.” Numerous websites have been attacked in the past using SQL injection causing leakage of sensitive data and huge loss to organizations. According to the Web Hacking Incidents Database3

http://projects.webappsec.org/w/page/13246995/Web-Hacking-Incident-Database.

(WHID), SQL injection attacks account for more than 17% of all web attacks reported. Even today, it is one of the most popular exploits used by the attackers. Sophisticated and automated SQL injection attack tools are freely and abundantly available on the Internet which even a novice hacker can use.

3. Problem statement and motivation

An SQL injection attack attempt is successful only when an injected SQL query gets executed on the database. Therefore, it must be detected before the query is sent to database server for execution. This involves examining an incoming SQL query at the database firewall level to determine if it contains any injected malicious code. The problem of SQL injection detection can therefore be stated as:

“Given an SQL query, determine if it is injected.”

Let $Q = {q_{1}, q_{2}, \dots, q_{n}}$ be the set of queries issued to the database server for execution; irrespective of the source web application in a shared hosting environment. Some of the queries in Q may contain SQL injection attacks. We denote the set of queries generated due to SQL injection attacks as $Q_{a}$ , and genuine queries generated by benign inputs as $Q_{g}$ . SQL injection detection problem can then be defined as, given any SQL query $q_{i}$ , determine whether $q_{i} \in Q_{a}$ or not. If $q_{i} \notin Q_{a}$ , it means that the query is genuine and does not contain any SQL injection attack, i.e., $q_{i} \in Q_{g}$ . Therefore, SQL injection detection is essentially a binary classification problem from the perspective of a database firewall.

We intend to take a step away from the traditional path of string, regular expression or parse-tree matching because these approaches generally fail to detect previously unseen attacks. Our main motivation in this research is to apply the notion of similarity instead. This is driven by the fact that new forms of SQL injection attacks often do not qualify as entirely new, but minor and creative variations of previously seen attack vectors. We postulate that a run-time query can be identified as injected with sufficient accuracy, if it is highly similar to one or more previously known injected queries. Our focus is on practical implementation and ease of deployment in a shared hosting scenario.

A part of the problem is “how to compute similarity between SQL queries?” as there may be special characters, operators, brackets, etc. The solution to this is to normalize SQL queries completely into text form that is suitable for applying document similarity measure without losing their syntactic structure. This is much simpler and straightforward than generating parse trees of queries and determine similarity between them by matching sub-trees. We propose a query normalization scheme which converts an SQL query into a sentence-like form so that similarities can be appropriately computed using available document similarity measures.

4. Overview of proposed approach

As stated already, our approach is centered around two key ideas: the query normalization scheme and application of document similarity measure. In brief, the proposed system begins with a preloaded set of known SQL injection attacks normalized into text form, which are grouped into clusters using a document similarity measure. The attack vectors in each cluster are merged into a document. At run-time, SQL injection attacks are detected by comparing the similarity of an incoming query with the documents containing the clustered attack vectors. The rest of this section elaborates various components of the proposed approach.

4.1. The query normalization scheme

Query transformation was proposed by Kar and Panigrahi [24] to convert SQL queries into structural form, with an important argument that the identifiers in a query, such as table names, column names, etc., are irrelevant with regards to its overall structure. In this work, we extend the transformation scheme to normalize a query purely into text form, i.e. a series of words separated by spaces, facilitating application of document similarity measures. In information retrieval, punctuations and symbols appearing in text are ignored while computing similarity because they do not contribute towards the textual content of the documents. However, in an SQL query, various symbols, operators, brackets, etc., are important structural and syntactic elements. It would be inappropriate to compute similarity between two SQL queries without considering these.

The normalization scheme uses only capital alphabets A–Z for all tokens and space character as token separator. Symbols, special characters, operators, etc., are also converted into words. Digits are also converted into tokens except where they are a part of an SQL keyword or function, such as SHA1(), MD5() etc. The only other symbol that is not converted is the underscore (_) character, as it is frequently used in MySQL system databases, system tables and functions, like information_schema, DATE_ADD, CURRENT_USER, LAST_INSERT_ID, etc. Splitting such tokens at the underscore character would result in over tokenization and negatively affect the similarity values. Additionally, tokens and symbols need to be substituted in proper order so that the syntactic structure of a query is not lost. The step by step process of query normalization is shown in Table 1.

Table 1
The query normalization scheme

Step Token/Symbol Substitution

1. White-space characters (\r, \n, \t) Space

2. Anything within single/double quotes

(a) Hexadecimal value HEX

(b) Decimal value DEC

(c) Integer value INT

(d) IP address IPADDR

(e) Single alphabet character CHR

(f) General string (none of the above) STR

3. Anything outside single/double quotes

(a) Hexadecimal value HEX

(b) Decimal value DEC

(c) Integer value INT

(d) IP address IPADDR

4. System objects

(a) System databases SYSDB

(b) System tables SYSTBL

(c) System table column SYSCOL

(d) System variable SYSVAR

(e) System views SYSVW

(f) System stored procedure SYSPROC

5. User-defined objects

(a) User databases USRDB

(b) User tables USRTBL

(c) User table column USRCOL

(d) User-defined views USRVW

(e) User-defined stored procedures USRPROC

(f) User-defined functions USRFUNC

6. SQL keywords, functions and reserved words To uppercase

7. Any token/word not substituted so far

(a) Single alphabet CHR

(b) Alpha-numeric without space STR

8. Other symbols and special characters As per Table 2

9. The entire query To uppercase

10. Multiple spaces Single space

Step	Token/Symbol	Substitution
1.	White-space characters (\r, \n, \t)	Space
2.	Anything within single/double quotes
(a) Hexadecimal value	HEX
(b) Decimal value	DEC
(c) Integer value	INT
(d) IP address	IPADDR
(e) Single alphabet character	CHR
(f) General string (none of the above)	STR
3.	Anything outside single/double quotes
(a) Hexadecimal value	HEX
(b) Decimal value	DEC
(c) Integer value	INT
(d) IP address	IPADDR
4.	System objects
(a) System databases	SYSDB
(b) System tables	SYSTBL
(c) System table column	SYSCOL
(d) System variable	SYSVAR
(e) System views	SYSVW
(f) System stored procedure	SYSPROC
5.	User-defined objects
(a) User databases	USRDB
(b) User tables	USRTBL
(c) User table column	USRCOL
(d) User-defined views	USRVW
(e) User-defined stored procedures	USRPROC
(f) User-defined functions	USRFUNC
6.	SQL keywords, functions and reserved words	To uppercase
7.	Any token/word not substituted so far
(a) Single alphabet	CHR
(b) Alpha-numeric without space	STR
8.	Other symbols and special characters	As per Table 2
9.	The entire query	To uppercase
10.	Multiple spaces	Single space

Table 2

Normalization of special characters and symbols (Step-8 of Table 1)

Symbol	Name	Substitution
`	Backquote	Remove
/* */	Empty comment	Remove
!= or <>	Not equals	NEQ
&&	Logical AND	AND
\|\|	Logical OR	OR
/*	Comment start	CMTST
*/	Comment end	CMTEND
~	Tilde	TLDE
!	Exclamation	EXCLM
@	At-the-rate	ATR
#	Pound	HASH
$	Dollar	DLLR
%	Percent	PRCNT
^	Caret	XOR
&	Ampersand	BITAND
\|	Pipe or bar	BITOR
*	Asterisk	STAR
-	Hyphen/minus	MINUS
+	Addition/plus	PLUS
=	Equals	EQ
()	Matching parentheses	Remove
(	Orphan opening parenthesis	LPRN
)	Orphan closing parenthesis	RPRN
{	Opening brace	LCBR
}	Closing brace	RCBR
[	Opening bracket	LSQBR
]	Closing bracket	RSQBR
\	Back slash	BSLSH
:	Colon	CLN
;	Semi-colon	SMCLN
"	Double quote	DQUT
’	Single quote	SQUT
<	Less than	LT
>	Greater than	GT
,	Comma	CMMA
.	Stop or period	DOT
?	Question mark	QSTN
/	Forward slash	SLSH

Each step of query normalization is a find-and-replace operation, done by appropriately using the preg_replace() and str_replace() built-in functions available in PHP. For example, the following lines of code implement the steps 3(a), 3(b) and 3(c) of the normalization process:

$query = preg_replace("/\b0x[0-9a-f]+\b/i", ’ HEX ’, $query); $query = preg_replace("/\b[0-9]*\.[0-9]+\b/", ’ DEC ’, $query); $query = preg_replace("/\b[0-9]+\b/", ’ INT ’, $query);

It may be noticed that the substitutions (e.g., ’ HEX ’) are done with a space character attached on either side of the normal tokens. In fact, every substitution is done with space characters on both sides so that the tokens are well separated. This however results in accumulation of multiple spaces between adjacent tokens as the normalization proceeds step by step. In Step-8, all special characters and symbols are converted as per Table 2 and the entire query is changed to uppercase in Step-9. Finally in Step-10, all multi-spaces are replaced with single spaces neutralizing the space accumulation side-effect of previous steps. The complete vocabulary including all MySQL keywords, functions, reserved words and the substitutions used in normalization scheme contains 686 distinct words for MySQL version 5.5. We did not consider some deprecated keywords in this work, but ideally they should be included.

To visualize working of the normalization process, consider the following SQL queries, intentionally written in mixed-case to demonstrate the effect of normalization:

sEleCt * fRoM products wHeRe price > 10.00 aNd discount < 8 SeLeCt email FrOm customers WhErE fname LIKE ’john%’ sELeCT CoUnT(*), sUm(`amount`) fROm `orders` OrDeR bY SuM(`amount`)

By applying the normalization scheme, these are converted into the following forms respectively:

SELECT STAR FROM USRTBL WHERE USRCOL GT DEC AND USRCOL LT INT SELECT USRCOL FROM USRTBL WHERE USRCOL LIKE SQUT STR PRCNT SQUT SELECT COUNT STAR CMMA SUM USRCOL FROM USRTBL ORDER BY SUM USRCOL

It may be observed that the normalized queries are like readable sentences (in a way), still preserving the semantics of the original queries. Consider an injected query generated due to a cleverly crafted injection attack to bypass detection (taken from the examples given in Section 2):

SELECT * FROM products WHERE prod_id = 24 OR ’ABC’ = CoNcAt(ChAr(0x28 + 25), cHaR(0x42), chAr(80 - 0x0D));#

This query contains a number of symbols, operators and SQL function calls. The normalization scheme converts it into the following form:

SELECT STAR FROM USRTBL WHERE USRCOL EQ INT OR SQUT STR SQUT EQ CONCAT CHAR HEX PLUS INT CMMA CHAR HEX CMMA CHAR INT MINUS HEX SMCLN HASH

Any SQL query, irrespective of its length and complexity, is thus normalized into a series of words separated by spaces, just like a sentence in English. The syntactic structure of the query is correctly maintained. Due to the way various symbols, operators, values, etc. are normalized; many different queries get converted into the same form, which helps reduce the size of query repositories.

The normalization scheme is also designed to take care of common techniques used by attackers to bypass detection in the following ways:

Newline, carriage-return and tab characters in the query are replaced with normal space character (see Step-1 in Table 1) which neutralizes bypassing attempt by whitespace spreading.

MySQL allows identifiers such as database, table, and column names to be delimited by the backquote (`) character wherever necessary to differentiate from reserved keywords. An attacker can unnecessarily delimit every identifier in the injection attack to bypass detection. As backquotes do not contribute towards the structural form of a query, they are removed from the query before substituting other symbols (see Table 2).

Parentheses are used to enclose function parameters and subqueries in SQL, but it is syntactically correct to use additional parenthesis-pairs even if not required. For example, CHAR(65) returns the character A, which can be augmented with extra parentheses as CHAR((((65)))) producing the same return value. Similarly, a tautology 2 = 5 - 3 can be written as ((2)) = (((5) - ((3)))) which still evaluates to true. If all parentheses are converted to tokens, these expressions after normalization would exhibit very low document similarity with known attack patterns, thereby providing an opportunity to bypass detection. However, attackers also sometimes inject one or two opening or closing parentheses to guess the parenthetical structure (if any) of the dynamic query. Therefore, matching parenthesis-pairs are removed but any mismatching parentheses are preserved and converted into tokens.

MySQL allows version-specific commands to be embedded inside a query within inline comments. For example, in the query “SELECT /*!50525 DISTINCT*/ retail_price FROM books”, the DISTINCT command inside comments will execute only if the version of MySQL is 5.5.25, but shall be treated as a comment on all other versions. This feature is rarely used by programmers, however if an attacker can somehow determine the version of the database server, he can take its benefit by embedding malicious code inside comments in an attempt to bypass detection. Comments inside queries are therefore preserved and converted into tokens. On the other hand, an attacker can try to stuff empty comments inside the injected code to bypass detection. Therefore empty comments are removed from the query during normalization.

4.2. Document similarity measures

Document similarity measures are typically used in information retrieval domain and is an extensively researched area. Several document similarity measures exist in the literature which have been defined for different applications. We look into some commonly used similarity measures of symmetric nature bounded in $[0, 1]$ , conduct experiments on linearity and execution time, and choose the most suitable one for our approach.

4.2.1. Vector space model

The Vector Space Model (VSM) is quite popular in information retrieval [34, Chap. 6]. In this model, documents are represented as feature vectors in a high-dimensional space. Similarity between two documents $d_{i}$ and $d_{j}$ , is computed as the cosine of the angle θ between the document vectors $\vec{d_{i}}$ and $\vec{d_{j}}$ . Formally, cosine similarity is given by: $\begin{matrix} (1) & Sim (d_{i}, d_{j}) = cos (θ) = \frac{\vec{d_{i}} \cdot \vec{d_{j}}}{‖ \vec{d_{i}} ‖ ‖ \vec{d_{j}} ‖} . \end{matrix}$

A popular method in VSM, known as the bag-of-words model, represents documents as term vectors using term-frequency (TF) and inverse-document-frequency (IDF) weighting. The TF weight of a term $t_{k}$ in a document $d_{i}$ and its IDF weight in the document corpus are respectively given by: $\begin{array}{l} (2) & {TF}_{t_{k}, d_{i}} = \{\begin{matrix} 1 + log ({tf}_{t_{k}, d_{i}}) & if {tf}_{t_{k}, d_{i}} > 0, \\ 0 & otherwise; \end{matrix} and \\ (3) & {IDF}_{t_{k}} = log \frac{N + 1}{1 + f_{d, t_{k}}}, \end{array}$ where ${tf}_{t_{k}, d_{i}}$ is the frequency of term $t_{k}$ in document $d_{i}$ , N is the total number of documents in the corpus and $f_{d, t_{k}}$ is the number of documents in which the term $t_{k}$ occurs. Equation (3) is adjusted to take care of any term that does not occur in any document (i.e. $f_{d, t_{k}} = 0$ ), by adding one document containing all terms only once to the corpus. The TF-IDF weight of a term $t_{k}$ in a document $d_{i}$ is the product of TF and IDF weights, given by: $\begin{matrix} (4) & w_{t_{k}, d_{i}} = {TF}_{t_{k}, d_{i}} \times {IDF}_{t_{k}} . \end{matrix}$

Two documents $d_{i}$ and $d_{j}$ are represented in vector form by the TF-IDF weights of the terms they contain in n-dimensions, where n is the total number of terms in the vocabulary, as: $\begin{array}{l} \vec{d_{i}} = ⟨ w_{t_{1}, d_{i}}, w_{t_{2}, d_{i}}, \dots, w_{t_{n}, d_{i}} ⟩ and \\ \vec{d_{j}} = ⟨ w_{t_{1}, d_{j}}, w_{t_{2}, d_{j}}, \dots, w_{t_{n}, d_{j}} ⟩ . \end{array}$

Equation (1) can therefore be expanded to: $\begin{matrix} (5) & Sim (d_{i}, d_{j}) = \frac{\sum_{k = 1}^{n} w_{t_{k}, d_{i}} w_{t_{k}, d_{j}}}{\sqrt{\sum_{k = 1}^{n} w_{t_{k}, d_{i}}^{2}} \times \sqrt{\sum_{k = 1}^{n} w_{t_{k}, d_{j}}^{2}}} . \end{matrix}$

Although TF-IDF weighting is popular in information retrieval domain and has been used for intrusion detection [7,17,30,42], our experiments using this measure on normalized queries did not produce results within acceptable limits. A major shortcoming of the TF-IDF based measure is that it ignores the order of occurrence of terms in the document. Two sentences: “John is older than Mary” and “Mary is older than John”, are determined as identical documents (i.e. similarity = 1.0) by this measure. Our initial reasoning was that, because SQL syntax follows a strict grammar, the order of terms should be automatically accounted for. For example, a query “SELECT price FROM products” cannot be written as “FROM products SELECT price” – it is syntactically incorrect even if the meaning is same in English. However, we found that TF-IDF weighted cosine similarity between a genuine query and an injected query (after normalization), turned out to be high in many cases. Deeper investigation revealed that: (1) except a few major SQL keywords, other terms do not necessarily follow a strict order, and (2) an injected query may not be very different from a genuine query considering the frequency of individual terms – a minor difference can potentially contain an SQL injection attack. Therefore, we conclude that, TF-IDF weighted term-based cosine similarity with the proposed normalization scheme is not appropriate to identify SQL injected queries with acceptable accuracy.

The shortcomings of term-based similarity can be overcome by considering phrases instead of individual terms. By phrase, we mean a contiguous sequence of terms, not the syntactical or structural clauses of SQL. A phrase in this context is same as an N-gram where each gram corresponds to a term; we prefer to use the term phrase for simplicity. Considering that a document d is a sequence of terms $(t_{1}, t_{2}, \dots, t_{n})$ , phrases are extracted by sliding a window of size w across the document, i.e., every phrase contains w consecutive terms. We call this as the phrase length. For example, if the phrase length is 2, then the phrases extracted are $(t_{1} t_{2}, t_{2} t_{3}, t_{3} t_{4}, \dots, t_{n - 1} t_{n})$ . The number of phrases in a document therefore equals to $(n - w + 1)$ , where n is the number of terms in the document. Let the phrases extracted from two documents $d_{i}$ and $d_{j}$ be $P_{i} = (p_{i_{1}}, p_{i_{2}}, \dots, p_{i_{m}})$ and $P_{j} = (p_{j_{1}}, p_{j_{2}}, \dots, p_{j_{n}})$ respectively. Then the set of unique phrases in both documents is $P = P_{i} \cup P_{j} = {p_{1}, p_{2}, \dots, p_{k}}$ where $k ⩽ m + n$ ; the upper bound occurs when the documents do not have any phrase common between them. The documents can now be represented in vector form in k-dimensions by the frequency of the phrases they contain as: $\begin{array}{l} \vec{d_{i}} = ⟨ f_{p_{1}, d_{i}}, f_{p_{2}, d_{i}}, \dots, f_{p_{k}, d_{i}} ⟩ and \\ \vec{d_{j}} = ⟨ f_{p_{1}, d_{j}}, f_{p_{2}, d_{j}}, \dots, f_{p_{k}, d_{j}} ⟩ . \end{array}$

Following Eq. (5), cosine similarity based on phrase frequencies can be written as: $\begin{matrix} (6) & Sim (d_{i}, d_{j}) = \frac{\sum_{q = 1}^{k} f_{p_{q}, d_{i}} f_{p_{q}, d_{j}}}{\sqrt{\sum_{q = 1}^{k} f_{p_{q}, d_{i}}^{2}} \times \sqrt{\sum_{q = 1}^{k} f_{p_{q}, d_{j}}^{2}}} . \end{matrix}$

It can be seen that the similarity computed by Eq. (6) is indirectly dependent on the phrase-length w. Taking $w = 1$ is same as computing similarity by unweighted term frequencies. Unless the two documents being compared are identital, increasing w results in lower similarity value as the number of phrases common to the documents decrease. A natural question arises – how to choose the appropriate phrase length? A rational answer is obtained by looking at the classic tautological attack “OR 1 = 1.” By normalization, it gets converted to “OR INT EQ INT” which contains four terms. Therefore, taking phrase length $w = 4$ should be appropriate. More justification on the phrase length is provided at the end of Section 4.4. For the rest of the paper, we compute similarity only by phrases with phrase-length = 4 without mentioning it specifically.

4.2.2. Set based models

Set based models, also known as Binary Similarity Coefficients, are based on presence or absence of elements between two sets without considering their frequency of occurrence. These similarity coefficients are widely used in systematic biology, entomology, anthropology etc., to compare similarity between two sets of samples for taxonomy and classification. Binary similarity coefficients also find applications in text categorization, clustering and information retrieval. We look into some popular binary similarity coefficients which produce similarity values in the range $[0, 1]$ .

Considering phrases as basic elements of a document, every document can be represented as a set of phrases. Similarity between two documents can be determined by presence or absence of phrases in them. Recall that, if the phrases extracted from two documents $d_{i}$ and $d_{j}$ are $P_{i} = (p_{i_{1}}, p_{i_{2}}, \dots, p_{i_{m}})$ and $P_{j} = (p_{j_{1}}, p_{j_{2}}, \dots, p_{j_{n}})$ respectively, then the set of phrases in both documents is $P = P_{i} \cup P_{j} = {p_{1}, p_{2}, \dots, p_{k}}$ . The following notations are defined based on presence or absence of phrases $p \in P$ in the two documents: $\begin{array}{l} a = | {p : p \in P_{i} \land p \in P_{j}} | = | P_{i} \cap P_{j} |, \\ b = | {p : p \in P_{i} \land p \notin P_{j}} | = | P_{i} - (P_{i} \cap P_{j}) | and \\ c = | {p : p \notin P_{i} \land p \in P_{j}} | = | P_{j} - (P_{i} \cap P_{j}) | . \end{array}$

In other words, a represents number of phrases common to $d_{i}$ and $d_{j}$ , b represents number of phrases present in $d_{i}$ but not in $d_{j}$ , and c represents number of phrases absent in $d_{i}$ but present in $d_{j}$ . The following binary similarity coefficients are expressed using these notations.

Jaccard coefficient (JC). Also known as the Jaccard Index, it was developed by Paul Jaccard (1912) to compare regional flora. It basically computes the ratio of intersection to union of two sets. Similarity between $d_{i}$ and $d_{j}$ by this coefficient is given by: $\begin{matrix} (7) & Sim (d_{i}, d_{j}) = \frac{a}{a + b + c} . \end{matrix}$

Second Kulczynski coefficient (2KS). Proposed in 1927, the second Kulczynski similarity coefficient averages the proportion of common elements from the two sets, given by: $\begin{matrix} (8) & Sim (d_{i}, d_{j}) = \frac{1}{2} (\frac{a}{a + b} + \frac{a}{a + c}) . \end{matrix}$

Sørensen–Dice coefficient (DC). Developed independently by Raymond Dice (1945) and Thorvald Sørensen (1948), the coefficient gives higher weight to items common between two sets. The similarity coefficient is given by: $\begin{matrix} (9) & Sim (d_{i}, d_{j}) = \frac{a}{a + (\frac{b + c}{2})} . \end{matrix}$

Ochiai/Otsuka coefficient (OC). The Ochiai or Otsuka similarity coefficient developed in 1957 is same as cosine similarity in VSM where each component of the vectors is either 0 (absent) or 1 (present). The coefficient is given by: $\begin{matrix} (10) & Sim (d_{i}, d_{j}) = \frac{a}{\sqrt{(a + b) (a + c)}} . \end{matrix}$

Sokal and Sneath coefficient (SSC). Unlike the Sørensen–Dice coefficient, the Sokal and Sneath coefficient (1963) considers penalty for mismatching items. The coefficient is given by: $\begin{matrix} (11) & Sim (d_{i}, d_{j}) = \frac{a}{a + 2 (b + c)} . \end{matrix}$

Tripartite similarity index (TSI). Tulloss [45] critically examined the properties of various similarity coefficients under different situations and suggested three cost functions U, S and R, taking ideas from the manufacturing industry. The Tripartite Similarity Index is defined using these cost functions as: $\begin{array}{rcl} Sim (d_{i}, d_{j}) = \sqrt{U \times S \times R}, \\ where U = {log}_{2} (1 + \frac{min (b, c) + a}{max (b, c) + a}), \\ S = \frac{1}{\sqrt{{log}_{2} (2 + \frac{min (b, c)}{a + 1})}}, \\ (12) & R = {log}_{2} (1 + \frac{a}{a + b}) \times {log}_{2} (1 + \frac{a}{a + c}) . \end{array}$

Positive matching index (PMI). Dos Santos and Deutsch [14] examined the properties of TSI and proposed a new similarity index named Positive Matching Index (PMI) with optimal characteristics, given by: $\begin{array}{l} (13) & Sim (d_{i}, d_{j}) = \{\begin{matrix} 0 & if a = 0, \\ \frac{a}{a + b} or \frac{a}{a + c} & if b = c, \\ \frac{a}{| b - c |} \times ln (\frac{a + max (b, c)}{a + min (b, c)}) & otherwise . \end{matrix} \end{array}$

4.3. Selection of similarity measure

All the similarity measures discussed in Section 4.2 fulfill the basic requirement of being symmetric, i.e., $Sim (d_{i}, d_{j}) = Sim (d_{j}, d_{i})$ , and produce a real-valued index in the range $[0, 1]$ where 0 means the documents are completely dissimilar and 1 means they are identical. Linearity is another desirable property of a similarity measure, which means equal amounts of change in the value of the coefficient when values of joint occurrence change by a factor of one [21]. It is intuitive and easier to interpret when the degree of similarity varies linearly with respect to number of phrases common between two documents. For our system, it is important that if two queries have x phrases in common and their similarity is computed as s, then the similarity should be exactly $2 s$ when they have $2 x$ phrases common between them.

Behavior of the similarity measures was tested by taking two identical strings of 100 words (taken randomly from the vocabulary of 686 words mentioned in Section 4.1) and then changing one word at a time until the strings are completely different. For each word changed, the similarity was computed by each of the binary similarity measures, and the values were plotted against number of mismatching words as shown in Fig. 1.

Fig. 1.

Linearity testing of binary similarity coefficients.

It is observed that SSC exhibits a very nonlinear (concave upwards) behavior. The similarity drops at a higher rate with few words changed and then at a lower rate as the documents tend towards complete dissimilarity. JC also exhibits similar behavior though less than SSC. The TSI is close to linearity but exhibits slightly concave downward nature. The rest of the five similarity measures are uniform in their behavior and perfectly linear from 1 to 0. Therefore, JC, SSC and TSI are excluded as candidate measures from our approach.

The remaining five similarity measures were tested for execution time by applying each between two strings of 100 to 2000 words selected randomly from the same vocabulary. The average execution time of 1000 runs on our experimental setup (described in Section 6) is shown in Fig. 2. As the vocabulary of words used in the normalization scheme consists of 686 words, after the length of strings cross about 700 words, the number of distinct phrases does not increase due to which the execution time of all similarity measures becomes almost constant.

Fig. 2.

Execution time of similarity measures.

It is found that, computing cosine similarity by vector space model takes nearly twice the time than the set based measures. The five set theoretic measures exhibit almost the same time of execution while PMI takes slightly less time than 2KS, DC and OC. Considering that several queries may need similarity matching at run-time per web page, even a marginal saving in execution time can improve performance. Therefore, PMI is selected as the most appropriate similarity measure for our system.

4.4. Detection strategy of SQLiDDS

The proposed approach for SQL injection detection is strategically guided by two interesting observations which we were unable to find mentioned anywhere in the literature. The first observation surfaced while looking for an answer to “where do SQL injections most commonly occur in an injected query?” Looking at the general programming practices for developing web applications, we find that, dynamic SQL queries are usually constructed in two parts: (1) a static part that is directly hard coded by the programmer, and (2) a dynamic part produced by concatenation of SQL elements, delimiters and received input values. Since the objective of constructing a dynamic query is to fetch different set of records depending on the input, the parameter values must be used to specify the selection criteria, i.e., in the WHERE clause of the query. For instance, the query string parameter “pid=24” (see Section 2) is used to construct the “WHERE prod_id = 24” portion of the query. Dynamic queries using form field values (except where they are purely used for inserting data into database), such as a login form, also have to use the values in the WHERE clause of the query as: WHERE username = ’input1’ AND password = ’input2’. It is a compulsive programming need as well as the de facto practice in web application development to use the input values in the WHERE clause part of dynamic SQL queries. In some cases, parameter values are passed to control ordering of the results. For example, in the URL http://www.ecomstore.com/product_listing.php?cid=6&sort=asc, value of the sort parameter is used in the ORDER BY clause, which comes after the WHERE clause in the query. If the database table or column names are used as names or values of form fields (such as option values in a select box) and used to construct a dynamic query, then only SQL injection attack can occur before the WHERE clause. Consider a ‘Search’ form on an e-commerce site that allows visitors to search for products or brands by name. Suppose, a novice programmer codes the HTML in the following manner:

<form name="frmSearch" method="POST"> Search for: <select name="tblname" size="1"> <option value="tbl_products">Products</option> <option value="tbl_brands">Brands</option> </select> Keywords: <input type="text" size="20" name="keywords" /> <input type="submit" value="Go" name="btnGo" /> </form>

Here, the dropdown box ‘tblname’ on the form directly exposes the actual table names in the database as option values. If the name of the column to search is ‘name’ in both the tables, the programmer may write the corresponding PHP code as:

$query = "SELECT * FROM ".$_POST[’tblname’]; $query = $query . " WHERE `name` LIKE ’%".$_POST[’keywords’]."%’"; $result = mysql_query($query, $dbconn);

In this case, both the parameters ‘tblname’ and ‘keywords’ are vulnerable to SQL injection attack. If the attacker exploits the ‘tblname’ parameter, then the attack vector will occur before the WHERE clause. However, this is an extreme case of poor programming; a programmer with minimum sense of security would never use database table or column names as form field or parameter values because the HTML source can be viewed on any browser. He would rather use code words such as ‘PRD’ and ‘BRND’ or integers like 1 and 2 for the option values in the dropdown box, and generate appropriate SQL query in the PHP code using a switch-case or if-else construct. The ‘keywords’ parameter still remains vulnerable to SQL injection attack, but it will anyway occur in the WHERE clause part of the query.

This observation resolves that SQL injections happen after the WHERE keyword, almost in all cases. In fact, SQL injection before the WHERE keyword is extremely rare – not a single instance was found in over 16,500 SQL injected queries we examined in this research. We also inspected some popular open-source PHP/MySQL software like Wordpress, phpBB etc., and did not find any instance of parameter values used before the WHERE clause in dynamic queries. Hereafter, for ease of referencing, we shall use the term “tail end” to mean the portion after the first WHERE keyword up to the end of an SQL query.

The second interesting observation stems from the basic intention behind an SQL injection attack – stealing sensitive information from the back-end database such as credit card numbers. By carefully formulating the injection code, the attacker tries to get the intended data displayed on the web page itself. Consider a product detail page that displays information of the selected product on the web page, such as product name, price, description, etc., which is determined by the parameter passed in the query string (e.g., pid=24). An attacker would inject additional SQL to change the results of the query forcing the web page to display data from other tables in the locations where the product name, price or description would have normally appeared. This is the only instrument for data breach through SQL injection. In SQL, fetching information from database is done by SELECT queries, which implies that unless the injectable parameter is used to construct a dynamic SELECT query that delivers data displayed on the web page, it is not useful for data breach by the attacker. This points to another assertion that SQL injection attacks mostly occur through dynamically generated SELECT queries.

These two observations lead to the strategic conclusion that, it is sufficient to examine the tail end of an SQL query for presence of any injection attack. This strategy significantly narrows down the scope of processing. Another plausible strategy may be to intercept only SELECT queries for examination and ignore UPDATE, DELETE, and INSERT queries. However, this may enable the attacker to cause damage to the data, if not breach. As such, by considering the tail end of a query, we automatically include UPDATE and DELETE queries in the investigation, because they also support a WHERE clause by SQL syntax.

A cognizant question may be raised here concerning a special kind of SQL injection, known as Second Order SQL Injection, which is initiated through INSERT queries [20]. In this method, the attacker submits values containing injected code through form fields (e.g., a registration form) which is not executed at that point of time, but silently gets stored in the database as any other normal data. When some other part of the web application (e.g., a change password form) uses that stored value in a dynamic query, the injected code comes into effect. Performing a second order SQL injection attack requires the attacker to have adequate knowledge about how the submitted values are used in other parts of the web application. It is interesting to note that the dynamic query that falls victim to a second order injection, must be a SELECT query displaying information on the web page if the purpose of the attacker is to steal data. Additionally, the stored value containing injected code must be used in its WHERE clause. Therefore, if the WHERE clause is examined to detect injection attacks, the secondary injection would practically render ineffective. Thus, the strategy to examine only the tail end of a query is rationally correct and sufficient for detecting SQL injection attacks.

Apart from greatly reducing the processing overhead, limiting the investigation only to the tail end of queries also enhances the detection capability using document similarity. Since the portion before the WHERE keyword is hard-coded and remains unchanged after an injection attack, it would undesirably inflate the similarity scores when the entire queries are compared. Particularly for long JOIN queries collating several related tables, but with a few conditions in the WHERE clause, the contribution to similarity score by the portion before the WHERE keyword could be significant, making it difficult to choose a reasonably sensitive threshold. Therefore, dropping the portion before the WHERE keyword and comparing similarity of the tail ends gives a realistic scale for identifying injection attacks.

Continuing the discussion from Section 4.2.1 on choosing a suitable value for phrase-length, we now give more justification on why $w = 4$ is appropriate for computing similarity between two queries. Reconsider the URL of the product detail page (cited in Section 2) receiving the benign input “pid=24” through the query string. The tail end of the dynamic query generated is: prod_id = 24, which is genuine. By normalization, it is converted to “USRCOL EQ INT” – containing three terms. To determine if this web page is vulnerable to SQL injection attack, an attacker would first perform basic tests such as adding a single quote (’) character or supplying a negative value of the parameter as shown below:

http://www.ecomstore.com/product_details.php?pid=24’ http://www.ecomstore.com/product_details.php?pid=-24

If the web page displays error messages or does not display normally, the attacker concludes that the parameter pid has been directly used in a dynamic query without any sanitization. The SQL queries that are generated by these abnormal requests are respectively:

SELECT * FROM products WHERE prod_id = 24’ SELECT * FROM products WHERE prod_id = -24

The tail end of these queries get normalized to “USRCOL EQ INT SQUT” and “USRCOL EQ MINUS INT” respectively. It may be immediately noticed that both contain four terms, suggesting that the tail end must contain at least four terms for an SQL injection attempt. Therefore, phrase length $w = 4$ is most appropriate for computing similarity.

4.5. Clustering of normalized injected queries

Normalization of the tail-end of previously collected SQL injected queries produces a set of injection attack patterns in text form which are considered as small documents. These are clustered into groups based on their document similarity and the patterns in each cluster are merged. The purpose of clustering is to create a set of larger documents, each containing a number of highly similar SQL injection patterns. This enables similarity matching against a small set of documents at run-time.

Text clustering has been extensively researched and many clustering algorithms have been proposed in the literature [1]. Clustering algorithms used in information retrieval have also been described by Manning et al. [34, Chaps 16–17]. Flat clustering algorithms like k-means are not suitable for our purpose primarily because of two reasons. Firstly, injection vectors are extremely heterogeneous, due to which an expert guess of k is likely to be inappropriate. Secondly, SQLiDDS is intended to be adaptive and the number of clusters would vary over time when re-clustering is done after addition of new injection patterns to the repository. Hierarchical agglomerative clustering (HAC) algorithm does not require the number of clusters to be prespecified. Though it has at least $Θ (n^{2})$ complexity, it is not a bottleneck because the size of the data set is greatly reduced by the query normalization process (see Section 5.1).

Algorithm 1

Clustering of normalized injection patterns

We developed a variation of the priority-queue based efficient HAC [see 34, p. 386] for clustering of the normalized injection patterns as shown in Algorithm 1. The process begins with computation of a $n \times n$ similarity matrix $S$ where $s_{i j} = Sim (d_{i}, d_{j})$ , n is the number of documents (normalized tail-ends) and $1 ⩽ i, j ⩽ n$ . Due to symmetric nature of similarity measure, $s_{i j} = s_{j i}$ and self-similarity $s_{i i} = 1.0$ , the matrix $S$ is diagonally symmetric with all diagonal elements equal to 1.0, as shown below:

The algorithm takes the similarity matrix $S$ and a threshold similarity $s_{θ}$ as input and produces the set of clusters $C$ . Initially, each document is placed in its own cluster. For each document $d_{i}$ , a preference queue $P_{i}$ is then created by choosing the documents whose similarity with $d_{i}$ is above $s_{θ}$ , sorting them in descending order by similarity. The document $d_{i}$ itself is excluded from the preference queue as self-similarity is always the highest. For example, if the preference queue $P_{1} \to {d_{7}, d_{3}, d_{8}, d_{5}, \dots}$ , then it means that document $d_{1}$ has highest similarity with $d_{7}$ , then with $d_{3}$ , then $d_{8}$ , $d_{5}$ and so on. If the similarity with two documents is same, then we follow a tie-breaker rule that the indexes of the documents are also in descending order. In this example, if similarity of $d_{1}$ with $d_{5}$ and $d_{8}$ are exactly same, then $d_{8}$ comes before $d_{5}$ in the preference queue. In case of a document whose similarity with all other documents is below the threshold $s_{θ}$ , the preference queue is empty – such documents are unique in nature and result in a singleton cluster.

Like the simple bottom-up HAC [34, Section 17.1], the algorithm forms clusters by successively merging pairs of clusters. However, it does not continue up to merging all clusters into a single cluster containing all documents; rather it terminates when the set of preference queues $P$ becomes empty. Merge pairs are determined by mutual matching of the first preferences of two documents. For example, if the first preference of $d_{1}$ is $d_{7}$ and first preference of $d_{7}$ is $d_{1}$ , then the clusters containing $d_{1}$ and $d_{7}$ can potentially merge. However, merging of two clusters is allowed only when the minimum similarity $s_{min}$ between the existing documents in the two clusters interested to merge is $⩾ s_{θ}$ . This is similar to the complete-link HAC and avoids chaining effect. If two clusters are allowed to merge only by mutual matching of first preferences of two documents, which is a local condition, the similarity between the boundary points in the resulting cluster may fall below $s_{θ}$ . Since the documents (injection patterns) in each cluster are finally collated into one larger document (used for run-time comparison), it is essential that the documents in each cluster are highly similar to each other. Checking the minimum similarity before merging two clusters adds a non-local condition to the merge criterion. This ensures that the intra-cluster similarity is always within the threshold and produces cohesive spherical clusters.

In each iteration, whether two clusters get a chance to merge or not, the first preference of the potential merge pairs are removed from the queues, allowing the clusters to try merging with some other cluster in the next iteration. This way each preference queue gradually reduces and finally becomes empty. Empty preference queues are eliminated from subsequent iterations. The algorithm terminates when there are no more preference queues left. The time complexity of the algorithm still turns out to be $Θ (n^{2})$ although it completes earlier than efficient HAC.

A question may be raised on a situation when none of the first preferences mutually match – this could result in a deadlock causing the algorithm run into an infinite loop. However, it can be easily proved that such a deadlock situation will never arise. Let the first preference of $d_{1}$ be $d_{2}$ because the highest similarity of $d_{1}$ is with $d_{2}$ . We denote this first-preference property of $d_{1}$ by $d_{1} ≻ d_{2}$ . Suppose the first preference of $d_{2}$ is $d_{3}$ instead of $d_{1}$ , i.e., $d_{2} ≻ d_{3}$ . Extending this, let $d_{3} ≻ d_{4}$ , $d_{4} ≻ d_{5}$ , …, $d_{n - 1} ≻ d_{n}$ . Now, if $d_{n} ≻ d_{1}$ , it would be a contradiction because in that case $d_{1} ≻ d_{2}$ cannot be true due to symmetric nature of similarity measure, and hence either $d_{1} ≻ d_{n}$ or $d_{n} ≻ d_{n - 1}$ must be true. Therefore during any iteration, there is at least one pair of documents whose first preferences would mutually match.

4.6. Determination of threshold similarity

The threshold similarity $s_{θ}$ plays an important role in the proposed clustering algorithm. Apart from limiting the length of the preference queues for each document, it also controls merging of clusters. It is therefore necessary to choose the threshold such that it defines a natural cluster boundary. In many applications, thresholds are usually fixed empirically or based on expert guess. In our approach, SQLiDDS needs to have a dynamic threshold that is dependent on the known injection patterns available in its database. Since the goal is to make it adaptive (partly by using its own history of detection and partly with inputs from the DBA), a fixed threshold is likely to undermine this aspect of our system.

In image processing domain, threshold selection by histogram analysis is quite popular [18]. The main advantage of threshold selection by histogram analysis is to make it data dependent, i.e. with changed data, the threshold can automatically change. The similarity range $[0, 1]$ is divided into equal intervals called bins. For example, taking the interval as 0.005 results in 200 bins as $[0, 0.005), [0.005, 0.01), [0.01, 0.015), \dots, [0.995, 1.0)$ . From the similarity matrix, frequency of similarity values within each bin is counted and a histogram is generated. Since the similarity matrix is diagonally symmetric, the frequency at each bin should ideally be halved, but this is not essential.

The histogram is iteratively smoothed using moving average method by taking the frequency at any bin as the average of previous, current and next bin, i.e., $f_{b_{i}} = \frac{1}{3} (f_{b_{i - 1}} + f_{b_{i}} + f_{b_{i + 1}})$ . In about 30 iterations the histogram is adequately smooth as shown in Fig. 3. The smoothed histogram usually exhibits a “U” shape with a flat portion in the middle. This is because injection patterns of one type using nearly same SQL technique are highly similar to each other but different from the rest. Therefore, high values closer to 1.0 and low values closer to 0 occur very frequently in the similarity matrix. Medium values (0.45 to 0.85) occur less frequently in comparison and therefore suggest a natural cluster boundary. As depicted in Fig. 3, the similarity corresponding to the bin at the mid-point of the flat portion between the points A and B of the smoothed histogram is selected as the threshold similarity $s_{θ}$ for clustering.

Fig. 3.

Histogram smoothing for determination of threshold similarity.

5. Architecture of SQLiDDS

SQLiDDS operates in two distinct phases: (1) offline phase, and (2) run-time phase. The offline phase can be considered as the training phase and is required only once. At run-time, the system performs hash-table lookup to detect known SQL injection attacks while previously unseen injection attacks are detected by similarity matching. The system thus re-trains itself from its detection history and also with inputs from the DBA. The rest of this section explains the two phases of our approach in detail.

Fig. 4.

The offline phase of SQLiDDS.

5.1. Offline phase

The offline phase, as shown in Fig. 4, begins with an adequate collection of SQL injected queries. The process of collecting injected queries has been discussed later in Section 6.1. The tail-end of the injected queries are extracted and converted into text form using the query normalization scheme. A major advantage of the normalization scheme is that several queries get converted into the same form. For example, two attacks “prod_id = 24 OR 1 = 1” and “brand_id = 6 OR 7 = 7” are different if compared as strings, are normalized to the same form “USRCOL EQ INT OR INT EQ INT”, i.e., both the WHERE clauses actually contain exactly the same injection pattern. Therefore, duplicates are removed after query normalization and the distinct structures thus obtained constitute the Injected Structures Database (ISD). The MD5 hash value of each structure is also computed and stored separately in an Attack Hash Table (AHT). The reason for storing the MD5 hashes is twofold: (1) computing similarity for every incoming query is avoided at run-time, and (2) looking up a hash table takes constant time and is very fast. The injected patterns in the ISD are clustered into groups based on their phrase similarity (as discussed in Section 4.5), the structures in each cluster are merged into a single document and saved in the Attack Documents Repository (ADR). Each document thus contains a set of highly similar injection patterns. Clustering enables similarity matching with a smaller set of documents at run-time.

A collection of genuine SQL queries is also prepared in the same manner. The process of collecting genuine SQL queries is much simpler. The sample web applications are run in a secured environment and the queries generated by the web pages are recorded. The tail-end of these queries are extracted and normalized. For example, the query “SELECT * FROM products WHERE category_id = 7 AND price BETWEEN 10.00 AND 20.00” is a genuine query. Its tail-end after normalization produces the pattern “USRCOL EQ INT AND USRCOL BETWEEN DEC AND DEC”. All genuine queries are processed in the same way. After duplicate removal, the distinct structures are stored in the Genuine Structures Database (GSD). The MD5 hash of the genuine patterns are generated and stored in a Genuine Hash Table (GHT). Unlike injected structures, clustering is not required for the genuine structures.

The offline phase completes with creation of the injected and genuine structure databases, the hash tables and the documents of clustered injected structures, making SQLiDDS ready for deployment as a database firewall between a web server and database server and start intercepting incoming SQL queries. Two thresholds, (1) Rejection Threshold ( $R_{th}$ ), and (2) Suspicious Threshold ( $S_{th}$ ) are defined in SQLiDDS configuration, such that $S_{th} < R_{th}$ .

5.2. Run-time phase

In the run-time phase (shown in Fig. 5), SQLiDDS intercepts queries issued by web applications to the database server. The tail-end of each intercepted query is extracted and normalized using the query normalization scheme. This produces the pattern which needs to be checked for any SQL injection attack. The MD5 hash of the pattern is computed and first the GHT is looked up for a match. If a match is found, the query is confirmed as a genuine query and forwarded to database server for execution. In case a match in GHT is not found, the AHT is looked up next. If a match is found in the AHT, then it is confirmed as a known SQL injection attack and the query is rejected without any further check. When a match is not found in the AHT, its phrase similarity with the documents in the ADR is computed. If the highest similarity with any document is above the predefined rejection threshold $R_{th}$ , the query is determined as an injection attack and rejected. At the same time, the pattern is added to the ISD and its MD5 hash is added to the AHT. Due to addition of a new structure to the database, clustering process is repeated and a new set of documents is generated by merging structures of each cluster which are stored in the ADR replacing the existing set of documents. The ISD and AHT are thus updated every time a new SQL injection attack pattern is detected. Since computing the similarity matrix and clustering is a time consuming task, the ADR is currently updated through a separate process that runs every 10 min. We are however considering incremental clustering to be able to update the ADR on-the-fly.

Fig. 5.

The run-time phase of SQLiDDS.

If the highest similarity is $< R_{th}$ , but $⩾ S_{th}$ , the query is allowed to execute, but tagged as suspicious and logged separately for examination by the Database Administrator (DBA). The DBA examines the SQLiDDS suspicious queries log time to time. If the DBA marks a suspicious query as genuine, then its normalized tail-end is added to the GSD and its MD5 hash is added to the GHT. On the other hand, if the DBA marks it as injected, then the normalized tail-end is added to the ISD and its MD5 hash is added to the AHT. This ensures that next time the same attack is instantly detected. As a new pattern is added to the ISD, the clustering process is repeated, and a fresh set of documents are generated by merging the injection patterns in each cluster. The new documents replace the old documents in the ADR. This increases the probability of detecting similar but unseen injection attacks in future. In this manner, the system learns new attack vectors over time from its detection history as well as expert help.

6. Experimental setup and evaluation

Our experimental setup consisted of a standard desktop computer with Intel^® Core-i3™ 2100 CPU @ 3.10 GHz and 2GB RAM, running CentOS 5.3 Server OS, Apache 2.2.3 web server with PHP 5.3.28 set up as Apache module, and MySQL 5.5.29 database server. Five client PCs of same hardware configuration were connected through a passive switch to the server creating a 100 mbps LAN environment.

In the literature, some researchers have used the AMNESIA testbed4

⁴
http://www-bcf.usc.edu/~halfond/testbed.html.

to validate their approaches [31,43]. The testbed consists of five sample web applications written in Java. We found that, each application has a few webpages and database tables. Many authors have used only the sqlmap tool on a sample application to generate the attack vectors [27,28,49]. In order to closely simulate real-world attack scenarios, we decided to apply as many as SQL injection tools available on medium-scale web applications generally seen on the Internet. Using PHP and MySQL, we developed five web applications; namely, Bookstore (e-commerce), Forum, Classifieds, NewsPortal, and JobPortal, with features and functionalities similar to real websites. However, the code was deliberately written without any input validation so that they are entirely vulnerable to SQL injection attacks. All web applications along with the back-end databases were installed on the server simulating a shared hosting environment.

6.1. Collection of SQL injected queries

SQLiDDS requires an adequate collection of known injected queries to begin the offline phase, which was generated using a honeypot based technique. First, we switched on the General Query Log5

⁵
http://dev.mysql.com/doc/refman/5.5/en/query-log.html.

option of the MySQL database server. When this option is enabled, MySQL server writes every connect, disconnect and SQL queries received from client web applications into a log file. SQL injection attacks were then launched from the client PCs on three web applications, namely Bookstore, Classifieds, and JobPortal, using a number of automated vulnerability scanners and SQL injection tools downloaded from the Internet, such as, HP Scrawler, WebCruiser Pro, Wapiti, Skipfish, NetSparker (community edition), SQL Power Injector, BSQL Hacker, NTO SQL Invader, sqlmap, sqlsus, The Mole, IronWasp, jSQL Injector, OWASP zap, etc. An advanced and versatile penetration testing platform based on Debian, known as Kali Linux,6

⁶

http://www.kali.org/ (formerly known as BackTrack Linux).

is bundled with several additional website scanners and database exploitation tools such as grabber, bbqsql, nikto, w3af, vega, etc. Some SQL injection scripts were also downloaded from hacker and community sites such as Simple SQLi Dumper (Perl), darkMySQLi (Python), SQL Sentinel (Java), etc., and applied on the websites. Over a two month period, each tool was applied multiple times on the three websites with different settings to ensure that all possible types of injection patterns are captured. Pointers to each of these tools are not being provided here as their download locations may be searched easily on Google.

In addition to these tools, several SQL injection tips & tricks were collected from tutorials, cheat-sheets, security forums, black hat sites, etc., and applied manually on the websites. As the websites were made entirely vulnerable, all the automated and manual injection attacks were successful. In the process, every SQL query generated by the web applications was logged by MySQL server in the general query log as planned. Over 4.52 million lines were written by MySQL into the log files, from which 3.35 million SQL queries were extracted. After removing duplicates, INSERT queries, and queries without WHERE clause, total 245,356 nos. of unique SELECT, UPDATE, and DELETE queries were obtained, containing a mixture of genuine as well as injected queries.

Each query was manually examined and the injected queries were carefully separated out. Out of the 49,273 injected queries identified, total 16,702 unique queries were obtained. The collection contained a good mix of all types of injection attacks, though the percentage of UNION-based, blind injection attacks, and time-based blind attacks was observed to be higher than others. In fact, these three SQL injection techniques are most commonly used by attackers. The tail-end (i.e., the portion after the first WHERE keyword up to the end) of these queries were extracted and normalized using the query normalization scheme. After removing duplicates, 3896 unique injection patterns were obtained for creating the ISD (the query normalization scheme reduced the size of ISD by 76.6%). The MD5 hash of the 3896 injected structures were computed using PHP’s md5() function and stored in the AHT.

For preparing the ADR, the normalized injection patterns were clustered using Algorithm 1. The threshold was determined as 0.645 from the similarity matrix by histogram smoothing as described in Section 4.6. Total 108 clusters were obtained including 17 singleton clusters. The patterns in each cluster were merged into a document and the documents were stored in the ADR. The phrases of each document were pre-extracted for efficiency and stored in comma separated format.

6.2. Run-time evaluation

For run-time evaluation, all the five sample web applications were used without any changes to their input handling. As we collected injected queries from three web applications during the offline (training) phase, this may be regarded as using approximately 60% of the data for training and 40% for testing. The rejection and suspicious thresholds were determined empirically by testing phrase similarity between injected and genuine patterns selected randomly from the structure databases, and set as 0.885 and 0.682 respectively. With SQLiDDS activated to intercept incoming queries, the web applications were attacked by using the automated tools and manual techniques from the client PCs as mentioned in Section 6.1. In addition, to quickly simulate genuine browsing of the web applications, we used Xenu Link Sleuth7

⁷
http://home.snafu.de/tilman/xenulink.html (a broken link checker).

which automatically discovers links from each web page and sends requests for each link. For each incoming query, SQLiDDS examines it only if it has a WHERE clause, and writes the details such as timestamp, original query, normalized tail-end, md5 hash, GHT and AHT lookup, maximum similarity score with documents in ADR, and decision etc., into a log file. Table 3 shows the results of first run of SQLiDDS, (i.e., without DBA involvement), where true positive (TP) refers to injected queries correctly detected, false negative (FN) refers to injected queries incorrectly identified as genuine or suspicious, true negative (TN) refers to genuine queries correctly identified, and false positive (FP) refers to genuine queries incorrectly determined as injected or suspicious. The figures shown in Table 3 are in terms of unique instances extracted from SQLiDDS debug log. Since the evaluation is done by practically launching injection attacks on the web applications using several automated tools under various configurations along with manual attacks, the number of injected queries outnumbers the genuine queries. For a fair comparison, the precision and recall have been class averaged to avoid skewing towards the majority class.

Table 3

Results of SQL injection detection by SQLiDDS

Web application	#Queries intercepted	TP	FN	FP	TN	TPR	TNR	FPR	FNR	Avg. precision	Avg. recall	F1 score
Bookstore	4541	3912	44	4	581	98.89%	99.32%	0.68%	1.11%	96.43%	99.10%	97.75%
Forum	3708	3128	65	6	509	97.96%	98.83%	1.17%	2.04%	94.24%	98.40%	96.28%
Classifieds	2056	1622	24	4	406	98.54%	99.02%	0.98%	1.46%	97.09%	98.78%	97.93%
NewsPortal	1593	1265	44	2	282	96.64%	99.30%	0.70%	3.36%	93.17%	97.97%	95.51%
JobPortal	3179	2691	29	3	456	98.93%	99.35%	0.65%	1.07%	96.95%	99.14%	98.04%

The experimental results show that SQLiDDS exhibits good accuracy of detection. For all test applications, the true positive rate (TPR) is above 96.5% and the true negative rate is above 98.8%. The false positive rate (FPR) is below 1% except for the Forum application. The false negative rate (FNR) is below 1.5% except for the Forum and NewsPortal applications. The class averaged precision varies from 93.17% for NewsPortal to 97.09% for Classifieds. The class averaged recall is close to 98% for NewsPortal, and is above that for the other applications. The F1 score for all applications is above 95.5%. Out of the false positives and negatives, total 96 queries (42.5%) were tagged as suspicious queries for the DBA to examine. Considering all applications together, the overall precision, recall, and F1-score of the system comes out as 95.70%, 98.78%, and 97.22% respectively.

It may be observed that the performance is better for the Bookstore, Classifieds and JobPortal. This is no surprise because these three applications were used as honeypots for collection of injected queries. However, the F1 score for the Forum and JobPortal applications are still above 95.5%, which is interesting as well as encouraging, and confirms efficacy of the approach. The main reason behind this is that web applications are generally written using similar programming practices, databases are created using standard relational database design principles, and parameter values are used in the same manner to construct dynamic queries. While normal SQL queries issued by different web applications may be quite different depending on their database schema and intended function, when the WHERE clause part of the queries (genuine or injected) are compared after normalization, they usually appear closely similar. This proves that, injection patterns collected from few honeypot applications can be used globally on the server for protecting other web applications and deliver adequate accuracy right from the time of deployment.

6.3. Two-fold validation

The knowledge-base of SQLiDDS consisting of injection attack vectors (stored in the AHT and ADR) was built by launching injection attacks on three honeypot web applications using over 23 different automated tools along with large number of manually constructed attacks. The run-time evaluation was conducted upon all five web applications to simulate real-world attacks in a shared hosting environment. During the study, we observed that, many of the attack tools randomly determine the vulnerable parameters (or form fields) and accordingly formulate the injection vectors. If a web page has multiple vulnerabilities, the same tool may attack it differently in successive runs. In addition, many tools also randomize the attack vectors to avoid detection by an IDS. Therefore, although the same set of attack tools are used for run-time evaluation, the attack vectors are not guaranteed to remain same; otherwise the system would yield 100% accuracy for the three honeypot applications simply by hash-matching with the AHT. The objective of the study was to establish that the attacks on any web application in a shared hosting environment are most likely to be similar to previously seen attacks which can be identified by comparing document similarity of normalized tail-ends of run-time queries with the knowledge-base. Considering the breadth of attack tools used to train the system, we believe that this objective is well realized.

Nevertheless, a question may be raised concerning the tests being incestuous, because the same set of tools are used for training as well as evaluation, though we include the non-honeypot applications in the evaluation. Since the injection vectors during training and testing are generated from the same set of attack tools, it may be possible that the system is learning the attack patterns of the specific tools instead of structural characteristics of SQL injection. A popular way to adjust for this, as suggested by Dietterich [12], is to conduct a repeated two-fold validation by partitioning the injection vectors according to the attack tools used. In other words, training and testing may be conducted on mutually exclusive set of attack tools to ensure that the injection vectors during testing are not generated from the same set of tools used for training, thus avoiding the incestuous relationship. However, it requires extreme bookkeeping for tracking the provenance of a particular injected query. To collect the injected queries for training (see Section 6.1), we attacked the honeypot web applications simultaneously from the five client computers, so all queries were recorded in the same MySQL log file. Though the web application that issued the query is recorded in the log, the attack tool that produced the injected query is not known. To conduct a two-fold validation encompassing all attack tools, the attacks must be launched serially on the web applications, each run starting with an empty log file. Each log file, containing several thousands of queries (genuine and injected), must then be scrutinized to extract the injected queries generated by that particular tool. Since many of the attacks (particularly blind and time-based blind injection attacks) take in order of days to complete, it would be extremely time consuming as well as cumbersome.

Therefore, we conducted a small scale two-fold validation using four attack tools on one web application. The Bookstore application was chosen as it was the largest among the five sample applications. The four attack tools selected were: (1) The Mole, (2) sqlmap, (3) darkMySQLi, and (4) sqlsus. These tools were selected because they are almost equally capable of generating various types of injection attacks, focus mainly on MySQL databases, and exhibit randomness in the attack vectors. Dietterich [12] recommends five iterations in total; by taking injection attacks generated from two tools for training and the rest for testing, six iterations were conducted.

The Bookstore application was serially attacked by these four tools, each attack session starting with a fresh installation of its database and an empty query log file. After completion of each attack tool, the log file was carefully examined and injected queries were separated out. The tail-ends were normalized and duplicates arising due to normalization were removed. This way, four sets of samples were prepared, each containing injection patterns generated from one source. In each iteration, two sets of samples were taken together and clustered to generate the documents of the ADR. The injection patterns in the other two sets were verified against these documents. The same similarity measure, rejection and suspicious thresholds as during run-time evaluation were used. Table 4 shows the results of the two-fold validation, where SPR refers to the percentage of samples determined as suspicious. Since the validation is between injected queries only, true negatives (TN) or false positives (FP) are not applicable.

Table 4
Results of two-fold validation using four attack tools on the Bookstore application

Iteration Training set Testing set Identified (TP) Suspicious (SP) Failed (FN) TPR SPR FNR

1 480 470 352 84 34 74.89% 17.87% 7.23%

2 422 528 379 112 37 71.78% 21.21% 7.01%

3 496 454 345 74 35 75.99% 16.30% 7.71%

4 470 480 357 87 36 74.38% 18.13% 7.50%

5 528 422 326 61 33 77.25% 14.45% 7.82%

6 454 496 364 99 35 73.39% 19.96% 7.06%

Average 475 475 354 86 35 74.61% 17.99% 7.39%

Iteration	Training set	Testing set	Identified (TP)	Suspicious (SP)	Failed (FN)	TPR	SPR	FNR
1	480	470	352	84	34	74.89%	17.87%	7.23%
2	422	528	379	112	37	71.78%	21.21%	7.01%
3	496	454	345	74	35	75.99%	16.30%	7.71%
4	470	480	357	87	36	74.38%	18.13%	7.50%
5	528	422	326	61	33	77.25%	14.45%	7.82%
6	454	496	364	99	35	73.39%	19.96%	7.06%
Average	475	475	354	86	35	74.61%	17.99%	7.39%

On the first look, the results may appear to be on lower side, but they are actually quite encouraging. We find that, when only four attack tools are used to generate the disjoint training and testing samples, on average 74.61% of injected queries are correctly identified. The suspicious rate indicates that, average 17.99% of the samples exhibit good similarity, though not up to the rejection threshold. On the other hand, the average false negative rate shows that, only 7.39% of injection vectors across disjoint sets are dissimilar and fall below the required similarity score. If the thresholds are recalculated considering only the samples in hand, then the results would improve because several instances which are determined as suspicious or failed, would move towards being correctly identified or suspicious respectively. For example, if 10 instances move from being suspicious to true positive, and 5 instances move from false negative to suspicious, the true positive rate improves by 2%. Also, because the number of training samples is much less, the clustering process generates more singleton clusters. As a result, several documents in ADR contain only one injection pattern, which adversely affect the similarity scores during testing. Intelligibly, if all attack tools are included in the two-fold validation, the training and testing sets would be about four times larger, and the figures would converge to the run-time evaluation results.

6.4. Performance overhead

Like any other SQL injection detection system, SQLiDDS also adds some processing overhead to the run-time performance of web applications, which consists of four components: (1) extracting the tail-end of a query, (2) normalization of the tail-end into text form, (3) lookup of the reference hash tables, and (4) similarity matching with clustered documents. Out of these, (1) and (3) consume negligible amount of time and can be ignored for calculating the net performance impact of the system.

The time consumed for query normalization was measured by applying it on 1000 queries selected randomly from the 245,356 queries collected. Average time of 1000 such runs was found to be 0.625 ms. Time consumed for similarity matching was measured by computing phrase-based similarity using Positive Matching Index between an injection pattern (out of the 3896) and a document of clustered structures (out of the 108), both selected randomly. The average time of 1000 runs was found to be 1.917 ms. To further improve run-time performance, instead of storing the clustered structures as simple text documents, the phrases were pre-extracted and stored in comma-separated format. With this optimization in the storage structure of the ADR, the average time reduced to 0.228 ms. Considering that similarity with half of the documents needs to be computed on average, total processing overhead comes to ≃ 12.7 ms per incoming query. Assuming that a standard web page issues 10 queries on average for execution and all queries require up to the similarity matching step, theoretically the total delay introduced by SQLiDDS is ≃ 127 ms for each page load. Considering that the page load times are generally in the order of several seconds over the Internet, a delay of 127 ms would not affect the end-user experience.

To assess impact on performance in production usage scenario, load testing on the web applications was conducted using Pylot8

⁸
Open source website performance testing tool: http://www.pylot.org/.

and Siege9

⁹

An HTTP load testing and benchmarking utility: http://www.joedog.org/siege-home/.

with 10 to 100 concurrent user-agents, configured with request interval of 5 ms and 5 s ramp-up time, covering all public web pages as test cases. Each test was run for 30 min with and without SQLiDDS intercepting queries, and the difference in average response time of the web applications was measured. Figure 6 shows the delay introduced by SQLiDDS for each application. Clearly, the delay is proportional to the average number of queries issued per page and increases with the number of concurrent users. Overall, the delay introduced by SQLiDDS was found to account for 2.5–3% of the normal average response time of the web server, which is almost imperceptible in online environment over the Internet.

Fig. 6.

Performance overhead of SQLiDDS.

6.5. Comparison with existing methods

Being a serious threat to web applications for nearly 15 years, excellent research on SQL injection prevention and detection exists in the literature. Researchers have attempted to provide a solution at various locations of web applications’ tiered architecture such as: (1) inspecting HTTP request packets before reaching the web server, (2) using specific coding methods while developing the web application, (3) retrofitting the code of existing web application, (4) examining queries issued by the web application before sending to the database server, (5) during query compilation or optimization phase at the database server, (6) after the query is executed by analyzing the query result size, and (7) identifying attacks by mining the web server access log or database query log. Our approach fits at the location 4 by sitting in between the web server and database server. Technically this location allows us to focus only on the SQL query issued by the web applications. Another advantage is that, the system is independent of the programming language used in the web applications, as well as which database server is used in the back-end. It is also possible to tune to system towards vendor specific SQL dialects.

Though many approaches have been acknowledged as detection or prevention techniques, only some of them have been implemented from prospective of practicality. Due to non-availability of any standard data set of SQL injection attacks, most of the approaches have been validated using synthetic ones or sample applications mimicking near real-world scenarios. Since each approach has its benefits as well as shortcomings under specific situations and systems, it is difficult to do a rational comparison based on experimental figures. Therefore, we provide an analytical comparison of our approach with some notable approaches we studied, shown in Table 5, which is based on the following considerations:

Specific coding method: Whether the approach is based on a specific coding methodology or use of a proposed framework to prevent SQL injection attacks.

Source code access: Whether the approach requires access to the source code or modifies it for static analysis, retrofitting, or identifying weakness in input validation.

Platform specific: Whether the approach is applicable or suitable only for a specific programming language and/or database platform.

Normal-use modeling: Whether the approach requires building a model of the SQL queries generated by the web application with benign inputs run in a secured environment.

Multiple websites: If the approach applies to only one web application or can protect multiple web sites hosted on a shared server.

Adaptive/self-learning: Whether the approach is adaptive and self-learning to be able to detect previously unseen attacks.

Time complexity: Whether the general time complexity of the system is high or low.

Practical usability: How well the system is practically usable in a real production environment.

Table 5
Analytical comparison of SQLiDDS with existing techniques

Approach → Consideration ↓ SQLRand [5] SQL-DOM [36] AMNESIA [19] SQLProb [31] CANDID [4] Swaddler [9] SQLiDDS

Specific coding method Yes Yes Yes No No No No

Source code access Yes Yes Yes No Yes No No

Platform specific No Yes Yes No Yes Yes No

Normal-use model No No Yes Yes Yes Yes No

Multiple websites No No No No No No Yes

Adaptive/self-learning No No No No No No Yes

Time complexity Low High Low High Medium High Low

Practical usability Low Low High Medium Medium Low High

Approach → Consideration ↓	SQLRand [5]	SQL-DOM [36]	AMNESIA [19]	SQLProb [31]	CANDID [4]	Swaddler [9]	SQLiDDS
Specific coding method	Yes	Yes	Yes	No	No	No	No
Source code access	Yes	Yes	Yes	No	Yes	No	No
Platform specific	No	Yes	Yes	No	Yes	Yes	No
Normal-use model	No	No	Yes	Yes	Yes	Yes	No
Multiple websites	No	No	No	No	No	No	Yes
Adaptive/self-learning	No	No	No	No	No	No	Yes
Time complexity	Low	High	Low	High	Medium	High	Low
Practical usability	Low	Low	High	Medium	Medium	Low	High

We also compare our approach with these approaches based on the ability to prevent or detect various types of SQL injection attacks as shown in Table 6.

Table 6

Types of SQL injection attacks detected (∙ : yes, ∘ : partially, ╳: no)

Approach → Type of attack ↓	SQLRand [5]	SQL-DOM [36]	AMNESIA [19]	SQLProb [31]	CANDID [4]	Swaddler [9]	SQLiDDS
Tautological attacks	∙	∙	∙	∙	∙	∘	∘
Logically incorrect queries	╳	∙	∙	∙	∘	∘	∙
UNION based attacks	∙	∙	∙	∙	∙	∘	∙
Piggy-backed queries	∙	∙	∙	∙	∙	∘	∙
Stored procedure attacks	╳	╳	╳	∙	∘	∘	∙
Blind injection attacks	∙	∙	∙	∙	∙	∘	∙
Time-based blind attacks	∙	∙	∙	∙	∙	∘	∙
Alternate encodings	╳	∙	∙	∘	∘	∘	∙

Although SQLiDDS detected almost all of the tautological attack vectors in our tests, we still show it as partial because theoretically tautological expressions can be formed in infinite number of ways.

7. Related work

Apart from document search, text categorization, or information retrieval, similarity measures have been long used by researchers for spam email detection [2], phishing website identification [51], unknown malicious executable detection [41], etc. However, to the best of our knowledge, document similarity measures have not been proposed specifically for detection of injection attacks at the SQL query level in the literature so far. A few studies have used similarity measures on HTTP requests or payloads for classification of different types of web attacks including SQL injection, which may be considered as related to our work to some extent.

Walenstein et al. [48] used similarity measures on N-grams of disassembled code to identify variants of known malwares. Small et al. [42] also used similarity measure along with string alignment techniques for detecting malicious HTTP payloads. At the same time, Gallagher and Eliassi-Rad [16] applied term frequency based similarity measure for classification of HTTP requests to identify and classify web attacks. Ulmer and Gokhale [46] used document similarity on HTTP data for a configurable hardware classifier for web attacks. Later they proposed massive parallel acceleration for the document-similarity classifier on multi-core processor architecture, focusing mainly on hardware implementation of similarity computation [47]. N-gram splitting and classification using Support Vector Machine (SVM) to detect SQL injection and XSS attacks was proposed by Choi et al. [8]. However, their approach extracts N-grams from the queries ignoring the symbols and operators, which are important syntactic elements in a query. Application of document similarity measures on normalized queries for detecting SQL injection attacks is proposed for the first time in this study to the best of our belief.

8. Conclusions and future work

This paper presented a novel approach to detect known as well as previously unseen SQL injection attacks in real-time using query normalization and applying document similarity measure. The approach was implemented in a tool named SQLiDDS designed to work as a database firewall. We adopted the strategy to examine only the WHERE clause part of run-time queries and ignore INSERT queries, which was based on interesting observations made during the course of research. The experimental results obtained on five sample web applications developed using PHP and MySQL with no input validation, confirm the effectiveness of our approach. As the system acts as a database firewall and does not require building a normal-use model of SQL queries, it is able to protect multiple web applications hosted on a shared server, which is an advantage over existing methods. Though our approach requires an initial set of injected queries, these can be collected once from a few honeypot applications and reused for further deployments. The system evolves by learning from its own detection history and also with inputs from DBA to become more robust over time. Performance overhead of the system is almost imperceptible over the Internet. The approach uses features available in most modern programming languages and therefore can be ported to other web development platforms without requiring major modifications.

In future, we will examine feasibility of improving the normalization scheme by converting a group of symbols (such as arithmetic operators) into one token, keeping in mind that over-normalization may lead to loss of accuracy. Next we plan to devise a weighting scheme for frequently occurring phrases which could help to reduce false negative and positives. We also plan to implement incremental clustering so that the attack document repository can be regenerated as soon as a new attack is detected. Multiple installations of SQLiDDS, communicating with each other to keep their attack vector repositories up to date could also be another interesting option to work on in the future.

References

C.C.

Aggarwal and

Zhai, A survey of text clustering algorithms, in: Mining Text Data, Springer, 2012, pp. 77–128. doi:10.1007/978-1-4614-3223-4_4.

Basavaraju and

D.R.

Prabhakar, A novel method of spam mail detection using text based clustering approach, International Journal of Computer Applications 5(4) (2010), 15–25. doi:10.5120/906-1283.

Benedikt,

Freire and

Godefroid, VeriWeb: Automatically testing dynamic web sites, in: Proceedings of the 11th International World Wide Web Conference (WWW’2002), 2002.

Bisht,

Madhusudan and

Venkatakrishnan, CANDID: Dynamic candidate evaluations for automatic prevention of SQL injection attacks, ACM Transactions on Information and System Security (TISSEC) 13(2) (2010), 14.

Boyd and

Keromytis, SQLrand: Preventing SQL injection attacks, in: Applied Cryptography and Network Security, Springer, 2004, pp. 292–302. doi:10.1007/978-3-540-24852-1_21.

Buehrer,

Weide and

Sivilotti, Using parse tree validation to prevent SQL injection attacks, in: Proceedings of the 5th International Workshop on Software Engineering and Middleware, ACM, 2005, pp. 106–113. doi:10.1145/1108473.1108496.

W.-H.

Chen,

S.-H.

Hsu and

H.-P.

Shen, Application of SVM and ANN for intrusion detection, Computers & Operations Research 32(10) (2005), 2617–2634. doi:10.1016/j.cor.2004.03.019.

Choi,

Kim,

Choi and

Kim, Efficient malicious code detection using N-gram analysis and SVM, in: 2011 International Conference on Network-Based Information Systems (NBiS), IEEE, 2011, pp. 618–621.

Cova,

Balzarotti,

Felmetsger and

Vigna, Swaddler: An approach for the anomaly-based detection of state violations in web applications, in: Recent Advances in Intrusion Detection, Springer, 2007, pp. 63–86. doi:10.1007/978-3-540-74320-0_4.

10.

Curtis, Barclays: 97 percent of data breaches still due to SQL injection, 2012, available at: http://news.techworld.com/security/3331283/barclays-97-percent-of-data-breaches-still-due-to-sql-injection/.

11.

Dahse, Exploiting hard filtered SQL Injections, 2010, available at: http://websec.wordpress.com/2010/03/19/exploiting-hard-filtered-sql-injections/. Accessed: 2011-06-23.

12.

T.G.

Dietterich, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation 10(7) (1998), 1895–1923. doi:10.1162/089976698300017197.

13.

Donovan, SQL injection attacks: Stop the madness, 2014, available at: http://www.fierceitsecurity.com/story/sql-injection-attacks-stop-madness/2014-03-04. Accessed: 2014-05-05.

14.

D.A.

Dos Santos and

Deutsch, The positive matching index: A new similarity measure with optimal characteristics, Pattern Recognition Letters 31(12) (2010), 1570–1576. doi:10.1016/j.patrec.2010.03.010.

15.

Douglen, SQL smuggling, or, the attack that wasn’t there, 2007, available at: https://dl.packetstormsecurity.net/papers/database/SQL_Smuggling.pdf. Accessed: 2015-02-12.

16.

Gallagher and

Eliassi-Rad, Classification of HTTP attacks: A study on the ECML/PKDD 2007 discovery challenge, in: Center for Advanced Signal and Image Sciences (CASIS) Workshop, 2008.

17.

J.J.

García Adeva and

J.M.

Pikatza Atxa, Intrusion detection in web applications using text mining, Engineering Applications of Artificial Intelligence 20(4) (2007), 555–566. doi:10.1016/j.engappai.2006.09.001.

18.

C.A.

Glasbey, An analysis of histogram-based thresholding algorithms, CVGIP: Graphical Models and Image Processing 55(6) (1993), 532–537.

19.

Halfond and

Orso, AMNESIA: Analysis and monitoring for NEutralizing SQL-injection attacks, in: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ACM, 2005, pp. 174–183. doi:10.1145/1101908.1101935.

20.

Halfond,

Viegas and

Orso, A classification of SQL-injection attacks and countermeasures, in: International Symposium on Secure Software Engineering (ISSSE), 2006, pp. 12–23.

21.

Hayek, Analysis of amphibian biodiversity data, in: Measuring and Monitoring Biological Diversity: Standard Methods for Amphibians, 1994, pp. 207–270.

22.

Johns,

Beyerlein,

Giesecke and

Posegga, Secure code generation for web applications, in: Engineering Secure Software and Systems, 2010, pp. 96–113. doi:10.1007/978-3-642-11747-3_8.

23.

Kals,

Kirda,

Kruegel and

Jovanovic, Secubat: A web vulnerability scanner, in: Proceedings of the 15th International Conference on World Wide Web (WWW’2006), ACM, 2006, pp. 247–256. doi:10.1145/1135777.1135817.

24.

Kar and

Panigrahi, Prevention of SQL injection attack using query transformation and hashing, in: Proceedings of the 3rd IEEE International Advance Computing Conference (IACC), IEEE, 2013, pp. 1317–1323.

25.

Kar,

Panigrahi and

Sundararajan, SQLiDDS: SQL injection detection using query transformation and document similarity, in: Distributed Computing and Internet Technology, Springer, 2015, pp. 377–390.

26.

S.M.

Kerner, How was SQL injection discovered? 2013, available at: http://www.esecurityplanet.com/network-security/how-was-sql-injection-discovered.html. Accessed: 2013-12-12.

27.

M.-Y.

Kim and

D.H.

Lee, Data-mining based SQL injection attack detection using internal query trees, Expert Systems with Applications 41(11) (2014), 5416–5430. doi:10.1016/j.eswa.2014.02.041.

28.

Kozik and

Choraś, Machine learning techniques for cyber attacks detection, in: Image Processing and Communications Challenges 5, Springer, 2014, pp. 391–398. doi:10.1007/978-3-319-01622-1_44.

29.

Lee,

Jeong,

Yeo and

Moon, A novel method for SQL injection attack detection based on removing SQL query attribute values, Mathematical and Computer Modelling 55 (2011), 58–68. doi:10.1016/j.mcm.2011.01.050.

30.

Liao and

V.R.

Vemuri, Using text categorization techniques for intrusion detection, in: USENIX Security Symposium, Vol. 12, 2002, pp. 51–59.

31.

Liu,

Yuan,

Wijesekera and

Stavrou, SQLProb: A proxy-based architecture towards preventing SQL injection attacks, in: Proceedings of the 2009 ACM Symposium on Applied Computing, ACM, 2009, pp. 2054–2061. doi:10.1145/1529282.1529737.

32.

Livshits and

Ú.

Erlingsson, Using web application construction frameworks to protect against code injection attacks, in: Proceedings of the 2007 Workshop on Programming Languages and Analysis for Security, ACM, 2007, pp. 95–104. doi:10.1145/1255329.1255346.

33.

Maciejak and

Lovet, Botnet-powered SQL injection attacks: A deeper look within, in: Virus Bulletin Conference, 2009, pp. 286–288.

34.

C.D.

Manning,

Raghavan and

Schütze, Introduction to Information Retrieval, Vol. 1, Cambridge University Press, Cambridge, 2008, available at: http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf.

35.

Maor and

Shulman, SQL injection signatures evasion, White paper, Imperva Inc., 2004, available at: http://www.issa-sac.org/info_resources/ISSA_20050519_iMperva_SQLInjection.pdf.

36.

McClure and

Kruger, SQL DOM: Compile time checking of dynamic SQL statements, in: Proceedings of the 27th International Conference on Software Engineering (ICSE 2005), IEEE, 2005, pp. 88–96.

37.

Nguyen-Tuong,

Guarnieri,

Greene,

Shirley and

Evans, Automatically hardening web applications using precise tainting, in: Security and Privacy in the Age of Ubiquitous Computing, 2005, pp. 295–307.

38.

OWASP, Top 10 security threats 2013, available at: https://www.owasp.org/index.php/Top_10_2013-A1-Injection. Accessed: 2013-11-15.

39.

Quartini and

Rondini, Blind SQL injection with regular expressions attack, 2011, available at: http://www.ihteam.net/papers/blind-sqli-regexp-attack.pdf.

40.

RFP, NT web technology vulnerabilities, Phrack Magazine 8(54) (1998), 8, available at: http://phrack.org/issues.html?issue=54&id=8#article.

41.

Shabtai,

Moskovitch,

Feher,

Dolev and

Elovici, Detecting unknown malicious code by applying classification techniques on OpCode patterns, Security Informatics 1(1) (2012), 1–22. doi:10.1186/2190-8532-1-1.

42.

Small,

Mason,

Monrose,

Provos and

Stubblefield, To catch a predator: A natural language approach for eliciting malicious payloads, in: USENIX Security Symposium, 2008, pp. 171–184.

43.

S.-T.

Sun and

K.B.

Sqlprevent, Effective dynamic detection and prevention of SQL injection attacks without access to the application source code, Technical report LERSSE-TR-2008-01, Laboratory for Education and Research in Secure Systems Engineering, University of British Columbia, 2008.

44.

TrustWave, Trustwave 2012 global security report, 2012, available at: https://www.trustwave.com/global-security-report. Accessed: 2013-06-24.

45.

R.E.

Tulloss, Assessment of similarity indices for undesirable properties and a new tripartite similarity index based on cost functions, in: Mycology in Sustainable Development: Expanding Concepts, Vanishing Borders, 1997, pp. 122–143.

46.

Ulmer and

Gokhale, A configurable-hardware document-similarity classifier to detect web attacks, in: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Vol. 4, IEEE, 2010, pp. 1–8.

47.

Ulmer,

Gokhale,

Gallagher,

Top and

Eliassi-Rad, Massively parallel acceleration of a document-similarity classifier to detect web attacks, Journal of Parallel and Distributed Computing 71(2) (2011), 225–235. doi:10.1016/j.jpdc.2010.07.005.

48.

Walenstein,

Venable,

Hayes,

Thompson and

Lakhotia, Exploiting similarity between variants to defeat malware, in: Proceedings of BlackHat 2007 DC Briefings, BlackHat, 2007.

49.

Wang and

Li, SQL injection detection via program tracing and machine learning, in: Internet and Distributed Computing Systems, Springer, 2012, pp. 264–274. doi:10.1007/978-3-642-34883-9_21.

50.

Wassermann,

Yu,

Chander,

Dhurjati,

Inamura and

Su, Dynamic test input generation for web applications, in: Proceedings of the 2008 International Symposium on Software Testing and Analysis, ACM, 2008, pp. 249–260.

51.

Zhang,

J.I.

Hong and

L.F.

Cranor, CANTINA: A content-based approach to detecting phishing web sites, in: Proceedings of the 16th International Conference on World Wide Web (WWW’2007), ACM, 2007, pp. 639–648. doi:10.1145/1242572.1242659.