Abstract
We propose a general solution to the problem of efficient substring search over encrypted data. The solution enhances existing “keyword” searchable encryption schemes by allowing searching for any part of encrypted keywords without requiring one to store all possible combinations of substrings from a given dictionary. The proposed technique is based on the idea of letter orthogonalization that allows testing of string membership by performing efficient inner products. We first propose SED-1, the base protocol for substring search. We then identify some attacks on SED-1 that demonstrate the complexity of the substring search problem under different threat scenarios. This leads us to propose our second and main protocol SED-2. The protocol is also efficient in that the search complexity is linear in the size of the keyword dictionary. We run several experiments on a sizeable real world dataset to evaluate the performance of our protocol.
Introduction
Efficiently searching plaintext documents for keywords has been a major research area within the information retrieval community. In its simplest form, the problem involves simple and exact keyword search. Exact keyword search, however, is just a special case of substring search, which allows searching for only a portion of the keyword. Prefix and suffix searches are the most common substring searches. Substring search can also involve finding all words containing a certain string regardless of its position in the word.
For outsourced data, when the server hosting the document corpus is an honest but curious adversary and the confidentiality of the documents needs to be protected, the above problem reduces to searching encrypted data for substrings. The problem of searching encrypted documents for keywords (and not substrings) has been investigated extensively with many constructions proposed in the context of symmetric search [8,10,13,16,19,27] including ones that support boolean expression search involving keywords [9,24,25], and asymmetric search [1,2,4,14] including support for conjunction, range or subset searches [5,26]. In this work, we address substring search over encrypted data in an efficient manner.1
By the time of submission August 2014, we were not aware of any existing dedicated solution for the substring encrypted search problem. Throughout the reviewing process, several constructions with better leakage profiles and efficiency [11,15,23] have been publicly published. We detail these techniques in Section 2.
Exact keyword search over encrypted data can be trivially extended to substring search albeit at the expense of significant storage overhead. Consider a user who associates the following three keywords with a given document: “search”, “symmetric” and “encryption”, and wants to execute the following query: “search the corpus for all keywords containing exactly the substring sear”. By using existing exact keyword symmetric searchable encryption schemes this query can be implemented as follows. The user generates all possible combinations of characters in words, such as, “s”, “se”, “sea”, “ea” … etc., and stores them in an index before performing the query. If we assume that all keywords have the same length m and the dictionary contains z keywords, then the user has to generate
In this work, we investigate the problem of efficient substring search over encrypted data. We assume that the user creates a document corpus,
To solve this problem efficiently we re-formulate the problem to ask the following question: Is there a bivariate function f that takes as input an encrypted substring s and a set of encrypted keywords in the dictionary
We provide an intuition behind our approach. Imagine that two vectors
At a higher level, the basic protocol works as follows. The user extracts a dictionary of keywords from the document corpus. The user then generates the searchable part corresponding to these keywords in a manner discussed above. The user also creates a lookup table that associates keywords with their respective documents. Finally, the user encrypts the documents and uploads the searchable part, the lookup table and the encrypted documents to the server. Note that the lookup table is encrypted using single-keyword symmetric searchable encryption technique, later generalized as structured encryption for inverted index data structure [10]. A user query generation is similar to the generation of any column of the searchable part. During the document retrieve phase, the server will first output all indexes (keywords) that contain the given substring (query), then uses the encrypted lookup table to retrieve the set of matching encrypted documents with a single interaction with the client.
Note that this technique, although appealing, induces many challenges in term of efficiency, correctness and security. First, characters, in ASCII for instance, do not form an orthogonal family in their binary representation. In order to be orthogonal, they should be linearly independent first, which implies that their binary representations should be at least equal to 128 bits. Consequently, a transformation is required to impart the orthogonality feature. This, in turn, brings up certain challenges in terms of the size of the transformed letter, and a decimal representation that involves computational precision. Second, a matrix product for identifying substring membership tends to be computationally inefficient for large dictionaries. To facilitate inner product computation, a compaction of the query (substring searched for) and the searchable part is mandatory which brings up more challenges for protocol correctness. Third, every keyword is defined by its own letters’ ordering that should be taken into consideration. In fact, this adjacency is critical for the correctness of the scheme especially if one wants to reduce the size of the query as well as the searchable part. Fourth, the search scheme should ideally allow wildcard search similar to the one enabled on plaintext search. Last but not least, substring search tends to leak more information than exact keyword search. To understand why, note that in order to perform efficient encrypted search, a user may send a deterministic trapdoor for a given substring. As a result, the server acquires the knowledge of all matching substrings, which will eventually lead to a worse leakage profile when compared to exact keyword SSE one. Based on this leakage, over time the server can build a knowledge of the substring frequencies which may lead to data and query recovery. It is important to point out that the SED protocol employs orthogonalization process for the sole purpose of fast inner product and leveraging both orthonormal and additive properties, the security of SED nevertheless consists of other cryptographic primitives involved in the construction process of the searchable part that we will detail later.
In this work, we address these challenges and present two progressively refined Substring search over Encrypted Data (SED) schemes. The first scheme, SED-1, is the initial attempt at using orthogonalization techniques for encrypted substring search. We discuss that SED-1 is vulnerable to some forms of inference attacks as well as attacks based on the search history. The SED-2 scheme improves upon SED-1 by reducing information leakage based on characters’ positions and pattern’s construction. The SED-2 scheme is efficient in that the communication complexity is linear in the size of the substring in the search query sent to the server, and the search complexity is in
Outline. The rest of the paper is organized as follows. We discuss related work in more details in Section 2. Section 3 introduces the threat model and formally defines the concept of adaptive security and information leakage. We present the orthonormalization pre-construction on which the two SED schemes are based in Section 4. The first scheme, SED-1, is presented in Section 5. It then presents a discussion of the two attacks that can be launched on SED-1. Section 6 and Section 8 respectively present SED-2 and its generalization. Security proof of the SED-2 scheme is discussed in Section 7 and postpone the performances’ study to Section 9. We summarize our contributions and conclude in Section 10.
Related works
The area of searchable encryption has been an active research area for over a decade. Many original constructions for searchable encryption can be found in the literature both in the symmetric search setting [9,10,13,16,27] as well as the asymmetric one [1,2,4,14]. Symmetric searchable encryption was later generalized by Chase and Kamara [10] to show how to encrypt different types of data structures. Symmetric schemes, in general, are more geared towards cloud storage and archival for a single user. Asymmetric schemes, on the other hand, are more suitable for multiple users in a collaborative environment. Early schemes focused only on exact keyword searches. In the asymmetric setting, several schemes have been proposed with support for conjunction, range or subset searches [5,26] that exploit the mathematical properties (in particular, homomorphism or pairing) inherent in this setting. However, asymmetric schemes remain orders of magnitudes less efficient than symmetric ones and are not suitable for practical deployments.
In the symmetric setting, Curtmola et al. showed that it was possible to support multiple users [13]. Recent works have also studied several extensions of symmetric searchable encryption schemes to support conjunctions in sub-linear time [9] with an extension to the multi-user setting [19], as well as a schemes [24,25] to support arbitrary boolean queries (including conjunctions, disjunctions and negations). Finally, dynamic SSE constructions have been also proposed [21,22].
Concurrently to our work, researchers got interested on designing symmetric searchable schemes with better expressiveness. Chase and Shen [11] show how to perform substring queries based on suffix trees as their underlying structure. The construction offers a search complexity in
Preliminaries
Notation
We use the following notation for this discussion. The notation
The precision depends on the variable that we have chosen in our experiments. In our case, we have chosen BigDecimal for our implementation.
If
Let
Structured encryption introduced by Chase and Kamara [10] is a generalization of symmetric searchable encryption. It is a primitive that shows how to encrypt any data structure while preserving its functionality. For instance, structured encryption can show how to encrypt inverted index, graphs and so on. We borrow the definition of structured encryption in [10]:
A structured encryption scheme
Note also that in the definition above, we can modify the
Throughout the paper, structured encryption will be mainly used to encrypt an inverted index. There are many efficient instantiations of inverted index (or lookup table) encryption such as the one by Chase and Kamara [10] or Cash et al. [9]. Moreover, we use the words single-keyword SSE and inverted index structured encryption interchangeably. In the following, we slightly adapt STE definition to substring search over encrypted data for which we encrypt both a structure as well as the documents.
Substring search over Encrypted Data (SED) over a set of documents
In our discussion, we will use the following notion of correctness for substring search over encrypted data. Let s be the substring being searched for and
Threat model
We assume that the server is honest but curious. It executes the protocol correctly and does not deviate from it, but analyses any data sent to it by the client and the corresponding interactions, in order to extract as much additional information as possible about the documents and the keywords. In the rest of the paper, we assume that the server and the adversary are the same entity. We define security for the SED protocols in terms of simulation based experiments: real and ideal experiment [13]. The approach to proving security under this framework is to show that the adversary cannot distinguish between the output of a real experiment and that of the ideal experiment, which is only based on the gathered information (leakage). We will show in the security analysis section that SED is secure under this definition. In the following, we elaborate on the characteristics of the adversary by discussing the types of information that the adversary can gather during the execution of the protocol.
Information leakage
We assume that the adversary has the characteristics of an adaptive adversary as defined by Curtmola et al. [13]. Such an adversary takes into consideration the history of search (that is, queries issued by the user previously as well as the results of those queries) before requesting for a new query. Information that are leaked to an adaptive adversary throughout the process of search are twofold: (1) the search pattern leakage as well as (2) the access pattern leakage. In the case of single-keyword SSE, search pattern leakage implies that the adversary will always know if a query was issued previously by the user. Access pattern leakage embodies a transcript of all accesses made by the user. This transcript consists, of the identifiers of encrypted documents searched for. Acquiring this knowledge, an adversary can build a pattern modeling the user’s accesses, for example, how often the user accesses a given document or what documents were not queried before, and so on. In addition, we are also required to take into consideration the setup leakage consisting of information disclosed by the encrypted data structure itself (before sending any search query). As an instance, for most single-keyword SSE, the encrypted data structures disclose the number and the size of documents stored on the server plus some additional information that we defer the details after the construction instantiation. We refer the reader to [13] for a detailed discussion on search and access pattern leakage. In the following, we denote by
As a side remark, for the adaptive leakage
Note that ORAM can hide the access pattern if and only if the ORAM blocks are padded to a maximum value to hide the number of documents identifiers associated to every keyword. Moreover, the encrypted documents, if stored in the same server, have also to be stored in an ORAM. That is, ORAM solution induces a lot of overhead in general if properly employed.
Both leakages,
In the following, we formally recall adaptive security definition introduced by Curtmola et al. [13] and generalized by Chase and Kamara [10].
(
-adaptive security).
Let us consider the substring search over encrypted data scheme as composed of four algorithms The challenger (user) runs the The adversary outputs
Substring search over encrypted data – pre-construction
We present two increasingly sophisticated substring search protocols. The first scheme, SED-1, is a toy constructions that shows how we can apply the orthogonalization concept. This first scheme suffers from some serious leakages that we handle in the second construction SED-2, in particular in its generalization. The search and storage complexity also decrease for SED-2. Both schemes provide both a general substring search as well as wildcard search. The general query allows one to search for a substring without fixing the position of the substring pattern within the keyword. The wildcard query allows one to search for a fixed substring at a fixed position in the keywords. The user provides the positional information of substring by utilizing a wildcard (do-not-care) symbol. (We use the “∗” symbol for denoting wildcards.) The wildcard symbol can be at the beginning of the substring such as “
Both schemes involve a setup phase that creates a single data structure that allows the substring matching for the entire corpus. This encrypted data structure has two parts – an encrypted lookup table that contains the association between keywords and corresponding documents, and a searchable part that allows identifying keywords that match a given query sent by the user. In addition, there is a preliminary pre-construction phase that transforms each keyword in the dictionary into an orthonormal keyword, and that we are detailing in the subsequent section.
Pre-construction
Let us consider the set of letters
We begin by generating t linearly independent vectors
We finally associate each letter
If
The orthogonal projection p onto a vector line spanned by a non-zero u is defined as:
We now discuss our first strawman scheme – SED-1 for Substring search over Encrypted Data. SED-1 is discussed in terms of three phases – the Setup phase in which the user prepares the encrypted data structure that allows the server to search later on, the Query phase in which the user poses a search query to the server, and the Retrieve phase in which the server retrieves the encrypted documents that satisfy the query. We assume that the user generates the key K and the seed χ by using the
Setup phase
Let
Next, the user creates an encrypted lookup table
The third step constructs the searchable part
The vector
Consider the set of all possible letters
We reiterate the same process for all m vectors. At the end, we obtain the following vector set
Storage overhead. The storage complexity of the SED-1 scheme is
SED-1 offers two different types of queries: Wildcard Query and General Query. We first consider the Wildcard Query setting. This construction takes into account that wildcard characters (represented by “∗” symbol) can be at the beginning of a string or between two strings. Since we are dealing with substring matching, wildcards at the end of the query string are meaningless.
The query is constructed as follows. Let
The last step is to adjust the orthonormal string s by making the location of first wildcard characters meaningful for the server. Indeed, we will see in the next section that the retrieve phase is based on an inner product computation between each orthonormal letter in the query and the corresponding searchable part position. We could have processed the first wildcards much in the same way as the other letters or wildcards; however this processing is not meaningful – the server will easily figure out that these first letters are wildcards since they will match the entire searchable part position. Moreover, in this case it will also cost more because of redundant inner product computations.
In the Wildcard Query setting, the main idea is to specify the number, p, of wildcards in the beginning of the query. We can then send to the server the position of the first letter of the string s in the keyword as
Retrieve phase
The retrieve phase is based on an inner product computation between each element of s and the corresponding searchable part. A keyword
If the server receives the query
A naïve approach to test for this is to compute
This resulting set contains the positions of keywords matching the query. The server sends back this set to the client. The client generates a token for every position in the set using the encrypted searchable encryption employed to construct
Our second case consists of receiving the string s from the user and the server performing a general substring matching. In order to perform this task, the server executes the same operation as if it receives a string with a position information,
The search computation is not dependent on the size of the database (number of files stored). Rather the search is based on the inner product operations that are in
The storage complexity is equal to the size of the encrypted lookup table which is dependent on the distribution of keywords per document, to which we add the searchable part which is in
Problems with SED-1
SED-1 has some vulnerabilities that make it unsuitable for a real-world deployment and which is why we refine SED-1 into SED-2. SED-1 is based on the searchable part
Letter frequency attack with known dictionary
We assume that the adversary knows the dictionary

Letter frequency attack with known dictionary
The algorithm first computes the number of characters stored in the searchable part (lines 3–10). The set
The attack is a direct consequence of the detailed searchable part construction that gives away different information about the frequency of keyword letters. Based on

Length Frequency.

Letter Frequency.
Note that, in this attack, we have made a strong assumption that the adversary has an accurate prior knowledge of the dictionary used by the user. In the next attack, we will assume that the adversary proceeds with the attack without any prior information about the dictionary.
The adversary may not know the exact dictionary but can easily know the distribution of characters in the searchable part
The attacker can try to use certain types of information to refine the data distribution model. These can be, for example, the geographic localization of the user, the field of interest, etc. However, we assume these to be well hidden from the adversary. Moreover, knowing the exact distribution of characters for each keyword in
Substring search over encrypted data #2 – SED-2
We have seen that the first scheme SED-1 is insecure and can leak the keywords in the case of a known dictionary. We should point out that it is a real challenge to perform substring matching and at the same time hide the distribution of letters within the dictionary from the server. In this section, we present the second and main construction, SED-2, that partially6
Refer to Section 7.1 for exact description of SED-2 leakage.
The main challenge for SED-2 is to maintain the order of letters within a given keyword. Indeed, only summing up orthonormal letters does not save any pattern with a size bigger than that of the keyword. SED-2 addresses this issue by creating a sort of chain of letters, intelligible to the user with an a-prior stored state, that allows one to search for any pattern. Add to that, this new feature SED-2 also consists of linking every character to its position. That is, two equal characters at different positions will be treated in the setup phase as two totally different characters. This feature enables to hide a great part of the information leakage in SED-1. We will also see that this position based construction can be further generalized with a pattern construction to even hide more about characters’ frequency, refer to Section 8. SED-2 is more efficient and induces less storage complexity than SED-1. In the following, we give a step-by step explanation of both the setup and the search phases (the key generation is still the same as SED-1). We give an algorithmic description of SED-2 in Figure 3.
In SED-2, the construction of the encrypted lookup table
The size of the vector

Substring search over encrypted data SED-2 for the Wildcard search case.

Example of letters set expansion.
We apply the pre-construction (see Section 4.1) on the entire set of vectors in
At the end of the Setup phase, the user sends to the server the collection
To discuss the Query phase, we again differentiate between the Wildcard Query and the General Query. Let us first consider that the user wants to search for a specific pattern in a fixed position. The user wants to search for the string s which can be any combination of wildcards and letters (no wildcard can be after the last letter). We denote by
The user then computes the following value for the search query and sends the query Q to the server.
Retrieve phase
The Wildcard Query is a special case of the General Query. We first discuss the case for the Wildcard Query. If the server receives one query, it assumes that the user seeks a Wildcard search. The server performs the following steps for each
Based on the encrypted lookup table
If, on the other hand, the server receives multiple sub-queries, it considers that the user has provided a General query. Note that, if there is a keyword that matches a given sub-query, there is no need to compute again its inner product value. This is because it already belongs to the matching positions. So to search efficiently, we first create a set P equal to
Correctness of SED-2
The correctness of the scheme can be easily verified. The query will contain orthonormal letters depending on the position in the string searched for. While the server receives the query, the inner product of the query and the component
Security proof for SED-2
Leakage description of SED-2
SED-2 construction, as any symmetric searchable encryption construction, has some leakage. This leakage is divided in two types. A setup leakage that the server (adversary) can undercover by just looking to the encrypted data structures, and, an adaptive leakage that the adversary can learn during the search phase. SED-2, if compared to single keyword SSE constructions, leaks more at the price of offering better expressiveness. In the following, we details both of these leakages.
Setup leakage
Adaptive leakage
This setup and adaptive leakage described above is the only leakage that SED-2 leaks to the adversary when a search is performed. We will show in the next section that SED-2 is
Security proof of SED-2
The SED-2 protocol is
If
Let Given set based on the simulator The simulator outputs The simulator now simulates the query Q for the Wildcard search as follows. The simulator It remains to show that for all probabilistic polynomial-time adversaries Note that Claim. For all PPT This transition just replaces the orthogonal values in every This game shows that our SED security is not based on the linear transformation but on the prior PRG evaluation.
That is the view of an PPT adversary
Claim. For all PPT
We show that if there exists an adversary
If
After answering all
Claim. For all PPT adversary
Claim. For all PPT adversary
If
After answering all
SED-2 hides information related to characters that exist at different positions while not impacting the scheme’s correctness. However, we have described in Section 7.1 SED-2 leakage, and we have shown that an adversary can recover non-trivial information about the encrypted data structure even in the setup phase. With adequate auxiliary information, an adversary can run many attacks based on characters frequency. This unfortunately can lead to the disclosure of all part of keywords dictionary. Fortunately, we can generalize SED-2 construction that leaks much lesser with a new pattern’s construction. The ultimate goal of the generalized SED-2 construction is to decrease the setup leakage to be as insignificant as possible.
Overview
The setup phase is based on a new parameter, pattern length γ, that determines the length of the pattern that is going to be orthogonalized. Every pattern is now determined by its position but also by its length. For every character in
The next step is to apply the orthogonalization process
The way we construct the searchable part
This generalization introduces a new balance between search expressiveness and leakage. In SED-2, the leakage can be greatly reduced when increasing γ. We notice that starting from
The setup leakage as well as the adaptive leakage is now limited to the following. The encrypted searchable part now leaks only the information that two keywords have the same pattern if they are also at the same position. For example, the keywords arbitrary and binary both contain the pattern
Leakage
As illustrated in the previous section, the generalization offers better security guarantees at the cost of slightly reduced expressiveness. We provide in the following a formal description of setup leakage. The adaptive leakage is similar to the one described for SED-2.
Setup leakage
Performance analysis
We have evaluated the performance of SED-2. For this purpose, we developed a proof-of-concept using Java. We use the Enron document corpus that has over 500,000 emails and is about 1.5 Gbytes in size. Our experiments demonstrate the efficiency of the SED-2 protocol for a moderate size corpus. We discuss the experiments and results in more details below.
Experiment setup
We executed a number of experiments to validate not only the correctness of the theoretical complexities, but also study the deployability of SED-2 in a realistic setting. Our analyses include a quantification in terms of execution time and storage for the complete SED-2 protocol. In addition to the construction of the lookup table and the searchable part, we study the overhead related to the initial stages of the protocol, namely, determining unique keywords for every document as well as creating a dictionary that is associated with this set of documents. Any encrypted search scheme will involve these steps and execution time for this phase is far from negligible.
We chose the Enron corpus [6] for our performance studies. The corpus contains more than 500,000 emails from about 150 users. The total size of the document corpus is nearly equal to 1.5 Gbytes. The high number of unique keywords in the corpus was the main reason behind our choice of the Enron corpus since the search complexity of SED-2 is dependent on the size of the dictionary and not on the number of the outsourced files. We do not consider the overhead due to the encryption of files in our performance analysis.
The SED-2 scheme is implemented in Java using the JSci library [20] for keywords orthogonalization. Since the components of the searchable part are decimal numbers, we use Java’s BigDecimal class for accurate precision. (For example, if the output is equal to one plus or minus
Results
We first show in Fig. 5 the execution time needed to determine for every file, the number of unique keywords and then constructing the entire dictionary of all these files (graph labeled as “keywords indexation”). Based on unique keywords associated with each file, we construct the lookup table that links each keyword to all documents that contain the keyword. These phases can be merged to gain time, however they are still linear in the number of files (multiplied by the number of keywords in the case of textual files, which was our case in the Enron emails).

Time of inverted index encryption.

Number of unique keywords.
SED-2 complexity is dependent on the size of the dictionary, we show in Fig. 6 that the number of unique keywords associated with a number of Enron emails. We notice that for 90,000 emails, the number of unique keywords is greater than the size of the Enable (172,820 keywords) or the British Oxford (127,238 keywords) dictionaries. This is due to the additional scientific or organizational notations, proper names etc., in the ENRON corpus and also due to the fact that the other dictionaries contains only the base words and not their variants like the plurals or so.
The setup phase of SED-2 pre-construction is based on the creation of a set of letters with a size equal to the initial set of letters, times the number of possible positions. The number of possible positions is equal to the size of longest keyword. We have assumed a maximum size equal to 40, and the set of initial letters equal to the ASCII representation. The final size of the orthonormal letters is equal to 9.2 MBytes constructed in 2.4 seconds. Once the orthonormal letters are created, they are stored in the client side for more efficiency during the search phase.
Figure 7 shows the time to construct the searchable part as a function of the number of unique keywords based on the dictionary. The study clearly demonstrates the linearity of the searchable part of the construction. As indicated earlier, the searchable part construction is independent of the number of files in the corpus. In Fig. 8 we show the size of the lookup table and the searchable part that the server should store dependent on the number of unique keywords indexed.

Searchable part construction.

Size of the encrypted data structure.

Search phase time.
During the search phase, the user a-priori loads the orthonormal letters in memory to avoid I/O overhead. The time to generate the search query construction is equal to 78 micro seconds. The test phase takes as an input the searchable part, the lookup table and outputs the corresponding set of documents, Fig. 9 shows the time of search in one thread by process and parallelized multiple threads in one process.
The server does not have to wait until the end of the search phase to send the matching encrypted documents, since the searchable part is linear. Once the server finds a matching it sends the positions to generate the token and therefore directly fetch the corresponding encrypted documents. Based on the graph, the mean search time per keyword is equal to 0.85 micro seconds for parallel search and 1.3 micro seconds for one thread process. Note that the client and server co-exists in the same machine and therefore communication cost is negligible in this case.
We presented in this paper SED, a substring SSE construction that has an overhead sub-linear in the size of the data set when considering that
