Abstract
Cloud computing is gaining ground in the digital and business world. It delivers storage service for user access using Internet as a medium. Besides the numerous benefits of cloud services, migrating to public cloud storage leads to security and privacy concerns. Encryption method protects data privacy and confidentiality. However, encrypted data stored in cloud storage reduces the flexibility in processing data. Therefore, the development of new technologies to search top representatives from encrypted public storage is the current requirement. This paper presents a similarity-based keyword search for multi-author encrypted documents. The proposed Authorship Attribute-Based Ranked Keyword Search (AARKS) encrypts documents using user attributes, and returns ranked results to authorized users. The scheme assigns weight to index vectors by finding the dominant keywords of the specific authority document collection. Search using the proposed indexing prunes away branches and processes only fewer nodes. Re-weighting documents using the relevant feedback also improves user experience. The proposed scheme ensures the privacy and confidentiality of data supporting the cognitive search for encrypted cloud data. Experiments are performed using the Enron dataset and simulated using a set of queries. The precision obtained for the proposed ranked retrieval is 0.7262. Furthermore, information leakage to a cloud server is prevented, thereby proving its suitability for public storage.
Introduction
Cognitive computing systems process data from different sources and weighs context to give the best solution for decision making. Cloud computing processes various sources of information faster and delivers different computing services such as data storage, software, networking, servers, and databases through the Internet. Services are subscription-based instead of buying licenses. The cost-saving benefits that the cloud brings to an organization are vital and hence significant organizations to small businesses, nonprofits, and government agencies, and individual consumers use cloud services. The organization’s investment in cloud is increasing and also the technology evolves to bring the business forward.
Cloud storage [1] service maintains and manages data in remote. Users store the files online and access them via the Internet from any location. Cloud provides many benefits such as usability, accessibility ofdisaster recovery, and cost savings. Besides the benefits, security and privacy of data become a primary concern. Although storage providers implement the best security standards, storing essential data on third-party service providers always opens up risks, especially when it comes to storing sensitive data. Moreover, providing secure access and flexible search to outsourced data becomes a big challenge. A natural approach to provide data confidentiality is encryption. Users encrypt documents before outsourcing to the cloud. However, encrypted documents reduce flexibility in the usage of data. To provide, Searchable Encryption (SE) [3] provides search functionality to encrypted data. SE is a secure and effective solution, where users search the outsourced data without leaking any information. The literature presents many SE schemes [4–6, 24–28] supporting various features such as a single keyword and multi-keyword search. However, these schemes are not efficient because of their simple functionality. Public key encryption (PEKS) [8] when combined with SE, in addition to search, satisfies the security requirement of the cloud data. The data owner encrypts the data with a public key, and data user decrypts using his private data. Though PEKS [9–11, 29] is secure, this asymmetric encryption approach is many-to-one and requires the sender to encrypt for an individual receiver, which leads to an increase in the ciphertext size.
Attribute-Based encryption is suitable for public cloud storage and overcomes the above limitation. There are two ABE variants: Ciphertext-policy ABE (CPABE) [12–14] and Key-Policy ABE (KPABE). [15]. In CPABE, attributes label a user’s authorization features, and data owner determines policy for the data user, whereas, in KPABE, attributes describe data and embeds policies into the user’s keys. Motivated by the benefits of ABE, keyword search helps in enhancing the functionality. Attribute-based keyword search scheme (ABKS) [16] allows the user to search for a keyword only when user characteristics match the given policy. However, this leads the time for the search to be proportional to the attributes used in the search process.
CPABE, a one-to-many encryption approach, is best suitable for a data-sharing environment with considerable users in the system. It enables complete access control to outsourced data with the attributes in the user’s credentials. Impressed by the excellent properties of CPABE, current state-of-art schemes combine CPABE with SE to enhance the flexibility in usage of outsourced data. Keyword search schemes based on CPABE [17–20] returns all the documents from multiple owners when attribute in secret key is matched with the policy. When searching for a vast collection of documents, the result list may contain documents with less relevancy to the data user.
When information sources contain critical data, providing security and privacy becomes significant challenge. Also, based on the type of information shared, it is essential to use different classes of cognitive information systems [21]. Cognitive search can add new information with cognitive skills during indexing. Also, apply the filter on the search results based on access control permissions to retrieve only authorized results.
As cloud computing is a “pay as you use” model, it is critically important for the data user to retrieve the documents that best match user requirements. At the same time, it is essential to return only authorized query results. This paper proposes Authorship Attribute-based ranking for Keyword Search (AARKS) overencrypted storage. It focuses on anauthorized similarity-based search scheme which returns only relevant and authorized encrypted documents to the data requester without loss of confidentiality. The first attention is weight index vector formation that assigns weight by finding the dominant keyword within the owners’ collection. The proposed novel index structure returns only relevant documents by evaluating the score for the user trapdoor at a faster rate. Re-weighting terms of documents improve user experience based on the relevancy feedback. Finally, embedded policy over the data ensures that only authorized documents are retrieved.
Similarity-based index tree construction based on dominant keywords of data authority collection helps the data user to search for relevant documents of multiple data owners in a more efficient way. Top-n depth-first search algorithm retrieves the relevant documents in sub-linear search time, thereby improving the user’s search experience. Fine-grained access control to an encrypted document allows the owner to define a complex policy to restrict data users. Only users with credentials that match the policy are allowed to retrieve the data archived on cloud storage.
Related work
Searchable symmetric encryption was initially introduced by Curtmola et al. [5] by adopting an inverted index structure. SE schemes give flexibility in searching encrypted data stored on cloud storage and thereby increasing the practicability of using cloud computing. Following the previous work, many systems [21–28] developed to improve both functionality andefficiency. The existing schemes use the following procedure. The data owner calculates the index vector for each document and groups thevectors to form the index file. Following this, the owner sends the encrypted documents and document index vector to the cloud server. On receiving the data user’s encrypted query, the server searches the stored index vectors and then gives the appropriate result. Finally, the user of the data decrypts the document using his secret key. This search process ensures both data privacy and keyword privacy. The multi-participants’ searchable encryption [24, 25] allows users to share data among many people and can also search on the documents uploaded by data-owners. The document index is essential in the search process. Review studied the different approaches to build the index for the search for retrieval.
Boneh et al. [8] combined public key encryption with searchable encryption to provide searchability. Works based on asymmetric setting [9–11, 29] focused on enriching the search functionality multiple keyword search, ranking of documents, and fuzzy search. Schemes [30–2] allows the owner to restrict the users of data, i.e., only users with matching credentials are allowed to perform the search. Hierarchical prediction encryption [30] for authorized keyword search uses trusted attribute authority to manage the access policy, whereas in [31] the access policy is defined by the owner. In a multi-user private keyword search [32] the server manages keys for legitimate users to realize search authorization.
Access control integrated searchable encryption makes data confidential and also preserves the privacy of user and data. Scheme [33] uses KPABE for access control and considers extracted keyword from documents as attributes for viewing documents and also associates key for decryption with access policy. Sun et al. [16] work enable authorization at the file level, and the scheme uses proxy re-encryption for user revocation. Liang et al. [34] proposes a technique that combines attribute-based features along with proxy re-encryption. All the authorized keyword search schemes return only authorized documents to data requesters. However, returning a massive number of documents may include less relevant results and reduces the users’ search experience. Hence, retrieving related documents for the given query is essential in the large scale cloud environment.
To improve search efficiency, Cao et al. [35] proposed a secure and efficient privacy-preserving search based on coordinate matching. However, the scheme efficiency is linear to documents in the collection, and hence it is not suitable for extensive document collection. Wang et al. [38] proposed an integrated access structure for relevant retrieval over the index tree. However, it uses term frequency for document weight assignment. Xia et al. [6] scheme organizes document keyword vector as a balanced tree and searches the tree using a greedy depth-first search algorithm. However, index vectors clustered in the form of a tree. Li et al. [19] propose ABE based keyword search scheme which achieves keyword search function as well as outsources key-issuing and decryption to the cloud server. However, it is not efficient in document retrieval for multiple keywords.
Preliminaries
CPABE basic construction
This section gives the formal descriptions of the main processes involved in the basic CPABE [12] scheme.
The proposed model
System model
The proposed architecture comprises of four different components: data owner, trusted attribute center, cloud server, and data user, as shown in Fig. 1.

Architecture for secure retrieval of documents from cloud storage.
Data owner: In proposed model, Owner Oi, first builds an index vector for his document collection {di1, di2, ... ,dij} and sends the index vector collection Iij = {Ii1, Ii2, ... ,Iij} to the Trusted Attribute Centre.
Trusted Attribute Centre: Attribute Centre is responsible for the following processes. 1) Generating attribute keys for the users of the system. 2) Collecting index vectors from different data owners and calculating the weighted index for the collection. 3) Constructing an encrypted index tree of the document collection 4) Uploading the encrypted index to cloud server and 5) Encrypting the user search query to form trapdoor and sending the user request to the server.
Cloud server: semi-trusted server stores index tree and ciphertext documents. Upon receiving a trapdoorcontaining encrypted search keywords from a data user, the server searches the index tree and gives the relevant result.
Data user: The authorized user sends the plaintext index vector query to the attribute center and sends ABE encrypted document to the server. The server returns documents to the attribute center. Finally, the user can view the authorized documents with the attribute key obtained from the attribute center.
The encrypted storage and authorized retrieval scheme that support multiple data owners and query relevancy should satisfy the following goals.
Information privacy: The cloud could not obtain privacy information such as keywords in query and keywords enclosed in the result list. Also should not reveal the trapdoor keywords that match the document keywords of the result list.
Data confidentiality: The documents outsourced to the trusted server is protected from both cloud server and unauthorized users. Only users attribute key that match with the embedded attribute policy can decrypt the document. Cloud authority should not determine any information while searching the encrypted index tree.
Efficiency: The proposed scheme achieves logarithmic search efficiency over an encrypted index tree, and the worst-case efficiency is sub-linear.
Based on CPABE [12] scheme, secure authorized ranked search for cloud data, which achieves data confidentiality, data privacy, and user privacy, is designed in the proposed work. Asymmetric Scalar-product-Preserving Encryption (ASPE) algorithm [36] is applied to convert both the index tree and the query to ciphertext form.
Setup(1k):
Firstly, the attribute center chooses a bilinear group G0 with generator ‘g’, e: G0 × G0 → GT, a bilinear map, and random numbers α, βZp. Then, it computes public key PK and a master key MSK as given in Equation (1). In addition to that, to preserve the privacy of search, the attribute center generates a search master key SMK, as given in Equation (2) which consists of two n x n invertible matrices M1, and M2, and a random n-length binary vector f, where ‘n’ is the number of dictionary keywords.
Given the attributes set S of a specific user, the attribute center generates the user attribute key AK, using the generated keys PK and MSK. By choosing a random r ε Z
p
and r
i
ε Z
p
for all attribute, i ε S, computes the key as in Equation (3).
The proposed work uses vector space model to denote each document as an index vector, which represents the frequency of keywords. Trapdoor vector and index vector product quantifies score relevancy of query and the document. Firstly, the data owner forms the index vector for documents in the collection and sends the index vector to the attribute center. Then the owner encrypts the document M, under the access structure T using the symmetric key to form ciphertext CT as given in Equation (4). Select a polynomial q r for the root of the T and set q r (0) = s, where s is a random belonging to Z p . For all nodes ‘y’ in T, set q y (0) = qparent(y) (index (y)). Let ‘x’ be the leaf nodes of the access structure T, then the ciphertext is computed as follows.
The attribute center collects the index vector ‘I’ of documents from different data owners. It assigns weights so that the documents with more relevance to the query is retrieved. This work identified Discriminative Feature Selection Term weight measure [2] to assign term weights. This method assigns more weight to the keywords that are having high average keyword frequency within a specific data owner document collection and the keywords with high existence rate in most of the documents of the same data owner. Also this measure assigns less weight to the keywords that occur in most of the documents in the overall collection. Thus the keyword weight is measured as given in Equation (5).
After assigning keyword weights specific to the data owner, document weight is measured using term frequency (TF) as follows.
Weighted similarity measured gives the relationship between document and keyword without the need for an external information source. Thus attribute center, on receiving index vectors from multiple data owners, measures the document weight for the collection.
Table 1 shows the data collected from multiple owners. The trusted Attribute center, firstly, finds the dominant keywords of specific owner collection and then assigns weight to the documents. The doc_order values shown in the table gives the order in which documents are assigned weightage within the specific data owner collection calculated using Equations (5) and (6). After assigning weights to the documents, a similarity tree is formed using Algorithm 1.
Data collection from multiple owners
Attribute center, firstly, on receiving index vectors from multiple data owners, constructs a secure index vector tree using Algorithm 1. Proposed index construction uses cosine similarity to pair nodes of a level. A novel grouping approach is used for building the tree. If two grouped nodes are each other’s best match among all the current groups, then it is impossible for any future grouping of the other groups to create a better match for either. Thus it is safe to group them immediately and also increases the locality in the searches. After tree construction, nodes of the tree are then encrypted based on the approach [36] using search master key SMK consisting of two invertible matrices and a random binary vector.
Secure trapdoor generation(QV,SMK):
Data user performs a search for documents with keywords ki has to submit query to the attribute center. To secure the plain text query from the cloud server, attribute authority converts the query vector into a trapdoor as follows. It splits query vector into two vectors QV′ a QV′′ of length m based on f vector in SMK. If f[l] of SMK is zero, then QV′ [l] is assigned a random value, and QV″ [l] is calculated by subtracting QV′ [l] from QV [l] and ; else if f[l] is one, then QV′ [l] andQV″ [l] are equal to QV [l], where l ε {1,2, ... ,m}. Then query vectors are multiplied with invertible matrix in SMK as given in Equation (7) and the attribute authority sends the generated trapdoor TR to the cloud server.
Documents of data owners in encrypted form and then encrypted index tree received from multiple owners are stored in the cloud. Cloud server on receiving trapdoor from attribute center searches the tree T using Algorithm 2 and retrieves top-k relevant documents by calculating relevance score as shown in Equation (8). Table 2 shows the weighted index vector for a sample set of documents and the score computed for sample query Q = [1,1,0] against the weighted index of the documents.
Score computation
Score computation
Figure 2 illustrates the working for indexing and search over the index in plaintext form. Nodes Ti is intermediate nodes formed while grouping documents based on similarity. Edges of the index tree give the score of the out-degree node, and the number within brackets shows the order in which the nodes are processed. For query Q = [1, 1, 0] and k = 2 (top 2), the documents in the result list are {d11, d12}. The greedy search algorithm does not process node T3 as the score of T3 is 0.9, which is less than the smallest score 1.0 of document d12 in the result list. When k = 3, then the algorithm results in {d11,d12,d13} with the score list {1.1,1.0,0.9}.

Greedy search for the query.
For the sample consider query Q, the response obtained, and the relevancy feedback given by the user are stored in attribute center for reweighting the terms as shown in Fig. 3. Let the relevance feedback given by the ur for the response is {+,-,+} where ‘+’ stands for relevant and ‘–’ for non-relevant. Then, documents d1 and d3 are relevant to the user. When user U1 searches for the same keywords again, the attribute center reweights the terms by assigning weightage to the relevant results of the previous search. Term reweight is done by reducing the average weights of the terms of relevant documents from the original query. i.e (1–0.8*[(0.5 + 0.7)/2], 1–0.8*[(0.6 + 0.3)/2)], 0} = {0.52, 0.64, 0}. This new query is searched over the index tree; the result obtained is {d11, d13, d21}. Thus user query is reweighted for the subsequent searches to produce the most relevant results, thereby increasing the user experience.

Schematic diagram for top representative documents in cloud storage.
On receiving the documents, the data user decrypts the file as attributes in secret key matches with the policy attributes. Given CT, the ciphertext, and AK, attribute key, the user employs recursive CPABE decryption a recursive algorithm to get the symmetric key used to encrypt the message and finally retrieve the plaintext.
The proposed scheme is applicable for secure ranked retrieval of encrypted cloud data. The university data sharing application scenario is considered in this work. The employees of an organization are data owners. Employees share data within the organization through cloud storage. In this case, users could only access their department data and have restricted access for the organization data. For example, professor handling cloud computing can share the course lecture notes, assignments, and other schedules to students who have only opted cloud computing subject. The student who wants to search for an assignment gives the assignment topic as input. The proposed scheme searches in the encrypted storage for the most relevant assignments. The proposed returns top relevant assignments shared, and the student can view only if he/she has opted for the course under which the assignment is shared.
Performance analysis
This section presents the experimental setup and results obtained for the proposed AAKS. Also, it compares the theoretical and computational cost of the proposed with the existing systems.
Experimental setup
This work measures the actual performance of the scheme mentioned above using the Enron Emaildataset containing million records collected from 150 users. Simulation experiments are conducted on Windows 7 with Intel Core i7 processor 2.3 GHz using (JPBC) Java and Pairing Based Cryptography library. The simulation code is written in java language and the results presented are the average of 50 runs of the sample set of queries used to test the proposed method. First, the original dataset is preprocessed by using text mining package of R tool to perform word segmentation, stop words removal, stemming and duplicate deletion and finally, 50 highest frequency keywords of the preprocessed result are taken as dictionary keywords. The experimental results include the generation of weighted index vector for document collection, construction of secure similarity tree, generation of the encrypted trapdoor, and secure search.
Table 3 compares the proposed method functionality with the previous keyword search schemes based on features such as multi-owner, multi-keyword, access policy and term re-weight.
Comparison of functionality
Comparison of functionality
As for the performance analysis of AARKS, this section compares the theoretical analysis and computation analysis of proposed scheme with schemes [35, 37]. Weighted index vector focuses on finding keywords that have a high probability of representing the specific owner and also ignores the keywords that occur in most of the documents in the multi-owner document collection. Further, the result obtained is used for assigning weights for the documents. Thus, this approach avoids the need for trusting on an external source for document quality.
Scheme [35] uses coordinate matching and scheme [37] uses a grouped balanced binary tree to construct an index tree. Time consumption of the encrypted index includes two parts: the time to construct an index tree and the time to encrypt the tree. For ‘t’ keywords in the dictionary, the time cost to generate index vector for one document is O(t). Grouping vectors procedure given in Algorithm 1 incurs the cost of similarity checking and pairing of vectors. The cost of similarity checking for the vector of length t is O(t). On collecting ‘N’ documents from data owners, attribute center builds an index tree. As N documents are collected from multiple owners, the time consumption to pair the nodes and construct similarity-based grouping as given in algorithm is O(N2t). Now, the cost of encrypting one index vector in the tree is O(t2) as involves multiplication of invertible matrix of order t with query vector of length t. So, for ‘M’ nodes in the grouped index tree, the time consumption to encrypt the index tree using Equation (7) is O(Mt2).
The experimental result of index tree construction and encryption is shown in Figs. 4 and 5. It illustrates that the time to build the tree and encrypt tree increases as the number of documents in the corpus increase. Next, the user given query is converted to trapdoor by multiplying the query vector with the invertible matrices. Therefore, the time cost of a secure query generation is O(t2). Time to generate trapdoor is shown in Fig. 6. The experimental result shows that trapdoor generation is governed by the size of dictionary. For all the time estimation carried out in the experimentation, the value of ‘t’ is taken as 50.

Time for similarity based index tree construction.

Time for index tree encryption.

Time for trapdoor generation.
Table 4 shows the results of sample queries given by data user. The proposed index vector weightage method assigns more weight to the keywords that are having high average keyword frequency within a specific data owner document collection. Hence the result obtained is different when compare to Term Frequency Inverse Document Frequency (TFIDF) method. For example, document d42 exists within top 2 results for all sample queries. But the document is ranked (3,8,4,4) for the given sample set of queries.
Query results for keywords
Precision (P) and Mean Average Precision (MAP) are used for evaluating the precision of the proposed scheme.
Precision (P) = (Number of relevant doc retrieved / Total number of documents in the collection). Mean average precision (MAP) is calculated by finding the average of the precision, as shown in Equation (9).
For processing N nodes, the search cost of [35] is O(Nt) as it depends on nodes in the collection, and each node size is ‘t’. As scheme [37] uses group balanced binary tree (GBB) and greedy search, the search cost of the approach is O(ηNlogt), where ‘η’ is the number retrieved documents. The proposed work uses a locally-ordered algorithm for binary tree construction and the search cost is O(ηNlogt). Since the nodes are grouped based on the local similarity, the greedy search algorithm prunes more nodes and hence reaches tree nodes for given ‘η’ faster than the existing GBB based tree construction. As proposed similarity-based tree construction takes local order for grouping the nodes, the cost is closer to sub-linear.
The proposed method reaches the mostrelevant document quickly, thereby reducing the number of nodes to be processed. The search time to reach ranked documents is shown in Fig. 7. The result illustrates that proposed AARKS is efficient than the existing schemes. Since user attributes embedded over the retrieved documents, only the authorized users can decrypt the retrieved documents.

Time for search.
This paper highlights the efficient retrieval of authorized documents stored in third party storage. When compared with the previous schemes designed for authorized keyword searches, this work improved the search performance by constructing a similarity-based tree. Also, further, preserved the privacy of the keyword and document in the searching process. The CPABE based access control ensures returning only authorized documents to data requester without leakage any information to the cloud server. Also, relevancy feedback for term re-weighting improved user experience. As a future enhancement, planned to add the update feature for the index construction by applying AI techniques. Also, the transformation of authorized documents at the third party server can be incorporated to reduce the computation cost for data user.
