Abstract
Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.
1. Introduction
Data are a collection of documents stored in the databases, where each document contains key-value pairs. The sub-documents are nested inside the main documents using the key value. Document orientation is highly flexible in designing the schema as they provide heterogeneous data and complex structured data that are to be stored and collected together. The document-oriented database is considered a schemaless database, as it does not require any schema to convey the data model [1,2]. Most organisations collect various data with different file format, speed and types due to technological growth. When the data are used properly, analysing the data creates massive potential and enables the system to attain better results. In the growth of this digital world, transforming the possible information requires not only the data analysis approach but also uses the computing environment and generation system for dealing with the enormous volume of data in the structure of the database [3–5]. The database in the documentation system offers the wrapper using which the user can access the information based on the given query. Based on the characteristic features of data, the field, like image and text, requires various approaches to create the indexes, accessing the records and formulating records. In the traditional database, the data are ordered in a specific way so that the contents are directly accessed using the specific field. The searching queries are formulated based on the index fields, and the records are retrieved using the index relation. Hence, the index information is directly accessed from the data [6].
Due to the enormous growth in digital data, understanding the query plays a key role in the documentation system to obtain the relevant information based on the user needs. Classic information retrieval (IR) system relies on the keyword matching scheme to perform the document indexes in the corpus. In the document system, the documents and the queries are represented using the vector space model [7]. Reformulating the query is considered the most popular track system in the database, which is explored to refine the user information effectively. Different techniques are introduced in the survey to reformulate the user queries automatically, where relevance feedback is one of the important techniques. The user is requested to judge the documents based on the relevance of the data; in the relevance feedback scheme, usually, the top most documents are retrieved based on the user’s query. The information, like indexing the terms and the non-relevant documents, is extracted from the original documents and is combined using the query to expand and re-weight the query automatically using the query terms. One of the essential features of the relevance feedback method is relevance judgements [8]. The system provides the user interface, which is employed by the user to specify the needs of the information in the form of query. Based on the query operation, the queries are processed to remove the stop word and are converted into the representation form, which is operated by the system, usually in the form of an index. The searching process identifies the documents using the index based on the relevant query. While searching the documents, the system provides the matching score to every document [9]. In the medical field [10], clustering has been proven to be a powerful tool for discovering patterns and structure in labelled and unlabeled data sets.
Based on the query of the user, the documents are efficiently retrieved using indexing. Indexing in the document is a process to assign the terms to the documents for easier retrieval purposes [11]. An indexing technique was introduced based on the document assumption, which is assigned to the index terms for document retrieval based on the queries [12]. In the knowledge base, the concept of semantic and indexing is used to find the matches in the document. The system is offered to infer the information by connecting the queries and the concepts [13]. Various techniques are adopted to enhance the query evaluation performance, like combinatorial or heuristic algorithms, rapid implementation of basic operations, semantic transformations and logic-based approach to generate the access plans and select the information among them. The above methods are described in the query framework evaluation procedure to represent the queries using the relational calculus [14]. Documents available under the same class use the same keywords, which is called the surface-based match method for IR [15]. The keyword-based category association is learned from the documents using the Linear Least Squares Fit (LLSF) method for estimating the keyword [16]. The query optimisation issues – like database machine usage, distributed database optimisation and query evaluation – are addressed. Query optimisation integrates a variety of techniques to solve the above-mentioned problem, which ranges from the logical transformation to the optimisation of the storage data at the system level [14].
1.1. Motivation
Document retrieval is the problem of how to find the stored documents that contain useful information. Various methods have been developed for retrieving the relevant documents. However, some problems remain unsolved. Some of the challenges are listed here. The pattern matching is inefficient with respect to the computational cost and is mainly intended to retrieve the exact matches. Hence, it does not create any real-world scenario for the lexical variations and results in poor modelling capability in case of the semi-structured data, which is a major challenge associated in the documents retrieval [17]. The IR system often results in complete and inaccurate results due to the challenges of synonymy and polysemy. The word and vocabulary mismatch problem arises in the conceptual document space for the query transformation [7]. The time latencies involved in the query pre- and post-processing degrades the performance of the retrieval system. Reducing the quality of the query results and query relaxation, along with user involvement, is the major challenge of the query refinement process [18]. Retrieving the relevant documents with the short queries is a challenging task in document retrieval. Frequently irrelevant documents are retrieved, when the keyword of the query is short. Hence, relevant documents affect the negative term distribution [19]. Data – like video, text and images – are characterised as unstructured data, as it contains valuable information to the business. Extracting and searching these kinds of information have a major challenge because these kinds of information are self-describing and do not have any predefined model [20]. These challenges are considered as the motivation, and a new method for document retrieval is proposed in this work. The proposed document retrieval system not only retrieves the most relevant documents but also speeds up the storing and retrieval of information. Also, the proposed cluster-based indexing algorithm effectively performs the clustering and the indexing process based on the relevant information of the documents.
The contribution of this article is the development of the cluster-based inverted indexing algorithm by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and the inverted indexing to generate the document indexing based on the keyword. The rest of the article is organised as follows: the literature survey of the existing techniques is elaborated in section 2. The proposed algorithm is described in section 3, and the results along with the analysis are elaborated in section 4. Finally, section 5 concludes this article.
2. Literature survey
The review of the existing methods is listed in this section: Lopez-Otero et al. [17] developed a Query-by-example Spoken Document Retrieval (QbESDR) approach to identify the documents based on the spoken query. This approach recorded the documents in indices, which allowed performing an efficient and fast search. The searching time of the query transcription was reduced, but it failed to use the clustering strategy for matching the document pairs. Rad et al. [21] modelled a lexical scoring system to determine the semantic relationship between the words. It utilised the lexical chain model to retrieve the relevant documents based on the judgement of relevance. Even though the ambiguity was resolved, it retrieved the unrelated documents. Gupta et al. [22] developed a hash-based indexing approach for document retrieval. This approach provided the privacy to retrieve the document based on the term features. However, the data were not retrieved effectively in less time. Hao et al. [7] introduced a coupling relationship model to organise and rank the documents based on the learned concept. It represented the documents and queries in the concept space to retrieve the semantic information. The threshold selection was effectively balanced, but retrieving the documents using the linguistic model was not achieved. Biswas et al. [23] developed a linear space index model to retrieve the relevant documents based on the query time. The monotonic function was determined to compute the relevance using the linear space. It provided efficient document retrieval for most of the relevant documents, but it required too much of space for limited search functions. Tekli et al. [18] developed a semantic-aware indexing framework by integrating the domain knowledge with the textual information to process the semantic-aware query. It generated more semantic relevant results and handled the variation while collecting the multi-attribute data. However, it was not accurate in performing the incremental result fetching to enhance the performance. Hao et al. [19] developed an expectation–maximisation (EM) algorithm to identify the top-rank documents. The negative and the positive feedback were integrated to enhance the retrieval performance. Even though it was robust and provided better precision, the complexity of this algorithm was increased, as it failed to evaluate the negative documents. Madaan et al. [24] introduced a Question Answering (QA) system to retrieve the documents. QA provided the exact answers to the language questions and attained high accuracy level. QA effectively indexed the semantic web and retrieved the relevant documents quickly, but it offered low-quality for reasoning questions.
3. Proposed cluster-based inverted indexing algorithm
Indexing and retrieving the documents based on the query plays a key role in document processing and retrieval. Figure 1 shows the block diagram of the proposed cluster-based inverted indexing. The proposed cluster-based indexing algorithm involves four stages, such as pre-processing, document indexing, complex query matching and query optimisation. The database contains a variety of documents, and these input documents are subjected to the pre-processing stage, where the redundant and the unnecessary words are removed using the stop word removal and the stemming techniques. The pre-processed documents are further passed into the document indexing stage, which utilises the clustering-based indexing algorithm termed cluster-based inverted indexing through combining the inverted indexing with the piFCM clustering algorithm for indexing the documents. The clustered documents are fed into the complex query matching stage. The query matching is performed for the user queries, like semantic queries or multigram queries, based on the Bhattacharyya distance to produce a better query matching result. The Bhattacharyya distance is used to find the relevant documents based on the minimum distance measure. Finally, the query optimisation is performed using the Pearson correlation coefficient based on the interactive query optimisation, which determines an effective way to retrieve the documents.

Schematic diagram of proposed cluster-based inverted indexing for document indexing and retrieval.
3.1. Pre-processing
The database contains several documents, and each of these input documents is pre-processed using the stop word removal and stemming techniques for removing the redundant and unnecessary words. The stop word removal and the stemming process are applied to reduce the seeking time of the user. Initially, in pre-processing, the duplicate and the redundant words are eliminated from the input documents. Then, the optimal informations are retrieved from these pre-processed documents by applying the indexing and the clustering. Every document sent to the indexing process must be pre-processed before it is forwarded to the next stage. However, documents with redundant words are not utilised by the query matching criteria to generate the retrieval results. Moreover, documents with unrelated information may also exist in the database. Thus, it is required to pre-process the documents to perform effective retrieval of documents. The retrieval mechanism without proper pre-processing affects the retrieval performance. The pre-processed documents are used by the indexing operations, which uniquely identifies the matched documents based on the keyword of the query. The pre-processing stage uses the stop word and the stemming techniques for effectively reducing the unwanted words in the documents.
3.2. Document indexing for effective retrieval of documents
The resulted pre-processed documents are passed into the document indexing stage. The document indexing is performed using the proposed cluster-based inverted indexing algorithm, which is the combination of the clustering algorithm and the inverted indexing. The proposed cluster-based indexing algorithm effectively performs the clustering and the indexing process based on the relevant information of the documents. The documents are indexed for the purpose of easy retrieval of data based on the keyword of the query. The documents are clustered using the piFCM [25] clustering algorithm based on the information present in the documents. Clustering groups the similar documents so that the documents with similar features are grouped under the same cluster, and there exists a different number of cluster group, or in other words, one can say that each cluster group contains various documents, but the information in the documents must be similar when it is grouped under the same cluster group.
3.2.1. piFCM clustering algorithm for indexing the documents
The documents with relevant contents are grouped into clusters. The documents may exist in various cluster groups according to the content of the documents. Different cluster group contains different documents, and it will be retrieved based on the matching documents. As the documents are clustered in a group, retrieving the information makes it easier for the user query. Every cluster group contains any number of data objects with the relevant information. The proposed cluster-based inverted indexing uses the piFCM clustering algorithm, which is based on the fuzzy C-means (FCM) clustering approach to cluster the data objects together effectively. piFCM is a clustering algorithm, which enhances the performance of indexing using the objective function of the membership data.
The piFCM clustering is performed using the utility function of the fuzzy consensus clustering (FCC). The contingency matrix is used in the consensus clustering to make the clustering process more effective. The consensus clustering uses the structure of the soft cluster based on the utility function. FCM uses the iterative process of piFCM to increase the efficiency of the consensus clustering. FCC uses the horizontal and the vertical segmentation approach to obtain the big data with a feasible framework. In the spark platform, the FCC framework is accelerated using the parallelisation scheme. The data objects with different membership degrees may exist in any cluster group. The difference between the basic partition and the consensus clustering is measured using the utility function, and the clustering results are obtained by maximising the utility value. The objective function based on the mutual information uses the k-means clustering for identifying the optimal solution in the consensus clustering. The utility function of the k-means consensus clustering (KCC) is applied into the objective function for transferring the consensus cluster groups into k-means clustering. The basic partitioning information is collected and summarised in the co-association matrix, which identifies the number of times and the instances present in the similar cluster group.
FCM uses the objective function rather than using the data centroid to identify similar objects based on the distance measure. The concept of multi-membership data is introduced into the piFCM clustering algorithm. The membership data set is denoted as
where
In the piFCM clustering, the piecewise centroid is defined with the dimensional vector of
where
Pseudocode of piFCM.
The fuzzy clustering defines the membership data with
where
where
where
Hence, the output of the cluster is represented as
where
3.2.2. Cluster group retrieval using inverted indexing
The proposed cluster-based inverted indexing algorithm uses the inverted indexing process to perform the indexing mechanism. Here, the input query keyword is sent to the clustered groups. Based on the keyword query, the documents are retrieved from different cluster groups. When the matched keyword is present in more than one document in different cluster groups, then all the documents are retrieved. Accordingly, different documents from various cluster groups are retrieved through inverted indexing. The inverted indexing is used along with the clustering algorithm to perform the document indexing efficiently. The documents are indexed based on the related matched keyword. The query keyword is forwarded to each cluster group to search for the matching documents, where each cluster group contains many documents. Thus, for each query keyword, the entire documents are searched from various cluster groups, and only the matched documents are retrieved using the indexing process. Hence, any document is indexed using the inverted indexing process, and the resulted retrieved documents are further used to process with the query matching mechanism.
3.3. Complex query matching using the Bhattacharyya distance
The result of the inverted indexing stage is further subjected to the query matching stage. Complex query matching is performed for the user queries, like multigram queries or semantic queries, to generate the results. Semantic queries are contextual and associative. Semantic query focuses on retrieving the documents implicitly and explicitly depending on the structural, semantic and syntactic information present in the information. It is designed to deliver the results through the query matching criteria. Semantic query processes the relationship between the documents based on the semantics of the unstructured data. The query matching process utilises the Bhattacharyya distance to produce better query matching results. The Bhattacharyya distance measure finds similar documents based on the minimum distance measure. Based on the Bhattacharyya coefficient, the Bhattacharyya distance measures the document differences in the retrieval process. The Bhattacharyya distance is represented as
where

Schematic diagram of complex query matching.
In the complex query matching stage, the input keyword is sent to the cluster groups. In this context, two cluster groups are considered that are referred to as cluster group-1 and cluster group-2. The cluster group-1 has the documents as,
3.4. Query optimisation for document retrieval
The matching result obtained from the complex query matching is further processed by the query optimisation stage. Finally, query optimisation is carried out using interactive query optimisation for determining an efficient way to execute a query with different possible query plans to retrieve the relevant documents. The query optimisation uses the similarity measure based on the Pearson correlation coefficient for retrieving the documents. The Pearson correlation coefficient is defined as the measure of the correlation between the documents and is expressed as
where
4. Results and discussion
This section describes the results and discussion of the proposed cluster-based inverted indexing algorithm using the piFCM clustering and inverted indexing.
4.1. Experimental setup
The proposed cluster-based indexing algorithm is implemented in the JAVA tool, and the experimentation is carried out using the Twenty Newsgroups data set [26] and the Reuters data set [27]. The Twenty Newsgroup data set contains a collection of newsgroup documents and is mainly used in text applications. The Reuter data set is a benchmark data set used to perform the document classification. It has 3019 testing documents, 7769 training documents and 90 classes, respectively.
4.1.1. Comparative methods used for the analysis
The performance is evaluated by comparing the proposed algorithm with the existing methods, like the Semantic indexing query framework (SemIndex) [18] and EM algorithm [19].
4.1.2. Evaluation metrics
The performance of the proposed cluster-based indexing approach is analysed and evaluated based on the metrics, namely, precision, recall and F-measure.
4.1.2.1. Precision
Precision is defined as the ratio of positive rate to the total number of observations and is expressed as
where
4.1.2.2. Recall
Recall is defined as the ratio of positive predicted observations to the actual number of observations and is represented as
4.1.2.3. F-measure
F-measure is the average weight of the precision and the recall value, and it takes both the positive and the negative values. Thus, F-measure is represented using the below equation as
4.2. Comparative analysis
This section describes the experimental results of the proposed cluster-based inverted indexing algorithm made using the metrics like precision, recall and F-measure by varying the cluster size.
4.2.1. Comparative analysis using Twenty Newsgroup data sets
The analysis made using the Twenty Newsgroup data set by varying the cluster size is elaborated in this section.
4.2.1.1. Analysis using multigram query
The analysis made using the multigram query named ‘student faculty’ for the metrics, namely, precision, recall and F-measure with respect to the cluster size, is explained in this section. Figure 3(a) shows the analysis based on the cluster size with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like SemIndex and EM, is 0.12 and 0.333, while the proposed cluster-based inverted indexing obtained better precision with the value of 1, respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex and EM, is 0.0888 and 0.36, whereas the proposed cluster-based inverted indexing obtained better precision with the value of 1, respectively. When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM, is 0.24 and 0.36, while the proposed cluster-based inverted indexing obtained better precision with the value of 1, respectively.

Experimental analysis using multigram query with data set-1 (a) precision, (b) recall and (c) F-measure.
Figure 3(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the recall attained by the existing methods, like SemIndex and EM, is 0.25 and 0.5, while the proposed cluster-based inverted indexing obtained a better recall value of 0.625, respectively. When the cluster size is 18, the recall attained by the existing methods, namely, SemIndex and EM, is 0.1667 and 0.5, whereas the proposed cluster-based inverted indexing obtained a better recall value of 0.75, respectively. When the cluster size is 19, the recall attained by the existing methods, namely, SemIndex and EM, is 0.458 and 0.5, while the proposed cluster-based inverted indexing obtained a better recall value of 0.75, respectively. When the cluster size is 20, the recall obtained by the existing methods, like SemIndex and EM, is 0.5 and 0.583, while the proposed cluster-based inverted indexing obtained a better recall value of 0.7083, respectively.
Figure 3(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17, the F-measure attained by the existing methods, like SemIndex and EM, is 0.1621 and 0.4, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.7692, respectively. When the cluster size is 18, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.1159 and 0.4186, whereas the proposed cluster-based inverted indexing obtained better F-measure with the value of 0.8571, respectively. When the cluster size is 19, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.3188 and 0.4186, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.8571, respectively. When the cluster size is 20, the F-measure obtained by the existing methods, like SemIndex and EM, is 0.3835 and 0.4351, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.829, respectively.
4.2.1.2. Analysis using the semantic query
The analysis made using the semantic query named ‘Electronics’ for the metrics, namely, precision, recall and F-measure with respect to the cluster size is explained in this section. Figure 4(a) shows the analysis based on the cluster size with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like SemIndex and EM, is 0.3111 and 0.4, while the proposed cluster-based inverted indexing obtained a better precision value of 1, respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex and EM, is 0.0888 and 0.4, whereas the proposed cluster-based inverted indexing obtained a better precision value of 1, respectively. When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM, is 0.1020 and 0.333, while the proposed cluster-based inverted indexing obtained a better precision value of 1, respectively.

Experimental analysis using semantic query using data set-1 (a) precision, (b) recall and (c) F-measure.
Figure 4(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the recall attained by the existing methods, like SemIndex and EM, is 0.5 and 0.583, while the proposed cluster-based inverted indexing obtained a better recall value of 0.833, respectively. When the cluster size is 18, the recall attained by the existing methods, namely, SemIndex and EM, is 0.1667 and 0.5, whereas the proposed cluster-based inverted indexing obtained a better recall value of 0.8333, respectively. When the cluster size is 19, the recall attained by the existing methods, namely, SemIndex and EM, is 0.2083 and 0.2083, while the proposed cluster-based inverted indexing obtained a better recall value of 0.5, respectively. When the cluster size is 20, the recall obtained by the existing methods, like SemIndex and EM, is 0.4166 and 0.5, while the proposed cluster-based inverted indexing obtained a better recall value of 0.625, respectively.
Figure 4(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17, the F-measure attained by the existing methods, like SemIndex and EM, is 0.3835 and 0.4745, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.9090, respectively. When the cluster size is 18, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.1159 and 0.444, whereas the proposed cluster-based inverted indexing obtained a better F-measure value of 0.9090, respectively. When the cluster size is 19, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.3169 and 0.144, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.666, respectively. When the cluster size is 20, the F-measure obtained by the existing methods, like SemIndex and EM, is 0.270 and 0.4, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.7693, respectively.
4.2.2. Comparative analysis using the Reuter data set
The analysis made using the Reuter data set by varying the cluster size is elaborated in this section.
4.2.2.1. Analysis using multigram query
The analysis made using the multigram query named ‘schedule limit’ for the metrics, namely, precision, recall and F-measure with respect to the cluster size is explained in this section. Figure 5(a) shows the analysis based on the cluster size with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like SemIndex and EM, is 0.3636 and 0.4909, while the proposed cluster-based inverted indexing obtained a better precision value of 0.7, respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex and EM, is 0.2545 and 0.26, whereas the proposed cluster-based inverted indexing obtained a better precision value of 1, respectively. When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM, is 0.22 and 0.236, while the proposed cluster-based inverted indexing obtained a better precision value of 0.6181, respectively.

Experimental analysis using multigram query using data set-2 (a) precision, (b) recall and (c) F-measure.
Figure 5(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the recall attained by the existing methods, like SemIndex and EM, is 0.45 and 0.666, while the proposed cluster-based inverted indexing obtained a better recall value of 1, respectively. When the cluster size is 18, the recall attained by the existing methods, namely, SemIndex and EM, is 0.3714 and 0.4667, whereas the proposed cluster-based inverted indexing obtained a better recall value of 0.9166, respectively. When the cluster size is 19, the recall attained by the existing methods, namely, SemIndex and EM, is 0.3142 and 0.4333, while the proposed cluster-based inverted indexing obtained a better recall value of 0.5666, respectively. When the cluster size is 20, the recall obtained by the existing methods, like SemIndex and EM, is 0.3428 and 0.4666, while the proposed cluster-based inverted indexing obtained a better recall value of 0.9166, respectively.
Figure 5(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17, the F-measure attained by the existing methods, like SemIndex and EM, is 0.40223 and 0.5654, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.8235, respectively. When the cluster size is 18, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.3020 and 0.3339, whereas the proposed cluster-based inverted indexing obtained a better F-measure value of 0.9565, respectively. When the cluster size is 19, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.2588 and 0.3058, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.5913, respectively. When the cluster size is 20, the F-measure obtained by the existing methods, like SemIndex and EM, is 0.2823 and 0.3294, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.9565, respectively.
4.2.2.2. Analysis using semantic query
The analysis made using the semantic query named ‘freedom’ for the metrics, namely, precision, recall and F-measure with respect to the cluster size is explained in this section. Figure 6(a) shows the analysis based on the cluster size with respect to the precision. When the cluster size is 17, the precision attained by the existing methods, like SemIndex and EM, is 0.2 and 0.24, while the proposed cluster-based inverted indexing obtained a better precision value of 0.545, respectively. When the cluster size is 18, the precision attained by the existing methods, namely, SemIndex and EM, is 0.3 and 0.327, whereas the proposed cluster-based inverted indexing obtained a better precision value of 0.7454, respectively. When the cluster size is 19, the precision attained by the existing methods, namely, SemIndex and EM, is 0.24 and 0.4, while the proposed cluster-based inverted indexing obtained a better precision value of 0.545, respectively.

Experimental analysis using semantic query using data set-2 (a) precision, (b) recall and (c) F-measure.
Figure 6(b) shows the analysis based on the cluster size with respect to the recall. When the cluster size is 17, the recall attained by the existing methods, like SemIndex and EM, is 0.3428 and 0.366, while the proposed cluster-based inverted indexing obtained a better recall value of 0.5, respectively. When the cluster size is 18, the recall attained by the existing methods, namely, SemIndex and EM, is 0.4285 and 0.6, whereas the proposed cluster-based inverted indexing obtained a better recall value of 0.683, respectively. When the cluster size is 19, the recall attained by the existing methods, namely, SemIndex and EM, is 0.342 and 0.5, while the proposed cluster-based inverted indexing obtained a better recall value of 0.7333, respectively. When the cluster size is 20, the recall obtained by the existing methods, like SemIndex and EM, is 0.371 and 0.4, while the proposed cluster-based inverted indexing obtained a better recall value of 0.5166, respectively.
Figure 6(c) shows the analysis based on the cluster size with respect to the F-measure. When the cluster size is 17, the F-measure attained by the existing methods, like SemIndex and EM, is 0.2526 and 0.290, while the proposed cluster-based inverted indexing obtained better F-measure value of 0.5217, respectively. When the cluster size is 18, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.3529 and 0.4235, whereas the proposed cluster-based inverted indexing obtained a better F-measure value of 0.713, respectively. When the cluster size is 19, the F-measure attained by the existing methods, namely, SemIndex and EM, is 0.2823 and 0.444, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.6255, respectively. When the cluster size is 20, the F-measure obtained by the existing methods, like SemIndex and EM, is 0.2748 and 0.315, while the proposed cluster-based inverted indexing obtained a better F-measure value of 0.539, respectively.
4.3. Comparative discussion
This section describes the comparative discussion of the proposed cluster-based inverted indexing by considering the maximal values for the metrics, like precision, recall and F-measure, respectively. The existing methods, such as SemIndex, attained the maximum values for the metrics, like precision, recall and F-measure are 0.0888, 0.4583 and 0.4022, respectively. Similarly, the existing methods, like EM, attained the maximum values for the metrics, like precision, recall and F-measure are 0.4909, 0.5 and 0.5654, whereas the maximum values attained by the proposed cluster-based inverted indexing are 1, 0.70 and 0.9235, respectively. Table 1 shows the comparative discussion of the proposed algorithm.
Comparative discussion.
EM: expectation–maximisation.
5. Conclusion
The clustering algorithm named cluster-based inverted indexing is proposed in this research to retrieve the relevant documents. The proposed cluster-based inverted indexing algorithm effectively retrieves the relevant documents by combining the inverted indexing with the piecewise fuzzy C-means clustering algorithm. The documents are pre-processed using the stop word removal and the stemming techniques to remove the redundant and unnecessary words. The document indexing is performed using the cluster-based inverted indexing algorithm, which utilises the pre-processed documents and generates the indexing based on the keyword of the clustered documents. The resulted documents are further processed by the complex query matching process, where the user queries, such as semantic queries or multigram queries, are matched using the Bhattacharyya distance. The better query matching results are acquired that are based on the minimum distance measure or minimal value of Bhattacharyya distance. The query optimisation uses the Pearson correlation coefficient based on the interactive query optimisation and retrieves the relevant documents efficiently. The proposed cluster-based inverted indexing algorithm performs better with the metrics, namely, precision, recall and F-measure values to be 1, 0.70 and 0.8235, respectively. The results show that the proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information. The future extension of the work is based on any optimal clustering algorithms for document retrieval. Also, in the future, the proposed system will be extended for image retrieval and video retrieval processes.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
