Abstract
A similarity measure is used in information retrieval systems to retrieve and rank the relevant documents. In this paper, a new fuzzy-based approach to develop hybrid similarity measure is proposed and implemented. The proposed approach overcomes the limitations of extensively used similarity measures such as Cosine, Jaccard, Euclidean and Okapi-BM25 along with Genetic Algorithm-based hybrid similarity measures proposed by researchers. This approach uses fuzzy rules to infer the weights of different similarity measures. In this paper, the experiments are performed on CACM and CISI benchmark data collections. The performance of the proposed approach is evaluated in terms of precision, recall and average precision and average recall of retrieved relevant documents. The results are compared with different similarity measures available in literature. The results show the marked improvement in performance of information retrieval systems using the proposed fuzzy logic-based hybrid similarity measure.
Keywords
1. Introduction
The importance of information retrieval (IR) systems increased with the development of libraries in order to obtain desired information from large document archives. IR has been radically changed with the advent of computers and digital technologies. Information retrieval is a process or method whereby a prospective user of information is able to convert her need for information into an actual list of documents in storage containing information useful to her [1, 2].
The main task of an IR system is to provide a list of relevant documents for a user query. This task is generally formulated as a ranking problem: for a given query, the retrieval system should rank the documents so that the more relevant documents to the query appear above the less relevant documents. Although many statistical measures are available in literature [3–5], the performance of IR systems is still an issue. In order to enhance the performance, different approaches to develop similarity measures have been proposed [6–17]. The Genetic Algorithm (GA) has been used to develop new similarity measures for improvement in IR system performance in the recent years. Fuzzy logic is also employed to a limited extent [18, 19] in IR.
A fuzzy logic system [20] is a formal framework well suited to modelling vagueness and uncertainty. In the present paper, a new fuzzy logic-based approach is proposed to develop a hybrid similarity measure that incorporates the features of standard/conventional similarity measures. The fuzzy weights are determined using fuzzy logic on the basis of the individual similarity scores of different similarity measures. This approach ensures that a higher value of weight is assigned in a fuzzified manner to the similarity measure having a high similarity score. The performance of the proposed approach is compared with the GA-based similarity measure proposed in Pathak et al. [17] along with other conventional similarity measures. The Vector Space Model (VSM) is used as an underlying model to develop the proposed method owing to its strengths compared with other models.
The rest of the paper is organized as follows. The literature review of available models, similarity measures for IR and Pathak et al.’s [17] method for GA-based hybrid similarity measure is presented in Section 2. Section 3 deals with the details of the proposed fuzzy logic approach to develop the hybrid similarity measure. The experimental results are presented in Section 4. In Section 5, the analyses of the performance of different approaches are discussed. Finally, the conclusion is drawn in Section 6.
2. Literature review
The Boolean model [14], VSM [21–24] and the Probabilistic model [25, 26] have been proposed to implement IR systems. VSM is considered as the most popular and efficient model owing to its simplicity and fast processing. It can handle weighted terms and produces a ranked list as output, and the indexing process is automated, which means a significantly lighter workload for the administrator of the data collection. The similarity measure is the centre of the VSM, which is used to measure the angle between two vectors. The framework of the VSM employs a ranking algorithm that tries to rank documents according to overlap between the terminology of the query and each document.
Generally, there are two approaches for searching relevant documents [27] in the literature. The first one is the keyword-based approach [28, 29]. The second approach uses the whole document as a vector to compute a similarity measure [30, 31] and the query is also presented as a vector using VSM. On the basis of these approaches, different similarity functions were proposed. The definitions of some widely used similarity measures such as Cosine, Jaccard [3] and Okapi-BM25 [4, 5] aare given below.
Cosine: this calculates the cosine of the angle between the query and document vector, as shown in eqn (1). The numerator represents the dot product of the vectors q and d, while the denominator is the product of their Euclidean lengths:
Jaccard coefficient: the Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the document and query vectors, as shown in eqn (2):
Okapi-BM25: this similarity measure considers not only the frequency of the query terms, but also the average length of the whole collection and the length of the document under evaluation. The mathematical formula for Okapi-BM25 is shown in eqn (3):
where Q is a query that contains the words T; D is the document set; k1, b and k3 are constant parameters; K is
In the past, various similarity measures have been developed for probabilistic and vector space IR models. Lin [6] presented an information theoretic definition of similarity for a probabilistic model. Jahromi and Valizadeh [7] proposed a query-sensitive similarity function for IR. Tuomo et al. [8] used a connection between the Cosine and the Euclidean in association with principal component analysis and grounded searching on the latter then applied the single and complete linkage. Yeh et al. [9] presented a learning to rank method named RankGP. This method employed genetic programming to learn a ranking function by combining various IR parameters like content features and structural features. Xia et al. [10] proposed a listwise approach to implement ranking function by taking individual list as an instance and minimizing a loss function defined on the predicted list and ground-truth list. Gledson and Keane [11] described a simple web-based similarity measure that relies on page counts only, could be utilized to measure the similarity of entire sets of words in addition to word pairs and could use any web-service enabled search engine distributional similarity measure. Chen [12] presented a new similarity measure based on the geometric mean averaging operator to handle the similarity problems of generalized fuzzy numbers. Garcia et al. [13] presented three similarity measures based on the journal ranking scores for academic journals in each subject area. The authors also proposed that the relative performance of each subject area of Elsevier’s Scopus can be evaluated using the overall prestige for the most important journals with ranking score above a given threshold value. Jiyin et al. [14] proposed a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Bhatnagar and Pareek [15] proposed a combined matching function using evolutionary approach for improving the efficiency of IR systems. Usharani and Iyakutti [16] proposed a GA-based method for finding the similarity of web documents based on cosine similarity.
Similarity measures such as Cosine, Jaccard, Euclidean and Okapi are extensively used in VSM. Okapi-BM25 is the latest variant of Okapi and enhances the performance of IR systems. Researchers such as Pathak et al. [17] also tried to design a hybrid similarity measure by finding a linear combination of these similarity measures in order to overcome the limitations of different similarity measures. In their approach, the weights for different similarity measures are determined using GA, but it suffers two drawbacks. First, finding the best possible combination of weights through GA is a time-consuming process. Second, it is not able to capture vagueness and uncertainty as the documents and queries are written in natural language.
2.1. Pathak et al.’s method for GA-based similarity measure
Pathak et al. [17] treated an overall matching function as a weighted sum of the scores returned by different matching functions. The weights are determined by using GA. The fitness function [32] used in this approach is given by eqn (4):
where α is a parameter used to express the degree of user preference for precision (P) or recall (R) component. A higher value of α characterizes a user with less preference for recall, while a lower value of α characterizes one with less preference for precision.
For each individual similarity measure, a randomly chosen weight (in the range 0.0–1.0) is assigned. The weights are encoded using the actual real numbers between 0.0 and 1.0 (inclusive). The experiments are conducted for a population size of 50, run for 75 generations; α is equal to 1 and crossover and mutation rates equal 0.6 and 0.1, respectively.
3. Proposed fuzzy logic-based approach for hybrid similarity measure
An appropriate weighted combination of scores produced by different similarity measures is possible in order to achieve better retrieval results as compared with those produced by any single similarity measure. The development of hybrid similarity measure is focussed on finding appropriate weights to be used for each similarity measure. Here, fuzzy logic may be useful for the reasons explained in Section 1 to decide the appropriate weights of different similarity measures. Therefore, fuzzy logic is employed in the proposed approach to develop hybrid similarity measures. The mathematical formulation for a fuzzy logic-based hybrid similarity measure (FBHSM) can be expressed by (5).
where smi (D,Q) represents the similarity measure that is used to calculate the matching score of document D for a given query Q; wti denotes the weight of the ith similarity measure and n is the total number of similarity measures considered. Two different fuzzy hybrid similarity measures are presented in this paper. The first one is a combination of cosine, Euclidean and Jaccard, that is, FBHSM1, and the second one is a combination of cosine, Jaccard and Okapi-BM25, that is, FBHSM2. The mathematical formulations for FBHSM1 and FBHSM2 are expressed by eqns (6) and (7):
The appropriate values of weights are determined using a fuzzy logic controller to maximize retrieval efficiency. A Mamdani-type Fuzzy Inference System [33] is used in the proposed approach with the help of the MATLABTM Fuzzy Logic Toolbox. Figure 1 shows the block diagram of a fuzzy logic controller that incorporates three major blocks. The first is the input block, the second is the main fuzzy logic controller block and the third is the output block. The working of fuzzy logic controller block, which contains three processes – fuzzification, approximate reasoning and defuzzification, is illustrated.

The block diagram of the fuzzy logic controller.
3.1. Fuzzification
In the proposed approach, three input fuzzy variables such as the scores/values of Cosine, Jaccard and Euclidean for each document are used for FBHSM1. Similarly the scores/values of Cosine, Jaccard and Okapi-BM25 for each document are used for FBHSM2. W1, W2 and W3 are the output variables, which represent the weights of Cosine, Jaccard and Euclidean/Okapi-BM25 similarity measures, respectively. The range of all input and output variables is represented by three linguistic terms as low, medium and high as shown in Figure 2. A triangular membership function is used to map input space to a degree of membership of fuzzy set.

Input and output membership functions.
3.2. Approximate reasoning
An approximate reasoning is established for inference in order to deal with uncertainty and vagueness. It includes two sub-processes, that is, fuzzy rule base and fuzzy inference engine as described below.
3.2.1. Fuzzy rule base
The Mamdani-type fuzzy rule toolbox is used to formulate the conditional statements that comprise fuzzy logic. Common knowledge of IR signifies that a higher value of weight should be assigned to the associated similarity measure that has a higher similarity value/score as compared with other similarity measures that have lower similarity values/scores. Therefore, the fuzzy rules are framed on the basis of the above domain knowledge. A total of 27 Mamdani-type fuzzy rules are framed as conditional statements, and are listed in Table 1.
Fuzzy rule base
3.2.2. Fuzzy inference engine
In the proposed approach, the fuzzy implication is modelled by Mamdani’s minimum operator. The sentence connective is also interpreted as ORing (logical OR) the propositions and defined by a MAX operator. The logical AND operator is typically used to combine the membership values for each fired rule to generate the membership values for the fuzzy sets of output variables in the subsequent part of the rule.
A fuzzy inference diagram for FBHSM2 is presented in Figure 3 in order to interpret the entire fuzzy inference process at once. In this figure, the six small plots in each row represent the antecedent and consequent of the particular rule. Hence, each rule is a row of plots, and each column is a variable. The first three columns of plots show the membership functions referenced by the antecedent, or the if-part of each rule. The last three columns of plots show the membership functions referenced by the consequent, or the then-part of each rule. The bottom-most plots in the fourth, fifth and sixth columns represent the aggregate weighted decision values of W1, W2 and W3. These diagrams may be used as a diagnostic for the performance of all 27 fuzzy rules.

Fuzzy Inference diagram for FBHSM2.
Further, the three-dimensional rule surfaces are plotted in Figure 4. It displays the dependency of any one of three outputs (i.e. W1, W2 and W3) on any two of the three inputs (i.e. value of Cosine, Jaccard and Okapi-BM25 for FBHSM2). Similarly the same analysis can be done for FBHSM1 to understand the fuzzy inference process and dependency of variables.

Rule surface diagram for fuzzy inference in FBHSM2.
3.3. Defuzzification
In this approach, the centroid method (centre of sums) is used for defuzzification method as defined by eqn (8). The defuzzified output value of FBHSM represents the weights for values of Cosine, Jaccard and Okapi-BM25:
where the input for the defuzzification process is a fuzzy set
4. Experimental results
The experiments have been performed on CACM and CISI benchmark datasets. CACM and CISI are the datasets on Communications and Information Retrieval systems, respectively. Both datasets are in English. CACM contains 3204 documents and CISI contains 1460 documents. The 25 queries are selected randomly from each dataset. The analysis of results obtained is presented in the form of precision-query curves, recall-query curves, average precision rate and average recall rate. The analysis is done for the top 10, top 25 and top 50 documents. The results are compared with Cosine, Jaccard, Euclidean, Okapi-BM25 and GA-based similarity measures developed by Pathak et al. [17].
In this paper, Pathak et al.’s [17] approach is implemented by running GA 10 times for each query in order to check the consistency and to obtain best possible optimized results corresponding to fitness values throughout these runs. Figure 5(a) represents the number of generations vs best and worst fitness curves and Figure 5(b) shows number of runs vs best fitness value of query 1 for the CACM dataset. Similarly, Figure 6(a and b) represents the same for the CISI dataset.

(a) Number of generations vs best and worst fitness value curves for query 1 (CACM dataset) using Pathak et al.’s method. (b) Best fitness value for each run for query 1 (CACM dataset) using Pathak et al.’s method [17].

(a) Number of generations vs best and worst fitness value curves for query 1 (CISI dataset) using Pathak et al.’s method. (b) Best fitness value for each run for query 1 (CISI dataset) using Pathak et al. method.
The performance of the proposed fuzzy-based approach is evaluated on the basis of precision and recall [22] in this paper. The precision and the recall are defined as follows in eqns (9) and (10):
where |Ra| is the set of relevant documents retrieved, |A| is the set of total documents retrieved and |R| is the set of all relevant documents. Moreover, the performance is also evaluated on the basis of average number of relevant documents retrieved.
Figures 7 and 8 represent curves for precision vs number of queries obtained for CACM and CISI datasets, respectively. Similarly, the curves for recall vs number of queries obtained for CACM and CISI datasets are presented in Figures 9 and 10, respectively. The performances of different similarity measures on the basis of the above curves are analysed in next section.

The precision of the top 10 retrieved documents with respect to 25 queries for different methods for the CACM dataset.

The precision of the top 10 retrieved documents with respect to 25 queries for different methods for the CISI dataset.

The recall of the top 10 retrieved documents with respect to 25 queries for different methods for the CACM dataset.

Recall of the top 10 retrieved documents with respect to 25 queries for different methods for the CISI dataset.
5. Discussion
Figure 7 clearly illustrates that FBHSM2 gives better precision values for 17 queries and lags for only two queries out of 25 as compared with Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measure. Figure 7 also shows that FBHSM1 outperforms Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measure for most of the queries but lags behind FBHSM2. Figure 8 shows that FBHSM2 gives high precision values of document retrieval for all 25 queries and FBHSM1 also gives better results in comparison to Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measure for most of the queries. The results shown in Figures 7 and 8 are for the 10 top-ranked retrieved documents. The same analysis is also done for the top 25 retrieved documents and the top 50 retrieved documents, and the results are in support of FBHSM1 and FBHSM2.
Figure 9 clearly illustrates that FBHSM2 gets better recall values for 17 queries and lags for only six queries out of 25 as compared with Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measures. Figure 9 also shows that FBHSM1 outperforms Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measure for most of the queries. Figure 10 shows that FBHSM2 gets high recall values of document retrieval for 20 queries and lags only for one query out of 25. FBHSM1 also gives better recall values in comparison to Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measure for most of the queries. The results shown in Figures 9 and 10 are for the 10 top-ranked retrieved documents. The same analysis is also done for the top 25 retrieved documents and top 50 retrieved documents, and the results are highly encouraging.
5.1. Comparison of average precision rate and average recall rate
Tables 2 and 3 show a comparison of the average precision and the average recall of the proposed similarity measures, that is, FBHSM1 and FBHSM2 with Pathak et al.’s GA-based similarity measure along with Okapi-BM25 for the CACM and CISI datasets, respectively. It is clear that the proposed methods increase average precision and average recall of the IR system for dealing with document retrieval. The analysis is done for the top 10 retrieved documents, the top 25 retrieved documents and the top 50 retrieved documents.
Comparison of average precision and average recall of the proposed approaches with Pathak et al.’s GA-based method along with Okapi-BM25 for CACM dataset
Comparison of average precision and average recall of the proposed approaches with Pathak et al.’s GA-based method along with Okapi-BM25 for CISI dataset
6. Conclusion
A new fuzzy logic-based approach to develop hybrid similarity measure is presented in this paper for IR. Two hybrid similarity measures named FBHSM1 and FBHSM2 are implemented on the basis of the proposed approach, the novelty of which lies in deciding the weights of different similarity measures using fuzzy logic. In comparison to Pathak et al.’s GA-based similarity measure, the proposed approach has two advantages: robustness and less computation time to determine the weights. The performance of FBHSM1 and FBHSM2 was evaluated and compared with Cosine, Jaccard, Euclidean, Okapi-BM25 and Pathak et al.’s GA-based similarity measures on CACM and CISI benchmark datasets. It is clear from the experiments performed that FBHSM1 and FBHSM2 not only are superior to those similarity measures that are used in formation of both of the proposed similarity measures, but are also better than Pathak et al.’s similarity measure. Finally, it is concluded that FBHSM2 is more promising than FBHSM1 because individually Okapi-BM25 (part of FBHSM2) gives better results over Euclidean (part of FBHSM1).
This paper is a preliminary effort to apply fuzzy logic in developing hybrid similarity measures and the results are promising, although the complexity of the hybrid similarity measure depends on the count of individual similarity measures as the number of fuzzy rules to be formed is also increased. As regards the future scope of the work, a better similarity measure can be developed using fuzzy logic and other statistical parameters like term frequency, inverse document frequency and normalization factors. This may bring substantial improvements in performance in IR systems.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
