Abstract
An information retrieval system (IRS) is used to retrieve documents based on an information need. The IRS makes relevance judgements by attempting to match a query to a document. As IRS capabilities are indexing design dependent, the hybrid indexing method (IRS-H) is introduced. The objectives of this article are to examine IRS-H (as an alternative indexing method that performs exact phrase matching) and IRS-I, regarding retrieval usefulness, identification of relevant documents, and the quality of rejecting irrelevant documents by conducting three experiments and by analysing the related data. Three experiments took place where a collection of 100 research documents and 75 queries were presented to: (1) five participants answering a questionnaire, (2) IRS-I to generate data and (3) IRS-H to generate data. The data generated during the experiments were statistically analysed using the performance measurements of Precision, Recall and Specificity, and one-tailed Student’s t-tests. The results reveal that IRS-H (1) increased the retrieval of relevant documents, (2) reduced incorrect identification of relevant documents and (3) increased the quality of rejecting irrelevant documents. The research found that the hybrid indexing method, using a small closed document collection of a hundred documents, produced the required outputs and that it may be used as an alternative IRS indexing method.
1. Introduction
Vocabulary mismatch is a phenomenon that occurs when searching for phrases in a document collection. In the business world and in academic research, when information retrieval systems (IRSs) are used to search for information, vocabulary mismatch often occurs. The motivation for this research was based on observations of vocabulary mismatch: (1) during a partial divestiture by a mining company and (2) during postgraduate doctoral research.
A few years back, a US company sold its copper and cobalt mine situated South East of the Democratic Republic of Congo to a mining company from the People’s Republic of China. The project objective was to identify and then separate the mine’s data and text information from computer servers situated in the United States, and once separated, migrate the data and text held in the databases and closed document collections to cloud-based computer servers hosted in Johannesburg, South Africa.
During migration of over 10,000 documents, a data quality initiative was necessary to cleanse historical textual data held within the system. This initiative encompassed the use of specific vocabulary, the search for key phrases and avoiding mismatching vocabulary where possible. Thus, the words in the phrases were required to be in the exact order and within the required proximity of each other, as specified in the search. Word ordinality and word proximity were critical requirements for the search of the unstructured text within the documents (purchase order details, invoice details and material descriptions). The objective was to understand the content of these documents better, thus enabling the creation of metadata that could better describe the contents of these documents and provide an improved search to retrieve exactly those documents that were relevant. The challenge, therefore, was to explore indexing methods and the phrases, keywords and arrangement of words (using ordinality and proximity) that existed within the documents and to identify which of the documents contained the specific phrases, thus attempting to reduce the mismatch in vocabulary.
The problem of mismatching vocabulary and efforts to reduce the impact thereof are well documented in the literature. One group of authors, in an attempt to improve the mismatching of vocabulary, applied a Bayesian optimisation intensified lexical method and a latent vector space method (referred to as a neural vector space model) to an IRS with the latter indicating some form of improvement [1]. Referring to the vocabulary mismatch problem as the semantic gap and to the relevancy of a user’s collection of documents, a second group explained that the semantic gap hinders the matching of a query to a document and that this is the important problem to attempt solving [2]. Further groups confirm that vocabulary mismatch remains a problem and state that it continues to be one of the important challenges when using keyword-based search systems [3,4], while others hint at the requirement for exact phrase matching and a review of indexing design [5,6].
The word ‘indexing’ refers to the way in which data stored in a computer can be retrieved in response to a query. The word ‘indexing’ also refers to the way in which the content of a document is represented. This representation can be performed using terms from a controlled vocabulary, for example, thesaurus or assigned keywords or a classification scheme.
During postgraduate doctoral research, one of the authors of this article experienced the problem of mismatching vocabulary. In preparation for a literature review (LR), the task the researcher was faced with was to search for and then read copious numbers of research articles, including conference papers, journal articles, theses, books and other documents, and to judge whether any of these articles were relevant to the research topic and reject those that were not. To be able to search for the literature, the researcher had to compile a vocabulary of phrases (that best describe the topic) and then express these phrases as queries. Thereafter, the researcher had to present the queries to the IRS, activate the search, and then review the documents retrieved by the IRS.
To summarise this situation, the research problem was stated as: for business people and researchers, challenges exist for IRSs when handling mismatching vocabularies in queries and source documents [7]. These challenges include IRSs retrieving some documents that are irrelevant and missing some that are relevant [8]. This situation increases the time taken to conduct the research by forcing additional perusal of unsatisfactory results [9] rendering IRSs less useful than they could be, thus inhibiting productive research [2,5].
An IRS is used to retrieve documents based on an information need. The IRS makes relevance judgements by attempting to match a query’s phrases to a document. As the usefulness of IRSs is dependent on indexing design, the novel hybrid indexing method (IRS-H) for research-based information retrieval is introduced in an attempt to solve the research problem.
Following the introduction of this research, the inverted indexing method (IRS-I) is discussed, the key concepts for the design of the novel hybrid indexing method (IRS-H) (an indexing method that maintains word ordinality and proximity suitable for phrase use) are explained, and then by experiment, the results generated by IRS-H and the results generated by IRS-I are compared with each other.
The objectives of this article are to examine IRS-H (as an alternative indexing method that performs exact phrase matching) and IRS-I, regarding retrieval usefulness, identification of relevant documents, and the quality of rejecting irrelevant documents by conducting three experiments and by analysing the related data. In this article, the number of documents in the researcher’s closed collection is deemed small (a few hundred documents as the participants need to physically review all documents) rather than the tens of thousands found in large document collections. Once a researcher has collected documents from sources that are of interest to him or her and these documents now reside on the researcher’s computer, the experiments begin. Note that some older work is referenced to highlight the original ideas and concepts of authors during the pre-computer period as well as the computer infancy period.
2. Background
2.1. Background to the inverted indexing method
By design, the inverted index method comprises two artefacts (Figure 1), the IRS and a traditional single index.

The two artefacts of the inverted indexing method.
When a search of the IRS (artefact 1) is triggered, a query is presented to the index (artefact 2), and when complete, a result is returned. The inverted index itself has two components: (1) the postings dictionary – a list of distinct terms sorted alphabetically and (2) the pointers that point to the positions (the document identities of where they exist) in each posting list [10–12]. Working in a similar way to a subject index in a reference book, the inverted index is a particular indexing structure that performs term-based matching [8]. A term (or word) expressed within a query is matched to a term (or word) within a document [13–16]. The drawback is that when more than one word is used in a query, one word may occur on the first page and another on the last page of a document. This generates the bag-of-words phenomenon, which creates the same effect as if the words within a collection are randomly dropped into a bag, without structure and without order [17,18].
2.2. Background to the hybrid indexing method
By design, the hybrid indexing method comprises three artefacts (Figure 2).

The three artefacts of the hybrid indexing method.
These artefacts are the IRS and a pair of indices: a token index (populated with text from documents) and a query index (populated with phrases from queries). When a search of the IRS (artefact 1) is triggered, the query index (artefact 2) interrogates the token index (artefact 3), and when complete, returns a result. The novel part of IRS-H is that it uses a pair of indices rather than a single index, and the one is used to interrogate the other. In a single index IRS, for example, IRS-I, the index is populated with a distinct list of tokens (words or chunks of text) together with the document numbers that the tokens originate from. The problem is that the order of the words and the proximity of these words with each other are lost. The first index of the hybrid index is the token index, which is populated with the tokens, the document numbers they originate from, and the position number where each token exists in each document. The second index, the query index, is populated with the words within the phrases expressed within the queries, with their ordinal positions. When the query index interrogates the token index, the words are matched and the word ordinality is checked using the positions from the token index. When the words and ordinal positions match, the document is returned. This novel indexing method is useful in many industries searching for distinct phrases, for example: ‘design science research’, ‘electronic health records’ and ‘qualitative method’.
What is the hybrid indexing method? The origins of the hybrid indexing method stem from postgraduate research in information systems [19], health informatics [20] and information retrieval [21]. The original problem to solve, using this method, was the phenomenon of mismatching vocabulary [22–25]. Vocabulary mismatch occurs when various synonymic phrases are used to describe the same topic, and these phrases often change over time [6]. This makes it more challenging to find relevant documents pertaining to a topic. In IRSs, vocabulary mismatch can occur when the indices mismatch a query to a document [5]. The hybrid indexing method is the design of an IRS that (1) gathers information from documents, (2) uses a search engine and (3) uses a specific indexing method, that stores the tokens and their positions to maintain word ordinality and proximity [26], to enable matching of phrases expressed in queries to those same phrases that occur in documents.
How can the hybrid indexing method be used? It can be used by anyone: researchers, academics, businesses, communities, scholars, parents and others. Its initial use was to find answers to the root cause and cure for a specific disease [27]. It later became appropriate, and then most useful, in postgraduate research to retrieve documents relative to the research topic better [20].
Why use the hybrid indexing method? It matches phrases expressed in a query to those same phrases that occur in a document. It effectively retrieves documents that contain a specific phrase irrespective of the number of words used within the phrase. It may be used in any language using the 26 letter Latin-based alphabet. When writing an LR, it is particularly useful for finding multiple phrases from a researcher’s closed collection of documents specifically for author referencing and page number referencing. When embarking on writing a new section within an LR, a simple search, using multiple phrases, will find those documents relevant to the topic – once found, the researcher, with confidence, can then focus on these documents to write up the new section. It has a high rejection quality of irrelevant documents from a collection. As the rejection quality is high, the number of irrelevant articles to be read by the researcher is reduced, thus saving precious research time.
3. The hybrid indexing method framework
Based on design science research and rigorous pilot testing [28–31], the framework of the hybrid indexing method [21] is illustrated in Figure 3.

The framework of the hybrid indexing method.
The framework comprises two sections: (1) relating to the researcher and (2) relating to the IRS.
3.1. The researcher
During the research process, a researcher will search for, and attempt to find, various research articles from the university library portals and via the World Wide Web:
Once found, these documents are typically downloaded onto the researcher’s personal computer (PC) and stored on the PC’s hard disk drive. These documents now form the closed document collection of the researcher, where N denotes the number of documents within the collection [32].
During the research process, and particularly during the LR stage of a thesis, the researcher becomes aware of various information needs to enable him or her to gather information content and to describe this content, in order to fulfil the requirements of the LR. The researcher’s information needs [33,34] are expressed as queries and presented to the IRS, which then returns the results – the list of documents found (referred to as system retrieved).
Once the documents are found, each is perused by the researcher, who then makes a judgement. These judgements are made for each document as to whether the document is relevant or not (non-relevant) to the information need [16]. The documents judged relevant to an information need for a specific research topic can then be used to compile an LR, or a section of other chapters within the thesis, where previous research needs to be reported on. Using a 2 × 2 contingency table, ‘retrieved relevant’ documents are referred to as true positive (tp) values, while ‘retrieved non-relevant’ documents are referred to as false positive (fp) values [35–37].
3.2. The IRS
The IRS comprises two processes as follows: (1) information gathering and (2) the search engine.
Information gathering is the process by which the IRS gathers information from the text of documents within a collection N and thereafter populates the hybrid token index. The steps are as follows:
4. The IRS acquires the content from the documents [38].
6. Tokenisation takes place [41,42], spaces are replaced with pipe delimiters [43], all special characters including hyphens [44] and inverted commas are removed, and all text is case folded to lowercase [45,46]. Stopping [47,48], stemming [49–51], suffix stripping [52] and other concepts are not used.
7. Transformation then takes place, where the tokens (words, codes or acronyms from the text), separated by the pipe delimiters [43], are extracted from the text, and together with the originating document number, are used to populate the hybrid token index. A unique token identity number is sequentially generated and allocated to each token as it is extracted. The token identity maintains the word ordinality and proximity between words.
The search engine: at this stage, the researcher provides the query, which is then used to populate the hybrid query index. The steps are as follows:
8. The phrases representing the information need are received from the researcher.
9. These phrases are then expressed within a query [53].
10. The phrases within the queries are presented to, and stored within, the hybrid query index [20].
11. At this stage, the hybrid query index interrogates the hybrid token index. Here, the attempted hit [35] or vocabulary match [54] is performed, by matching the phrase within the query to the same phrase occurring within the text of the document at least once. Note that various measurement techniques suggested in the literature are not made use of, for example: inverse document frequency (idf) and the product of term frequency (tf) and idf (tf*idf) [55–57], cosine similarity [10,58] and others. These measurement techniques were not used because the results needed to be non-weighted (based on pure ‘hits’, where tf = 1) and uninfluenced by these and other mathematical techniques.
12. If a match is found, the IRS judges the document to be relevant, and thus retrieves the document [59]. Using a 2 × 2 contingency table, IRS retrieved documents are referred to as tpfp: the sum of true positive (tp) and false positive (fp) values [36,53,60].
13. If a match is not found, the IRS judges the document to be non-relevant, and thus does not retrieve the document. IRS not-retrieved documents are referred to as fntn: the sum of false negative (fn) and true negative (tn) values [61].
These retrieved documents are then fed back to Step 3 where they are perused by the researcher, and the ‘retrieved relevant’ (tp) and ‘retrieved non-relevant’ (fp) judgements are made.
4. Research hypotheses
To meet the three research objectives, the three null hypotheses are now presented as follows:
H10. Hybridised indexing does not increase the retrieval of relevant documents.
H20. Hybridised indexing does not reduce the incorrect identification of relevant documents.
H30. Hybridised indexing does not increase the quality of rejecting non-relevant documents.
5. Materials and methods
The research was based on an objectivist and positivist stance using the deductive approach. The research method was an explanatory cross-sectional study while the research strategy followed was experimental [62].
A collection of 100 documents (N = 100) archived published and unpublished material covering numerous disciplines were non-randomly selected from the Cape Peninsula University of Technology (CPUT). This document collection, made available for experimentation, encompassed journal articles, conference papers, dissertations and theses. A set of 75 phrases (pt) expressed as queries (q), relevant to various research topics, were randomly compiled from the literature for experimental use. As space to list all queries is limited, a few examples are provided for queries 1, 2, 61 and 67:
q01 = ‘design science’.
q02 = ‘design science research’.
q61 = ‘electronic health records’.
q67 = ‘qualitative method’ OR ‘qualitative analysis’ OR ‘qualitative research’ OR ‘qualitative research design’ OR ‘qualitative research method’ OR ‘qualitative research methods’ OR ‘qualitative research methodology’.
The goal of the expanded query q67, for example, is to search for and retrieve those documents that contain at least one of the phrases within the query. To evaluate an IRS, what is needed is a set of results generated by the IRS and a set of results generated by at least one human participant. Once complete, a comparison can be made to see how the IRS results compare with those made by the human. As two IRSs were being evaluated, three sets of results were needed: (1) the human participant result set, (2) the IRS-I result set and (3) the IRS-H result set. Therefore, in order to generate the necessary data, three experiments were performed: (1) participant generated data, (2) IRS-I generated data and (3) IRS-H generated data, illustrated in the experimental framework (Figure 4). The data were used to test the three null hypotheses as follows: H10, H20 and H30.

The experimental framework.
5.1. Experiment 1
To collect the participant data, a 10-page questionnaire was used to gather quantitative data during a 1-day experiment with five participants conducted at CPUT. Each page represented an information need with a set of questions in the form of phrases. Each participant randomly selected 20 of the 100 documents, and then answered ‘true’ or ‘false’ according to whether or not each of the 75 phrases (pt), expressed as queries, existed in any of their selected documents (d). Boolean data captured via the questionnaire were converted to binary, where ‘1’ represented ptd = ‘true’ and ‘0’ represented ptd = ‘false’. Data were initially stored in a phrase-by-document matrix, and after conversion, in a query-by-document matrix. Data generated for analysis purposes were the values of tpfn and fptn.
5.2. Experiment 2
For IRS-I, using the inverted indexing method, all 100 documents were presented to IRS-I and the information gathering process was run. Thereafter, the 75 phrases, expressed as queries, were presented to the search engine of IRS-I and processed. These processes generated quantitative data in the form of a term-by-document matrix using term frequency (tfd) values (the number of times a term occurs in a document). These values were converted to binary, where ‘1’ represented tfd > 0 and ‘0’ represented tfd = 0, and then stored in a query-by-document matrix. The generated outputs for data analysis purposes were the values of tpfp and fntn.
5.3. Experiment 3
For IRS-H, using the hybrid indexing method, the processes as used for Experiment 2 were repeated. All 100 documents were presented to IRS-H and the information gathering process was run. Thereafter, the 75 phrases, expressed as queries, were presented to the search engine and processed. These processes generated quantitative data in the form of a phrase-by-document matrix using phrase frequency (ptfd) values (the number of times a phrase occurs in a document). These values were converted to binary, where ‘1’ represented ptfd > 0 and ‘0’ represented ptfd = 0, and then stored in the query-by-document matrix. Data generated for analysis purposes were the values of tpfp and fntn.
6. Data analysis
A 2 × 2 contingency table (Figure 5) was used to analyse the data in the format required.

A 2 × 2 contingency table.
Reading from left-to-right, the top four-block row represents the number of documents IRS-H judged relevant – tpfp (the sum of true positive (tp) and false positive (fp)). The bottom four-block row represents the number of documents IRS-H judged non-relevant – fntn (the sum of false negative (fn) and true negative (tn)). Reading from top to down, the first four-block column represents the number of documents a participant judged relevant – tpfn (the sum of tp and fn). The second four-block column represents the number of documents a participant judged non-relevant – fptn (the sum of fp and tn). The values of tp, fp, fn and tn were then derived from tpfp, fntn, tpfn and fptn, while N represents the total number of documents in the collection, N = 100. This process was repeated for IRS-I and its generation of data.
From the literature, a number of IRS measurement formulae may be utilised. The common formulae are Precision, the quality of being precise or exactness and Recall, an order to return a document. A third and rather uncommon formula in IRSs is Specificity, the quality in rejecting irrelevant documents. For each of the 75 phrases expressed within their own query, performance measurements were used to calculate the values for: Precision, Recall and Specificity.
Precision (P) is the ratio of the number of relevant documents retrieved (tp) to the number of documents retrieved (tp + fp) [10,40] and is presented in equation (1)
Recall (R) is the ratio of the number of relevant documents retrieved (tp) to the number of relevant documents in the collection (tp + fn) [10,63] and is presented in equation (2)
Specificity (S) is the ratio of the number of not-retrieved non-relevant documents (tn) to the number of non-relevant documents in the collection (fp + tn) [33,64] and is presented in equation (3)
Statistical analysis was performed to test the three null hypotheses. For the first hypothesis, Precision (P), Ranked Average Precision (AP) and Mean Average Precision (MAP) were used [65]. MAP can be expressed as the sum of the AP values for all queries divided by the number of queries as presented in equation (4) [63]
For the second hypothesis, Recall, Ranked Average Recall (AR) and Mean Average Recall (MAR) were used. MAR can be expressed as the sum of the AR values for all queries divided by the number of queries as presented in equation (5) [63]
For the third hypothesis, Specificity (S), Ranked Average Specificity (AS) and Mean Average Specificity (MAS) were used [65]. MAS is defined as the sum of the AS values for all queries divided by the number of queries as presented in equation (6) [64,66]
To test statistical significance, a one-tailed Student’s t-test was performed, and a suggested IRS statistical significance test [67] was performed, for all three hypotheses.
7. Results
7.1. Hypothesis 1
7.1.1. H10: hybridised indexing does not increase the retrieval of relevant documents
To test the first hypothesis, the data generated by IRS-I and the participants were compared with those results generated by IRS-H and the participants, using the Precision formula. The objective was to test whether IRS-H increases the retrieval of those documents judged relevant by the IRS. As the query-by-document matrix is too large to present in this article (75 columns by 100 rows generating 7500 Precision value data cells), the results for AP are presented per query and per indexing method in Table 1. Note that values highlighted in bold indicate values where AP of IRS-H is greater than AP of IRS-I.
Results of AP measurements per query per indexing method.
AP: Average Precision; IRS: information retrieval system.
Note that for all queries, where APIRS-H ≠ 0, APIRS-H was greater than APIRS-I. These results indicate that using exact matching of words, and by maintaining word ordinality and proximity of the words within the phrases when the AP of IRS-H is not zero, the AP is greater than the AP of IRS-I in all cases. Referring to the Precision formula (equation (1)), to increase P, either tp must increase or fp must decrease. Because of IRS-Hs’ exact matching capabilities, fp may decrease when fewer non-relevant documents are returned than would be returned by IRS-I, thus increasing P. All the words in a phrase must exist in a document for it to be relevant using IRS-H rather than at least one word in a phrase for IRS-I.
Using these AP results, the MAP values for each method were calculated. The MAP result for all queries for IRS-I (MAPIRS-I) is
The MAP result for all queries for IRS-H (MAPIRS-H) is
With the results of MAPIRS-I = 0.2179 and MAPIRS-H = 0.2780, the data suggest that MAPIRS-H is greater than MAPIRS-I, and therefore, IRS-H has a higher MAP compared with that of IRS-I. A Student’s one-tailed t-test was performed using a 95% confidence level with a significance level (α) of 5% to prove statistically that these results did not occur by chance. As there were two systems, each with 75 queries, the sample size was Nq = 150 and the degrees of freedom (df) was set as df = 150 − 2 = 148. The critical value (tcv) was 1.66 while the t-value (t) result was 1.815. The MAP t-distribution and the results for IRS-I compared with IRS-H are presented in Figure 6.

MAP t-distribution and results.
The results are statistically significant as p < α, where p (Type I error) = 0.0365 and α = 0.05. As t > tcv, where t = 1.815 and tcv = 1.66, the alternative hypothesis H11 is accepted and therefore: Hybridised indexing increases the retrieval of relevant documents.
7.2. Hypothesis 2
7.2.1. H20: hybridised indexing does not reduce the incorrect identification of relevant documents
To test the second hypothesis, the same set of generated data used to test Hypothesis 1 (generated by IRS-I and the participants and by IRS-H and the participants) was used together with the Recall formula. The objective was to test whether IRS-H reduced the incorrect identification of retrieving those documents judged relevant by the IRS. As the query-by-document matrix is omitted, owing to size, for this article, the results for AR are presented per query and per indexing method in Table 2. Note that values highlighted in bold indicate values where AR of IRS-H is greater than AR of IRS-I.
Results of AR measurements per query per indexing method.
AR: Average Recall; IRS: information retrieval system.
Similar to the AP values, for all queries, where ARIRS-H ≠ 0, ARIRS-H was greater than ARIRS-I. Using these AR results, the MAR values for each method were calculated. The MAR result for all queries for IRS-I (MARIRS-I) is
The MAR result for all queries for IRS-H (MARIRS-H) is
With the results of MARIRS-I = 0.5252 and MARIRS-H = 0.3572, the data suggest that MARIRS-H is less than MARIRS-I, and therefore, IRS-H has a lower MAR compared with that of IRS-I. A second Student’s one-tailed t-test was performed using a 95% confidence level, a significance level (α) of 5% and df was set as 148. The critical value (tcv) was −1.66 while the t-value (t) result was −3.565. The MAR t-distribution and the results for IRS-I compared with IRS-H are presented in Figure 7.

MAR t-distribution and results.
The results are statistically significant as p < α, where p (Type I error) < 0.001 and α = 0.05. As t < tcv, where t = −3.565 and tcv = −1.66, the alternative hypothesis H21 is accepted, and therefore: Hybridised indexing reduces the incorrect identification of retrieved documents.
7.3. Hypothesis 3
7.3.1. H30: hybridised indexing does not increase the quality of rejecting non-relevant documents
To test the third hypothesis, the same set of generated data used to test Hypotheses 1 and 2 (generated by IRS-I and the participants and by IRS-H and the participants) was used together with the Specificity formula. The objective was to test whether IRS-H increased the quality of rejecting a document judged non-relevant by the IRS. Again, the query-by-document matrix is omitted, as it is too large in size for this article, therefore the results for AS are presented per query and per indexing method in Table 3. Note that values highlighted in bold indicate values where AS of IRS-H is greater than AS of IRS-I.
Results of AS measurements per query per indexing method.
AS: Average Specificity; IRS: information retrieval system.
For all queries, ASIRS-H was greater than ASIRS-I. Using these AS results, the MAS values for each method were calculated. The MAS result for all queries for IRS-I (MASIRS-I) is
The MAS result for all queries for IRS-H (MASIRS-H) is
As MASIRS-I = 0.5827 and MASIRS-H = 0.9727, the data suggest that MASIRS-H is greater than MASIRS-I, and therefore, IRS-H has a higher MAS compared with that of IRS-I. A third Student’s one-tailed t-test was performed using a 95% confidence level, a significance level (α) of 5% and df was set as 148. The critical value (tcv) was 1.66 while the t-value (t) result was 20.468. The MAS t-distribution and the results for IRS-I compared with IRS-H are presented in Figure 8.

MAS t-distribution and results.
The results are statistically significant as p < α, where p (Type I error) < 0.001 and α = 0.05. As t > tcv, where t = 20.468 and tcv = 1.66, the alternative hypothesis H31 is accepted, and therefore: Hybridised indexing increases the quality of rejecting non-relevant documents.
At this point of the research, with the results presented and the three hypotheses H1, H2 and H3 having been tested, the three alterative hypotheses were accepted. These three hypotheses proved, statistically, that the hybrid indexing method worked, and in each test, IRS-H outperformed IRS-I. In summary, these findings were as follows:
IRS-H increased the retrieval of relevant documents compared with IRS-I,
IRS-H reduced the incorrect identification of retrieved documents compared with IRS-I and
IRS-H increased the quality of rejecting non-relevant documents compared with IRS-I.
8. Discussion
8.1. The framework
This article is concluded by briefly introducing the inverted indexing method, by introducing and explaining the key concepts of the design for the hybrid indexing method, and by comparing the results generated by IRS-H with the results generated by IRS-I. From the results, it is evident that the hybrid indexing method worked and all three tests were found to be statistically significant. Referring back to the framework (Figure 3), the key design concepts and attributes of the hybrid indexing method are as follows: (1) the hybrid indexing method uses a pair of indices comprising a token index and a query index where the query index interrogates the token index and returns a result (Step 11) – this enables phrases, containing one or multiple words, to be matched exactly with those occurring in the documents; (2) document text is case folded to lowercase with special characters removed (Step 5) – this allows matching of words; (3) identical tokens must not be de-duplicated as each token is allocated a unique sequentially generated token identity (Step 6) – each token has its position in the text populated within the index; (4) by design, the use of vectors, stopping, stemming, suffix stripping and other concepts are unnecessary (Steps 5, 6, 12 and 13) – because of the nature of exact matching (hits), there is no need to try to influence the results mathematically; (5) queries can be expressed using single or multiple word phrases (Step 9) – this eliminates the bag-of-words phenomenon; (6) although phrases are treated independently, words within phrases are treated as a whole (Step 8) – whole words must be expressed in queries as used within the text; (7) word ordinality and word proximity are maintained (Step 11) – the token positions using the unique token identifier in the token index provide the ordinality and proximity logic; and (8) note that when a search is triggered, a query attempts to match all words in a phrase to a string of text in a document exactly, thus maintaining word order and proximity (Step 12) – if an exact match occurs, the document number is returned, as it is judged relevant by the IRS.
8.2. The impact on researchers
The impact on researchers and others using the hybrid indexing method is significant. Researchers will now be able to use phrases containing numerous words and acronyms, rather than keywords, to search for documents. If a document is retrieved by IRS-H, the researcher will be confident, through the increase in retrieval (Precision), shown by the increase in MAP, that the phrase used will exist somewhere in the document. This is particularly useful when searching for documents to write or update an LR with content and for referencing authors and page numbers.
Note, that when searching for documents, what a researcher does not want, are documents returned by the IRS that are irrelevant. Thus, documents must be filtered, so that those documents returned are of some significance. Again if a document is retrieved by IRS-H, the researcher will be confident, through reduced Recall, shown by the decrease in MAR, that the phrase will exist somewhere in the document.
To support Recall, the IRS rejection quality of irrelevant documents is critical. Therefore, Specificity is an important measurement in IRS evaluation. When searching Web document collections, the number of documents in the collection (N) is often unknown, and therefore, Specificity cannot be calculated, but in closed collections, Specificity can be calculated. As IRS-H increased Specificity (shown by increasing MAS), IRS-H increased the rejection quality of irrelevant documents. This reduced the number of non-specific documents retrieved, thus saving the researcher’s time in perusing irrelevant documents. This is a key time saving attribute of the hybrid indexing method for researchers and LR authors.
8.3. Theoretical, methodological and practical contributions
The hybrid indexing method has theoretical, methodological and practical contributions to knowledge. The theoretical contributions are the design of the hybrid indexing method and the functionality of the hybrid token index and the hybrid query index. The methodological contributions provide increased retrieval of research documents by increasing MAP, reducing MAR and increasing MAS.
This method has many practical applications that may be applied in various disciplines, for example: (1) postgraduate research and thesis writing; (2) in healthcare – searching for multiple word disease descriptions and procedural codes; (3) digitising a university library, thus eliminating the use of keywords; (4) searching for various Acts within the legal profession; (5) in the motor industry, searching for specific makes and models of motor vehicles; and (6) data and text analysis during information systems implementation. Finally, the hybrid indexing method may be used in any language using the 26 letter Latin-based alphabet.
In future research, the hybrid indexing method should be further tested in larger closed document collections, for example, a small community library. The indexing size and performance could then be monitored and the speed of the various searches quantified. Depending on these results, design improvements to the index could then be applied and performance tested.
9. Conclusion
The research problem was addressed by: (1) increasing the retrieval of relevant documents (Precision), (2) reducing incorrect identification of relevant documents (Recall) and (3) increasing the quality of rejecting irrelevant documents (Specificity).
The research found that the hybrid indexing method using a small closed document collection of a hundred documents produced the required outputs and that it may be used as an alternative IRS indexing method to perform exact phrase matching in business and research. Further research is required to test the hybrid indexing method using larger document collections and to compare the results with other indexing methods.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
