Abstract
Currently, the semantic analysis is used by different fields, such as information retrieval, the biomedical domain, and natural language processing. The primary focus of this research work is on using semantic methods, the cosine similarity algorithm, and fuzzy logic to improve the matching of documents. The algorithms were applied to plain texts in this case CVs (resumes) and job descriptions. Synsets of WordNet were used to enrich the semantic similarity methods such as the Wu-Palmer Similarity (WUP), Leacock-Chodorow similarity (LCH), and path similarity (hypernym/hyponym). Additionally, keyword extraction was used to create a postings list where keywords were weighted. The task of recruiting new personnel in the companies that publish job descriptions and reciprocally finding a company when workers publish their resumes is discussed in this research work. The creation of a new gold standard was required to achieve a comparison of the proposed methods. A web application was designed to match the documents manually, creating the new gold standard. Thereby the new gold standard confirming benefits of enriching the cosine algorithm semantically. Finally, the results were compared with the new gold standard to check the efficiency of the new methods proposed. The measures used for the analysis were precision, recall, and f-measure, concluding that the cosine similarity weighted semantically can be used to get better similarity scores.
Introduction
Semantic similarity has been studied in recent decades [14] and it has been used successful by other fields such as Internet of Things [20], Cheminformatics [6], and Biomedical Ontologies [23] achieved improvements in their respective fields.
Semantic similarity is a metric between two objects, statements, concepts, terms, or classes, based on co-occurrences, taxonomy, or correspondence of their meanings [11, 21]. However, it is not always easy to identify the correct semantic similarity score between words, as some words have low scores even if they are in the same domain. For example “java”, and “c” are in the same domain (programming). Nonetheless, their ranges of semantic similarity are between 0.06 to 0.25 path similarity and 0.125 to 0.8 WUP similarity; where 1 is the highest similarity, value, and 0 is the lowest. Consequently, this research work proposes a method to obtain the highest similarity score between documents.
This research work focuses on improving the matching between plain text and increasing the values of the algorithms such as the cosine similarity [28] and fuzzy logic [1] using semantic methods by adding different types of information. The documents were tagged with POS (part of speech) and Synsets. The keywords were extracted using the method presented in [10]. The implementation of semantic similarity methods such as Wu-Palmer Similarity (WUP) [29], Leacock-Chodorow similarity (LCH) [12], and path similarity (hypernym/hyponym) were used to weight the keywords. The job descriptions and CVs were mainly chosen to measure the efficiency of the algorithms proposed in this work. Moreover, the high growth of job seekers makes the task of selecting the best candidate for a job difficult. Additionally, methods such as TF-IDF [25], inverted index [30], and lemmatization were used to solve these problems. A document (CV) is used as a query to retrieve a set of documents (job descriptions) of the same domain. It results in a list of the highest similarity scores.
Additionally, the semantic similarity values can be used to predict the relationship between two texts. The relationship employing methods with sentences, phrases, paragraph, or words [4, 21]. Another method to deal with the issue of recruiters and job seekers was the implementation of the fuzzy logic.
The fuzzy logic is ideal when the classical logic is not appropriate. Moreover, it can help to solve aspects of uncertainty [1]. It was implemented in a matrix (2D list) using the values obtained from the WUP similarity method. Fuzzy logic is useful to make decisions with a corpus using a membership function. The creation of a fuzzy matrix was necessary to make decisions to match documents and then make a comparison with the previous methods.
Furthermore, the cosine similarity is a recurring measure used by research works recently with Neural Networks [15]. The reason is that cosine similarity has become a standard measure between the community of researchers for the practical manner of applying it and the useful results between the relatedness of two documents [27].
Recently, motivated by the research works such as [15] that used neural networks modifying the process with the cosine similarity, [21] that used similarity between words, and [16, 19] that used a fuzzy approach. These research works modified algorithms or applied different perspectives, having improvements. Thus, the important characteristics were extracted to apply for the experiments discussed. Moreover, the association of the semantic similarity with cosine similarity and fuzzy logic was necessary to create two new strategies. First strategy, the modification of the cosine similarity equation adding the semantic similarity. Second strategy, the creation of a fuzzy matrix to make decisions. Therefore, the key contributions of this research work can be summarised in two strategies proposed to match documents of the same domain and a new gold standard. The first strategy proposed uses the cosine similarity, adding a weighting using the semantic comparison of the keywords. The second strategy proposed uses the fuzzy logic, creating a membership function with the results of semantic similarity of the keywords. A new gold standard was created to evaluate the efficiency of all the methods used in this research work. The strategies proposed are explained in detail in the next sections.
The organisation of this document is as follows. The following section provides a review of the related work using different methods of semantic similarity, measures to evaluate, and applications. Section 3 describes the datase used for the experiments. Section 4 contains a description of the pre-processing task of the data. Section 5 describes the measures used and the proposed strategy with the methods employed in the experiments. Section 6 describes in detail all the steps in the experiments proposed in this research work. Section 7 provides an evaluation and comparison of the results obtained. Finally, the last section, which is the conclusion of this research work, outlines future work directions.
Related work
Our research aims to obtain high similarity values from plain texts (CVs and job descriptions). For the same, we have studied previous research works focusing on cosine similarity, fuzzy logic, semantic similarity, comparison values, and corpus creation.
Cosine similarity
The modification of algorithms using the cosine similarity could be positive as the authors of [15] demonstrated. They used neural networks and convolutional networks modifying the process changing the dot product with the cosine similarity. Thus, the research work [15] demonstrated the beneficial to apply the cosine similarity in different algorithms. The research work [9] provides an explanation of the cosine distance, BLEU score, and discriminant approaches. Moreover, [9] describes the similarity between the cosine distance and the BLEU score and proposed an extension to BLEU, SVD-LSI, cosine distance. The cosine distance had a classification accuracy of 80%.
Fuzzy logic
Consensus or Trade-Off (CoTo) [16] used a fuzzy operator for the aggregation of semantic similarity values that deal with the non-stochastic uncertainty inherent to the human language. Moreover, CoTo uses the concept of fuzzy set membership; being a direction to continue the experiments. Another direction implementing fuzzy logic is by the authors in [24] using a perspective of the temporal pattern trends to predict decisions. However, the authors [19] used the fuzzy logic score to summarise documents obtained a 95% of confidence level, demonstrating its efficiency to extract the essential meaning of a document.
Semantic similarity programs
There are tools to ease the semantic similarity between two texts such SEMILAR 1 in JAVA. The SEMILAR provides unidimensional and bidirectional algorithms such as the relation of an entailment unidirectional, paraphrase relation bidirectional, lexical overlap, KL distance, QAP (Quadratic Assignment Problem), Koopmans-Beckmann and unsupervised method LDA (Latent Dirichlet Application); the algorithms were used as a line base for this research paper. In the research [26], for the paraphrases identification, the best accuracy obtained was identified as 77.6% using the semantic similarity toolkit SEMILAR. Thus, SEMILAR toolkit was implemented using the job descriptions and CVs. [7] analysed the other twelve semantic similarity tools with a connection to WordNet, giving features such as the availability of similarity metrics, and support for editing.
Previous works with semantic similarity
There are many methods to calculate the semantic similarity between words, sentences, and texts. [21] proposed three methods to complete this task. These are word similarity, sentence similarity, and word order similarity using a lexical database in English language. Moreover, they explain why it is necessary to use the word tokeniser and eliminate stop words before using a semantic similarity sentence as this enriches the data and provides better results whereas others utilise additional information from the skip-gram model (n-grams) [3], which takes into account subword information to enrich the text [2]; they were perspectives take into account for this research paper. Other perspectives of semantic similarity measures proposed by [18] comprises corpus-based and knowledge-based perspectives; they acknowledged the importance to remove stop words, part-of-speech tagging, and longest subsequence matching. It reduced by 13% the error rate concerning the vector-based similarity metric some of them explained on [9]. One important work is [13] as it is an essential form to use the techniques to enrich text. In this case, by the creation of ontologies, this method was approaching human ratings significantly. Moreover, they used WordNet for the experiments, which is another reason to use WordNet and human ratings.
Semantic similarity evaluation
There is a reasonable amount of semantic similarity evaluation available; the most popular are provided by WordNet and MeSH [31]. A novel representation is information retrieval by semantic similarity or what their authors [8] called the semantic similarity based retrieval model (SSRM) used for similar terms that could be or not needing lexical similarity. A single method is mapping terms in the ontology and analysing their similarities. Moreover, the authors mention the importance of the human judgement of similarity. Another significant contribution is the comparison between the use of WordNet and MeSH, which gives WordNet 9% better results.
Dataset
In this research work, we used two different types of domains and one new gold standard. The first domain applied is related to the job descriptions and was extracted from (https://www.jobs.ie/). It is a website in Ireland to find employment. The website had 46 different categories and 6,917 job descriptions at the time of writing this paper. Each job description file contains information such as skills needed, salaries and area of work. All the documents were obtained in HTML and JSON format and were cleaned from HTML tags, with the information updated each week. The category used is IT (Information Technology) list count with 238 job descriptions; the average per file is 3 kilobytes. The category chosen is IT because verifying the efficiency of the results obtained was necessary to compare the IT category with another corpus (second domain).
The second domain applied is on CVs and provided by the community of LKE (Language & Knowledge Engineering). The CVs count with eighty files, and the average per file is 8 kilobytes. Most of them are related to IT, and the reason is to take 56 related to IT and 24 that have some similarity with IT to check the results obtained. Each CV file contains information such as skills, abilities, knowledge, and personal information. The reason for choosing the CVs is that we have experts in this field who can map manually CVs to job descriptions to validate the results. The mapping is done with the help of a web application developed for this research work. Future work intends to obtain expert opinions from other areas.
The new gold standard was created using the web application developed for this research work shown in Fig. 1. The web application has been created for two purposes. The first purpose is to support users in the task of mapping CVs to job descriptions. The web application provides the facility to map all the combinations between job descriptions and CVs. The second purpose is to help in the identification of keyword extraction by providing a suggestion of possible mappings. The suggestion is computed using cosine similarity. Figure 1 shows the web application which consists of five essential components marked with the numbers (1), (2), (3), (4), and (5) respectively.

Mapping CVs with Job Descriptions.
The component (1) suggests the similarity cosine score, which has two different applications. The first application is the checkbox, and if it is clicked, the file suggested is moved to the component (2). The function to visualise the job description is in the third column marked with the number (4). The component (2) is the mapped documents with its respective similarity cosine score. Consequently, the user will validate with the help of component (3) and (4) the similarity between the documents; this validation by the user is done in component (2).
Fig. 2 shows the process used to obtain the results of semantic mapping. There are six different components marked with the symbols (i), (ii), (iii), (iv), (v), and (vi) respectively. The component (i) refers to the job descriptions that were extracted from the website (jobs.ie), The components (ii) and (iii) refer to different domains that can be processed, job description and CVs in this case. The component (iv) refers to the keyword extraction. Moreover, the keywords extracted were stored in a relational database named postings list; which receives and sends these pieces of information to different algorithms implemented. The component (v) refers to the algorithms used such as Jaccard distance (baseline), Latent semantic analysis (LSA), Latent Dirichlet Allocation, word to word similarity, cosine similarity weighted semantically, and fuzzy matrix. Finally, the component (vi) store all the results to be evaluated with the new gold standard.

Procedure Diagram.
This section presents the series of steps used for pre-processing.
Section 3 explains that the dataset used for this research work. The dataset was in HTML and JSON format. Thus, the irrelevant tags such as HTML and JavaScript tags in the corpus were cleaned. The files were named with the prefix ’job’ and a consecutive number for the job description, i.e. job1, job2,..., jobn. Punctuation signs and symbols such as @, ", ’, *, ?, ˜, among other were removed. The recruiters usually use these symbols when writing the job descriptions. All letters were converted into lowercase. NLTK [22] was used to tokenise the whole corpus with POS functions. NLTK uses the words before and after each word to differentiate between nouns and verbs; one example Support could be a noun or verb. Possible combinations with punctuation marks such as “.”, “,” and “;” were removed.
Measures
The measures used for this research work are the cosine similarity score, inverted index, TF-IDF and semantic similarity of WordNet using NLTK.
The cosine similarity score is a metric for comparing vectors; it has a query as a first vector and each document from the second domain as a second vector. In this case, the CVs and job descriptions were represented as vectors, the equation 1 is its representation.
Where
In this research work, it was essential to decide which job descriptions and which CVs had to be compared and to avoid comparing all CVs with all job description to avoid high processing. Thus, keywords were extracted and then only the job descriptions and CVs that contained common keywords were compared.
To obtain the keywords, it was essential to use the method presented in [10]. The algorithms used an intersection with collocations measures and then used a unique POS (Part of speech) tags filter.
The intersection consists of the extraction of the high values in the ranking of 3 different collocation measures such as PMI, Likelihood, and Chi-square.
The TF-IDF (Term Frequency – Inverse Document Frequency) is one algorithm to put weight on a word, taking into account its importance in a set of documents (corpus).
A random sample of the postings list is in Table 2. TF-IDF [25] was used to obtain the postings list and clean the stop words. The postings list is the co-occurrences of the words in each file and compares the word with other files that contain the same word. In other words, the implementation of cosine similarity score uses a dictionary of keywords extracted from the whole dataset and count the frequency of each word in each file. One interpretation is comparing the words administration and administrator. The keyword administration appears 61 times from the job description 6 until the job description 233 and the keyword administrator appears 22 times from the job description 10 until the job description 198.
The inverted index is when the information or database is indexed assigning content dynamically which is then able to map. So, the implementation of the indexed in the keywords was to be able to create a direct search in the postings list.
Wu-Palmer measure is explained in [17]; this measure uses two concepts, but for this research work, the words were referenced instead of the concepts. The words denoted as (w
q
, w
d
) depending on the postings list. The score is in the range (0, 1]. Where l (w
q
, w
d
) refers to the length of the shortest path from word w
q
(synset) to word w
d
(synset), lso (w
q
, w
d
) refers to the lowest common subsumer of w
q
and w
d
. Finally, dep refers to the depth of the path to the result of lso (w
q
, w
d
) (synset) from the global root entity.
The proposed strategy consists of enriching the cosine score with a different technique and testing the efficiency with the new corpus created. The new strategy is explained in equation 3, where the cosine score used the TF-IDF multiplied by the semantic similarity using the documents as vectors and returning the dot product to the vectors, where (Sim
wp
(w
q
, w
d
)) is distinctive because it uses the keywords extracted.
The lemmatise helps to define if it is a noun, but for the Synsets (set of synonyms), each word contains different meanings even if it is a single word as, for example, Fig. 3 shows the possible meanings for each word represented in a vector.

Synsets Vectors of “java” and “c”.
The possible results of this example are six because afterwards, to apply the filters as POS (part of speech) and identical word remaining are two options in the array “java” and 3 options in the array “c”. Fig. 3 shows the deleted options with a strikethrough. In the first option to the array is “java.n.01”, which refers to an island; however, this meaning is not useful for this purpose. The second option in the array is “java.n.03” and refers to a programming language. In the same way, the options to “c” are “c.n.09”, referring to a unit of electrical charge. “c.n.10” refers to a programming language and “c.n.11” refers to the keynote of the scale of C major. Then, for this example, the higher score is between “java.n.03” and “c.n.10” with 0.25 for path similarity, 0.8 for “WUP similarity”, and 2.251 for “LCH similarity”. Table 1 shows the scores of the words “java” and “c” using algorithms such as path similarity, WUP similarity, and LCH similarity.
Similarity Results for “java” and “c”
An example of the proposed strategy with a company that wants to hire a new employee is represented as shown in Fig. 4. When the job description is the document query, the documents are the CVs, but when returning the top N of the best candidates. A different example is when the CV can be the document query and the top N can be the best companies for the job seeker. The results were compared with the new gold standard. The measures used to verify the efficiency of the new strategy proposed for the comparison are precision, recall, and f-measure. The section of results described in detail the different values of N and the reason for using different values for showing a graph.

Parameters of the Equation and Results.
This section explains in detail the methods proposed. Fig. 7 shows the requirements and the procedures of the methods explained in the following subsections; it is also applied to the SEMINAR toolkit, where number (1) represents the corpus of the domain X (job descriptions); it is also the dataset for the all the keyword extracted to create the postings list and the text tagged with Synsets. Number (2) represents the query (CV), the document of the domain Y, i.e. the document to match with the corpus in number (1). Number (3) represents the server where the information is processed and the algorithms are executed. Number (4) represents the result matrix Fig 6; of all algorithms used in this research work produced eight subsets, each subset is a matrix of all job descriptions with all CVs, and the number of results of each subset is 19,040 results; the sum of all the subsets is 11,640 results. The subsets were saved as a CSV file to read it for columns. Finally, number (5) is compared to the subsets to obtain by the algorithms and is compared with the gold standard using measures such as precision, recall, and f-measure.
Cosine weighted semantically
This method applies the different equations explained in the measures given in section 5. The structure consists of six steps, and the description for each one is as follows. First, the keyword extraction is essential for this research work. Therefore the intersection of 3 different measures such as likelihood, chi-square, and PMI was implemented [10]. Fig. 5 represents the keywords extracted from the documents. The left side represents the documents to extract the keywords and the right side represents the vector or the list created.

Keywords Extraction.
Second, the use of the TF-IDF is necessary to know the number of occurrences of the word in each document. The information retrieval method (inverted index) was used to search the keyword in each document and count the number of occurrences. The inverted index used the list of Fig. 5 to search each keyword in all the documents.
Third, the creation of the postings list with the keywords. The postings list is the vector or list that contains all the keywords. Also, the keywords refer to the number of occurrences in the document. For example, the keyword java has one occurrence in the job description 30, one occurence in job description 31, two occurences in job description 50, among others. The representation is java [’job30’, 1, ’job31’, 1, ’job50’, 2,... ’job n’, m], where n means the job description n-th and m means the occurrence m-th in the document. A random sample of the postings list is given in the Table 2.
Postings List
Fourth, the cosine weighted semantically use a query to search in the postings list. For this research work, the query is the CV. The keywords were extracted using the same method, creating a vector or list shown in Fig 5, but this was used just for the query.
Fifth, the equation 3 used a summation where the TF-IDF is multiplied by the semantic similarity score represented in the equation 2. To obtain the semantic similarity score, it was necessary to use the keywords from the query and the postings list.
Sixth, finally, all the scores were stored in a matrix where it selected different rankings. Diverse top numbers (top N) organise the ranking from highest to lowest. The matrix is represented in Fig 6. Where the columns are the job descriptions and the rows are the CVs, the R represents the resulting score between the job description and the CV.

Structure that Contains Similarity Results.
This method applies the fuzzy logic to make decisions using the semantic similarity values. The structure consists of four steps and the description of each one is as follows.
First, the keywords extraction was used to create an individual list of keywords for each document; All the documents have their own keyword list to compare the CVs and job description.
Second, the WUP similarity was used for each keyword in the keyword lists from the CVs and job descriptions; When the WUP similarity values were extracted, this creates another list for each keyword. Each keyword has its own list resulting a 2D list. Then, the average was calculated using the values in the WUP list; the keywords were the column reference. Therefore, the resulting values are the weighting values used for matching documents.
Third, the percentage of matching was obtained from the weighting of matching using the entire list. This percentage of matching is the information-based decisions with those percentage characteristics.
Fourth, creating the fuzzy set called in this research work fuzzy matrix, and the purpose of the fuzzy matrix is to use the membership function represented in equation 4 where A is the fuzzy set.
X = Universe of discourse are the matchings.
μ A (x) =1 means that x definitely belongs to A
μ A (x) =0 indicates that x does not belong to A
When the range of a membership function consists of 1 and 0 only, it can be said that the A is a classic set. When the membership degree of an element x is 0, it is often not listed in the fuzzy set A.
This section shows the results of all the methods used. In addition, the analysis of the results for each method is presented.
The SEMILAR library was used to compare the semantic results obtained with the strategies proposed in this research work such as cosine similarity weighted semantically and fuzzy matrix. The algorithms used are CM Comparer, Greedy Comparer WNLin, and LSA Comparer. Moreover, Jaccard measure was used as baseline. Each measure has 19,040 results. Every result is obtained comparing the job descriptions and CVs with its measure.
Cosine weighted semantically results
The previous section explains how the method created the matrix. Furthermore, this section explains the results when compared to the gold standard.
The charts are in the query interval of the CV21 to CV40 because it is the interval from a user that manually matched the CVs with the job descriptions. Section 5 explains the reason for using the top N for the results displayed in the charts N that take the value of 5, 10, 12, 15, and 20. It considers the recall and precision measure so that the value remains high in both.
Fig. 8 shows recall measure values. Where the CV21 and CV23 have 0% because those are the false positives; the CVs are from truck drivers; they have specific keywords such as warehouse, teamwork, certification, licenses, Microsoft office, powerpoint, work under pressure, among other. They are 0 because in the new gold standard, there is no matching, so the recall is 0. The opposite happens with CV22 that has 100%, the reason for it being that in the gold standard, there is a single value as only one job is suitable as a.NET developer. Also, as expected in the recall, when increasing the number of matches, the recall has to increase. The top 20 has the highest values.

Structure of the Experiments.

Recall Cosine Weighted Semantically.
Fig. 9 shows the precision values, the average is 60% and when increasing the values of the top, the values of the precision measure decrease as well. Moreover, the top 5 and 10 remain with higher values. CV26 shows that the top 5 and 10 get 100% as it corresponds to the gold standard values with the algorithm.

Precision Cosine Weighted Semantically.
Fig. 20 shows the f-measure values, as recalling the CV21 and CV23 have the 0%. This is because recall as 0 and the f-measure is a combination of the recall and precision measure. Top 15 and 20 remain with the highest values but start decreasing in the CV40.
The comparison with the algorithm of Fuzzy matrix is necessary to obtain the similarity results using the WUP similarity and the keywords. When the results are successfully created, proceed to create the 3 different measures with the respective charts. After the comparison measures are ready, then creating the membership function becomes possible.
Fig. 10 shows the recall measure values. Where CV21 and CV23 are the false positives. Thus, there is no match in the gold standard, and only the top 20 and 15 get the single match that is in the CV22; after that, the top 20 remains with the highest values.

Recall Fuzzy Matrix.
Fig. 11 shows the precision measure values. Where the top 5 has the highest values. However, this is curious behaviour as the ranking between 5 to 20 is a kind of curve starting the high value with top 5 and decreasing to top 10 and top 12. After that, the values start to increase. This behaviour repeats 13 times.

Precision Fuzzy Matrix.
Fig. 20 shows the f-measure values. Where the top 20 is the best option for this measure. This also has the same behaviour that recalls. The behaviour means that the results extracted by the algorithm do not match with the gold standard, and the reason for this is easy. When it extracts, the matches there are feasible even when others are not. When the previous charts represent the values, the membership is ready to create the representation. The representation of the functions intervals is as follows:
Where x represents the value of the match that is in the previous dataset and analysed with the gold standard. A1 (x) means the accepted interval and A2 (x) is the incorrect interval of matches.
The CM Comparer results were used to compare them with the proposed methods. Fig. 12 shows the recall measure values. The same behaviour is seen in previous results as the comparison depends on the gold standard instead of the previous where the average is smaller. And the average is 2.5 with the top 20 giving higher results.

Recall CM Comparer.
Fig. 13 shows the precision measure values. Where the top 5 has higher values in 9 CVs, but in the CV28, CV34, and CV36, the value is 0 because of the values extracted in the first 5 not matching with the gold standard. However, the top 10 remains in the top.

Precision CM Comparer.
Fig. 20 shows the f-measure. Where the top 20 is the one with the highest values in 13 values; However, on 4 occasions, the top 10 is the highest because the job descriptions from the top 10 have a higher level of coincidence with the CVs.
The Greedy Comparer WNLin is a similarity method and its results were used to compare them with the proposed methods. Fig. 14 shows the recall measure values. It has an unusual behaviours in the chart. Where the CV40 as the top 10, 12, 15, and 20 are precisely the same values. Also, the top 5 is 0 in CV36 and CV40 because of the values getting by the algorithm Greedy Comparer WNLin is not matching with the gold standard.

Recall Greedy Comparer WNLin.
Fig. 15 shows the precision measure values, and one of the unusual behaviours is the CV29; when the top is 5 high, then it decreases almost halfway with top 10. Then there is a slight decrease in top 12 and suddenly there is a high increase with the top 15 and 20, this is unique because in this charts, the bars increase, decrease, or remain, but is not a common curve. The reason for this curve is that the keywords from the CV used in the algorithm do not match with the postings list; Then, when is compared with the gold standard, there are few coincidences.

Precision Greedy Comparer WNLin.
Fig. 20 shows the f-measure values; the top 20 has 14 CVs that are in the top. Also, the top 10 is the highest value in CV40 whose values then start to decrease, because the algorithm matches with the first values after the top 5 because the top 5, in that case, has 0.
The LSA Comparer obtain the relationship using a semantic structure that the results were compared them with the proposed methods. Fig. 16 shows the recall measure values, and unlike the previous algorithms, the CV22 is 0 because in the gold standard there is one single match where the algorithm does not extract. Moreover, the top 20 always remain on the top and twice with the top 15.

Recall LSA Comparer.
Fig. 17 shows the precision measure values, the values remain and looks similar, but the top 5 has 10 CVs, where is the highest value and twice related in the first place. The CV35 is the exception because the top 20 is the highest value the reason is that the matches in the gold standard between the algorithm start to match when are more values.

Precision LSA Comparer.
Fig. 20 shows the f-measure; the reason that the CV22 is 0 is that the recall is 0 as well. Also, the top 20 has the highest values for 14 times. So for this, the algorithm would be better to take the top 20; most possibilities match with the gold standard. However, the top 15 is the highest in the CV40 and remains close to the top 20.
The Jaccard measure is the algorithm used as a baseline to compare the previous algorithms. Fig. 18 shows the recall measure values. Where the average is 10%, the values are too low. The CV22 is 100% because there is one only match in the gold standard and the Jaccard algorithm extracts it.

Recall Jaccard.
Fig. 19 shows the precision measure values. The values looks that remain equal, but in some situations, the top 5 has the highest values. Three times has 100% that means this algorithm extract the matches 12 times good to 20.

Precision Jaccard.
Fig. 20 shows the f-measure values, and as expected, as a baseline, the values are too low. Even when the precision is high, the recall is too poor, and this is the reason the f-measure is low as well. In the research field, the importance of precision and recall have to be high (f-measure). The reason is that if the algorithm is applied to get matches, this has to get many matches. This algorithm does not care about relevance.

F-measures of the Methods Applied.
This research work showed that the strategy of the cosine similarity, adding a weighting using the semantic comparison of the keywords increases the matching between documents. Moreover, the strategy of the fuzzy matrix created a membership function that helps in the task of decisions making; in this case, to match documents of the same domain works properly. The strategies proposed and the methods used to obtain the results in section 7 were compared to verify the efficiency of the strategies proposed.
The pros and cons of the two strategies proposed in this research work.
First strategy: the cosine similarity weighted semantically improves the matching of documents compared with CM Comparer, Greedy Comparer WNLin, LSA Comparer, and Jaccard. The improvement consists of enriching the cosine similarity values using keywords weighted with the WUP similarity. The improvement is visualised with the percentages obtained from f-measure shown in Fig. 20. Where the cosine weighted semantically has the higher top rankings results. The cons are this strategy requires more processing time to complete the results because it processes the keywords as vectors and uses methods such as WUP similarity and TF-IDF in sequential functions. Moreover, the results could be confusing as the results use fractional numbers. The fractional numbers obtained are in an interval of 100000-1 and 100-1 such as 0.0000945 that refers to the similarity score between the documents. Where 100-1 is the highest score obtained and 100000-1 is the lowest score of similarity.
Second strategy: the fuzzy matrix requires less processing time improving the matching of the job description with the CVs. The cons are that this is unique for these documents; if it is attempted to apply it to other corpora, this will reduce the efficiency. Thus, the strategy should be applied when matching values are required. Future work would be to use the cosine similarity weighted semantically with different corpora. Moreover, there is a possibility to improve the matching results; the possible improvement is to score all keywords based on their specific domain.
