Abstract
User rating information on multiple predefined aspects gathered by hotel recommendation systems generally shows a deviation between the overall rating and detailed criteria ratings. In this study, to address this deviation, we proposed a novel hotel recommendation method that clusters users with different preferences into different groups using the K-means algorithm. Moreover, we allocated weights to different criteria and obtained a comprehensive score. A case study on actual data from Tripadvisor.com showed that compared with three other models, our proposed model demonstrated a more impressive performance. This research can offer advantages to hotel service providers and customers in terms of decision making.
Introduction
In general, recommendation systems (RSs) have two main purposes [1]. On the one hand, RSs are used to motivate users to act, such as to buy a book or watch a movie. On the other hand, RSs can be seen as effective tools for information overload. recommendation systems are employed in numerous domains, such as e-commence, medicine [2], movies and restaurants [3]. In hospitality and tourism, RSs are important applications with tremendous growth and potential [4]. Studies on the travel industry [5] confirm that customer feedback (online reviews or ratings provided by users), such as electronic word of mouth [6, 7] has considerable impact on hotel sales and tourist decision making for hotels. To assist customer decision making in the hospitality sector based on user feedback, develop an intelligent method is necessary.
Hotel recommendations [8] are not new and have been studied for dozens of years. In the hotel recommendation process, different types of information [9] are utilised. Single user item ratings are typically utilised to analyse user and hotel features. However, the information reflected by single ratings is limited. Different types of users generally have different hotel preferences. For example, some travellers may attach considerable importance to price, because they generally prefer cost-saving hotels. Meanwhile, others may pay attention to room size or service quality. Single ratings cannot satisfy such complicated preferences and may lead to recommendation results with poor quality.
With the development of Internet platforms and scientific technology, multi criteria ratings have become popular and employed in many domains, especially tourism. For example, on Tripadvisor.com, which is the world’s leading travel website, users can rate hotels they booked not only on overall but also on specific details. These ratings can determine the probability of the utilisation of multi criteria ratings to make hotel recommendations. Multi criteria ratings contain not only the overall rating but also detailed criteria ratings, which reflect substantial and detailed information [10]. Considering multi criteria ratings is helpful for comprehensively understanding users’ preferences and can benefit users’ hotel selection decision making [11].
However, many existing studies consider the overall and detailed criteria ratings separately rather than together comprehensively. In addition, an overall rating is generally not an objective reflection of the other detailed criteria ratings. For example, from Tables 1 and 2, we can observe that User1 is satisfied with the aspects of a hotel except for location but still gave a very low overall rating. Moreover, we can see that User2 is satisfied with the hotel’s cleanliness, service and price and unsatisfied with its location, room size and sleep quality but gave a satisfactory overall rating. At the same time, the ratings of User3 show that User3 not satisfied with any aspect but gave a high overall rating. In summary, the overall rating is not the average reflection of the other detailed aspects.
Meaning of different rating levels
Meaning of different rating levels
Simple multi criteria ratings
As mentioned above, a deviation typically exists between the overall rating and detailed criteria ratings, which exerts a considerable effect on the quality of recommendations. Therefore, considering the overall rating and detailed criteria ratings comprehensively is necessary. In this study, we propose a new method for resolving the problems above. Firstly, we cluster users according to their multi criteria ratings, with users with close similarities divided into the same clusters. Secondly, we distribute weights to each criterion in the different user groups and generate a comprehensive score. Finally, based on this score, we provide predictions and recommendations.
Moreover, extensive experiments show that the proposed method can well address the deviation problem between the overall rating and detailed criteria ratings and obtain efficient and accurate recommendation results.
The major contributions and innovations of this work are as follows: Proposes a clustering model to describe users’ preferences, considering the overall rating and detailed criteria ratings comprehensively. Introduces a comprehensive score to address the deviation between the overall rating and detailed criteria ratings. Improves the performance of recommendations by our proposed method.
The remainder of this paper is organised as follows. Section 2 reviews the literature on traditional RSs, single- and multi criteria- rating RSs and hotel RSs. Section 3 states the problems presented in this study, and Section 4 discusses the proposed work. Section 5 evaluates the proposed method using a real data set clawed from Tripadvisor.com, and the conclusion is stated in Section 6.
Traditional RSs
RSs [1] have been popular in academia and various industries for dozens of years since they were first studied in the 1990 s perhaps owing to their excellent performance in filtering information and overcoming information overload. In general, three types of RS techniques exist, including the content-based (CB) [12, 13], collaborative filtering (CF) [14] and hybrid technique [15].

Consumers’ hotel assessments.
The core of the CB technique is to obtain a description of an object and a record of important characteristics. For example, in a bookstore features may include genres, authors and themes. Similar to item descriptions, user records must be extracted or ‘learned’ automatically by analysing users’ behaviours and feedback or directly asking users about their interests and preferences. The main idea of the CF recommendation method is to predict user preferences or interests using the past behaviours or opinions of existing user groups. The basic idea of the CB and CF systems is that if users have the same preferences in the past (e.g. they browsed or bought the same books), then they will have similar preferences in the future. The hybrid technique combines the CB and CF techniques and can generate effective recommendation results. Given that the CB, CF and hybrid techniques have shortcomings and advantages [3], they are employed in different studies to address different problems. The CF technique is the most popular RS owing to its flexible and powerful information filtering performance. The RS presented in this study can be classified as a CF RS.
With the development of RSs [44], many types of approaches were proposed, and new technologies emerged. In the classical CF setting, the RS input is a user–item rating matrix. In addition, an RS can be seen as a function that returns a rating prediction (or relevance score) for a given user–item combination. However, the information contained in single ratings is relatively limited. To provide users with personalised items or services based on their preferences and optimise predictions and recommendations sufficiently, an increasing number of Internet platforms, such as Tripadvisor.com and Dianping.com, provide users with opportunities to rate goods and services from different dimensions. The amount of information included in multi criteria ratings is abundant, because they include not only the overall rating but also detailed criteria ratings. For example, a Tripadvisor.com user named Kenneth-05 posted his ratings on a five-star review system, as shown in Fig. 1.
In multi criteria rating settings, users’ item preferences can be reflected by a number of different angles. Thus, multi criteria ratings as RS input have become increasingly popular. Tong et al. [16] leveraged multi criteria ratings and social connections to improve personalised rankings for collaborations. In contrast to traditional CF, Xavier et al. [17] extracted a set of expert neighbours from an autocephalous data set rather than applying the nearest neighbour method to user–item rating data. Alexandros et al. [18] used multi criteria ratings and extended a two-dimensional user–item matrix to a three-dimensional user–item–context matrix by integrating contexts into user-rating data. Based on multi criteria ratings, Zhang et al. [19] classified users into three groups (an optimistic user group, a pessimistic user group and a neutral user group) by clustering their preferences.
Zhang et al. [20] conducted a case study of Tripadvisor.com and proposed a new decision support model that utilises an interval value neutral number, which is a type of fuzzy method, to analyse social information, multi criteria ratings and inter relations among criteria. To eliminate rating deviation from consumers with different scoring characteristics, Zhang et al. [21] presented a personalised restaurant recommendation method by establishing the relationship between user groups and restaurant groups using a K-means clustering technique and probabilistic language term set. Besides, compared with a single-rating data set, the sparseness of a multi criteria-rating data set is higher. To address the problem of sparse data, various types of techniques were proposed [22–24].
The performance of predictions and recommendations can be improved by considering multi criteria ratings rather than single ratings [8]. However, few studies showed the detailed differences between the overall rating and detail criteria ratings.
Hotel RSs
Hotel recommendations are not a new concept but a well-studied topic in the hospitality domain [25]. The goal of hotel RSs is to help travellers select hotels that cater to their preferences. With the convenience of the Internet, travellers can share their experiences and feelings with others. Online reviews are ways to share experiences and feelings, which have gradually become indispensable in travellers’ decision making when selecting hotels. Consequently, online reviews have garnered considerable attention from researchers and become popular in academia and various industries. Dong et al. [26] explored the relationship between online reviews, characteristics and trust. Hu et al. [27] confirmed that the effect of negative reviews is stronger than that of active reviews and explored the focus of hotel complaints at different levels. Moreover, [28] expressed that the language style of online reviews exerts a considerable influence on consumer attitudes and product sales.
To further examine the subject, Lin et al. [29] proposed a new hotel recommendation method based on unknown tourist preferences drawn from their ratings. Some researchers explored RSs from the perspective of fuzzy theory [45, 46] and fuzzy tools [47]. For example, Chen et al. [30] optimised the weights of different aspects using a fuzzy technique and nonlinear programming approach in the hotel recommendation process. Furthermore, aside from the weight of the criteria, the authors considered interdependencies among the criteria. To discover the hotel selection preferences of Hong Kong inbound tourists, Li et al. [31] used a Choquet integral, which is a fuzzy decision method that considers the weights and relativity of different aspects, to determine users’ preferences. Based on probabilistic linguistic information, Peng et al. [32] established a cloud decision support model for consumers to select hotels.
Moreover, to fully utilise the latent value of online reviews and strengthen the performance of hotel RSs, an increasing number of studies considered online ratings and textual comments [33, 34]. To cater to different consumer goals simultaneously, Mao et al. [35] proposed a multi object recommendation model based on hypergraph ranking. Meanwhile, Veloso et al. [36] extracted context such as hotel location and tourists’ emotions from textual comments and extended a two-dimensional user–item matrix to a three-dimensional user–item–context matrix. Ahani et al. [37] used a cluster technique and dimension decomposition method to assess online ratings and textual information.
Most of the above studies are based only on existing rating information. However, user rating information on multiple predefined aspects gathered by hotel RSs generally exhibits a deviation between the overall rating and detailed criteria ratings. To recommend hotels that meet tourists’ preferences, the deviation mentioned above should be considered, and the potential value of online hotel reviews should be examined further.
Problem statement
In the hotel recommendation domain in this study, different tourists generally have different preferences. Some may focus on price, because they prefer cost-saving hotels, whereas others may pay substantial attention to room size or service quality. Therefore, traditional single ratings typically cannot well express tourists complex preferences. Thus, we use multi criteria ratings to improve the performance of recommendations. Assume that n users, m hotels and o criteria are included in this work. The question that must be answered is how to supply the target users with top-N hotels that match their preferences. We designate the rating of the k-th criterion of the j-th item given by the i-th user as R ijk . If R ijk is null, then no rating exists for the i-th user on the k-th criterion of the j-th item.
However, in this study, the overall rating only has a strong relationship with a few aspects rather than with all six aspects. For example, some customers may give a satisfactory overall rating only because the hotel price meets their cost-saving expectations, whereas others may pay considerable attention to room size or service quality. A deviation exists between the overall rating and six criteria ratings, which can seriously affect the recommendation results.

Hotel recommendation process.
To help customers’ decision making and reduce the deviation mentioned above, we consider the overall rating and six detailed criteria ratings simultaneously. Based on a clustering technique, we classify the massive amounts of ratings into different clusters to reduce data dimensionality and increase density. The similarities of objects in the same clusters are very high, whereas the similarities of objects in different clusters are very low. Thus, we cluster customers in different groups according to their rating preferences on different hotel criteria. Moreover, we determine the weights of the different criteria of the different groups. Next, we generate a new comprehensive score for each user. Finally, according to the comprehensive score, we generate the most similar neighbours of the active users and the top-N recommendation list.
The framework and details of the proposed hotel recommendation method are explained in this section. The general framework proposed in this study is illustrated in Fig. 2. The proposed recommendation method includes the following four components: (1) the data preparation module, (2) the user cluster generation module, (3) the weight determination and comprehensive score module and (4) the prediction and recommendation module. The details of the four modules are revealed in the following subsections.
Data preparation module
Given that the sparsity of the data set we used is more than 99% and for the inverse relation between items and density, decreasing the items to increase density was crucial. We must set rules to filter the noisy data to increase the density of our data set. For example, we selected the active hotels and users according to the following rules. A hotel must be rated by at least m users, and a user must have consumed or rated at least n hotels. Consequently, seeking the correct values of m and n is pivotal to our progress.
User cluster generation module
Clustering techniques, as unsupervised machine learning methods, are used for various real-world applications. Clustering techniques can be employed to well address data sparsity by filtering raw data into K different clusters, which is an effective way to classify consumers. These techniques are mainly utilised to uncover groups with analogical needs, habits and preferences and optimise the performance of decision support systems. Identifying customers with similar characteristics is complicated and time consuming; thus, employing a clustering technique to obtain accurate predictions was necessary and effective.
The K-means algorithm is a highly effective clustering technique owing to its strong explanatory power. Thus, we selected this algorithm as our clustering technique. By utilising this algorithm, we could divide customers into several groups according to their preferences. As an efficient clustering algorithm, the K-means algorithm fully displays its robustness in distinguishing groups based on the multi criteria ratings of hotels listed on Tripadvisor.com. Moreover, it can learn knowledge from intricate data sets and demonstrate impressive clustering capabilities. In addition, the algorithm can consider multidimensional data and generate considerable clustering results. In view of the properties and robustness of the algorithm, taking advantage of this clustering approach for consumer segmentation in the hotel industry (hospitality) was useful. In order to provide a clear description, we display K-means method in Algorithm 1.
In this study, we utilised a silhouette coefficient [38] to measure and estimate the clustering quality level and determine the number of clusters. A silhouette coefficient is a type of evaluation method for clustering effects that combines cohesion and separation. This coefficient can be used to evaluate the influence of different algorithms or the different running modes of an algorithm on clustering results. The silhouette coefficient equation is expressed as follows:
where a(i) represents the intra cluster dissimilarity indicating the average distance between element i and the other elements of cluster k (the intra cluster distance). The smaller the a(i) value, the more the element i grouped in the cluster.
The overall score for a set of n
k
elements (one cluster or the entire clustering) is calculated by taking the average of the silhouette coefficients SC
i
of all elements i in the set.
The value of SC i belongs to the interval [–1, 1], and the closer the SC i value to 1, the more reasonable the element i clustering. The closer the SC i value to –1, the more unreasonable the element i clustering, which implies that element i should be classified into the other clusters. If SC i is approximately 0, then element i is on the boundary of at least two clusters. The general rules of thumb of silhouette coefficients are shown in Table 3.
Interpretation of silhouette coefficient values
Aside from the silhouette coefficient, we used the Calinski-Harabaz (CH) [39] index to measure and estimate the clustering quality level and determine the number of clusters. The CH index measures compactness by calculating the square sum of the distances between each point and the centre of the cluster and the separation degree of the data set by calculating the sum of the square of the distance between the centre point and centre of the data set. The CH index is obtained by the ratio of the separation degree to the compact density. Thus, the larger the CH index, the closer the cluster. Moreover, the more dispersed the cluster, the better the clustering result. The CH index equation is as follows:
Lee et al. [40] found that customers may have similar preferences if they give similar ratings to various criteria. Identifying users’ preferences is an important reference in the recommendation process [41]. However, users may attach different levels of importance to different criteria, and determining the weight of different criteria is crucial. In this study, the coefficient of variation method [42] was used to determine the weight of each aspect described above. Next, the weight of each criterion from the different customer groups was allocated. The coefficient of variation of the criteria is calculated as follows:
The equation for calculating the weights is as follows:
The data of each criterion are standardised with the Z-score standardisation method, which can be calculated as follows:
After the data of each criterion are standardised and multiplied by the weight, the sum was calculated, and the comprehensive score of the i-th item given by u
a
was obtained as follows:
In this section, we calculated the similarity between each pair of users using amended cosine similarity (ACS). ACS can solve the problem of cosine similarity, considering only the similarity in the vector dimension direction but not the difference in the dimensions of each dimension. Therefore, when calculating the similarity, ACS performs a correction operation by subtracting the mean value from each dimension. The ACS equation is as follows:
Assume that I = {I
i
|i∈{1, ... ,n} is the set of n items, U = {u
a
|a∈{1, ... ,m}} is the set of m users and R = {Mi,a |i∈{1, ... ,n}, a∈{1, ... ,m}} is the rating matrix, where Mi,a and Mi,b represent the comprehensive score of I
i
given by u
a
and u
b
, respectively. ACSa,b is the cosine similarity between u
a
and u
b
, which belongs to the interval of [–1, 1]. The negative value indicates that the pairs have no correlation rather than opposite behaviours. For example, if the value of the similarity between u
a
and u
b
is negative, it does not mean that u
a
will conversely give a high rating of I
i
when u
b
gives a low rating. Therefore, we considered only the positive similarities, and the higher the value, the more similar the behaviour between the two users.
After the similarities of each pair of users were computed, the k-neighbour set of the active users can be given based on the ranked value of similarity above. Next, we obtained the predictive ratings of every single user for each item, which can be expressed as follows:
Lastly, based on the predicted values, from high to low, we ranked the recommended items and generated a list of top-N recommendations.
We exploited a general decision support procedure of a hotel RS based on the above modules.
Multi criteria ratings were extracted from the raw data set, and rules were set to filter the noisy data.
Users were clustered into different groups by applying the K-means method to the multi-dimensional ratings.
Considering the multiple preferences of the different groups, for the different groups to distribute weights to the different criteria is reasonable.
The group to which the target users belonged to was determined, and the comprehensive score was calculated using the weights obtained in Step 3.
The similarities between the active users and other users belonging to the same groups as the active users were calculated.
Determining the right size of the neighbour set for the active users is vital for the prediction.
The ratings of unrated hotels for every active user were predicted according to the similarities and neighbours obtained in Step 6.
The top-N hotel recommendation list was generated for each target user.
Experiment and results analysis
Experimental settings
To assess the recommendation model proposed in this work based on multi criteria ratings, we utilised the information captured on Tripadvisor.com. Tripadvisor.com is the world’s leading travel website, which provides comments and suggestions from travellers around the world and covers a wide range of travel planning and hotel, scenic spot, restaurant and airline reservations. On Tripadvisor.com, users can give an overall rating for hotels they booked and rate hotels from six aspects. The six hotel features are C1 = Cleanliness, C2 = Location, C3 = Rooms, C4 = Service, C5 = Sleep quality and C6 = Value.
In our experiment, the hotel data set was obtained from Tripadvisor.com. The raw data set included 1,521,672 records on 12,773 hotels given by 781,403 travellers. Each record included not only ratings but also users’ opinions in the form of textual comments. However, we focused only on the ratings. Besides, abundant noisy data existed in the raw data. For example, not all users rated every criterion, and as hotels differ from other products (e.g. food, clothing, movies, music and so on), consumption frequency is very low. In our data, most of the users had only one consumption record, which caused the problem of high sparsity. Noisy data can reduce the quality of a data set; thus, identifying and filtering noisy data are essential.
In our data preprocessing, we extracted the hotel ID, user ID and multi criteria ratings from the raw data set and changed them into a matrix. Moreover, we searched for and removed noisy records in which not every aspect was rated. After clean up and reformatting, each record contained an overall rating and six detailed criteria ratings. At the same time, to increase the density of the data set, we filtered it to select the active hotels and users and generated a highly compact data set, which we used as the basis of our proposed method. We set rules to search for and select the active users and hotels. Users with at least three consumption records were selected and screened, and hotels rated by at least 100 users were selected. After the selection of the active users and hotels, we generated a relatively dense data set that included 5,341 users, 100 hotels and 18,525 rating records. The general statistics are presented in Table 4.
Basic census
Basic census
The hotel ratings by the users were measured in the interval of [1, 5] in a quantitative scale. To understand the specific distribution of the overall rating and detail criteria ratings, we statistically analysed levels 1–5 of the multi criteria ratings. Next, we plotted the distribution accordingly, and the details are presented in Fig. 3.

Multi criteria rating distribution.
Figure 3 shows an interesting phenomenon, that is, the vast majority of the rating values (over 75%) was or 3 or above (overall rating and detailed criteria ratings). At the same time, low-value (2 or below) ratings were few. Detailed information on the overall rating and other criteria ratings is illustrated in Table 6.
Table 6 shows an interesting phenomenon, that is, more than 75% of the ratings were higher than 3. In the ‘Cleanliness’ and ‘Location’ aspects, nearly more than 75% of the ratings were 4 and above. Interestingly, the mean value of the overall rating and six detailed criteria ratings differed. The mean value of ‘Overall’ was the smallest after the mean of ‘Rooms’, whereas ‘Location’ had the highest mean value. The mean value of ‘Cleanliness’, ‘Location’ and ‘Sleep quality’ was higher than 4, whereas the mean value of ‘Overall’, ‘Rooms’, ‘Service’ and ‘Value’ was less than 4. A deviation can be observed between the overall rating and six criteria ratings, which can seriously affect the recommendation results.
During our clustering process, we tried to cluster the users with the self-organising mapping (SOM) method. However, the application effect of the SOM method proved unsatisfactory perhaps owing to the data set. The effect of using the K-means method for clustering was very impressive; thus, in the experiment stage, we insisted on using the K-means method to generate the user cluster model. However, its performance has a strong relationship with the number of clusters. To determine the appropriate number of clusters, we utilised a silhouette coefficient and the CH index, as mentioned in Section 4.2, to measure the effectiveness of the clustering result. In our work, users were clustered into different groups based on the preprocessed multi criteria ratings in Section 5.1. Figures 4 and 5 describe the corresponding silhouette coefficient and CH index scores of different numbers of clusters K, respectively, where K belongs to the integer between 2 and 10. Figure 4 shows an interesting phenomenon, that is, the silhouette coefficient score was more than 0.5 when K was no bigger than 3. Based on Table 3 in Section 4.2, we determined that the clustering structure was reasonable. At the same time, Fig. 5 shows that the CH score took the highest value when K was 4 and ran up to the second highest when K was 3. After comprehensively considering the silhouette coefficient and CH index, we concluded that the best clustering number of K was 3.

Score of silhouette coefficient.

Score of CH index.
Subsequently, three user clusters generated by the K-means clustering algorithm are presented in Fig. 6. Users who belong to the same clusters typically demonstrate high similarities, whereas the similarities of users from different clusters are generally low. Table 5 shows the centroids of the three clusters. According to the nearest distance to these centroids, the 18,525 users were automatically divided into three groups. In Table 5, the first row denotes the different dimensions, from the overall rating to the six detailed criteria ratings. Rows 2 to 4 express the centroids of each cluster under seven dimensions, namely, ‘Cleanliness’, ‘Location’, ‘Rooms’, ‘Service’, ‘Sleep quality’, ‘Value’ and ‘Overall’, and the higher the value of the centroids, the higher the average satisfaction. The ‘Cleanliness’ value in Cluster #1 was higher than the other six dimensions. Meanwhile, in Cluster #2 and Cluster #3, the ‘Location’ value was the highest. This finding implied that group preferences were distinct between the seven dimensions under the different clusters.

Cluster of customers.
Centroids of users
Basic description
Multi criteria weights of three user clusters
Although users were classified into different clusters, group preferences were distinct between the seven dimensions under the different clusters. Some users may attach considerable importance to cleanliness, whilst others may prefer location. Determining the correct weight for each criterion is important. Besides, an interesting phenomenon can be observed in our collected rating data, that is, the relevance between the overall rating and six detailed criteria ratings appeared to be incongruous. For example, a user may give a 4-star overall rating but no more than 3-star ratings to the detailed criteria. Thus, a deviation can be observed between the overall rating and detailed criteria ratings, which can significantly affect the prediction and recommendation results.
Considering the factors above, we allocated a weight to each criterion in each cluster. The criterion weight of each user cluster was obtained with Equation (6) and represented in Table 7. The first line denotes the different dimensions of the six detailed criteria. From the second to the fourth, each line indicates the weight vector of a user cluster under the six detailed dimensions. The bigger the weight value of a certain criterion, the higher the priority of that criterion to a certain user group. For example, for Cluster #1, the weight value of ‘Cleanliness’ was the highest, which meant that the users in Cluster #1 paid more attention to ‘Cleanliness’ than to the other five criteria when engaged in hotel decision making. For Cluster #1 and Cluster #2, though the weight value of ‘Location’ was higher than that of the other five criteria, small differences existed between the weights of the six dimensions. For example, the weight of ‘Rooms’ was minimal in Cluster #1, whereas in Cluster #2, ‘Service’ had a minimal weight. This finding meant that the users in Cluster #2 and Cluster #3 paid more attention to hotels’ location. However, the users in Cluster #2 rarely valued room size, and the users in Cluster #3 were less focused on service when staying at a hotel.
After the weight of each criterion was determined, we obtained the weighted score of the six criteria and the comprehensive score using Equation (8), which was mentioned in Section 4.3.
Result and discussion
In this study, to decrease data sparsity and improve the effectiveness of the experiment results, we set rules to search for and select active users and hotels that met the requirements of our experiment. We selected and screened users with at least three consumption records and chose hotels rated by at least 100 users. Moreover, we generated a compact data set, which we named HM-3-100. Next, we employed a five-fold cross validation procedure to eliminate the negative effects of randomness. We divided the data into two parts, specifically, 80% constituted the training set, and the remaining 20% was the test set. Then, we conducted a series of experiments to estimate the accuracy of the proposed method.
To determine the relationship between the users, hotels and different criteria, we conducted experiments based on our data set, which was preprocessed. We performed three comparative experiments on three other models, namely, the standard CF model, model 1 and model 2 [43].
Standard CF model: This model employs a traditional single-rating-based CF approach that uses the adjusted weighted sum and cosine similarity metric. We used this approach as a baseline to compare a single-rating system with multi criteria recommendation approaches.
Model 1: This model is an aggregation function-based approach using an ordinary least squares regression implemented with the traditional user-based CF approach, which roughly corresponds to the user-based linear regression method. In this regression model, we learned the coefficients of each user.
Model 2: This model is a machine learning technique that uses the traditional user-based CF approach to estimate individual multi criteria ratings, which roughly corresponds to the support vector (SV) regression method that uses the estimates of the user-based SV regression and harmonic mean of two estimates.
Numerous estimation metrics can be used to evaluate the performance of RSs. In our work, we utilised representative evaluation protocol and precision measurement metrics to estimate the recommendation effect of the proposed method. We evaluated the proposed method from two aspects: (1) prediction quality, such as the mean squared error (MSE), coefficient of determination (R2), mean absolute error (MAE) and root mean squared error (RMSE), and (2) recommendation performance, such as precision, recall and the F measure, which are numerical measurements for prediction accuracy.
Typically, a prediction model can be evaluated using MSE and R2 measures. However, the R2 measure provides a value between [0, 1] and has one comparative advantage compared with the MSE. As the R2 value moves close to 1, prediction precision increases. For quality predictions, we employed the RMSE and MAE to estimate the rating prediction. The RMSE and MAE are methods for measuring rating deviations.
The expressions of the MSE, R2, MAE and RMSE are shown as Equations (11), (12), (13) and (14), respectively.
We randomly generated five groups of test sets for each customer group to predict the users’ rating preferences. In addition, we conducted a series of experiments based on each method mentioned above. The average prediction accuracy of each model is shown in Table 8, which displays the average value of the MSE, R2, RMSE and MAE of each model. The phenomenon in Fig. 7 shows that the MSE, RMSE and MAE values of our proposed model were lower than those of the three comparative models, and the R2 value of our proposed model was higher than that of the three other models. This phenomenon may have a relationship with the data set for our experiment but emerged mainly because we calculated the comprehensive score of each user based on the clustering results to eliminate the deviation between the overall rating score and detailed rating scores. This finding meant that the performance of our proposed hotel recommendation method was better than that of the three other models. Moreover, this phenomenon further confirmed that our proposed model can generate minimal prediction deviation.
Deviation of prediction

Deviation of prediction.
Efficiency of recommendation
Meanwhile, we also utilised precision, recall and the F measure to evaluate the quality of hotel recommendations. Recall indicates the ability of a system to present all relevant items. In reality, retrieving all the relevant items from a collection may be impossible, especially when the collection is large. A system may be able to retrieve a proportion of the total relevant items. Thus, the performance of a system is often measured by the recall ratio, which denotes the percentage of the relevant items retrieved in a given situation. Precision implies the ability of a system to present only the relevant items, which relates to its ability to not retrieve non-relevant items. This factor, that is, how far a system can withhold unwanted items in a given situation, is measured by the precision ratio. The precision and recall measures are expressed by Equations (15) and (16), respectively.
The F measure is a metric defined as the harmonic mean of precision and recall and also widely used to evaluate the quality of recommendations. We used the F1 metric in our evaluation, as shown in Equation (17).
In our experiments, we explored recall, precision and F measure numbers (@5, @10, @15, @20, @30, @40 and @50). It may be due to the data set that recall and precision often took a null value when the numbers were @5, @10 and @15 and when the numbers were @40 and @50, the recall value was close to 1, and the precision value quickly decreased. This finding meant that the top-N hotels we predicted will intersect with the related hotels only when the number is between @15 and @30. Therefore, we mainly obtained information on numbers @20 and @30.
Table 9 and Fig. 8 show the contingency table for computing precision, recall and the F measure. From the results illustrated in Table 9 and Fig. 8, we can see that the recall@20, recall@30, precision@20, F1@20 and F1@30 value generated by our proposed model was higher than that generated by the three other recommender models, which meant that our proposed model can generate more impressive recommendation results.

Efficiency of recommendation.
Our experiments revealed that considering multi criteria ratings can produce more accurate prediction and more effective recommendation results than considering only single overall ratings. At the same time, the results of our experiments confirmed that our proposed model demonstrated better performance than the standard CF model, model 1 and model 2. Thus, our proposed model can help users find hotels that match their preferences. As validated in this section, except for R2, all the deviation values of our proposed model for the MSE, RMSE and MAE were the lowest compared with those of the other benchmarking models. This result also confirmed that individual recommendation satisfaction was highly effective under the recall, precision and F measure metrics in our proposed model compared with that in the other models. Our experiment results indicated that our proposed model can generate positive hotel recommendations for users.
In summary, this research sheds light on the theoretical development of the use of user feedback in hospitality and hotel recommendation methods. We showed that user feedback on hotels can be used effectively to help online users find the most relevant hotels tailored to their preferences. In addition, from the data analysis of a major e-tourism platform, namely, Tripadvisor.com, we found that multi criteria ratings can benefit RSs in the hospitality sector. This result is also supported by previous research on the use of multi criteria ratings in intelligent recommender agents, which showed that multi criteria ratings can significantly enhance the performance of recommender agents in the tourism context and accordingly improve customer satisfaction in using these systems. Hence, in connection with tourists’ trips, collecting tourist feedback on additional hotel dimensions by including more choices in rating forms would be worthwhile to develop highly interactive RSs. Accordingly, our recommendation scenario can lead to improvements in hotel RS accuracy in community-based sites such as Tripadvisor.com.
In hotel RSs, finding the appropriate hotels based on related hotels in huge user rating sets can generate advantages for hotel service providers and customers in terms of decision making. In addition, motivated by the lack of analysis on deviation between users’ overall rating and detailed criteria ratings in hospitality and tourism studies, we provided a method to address this problem based on a large amount of user feedback data on hotel features and predicted patterns, which is important from theoretical and practical perspectives. One of the advantages of the proposed method is its capability to address the deviation between tourists’ overall rating and criteria ratings for hotels. Therefore, the proposed recommendation method can accurately recommend relevant hotels to tourists and enable hotel managers to effectively formulate sales and marketing strategies to improve service quality and attract tourists.
Based on the above discussion, considering the deviation between the overall rating and detailed ratings can benefit users, hotel managers and Tripadvisor.com. For users, when choosing hotels, they should consult not only the overall or specific rating but also detailed criteria ratings comprehensively. Meanwhile, hotel managers can make timely corresponding adjustments and improvements based on user feedback and hotel ratings. Finally, Tripadvisor.com should make corresponding adjustments to its rating mechanism to address the deviation stated above to improve the rationality of its rating mechanism.
Although the findings of the current research are interesting, future hotel RSs can add other contextual features from different perspectives, including time, season, weather, location, social relationships and the hotel context. This improvement can help hotels offer an accurate and effective RS and unique and satisfactory experiences to customers. Finally, this research is a step towards tourism development and contributes to the hotel industry and society.
Conclusion and future works
This study proposes a novel hotel recommendation method to improve the quality of hotel recommendations and help tourists find hotels that match their preferences on Tripadvisor.com. In terms of the challenges in multi criteria ratings in hotel RSs, such as the deviation between the overall rating and detailed ratings, we employ the K-means algorithm and obtain a comprehensive score to improve recommendation efficiency without affecting the quality of the data set. The contributions of our work are described as follows. Firstly, a comprehensive score is calculated, and the deviation between the overall rating and detailed criteria ratings is addressed. Secondly, the overall rating and detailed criteria ratings are integrated in our proposed hotel model using the K-means technique to classify users into different clusters. Lastly, a case study on real-world data from Tripadvisor.com indicates that our proposed hotel recommendation model demonstrates better performance than the three other methods.
This study contributes to the hotel industry, because our proposed model aims to help not only users find hotels that match their preferences on Tripadvisor.com but also hotel managers provide accurate services to users. However, numerous challenges exist. In this study, we did not consider hotel information such as location, social information and detailed context. However, the proposed model plays a specific role based on which it can be well combined with other methods (e.g. fuzzy tool and fuzzy theory) to include other information to solve problems such as diversity and natural noise. In the future, we will further explore this subject.
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 71971221), the Hunan Provincial Natural Science Foundation of China (Nos. 2018JJ3697 and 2020JJ4440) and the Education Department of Hainan Province (Hnky2019-74).
