Abstract
With the extensive use of rating systems in the web, and their significance in decision making process by users, the need for more accurate aggregation methods has emerged. The Naïve aggregation method, using the simple mean, is not adequate anymore in providing accurate reputation scores for items [6], hence, several researches where conducted in order to provide more accurate alternative aggregation methods. Most of the current reputation models do not consider the distribution of ratings across the different possible ratings values. In this paper, we propose a novel reputation model, which generates more accurate reputation scores for items by deploying the normal distribution over ratings. Experiments show promising results for our proposed model over state-of-the-art ones on sparse and dense datasets.
Introduction
People are increasingly dependent on information online in order to decide whether to trust a specific object or not. Therefore, reputation systems are an essential part of any e-commerce or product reviews websites, where they provide methods for collecting and aggregating users’ ratings in order to calculate the overall reputation scores for products, users, or services [12]. The existence of reputation scores in these websites helps people in making decisions about whether to buy a product, or to use a service, etc. Reputation systems play a significant role in users’ decision making process.
Many existing reputation models focused on working with sparse dataset; assuming that the accuracy of reputation scores can be affected with the lack of enough ratings per item. Other models focused on robustness of the reputation score, i.e., the value is not easy to be affected by malicious reviews [6]. In general, the majority of the recently proposed reputation systems involved other factors, besides the ratings, such as the time when the rating was given or the reputation of the user who gave that rating. Usually, this data is incorporated with ratings as weights during the aggregation process, performing the weighted average method. These factors can be easily combined into our proposed methods.
One of the challenges that face any reputation model is its ability to work with different datasets, sparse or dense ones. Within any dataset some items may have rich rating data, while others, especially new ones, have low number of ratings. Sparse datasets are the ones that contain higher percentage of items which do not have many ratings or users who didn’t rate many items. However, with the increased popularity of rating systems on the web particularly, sparse datasets become denser by time as ratings build up on the dataset. Current reputation models focused on providing methods which work well with sparse datasets assuming they are the ones require attention only. However, the accuracy of these models decreases with the increment of dataset density. This address the need for a general reputation model, which provides more accurate reputation scores with any dataset no matter how sparse or dense it is.
On the other hand, most of the existing reputation models don’t consider the distribution of ratings. People usually have different leniency on rating an item depending on their preferences and expectations. For example, a lenient user would rate an item as 5 stars while he has minor negative opinion about it, while another strict user would rate an item as 4 stars because he is more difficult to satisfy. We believe that the reputation system must acknowledge that both ratings are positive ones. Given the previous example, if we use the rating scale [1–5], then the values of
In this paper, we propose to consider the frequency of ratings in the rating aggregation process in order to generate reputation scores. The purpose is to enhance accuracy of reputation scores using any dataset no matter whether it is dense or sparse. The proposed methods are weighted average methods, where the weights are assumed to reflect the distribution of ratings in the overall score. An important contribution of this paper is a method to generate the weights based on the normal distribution of the ratings. We evaluate the accuracy of our results using ratings prediction system, and we compare with state-of-the-art methods. Our methods show promising results dealing with any dataset no matter how dense it is.
In the rest of this paper, we will first introduce a couple of existing product reputation models briefly in Section 2, and then we will explain the proposed methods to calculate reputation scores for products in Section 3. We will also provide detailed experiments and results evaluation in Section 4 in order to prove the significance of our proposed method. Finally in Section 5 we conclude the paper.
Related work
Reputation systems were used with many objects, such as webpages, products, services, users, and also in peer-to-peer networks, where they reflect what is generally said or believed about the target object [9]. Different objects have different factors that may affect their reputation values, while some commonality still stand for all of them; such as the time factor. In specific, an item’s reputation is calculated based on ratings given by many users using a specific aggregation method. Garcin et al. [6] analyzed ratings’ aggregators including Arithmetic mean, weighted mean, median and mode, against different factors such as robustness and informativeness. Authors proposed that using median or mode is more robust than using Arithmetic mean and weighted mean.
Many methods used weighted average as an aggregator for the ratings, where the weight can represent user’s reputation, time when the rating was given, or the distance between the current reputation score and the received rating. Shapiro [15] proved that time is important in calculating reputation scores; hence, the time decay factor has been widely used in reputation systems [4,7,11,16]. For example, Leberknight et al. [11] discussed the volatility of online ratings, where the authors aimed to reflect the current trend of users’ ratings. They used weighted average where old ratings have less weight than current ones. On the other hand, Riggs and Wilensky [13] performed collaborative quality filtering, based on the principle of finding the most reliable users. Their proposed method is a weighted average using user’s reliability as its weight, which is defined as the ability of a user to provide a rating for an item that is close to the average of this item’s ratings, given by all users.
One of the baseline methods we use in this paper is proposed by Lauw et al., which is called the Leniency-Aware Quality (LQ) Model [10]. This model is a weighted average model that uses users’ ratings tendency as weights. Rating tendency is a value that reflects how users tend to give higher ratings than others. The authors classified users into lenient or strict users based on the leniency value calculated according to Eq. (1), and the leniency value is used as a weight for the user’s ratings when they are used to calculate a reputation score of an item using Eq. (2).
where
Another baseline model that we use was introduced by Jøsang and Haller, which is a multinomial Bayesian probability distribution reputation system based on Dirichlet probability distribution [7]. This model is probably the most relevant method to our proposed method because this method also takes into consideration the count of ratings. The model introduced in [7] is a generalization to their previously introduced binomial Beta reputation system [8]. The authors indicated that Bayesian reputation systems provide a statistically sound basis for computing reputation scores. They use a cumulative vector
Where σ represents the overall reputation value,
Using fuzzy models are also popular in calculating reputation scores because fuzzy logic provides rules for reasoning with fuzzy measures, such as trustworthy, which are usually used to describe reputation. Sabater & Sierra proposed REGRET reputation system [14], which defines a reputation measure (and its reliability) that takes into account the individual dimension, the social dimension and the ontological dimension. Bharadwaj and Al-Shamri [5] proposed a fuzzy computational model for trust and reputation, their model used the beta reputation model proposed by Jøsang [8] in order to calculate the reputation of a user. According to them, the reputation of a user is defined as the accuracy of his prediction to other user’s ratings towards different items. Authors also introduced reliability metric, which represent how reliable is the computed score.
In general, some of the proposed reputation systems compute reputation scores based on the reputation of the user or reviewer, or they normalize the ratings by the behaviour of the reviewer. Other works suggested adding volatility features to ratings. According to our knowledge, most of the currently used aggregating methods in the reputation systems do not reflect the distribution of ratings towards an object, which is actually important in determining the reputation of the object [2]. Besides, there are no general methods that are robust with any dataset and always generate accurate results no matter whether the dataset is dense or sparse, for example, LQ model [10] is good with sparse datasets only and Jøsang and Haller model [7] generates more accurate reputation scores for items with low frequent ratings.
Normal distribution based reputation model (NDR)
In this section we will introduce a new aggregation method to generate product reputation scores. Before we start explaining the method in details, we want to present some definitions. First of all, in this paper we use arithmetic mean method as the Naïve method. Secondly, “rating levels” term is used to represent the number of possible rating values that can be assigned to a specific item by a user. For example, considering the well-known five stars rating system with possible rating values of
As mentioned previously, the weighted average is the most frequently used method for ratings aggregation, while the weights usually represent the time when the rating was given, or the reviewer reputation. In the simplest case, where we don’t consider other factors such as time and user credibility, the weight for each rating is
Comparing weights of each rating level between Naïve and NDR methods
Comparing weights of each rating level between Naïve and NDR methods
Our initial intuition is that rating weights should relate to the frequency of rating levels, because the frequency represents the popularity of users’ opinions towards an item. Another important fact that we would like to take into consideration in deriving the rating weights is the distribution of ratings. Not losing generality, like many “natural” phenomena, we can assume that the ratings fall in normal distribution. Usually the middle rating levels such as 3 in a rating scale [1–5] system is the most frequent rating level (we call these rating levels “Popular Rating Levels”) and 1 and 5 are the least frequent levels (we call these levels “Rare Rating Levels”). By taking both the rating frequency and the normal distribution into consideration, we propose to ‘award’ higher frequent rating levels, especially popular rating levels, and ‘punish’ lower frequent rating levels, especially rare rating levels.
Table 1 shows the difference between the Naïve method and the proposed Normal Distribution based Reputation Model (NDR) which will be discussed in Section 3.1. From the second column in Table 1 (i.e., Weight per rating), we can notice that using the Naive method the weight for each rating is fixed which is
Our method can be described as weighted average where the weights are generated based on both rating distribution and rating frequency. As mentioned before, we use a normal distribution because it represents many “natural” phenomena. In our case, it will provide different weights for ratings, where the more frequent the rating level is, the higher the weight the level will get. In other words, using this weighting method we can assign higher weights to the highly repeated ratings, which we believe will reflect more accurate reputation tendency.
Suppose that we have n ratings for a specific product P, represented as a vector
Equation (6) is used to evenly deploy the values of
The purpose of using such these values for
Figures 1 and 2 show the weights generated for the previous example by the Naïve method and the proposed NDR method, where left-most region represents the overall weight for rating level 2, and the middle region and the right-most region are for rating levels 3 and 5, respectively. We can see that, the weights for all ratings are the same in Fig. 1, while in Fig. 2, the ratings with index near to the middle will be given higher weights.
In order to calculate the final reputation score, which is affected by the ratings and the weights, we need to sum the weights of each level separately. To this end, we partition all ratings into groups based on levels,
The final reputation score is calculated as weighted average for each rating level using Eq. (8), where
Equation (9) calculates level weights

Average Method Weights for the 7 ratings example.

NDR Normalized Weights for the 7 ratings example.
In this section we will do a slight modification to our proposed NDR method by combining uncertainty principle, introduced by Jøsang and Haller Dirichlet method [7]. This enhancement is important to deal with sparse dataset, because when the number of ratings is small, the uncertainty is high. The enhanced method is expected to pick up the advantages of both reputation models, i.e., the NDR method and the Dirichlet method. Inspired by the Dirichlet method in [7], the NDRU reputation score is calculated using Eq. (10) which takes uncertainty into consideration:
The NDRU method will reduce the effect of complimenting popular rating levels and depreciating rare rating levels process done by the NDR model. We can say that in all cases if the NDR method provides higher reputation scores than the Naïve method, then the NDRU method will also provide higher reputation scores but marginally less than the NDR ones and vice versa. However, as we have mentioned before, in the case of having a small number of ratings per item, the uncertainty will be higher because the base rate b is divided by the number of ratings plus a priori constant
In the beginning we want to say that there are no globally acknowledged evaluation methods that appraise the accuracy of reputation models. However, we choose to assess the proposed model in regards to the accuracy of the generated reputation scores, and how the items are ranked. Hence, we conducted two experiments in this research. The first experiment is to predict an item rating using the item reputation score generated by reputation models. The hypothesis is that the more accurate the reputation model the closer the scores it generates to actual users’ ratings. For one item, we will use the same reputation score to predict the item’s rating for different users. The mean absolute error (MAE) metric will be used to measure the prediction accuracy.
The second experiment aims to prove that the proposed method produces different results than the Naïve method in terms of the final ranked list of items based on the item reputations. If the order of the items in the two ranked lists generated by the Naïve and NDR methods is not the same, we say that our method is significant. In this part, we consider a reputation model that generates a list of items with the same order as the list generated by the Naive method as a useless model, because it doesn’t generate any novel value in it. We will use the Kendall tau coefficient method to measure the association between the two ranked lists.
Datasets
The dataset used in this experiment is the MovieLens dataset obtained from
Used datasets statistics
Used datasets statistics
In the experiments conducted in this research, we select two well known metrics to evaluate the proposed methods.
Mean Absolute Error (MAE)
The mean absolute error (MAE) is a statistical accuracy metric used to measure the accuracy of rating prediction. This metric measures the accuracy by comparing the reputation scores with the actual movie ratings. Equation (11) shows how to calculate the MAE.
Kendall tau coefficient is a statistic used to measure the association between two ranked lists. In other words, it evaluates the similarity of the orderings of the two lists. Equation (12) shows how to calculate Kendall Tau coefficient τ, where it divides the difference between concordant and discordant pairs in the two lists by the total number of pairs
Ratings prediction
In this experiment we use the training dataset to calculate a reputation score for every movie. Secondly we will use these reputation scores as rating prediction values for all the movies in the testing dataset and will compare these reputation values with users’ actual ratings in the testing dataset. The theory is that a reputation value to an item that is closer to the users’ actual ratings to the item is considered more accurate. The Baseline methods we will compare with include the Naïve method, Dirichlet reputation system proposed by Jøsang and Haller [7], and the Leniency-aware Quality (LQ) model proposed by Lauw et al. [10].
MAE results for the 5 fold rating prediction experiment
MAE results for the 5 fold rating prediction experiment
The four datasets we use include three sparse datasets (i.e., 4RPM, 6RPM, and 8RPM) and one dense dataset (i.e., ARPM). The three sparse datasets reflect different levels of sparsity. In Table 3, the MAE results using the sparsest dataset 4RPM shows that the best prediction accuracy was produced by the Dirichlet method. The reason is because the Dirichlet method is the best method among the tested 5 methods to deal with the uncertainty problem which is especially severe for sparse datasets. The proposed enhanced method NDRU achieved the second best result which is close enough to the Dirichlet method result with a small difference, indicating that NDRU is also good at dealing with uncertainty. On the contrary, the proposed NDR method returns the worst result for the sparsest dataset because with small number of ratings there are no enough rating frequencies to feed the distribution weighting system.
However, when we use less sparse datasets 6RPM and 8RPM, the proposed NDRU method achieved the best results. In more details, when we use the 8RPM dataset, the NDR accuracy is the second best result and better than all the baseline methods, but it is still worse than the Dirichlet method using the 6RPM dataset.
Finally, the last row in Table 3 shows the results of ratings prediction accuracy using the whole MovieLens dataset (ARPM) which is considered a dense dataset. We can see that the proposed method NDR has the best accuracy. Moreover, our enhanced method NDRU achieved the second best result with an extremely small difference of
From the results we can see that the NDR method produces the best results when we use it with dense datasets, and that the Dirichlet method is the best with sparse datasets. Most importantly, the enhanced NDR method with uncertainty, i.e., the NDRU method, provides good results in any case, and can be used as a general reputation model regardless of the sparsity in datasets. The NDRU method keeps the advantages of both: the advantage from the Dirichlet method when deals with sparse datasets and the advantage from the NDR method when deals with dense datasets.
In this experiment, we will compare two lists of items ranked based on their reputation scores generated using the NDR method and the Naïve method. The purpose of this comparison is to show that our method provides relatively different ranking for items from the Naïve method.
The experiment is conducted in 20 rounds, with different percentage of data used every time. In the first round we used a sub-list with only the top
From Fig. 3 we can find that, for all datasets, the more the items taken from the lists, the more similar the order of the items in the lists generated by the two methods. However, usually users are more interested in the top items. Therefore, the order of the top items in the lists is more crucial. If we only look at the top 20% items, we can find that the behaviour of using the whole dataset ARPM (which is much denser than the other three datasets) is different from using other three sparse datasets. For the dense dataset, the similarity reaches its minimal when we only compare the top 1% items and the similarity increases when we compare larger portions of the dataset. This result indicates that for the dense dataset, the proposed method NDR ranks the top items in the item list differently from the item list generated by the Naïve method.

Kendall similarities between (NDR) and Naïve methods.
On the other hand, with the sparse datasets, the ranking on the top 1% of the items shows high similarity between the two lists, which indicates that the top 1% items are ranked highly similar for the sparse datasets. This can be explained that due to the sparsity of the dataset, limited items can be selected to be on the top 1% which makes the possibility of choosing the same items very high. When we increase the percentage of the top items, the similarity decreases sharply. In summary, for all the four datasets, the ranking order of the top 20% items in the ranked lists generated by the Naïve method and our proposed method NDR is different.
In this work we have proposed a new aggregation method for generating reputation scores for items or products based on customers’ ratings, where the weights are generated using a normal distribution. The method is also enhanced with adding uncertainty part by adopting the idea of the work proposed by Jøsang and Haller [7]. The results of our experiments show that our proposed method outperforms the state-of-the-art methods in ratings prediction over a well-known dataset. Besides, it provides relatively different ranking for items in the ranked list based on the reputation scores. Moreover, our enhanced method proved to generate accurate results with sparse and dense datasets. In future, we plan to use this method in different applications such as recommender systems [1,3]. Besides, this method can be combined with other weighted average reputation models that use time or user reputation in order to improve the accuracy of their results.
