Abstract
Service Oriented Architecture is built on services in which web service discovery is one of the most widely explored domains. Service reputation plays a vital role in the discovery process while selecting the optimal service from a large pool of functionally equivalent services. Reputation mechanism attempts to forecast the future performance of a service based on its past behaviors generally obtained in the form of feedback ratings submitted by users of the service. These feedback ratings about a service may vary from user to user. Variations in feedback ratings could be because of different users’ different subjective judgments and/or dishonest users’ purposely submitted unfair ratings. This paper proposes a service reputation measurement approach counting such diversified act of rating. To realize this goal, an efficient user credibility assessment methodology has been devised by employing the measure of interrater reliability. Experiments are performed to validate the feasibility and effectiveness of the proposed reputation measurement approach. The experimental results depict that the proposed approach can fairly assess service reputations in the presence of various kinds of raters in the system.
Keywords
Introduction
In Service Oriented Architecture (SOA), with the invent of Web services, the Web has become a platform where applications, built based on services, can be automatically invoked by other Web clients [10]. W3C defines web service as: “A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards” [2]. Web Services architecture authorizes interactions among three entities: service provider, service consumer, and service registry [8].
Due to the popularity of SOA, services over the Web are growing rapidly. This has resulted in a number of services offering similar functionality [27]. Therefore, selecting the optimal service from a large pool of functionally equivalent services has become a challenging task. Since the providers of the services may not perform according to what have been advertised by them [11], care must be taken while selecting a suitable service that satisfies the consumer’s need. A service selection mechanism must not select any service that harms a consumer’s interest. The selection mechanism must explore ‘how well a service will do’ along with ‘what a service can do’. To achieve this, service reputation assessment plays a vital role in discovery process.
A reputation mechanism attempts to forecast the future performance of a service based on its past behaviors [21]. Service reputations speak about the degree of competency of services in the competitive service market. Almost all reputation oriented service discovery mechanisms utilize the feedback ratings left by the service consumers after availing services to evaluate the service reputations. The feedback rating generally portrays the level of satisfaction of a consumer’s experience with an invoked service. In ideal situations, the feedback ratings of different consumers about a service must comply with each other. However, in real-world environment, the feedback ratings submitted for a particular service may vary from consumer to consumer due to several reasons. Two such reasons can be (i) different consumers’ different subjective judgments, and (ii) dishonest consumers’ purposely submitted unfair ratings. It is a fact that different consumers of different mindset exist in online open environment and these consumers may not always carry out the act of rating in an identical fashion. They may exercise their subjective preferences while reporting their feedback ratings about the perceived service quality. On the other hand, the existence of dishonest consumers in such open environment cannot be overlooked. These consumers submit varying ratings that are not in accordance with the deserving ratings of a service. They do it purposely in order to upgrade or downgrade the overall reputation of the service in the business market [1,5]. Thus, counting the diversified act of rating, there is a strong need of devising an efficient methodology at this point to counter such situations and to support the genuine services to maintain high reputation scores in this competitive business market.
Previous approaches [12,13,25] in service reputation measurement dealt only with the presence of dishonest consumers while devising a methodology for user credibility assessment. In contrast, the proposed reputation measurement approach takes into account the presence of both dishonest consumers and consumers with subjective preferences in the system. Wherein, an efficient user credibility assessment methodology has been devised by employing the measure of interrater reliability. Interrater reliability is the measure of the degree of similarity between the reported feedback ratings of consumers (raters) of a service. Application of interrater reliability measure in user credibility assessment ensures the appropriate reflection of credibility of each consumer in reporting feedback ratings at par other consumers of the system. And then, the user credibility aids in evaluating the reputations of the services. This summarizes the contribution of this paper.
The remaining sections of this paper are organized as follows: Section 2 reviews the related works. Section 3 presents the proposed reputation measurement approach. In Section 4, the experimental results based on simulated scenarios are discussed. Section 5 concludes the paper.
Related work
In this section, prominent service reputation measurement approaches existing in the literature are discussed.
Malik and Bouguettaya [10,12] emphasized on shielding of reputation based service discovery system from unfair or malicious ratings. They proposed a methodology for evaluating the credibility of consumers/raters thereby assessing the reputation of providers by assigning more weight to honest raters’ and relatively less weight to dishonest raters’ ratings. Their approach is based primarily on the concept of majority rating scheme in which a particular consumer’s credibility is increased or decreased depending upon the degree of correspondence of the feedback rating provided by the consumer with the majority opinion. They formulated the overall credibility of a consumer as a function of three parameters: the historical behavior of the consumer, the agreement/disagreement of the rating provided by the consumer with the majority rating, the equivalence of the rating provided by the consumer with the last evaluated reputation value of the provider. In ‘RATEWeb’ [12], the reputation assessment framework also caters additional metrics for reputation evaluation viz personalized preferences, temporal sensitivity, and first-hand interaction knowledge.
Hien Trang Nguyen et al. [13] proposed a trust and reputation system for web service selection based on Bayesian network model. The overall reputation of a service is derived based on the weighted integration of three different sources of trust: direct experience opinion of the requester (direct trust), recommendation from other consumers (recommendation trust, subjective view) and QoS monitoring information (conformance trust, objective view). The credibility of a consumer as a rater, in providing recommendations, is calculated as the weighted integration of two factors: usefulness of the rater’s feedbacks (global measure) and similarity of ratings and QoS preferences between the rater and requester (personalized measure).
Yan Wu et al. [25] proposed a two-phase approach to assess the service reputations. To capture the variations in service reputations over time due to change in service quality, a dynamic weight formula (DWF) is proposed in the first phase (when the ratings are believed to be fair) to reflect the current nature of service behavior. To deal with false variations in service reputations because of the effect of unfair ratings submitted by dishonest consumers, the authors in the second phase suggested the use of olfactory response formula (ORF) to suppress the negative impacts of malicious ratings.
In [28], the reputation of a service is assessed as the weighted integration of direct experience based reputation and recommendation based reputation. The recommendation based reputation about a service for a consumer is both similarity (between requester and recommender) and trust (on recommender from requester) weighted average of recommended reputations of the service provided by other consumers.
In [23], the service reputation measurement approach consists of two phases: (i) malicious feedback rating detection to identify unfair ratings purposely submitted by dishonest users by employing cumulative sum (CUSUM) control chart, and (ii) feedback rating adjustment by computing feedback similarity employing Pearson Correlation Coefficient (PCC) to deal with different preferences the users exhibit while reporting their feedbacks.
Talal H. Noor et al. [14] designed CloudArmor (CLOud consUmers creDibility Assessment & tRust manageMent of clOud seRvices) which is a framework to support reputation-based trust management of services in cloud environments. It includes (i) a Zero-Knowledge Credibility Proof Protocol (ZKC2P) that enables the Trust Management Service (TMS) to prove the credibility of consumer feedbacks and preserves consumers’ privacy, (ii) feedback credibility assessment model to shield the services from malicious consumers and to evaluate trustworthiness of the services, (iii) an availability model to ensure the availability of decentralized implementation of TMS. The trust result (reputation) of a service is evaluated based on its all attained trust feedbacks weighted by their corresponding credibility scores, the total number of trust feedbacks and the change rate of trust results of the service during a time period.
[29] proposed a domain-partition based trust model to minimize trust storage and computing overhead and maximize the ability to detect malicious nodes and their unfaithful trust evaluations in unreliable cloud environments. Domain and cross-domain sliding-windows are designed to store the most recent trust values. Then, an algorithm is developed to compute domain and cross-domain trust values for nodes, and a filter procedure is adopted to remove malicious trust evaluations and malicious nodes from a domain.
Miao Wang et al. [22] presented a High-reliability Multi-faceted Reputation (HMRep) evaluation methodology for online services. In the first phase, they addressed and estimated the missing feedback ratings of various services based on rating behavior of the user and quality of the service. To efficiently assess the service reputation, their model finds and then eliminates malicious and irresponsible consumers from the system in the second step. And, then the service reputation is assessed based on its received feedback ratings and their corresponding consumers’ credibility. In order to properly reflect the change in service quality, their model also makes use of historical information of the services during service reputation calculation.
Okba Tibermacine et al. [18] proposed a three-phase process for evaluating web service reputation. The process applies at two rounds an adapted HITS (Hypertext Induced Topic Search) algorithm to evaluate the user credibility values and service reputation values. In first phase, the credibility values of the users are evaluated based on the majority voting model. In second phase, an analysis of the dispersion of user credibility values is done to identify a threshold that separates honest and malicious users. Then, a refined model excluding malicious users and their feedback ratings is constructed. In the last phase, a second round of HITS algorithm is applied for the final evaluation of service reputation.
The following two observations are drawn after analyzing the literature:
Many of the reputation measurement approaches have adopted the majority rating scheme while devising a methodology for user credibility assessment. In such scheme, the user credibility score increases if the consumer’s reported feedback rating matches with the majority opinion and decreases otherwise. Consumers of the majority group attain high user credibility scores; while other consumers attain relatively low user credibility scores based on the amount of deviations of the feedback ratings from the majority opinion. Systems, which rely on majority rating scheme, believe that majority of the consumers are honest and report trustworthy feedback ratings. However, the majority rating scheme may sometimes fail to accurately measure the service reputations if malicious consumers form the majority group or the majority raters collude. Under such circumstances, the opinion of the majority group formed by malicious consumers will adversely influence the user credibility scores of different consumers which, in turn, will affect the fair feedback ratings of honest consumers.
The reputation measurement approaches existing in the literature consider two types of consumers/raters: honest and malicious. Honest consumers report feedback ratings at par the perceived service quality. On the other hand, malicious consumers purposely submit distorted feedback ratings about the perceived quality. However, another type of consumers also exists in the system who provide subjective judgments [19], but are not explored by the research community. This type of raters is also honest in nature, but exercise their subjective preferences while reporting feedback ratings about the perceived service quality. Therefore, if not treated separately, they will also be considered as malicious consumers as their fair feedback ratings can vary depending on their subjective preferences. Thus, these consumers will be assigned unjustifiable user credibility scores as compared to the honest consumers.
Owing to the above facts, a user credibility assessment methodology has been proposed in this paper which
avoids dependency on majority opinion during the assessment of user credibility scores. considers the existence of consumers with subjective judgments along with honest and malicious consumers.
Reputation measurement approach
Preliminaries
User credibility assessment
One of the most important factors that must be addressed for the success of reputation-oriented service discovery in feedback based systems is the efficient assessment of user credibility. In online open environment, where user activities are mostly vague, user credibility plays a vital role in knowing service reputation better. User credibility represents the quality of being trustworthy in providing feedback rating [11]. It is a good practice to give more weightage to the feedback ratings of high credible users and relatively less weightage to the feedback ratings of low credible users [7,26].
Reputation-oriented service discovery relies on feedback based models and the rating scale is considered to be the most prominently used measuring instrument utilized by the consumer to express its level of satisfaction about the perceived quality of a service by selecting some point on the scale. This aids in obtaining sets of feedback ratings for different services, which have been actively availed by the consumers of the system. These feedback ratings about a service may vary from consumer to consumer due to variety of reasons. Two such reasons, on which the proposed model intends to concentrate simultaneously in this paper, are: (i) different consumers’ different subjective judgments about the same service, (ii) dishonest consumers’ malicious ratings.
Though feedback ratings are considered to be the rich sources of data to express satisfaction levels about perceived service quality, they might be subject to various sources of bias and error. As there is a lack of well-guided systematic process for consumers of the system to follow while reporting their feedback ratings, it is obvious that these consumers may not carry out the act of rating in an identical fashion. In online open environment, different consumers of different mindset, residing in distributed locations, can arbitrarily join the system. Therefore, it is not an easy job to bring them all in a common platform to train them on how to use the rating scale and how to perform the act of feedback rating for different qualities of services they might experience while availing the services in real time.
A consumer may appraise the performance of a service to be satisfactory. Another consumer may report the same performance of the same service as moderate. While, some other consumer may find it to be unsatisfactory. Though the consumers have exercised their feedback ratings honestly, the disparities in their judgments about the same service may have occurred because of the consumers’ varying satisfaction levels which resulted in the selection of different points on the rating scale. It is worth noting that subjective judgments may differ from consumer to consumer and contain high degree of personal preferences. So, the system must accept that consumers of such open environment are heterogeneous in nature and they might exercise their subjective preferences while reporting their feedback ratings about the perceived service quality.
A consumer might not give much effort in choosing the appropriate point on the rating scale in accordance with the perceived service quality. The consumer might arbitrarily select any point on the scale only for the sake of performing the act of feedback rating. Some of the probable reasons could be that the consumer is unaware of the use of the rating scale, or unable to differentiate between the relative points on the rating scale, or does not actually know which point on the rating scale is to be chosen for the perceived service quality. In this way, some consumers introduce some errors in the feedback rating data [15].
In addition, the existence of dishonest consumers in such open environment cannot be overlooked. These consumers submit varying ratings that are not in accordance with the deserving ratings of a service. They do it purposely in order to upgrade or downgrade the overall reputation of the service in the business market [1,5]. Several other unknown, irrelevant factors also influence the judgment process that can distort the validity of feedback ratings leading towards erroneous evaluation of service reputation.
Counting all these facts under a single umbrella and supposing that for each service, every datum recorded on the rating scale is the result of either subjective judgment of the consumer or honest rating or malicious rating or rating error, the generalization of each service’s set of feedback ratings is important to precisely evaluate the service reparation. To realize this goal, the measure of interrater reliability or interrater agreement is employed to devise a methodology for user credibility assessment.
Interrater reliability
Interrater reliability or agreement is the measure of the degree of similarity between the reported feedback ratings of consumers (raters) of a service. It defines how much consensus or consistency is there in the feedback ratings reported by various consumers as raters. In the proposed model, the interrater consensus and consistency estimates of interrater reliability are used to deal with the assessment of user credibility of individual consumers/users. Application of these two interrater reliability measures in user credibility assessment ensures the appropriate reflection of credibility of each consumer in reporting feedback ratings. The interrater consensus estimate represents the extent to which different consumers tend to report exactly the same feedback ratings about the rated service. On the other hand, interrater consistency estimate represents the degree to which the feedback ratings of different consumers are proportional when expressed as deviations from their means [19]. In practice, the interrater consistency estimate represents the extent to which different consumers tend to report the same relative ordering or ranking of the rated service. In a nutshell, interrater consensus estimate refers to the absolute correspondence of feedback ratings, while interrater consistency estimate refers to the relative correspondence of feedback ratings. It is to be noted that low interrater consensus and high interrater consistency (or vice versa) may exist simultaneously. For instance, one consumer’s reported feedback ratings about services may be consistently one or two scale points higher than another consumer’s reported feedback ratings towards the same services. In this example, a high estimate of interrater consistency is observed since the relative ordering of the services for both the consumers are much same, though, exact agreement between these consumers has never been achieved which resulted in low interrater consensus.
Proposed framework
Model setup
A service provider can offer any number of services; but for simplicity, this model assumes that a service provider offers only one service. The terms service provider and service are used interchangeably throughout this paper. Similarly, the terms service consumer, user and rater are also used interchangeably in this paper. The model assumes that m number of consumers and n number of services reside in the system. The set of consumers is denoted by
Computing consensus estimate
To compute consensus estimate of interrater reliability, the weighted Kappa coefficient of Cohen [4] has been adopted. In estimating the degree of consensus between two consumers/raters, Cohen’s Kappa (κ) [3] is a measure of “the proportion of agreement after chance agreement is removed from consideration” and is mathematically calculated as given in equation (1).
κ treats all kinds of disagreements (either by 1 scale point or by more) between any two raters equally, i.e. as total disagreement. But, situations may arise when certain disagreements between two raters are to be penalized more (or less) than others. Thus, to cater for partial agreements, Cohen developed weighted Kappa (
Since, all consumers may not avail all services of the system, complete feedback ratings of the services may not always be available for any two consumers. K. L. Gwet [6] recommends a general formulation of Cohen’s weighted Kappa (
Thus, the weighted Kappa coefficient of Cohen (
A value for
Interpretations of various
values as suggested by Landis and Koch
Interpretations of various
The consensus estimate between two consumers/raters,
To compute consistency estimate of interrater reliability, the Spearman’s rank-order correlation coefficient (denoted by ρ (rho) or
Let
Equation (9) is useful in situations when ranks of services as observed by any of the consumers are all different. However in reality, situations of tied ranks may occur when more than one service are assigned the same rank by a consumer. In such situations, services having tied ranks are assigned a common rank which is the average of the ranks of positions they occupy in the ordered service set of an individual consumer. Thus, taking tied ranks into account, the following version of Spearman’s rank-order correlation coefficient, given by equation (10), has been used in the proposed model to compute the interrater consistency estimate,
A value for
Computing consensus and consistency estimates of single rater vs rest of raters
Until now, the model has computed the consensus estimate,
The consensus estimate of single rater vs rest of raters,
Similarly, the consistency estimate of single rater vs rest of raters,
A value for
Computing user credibility
Both
The service reputation calculation
Having calculated the user credibility,
Experiments and results
Experiments are performed to validate the feasibility and effectiveness of the proposed reputation measurement approach and the results, thus obtained, are reported in this section.
Due to the current limited availability of feedback rating data in service-oriented environment, many existing service reputation measurement approaches [12,17,18,23,25] dealt with simulation data for the purpose of performance evaluation of their models. Therefore, to evaluate the performance of the reputation measurement approach proposed in this paper, the experiments are performed on a simulated environment in which various kinds of consumers interact with the services to generate various kinds of feedback ratings.
Experimental setup
The simulated environment consists of 100 services and 200 consumers. Services with different performance levels (
Consumers/raters also report their feedback ratings about services on a 10-point integer rating scale. Three types of raters exist in the simulated environment: RaterType#1, RaterType#2, and RaterType#3.
RaterType#1: Raters of this type are honest in nature. They always submit the feedback ratings at par the perceived service quality. That is, their submitted feedback ratings always conform to the actual performance levels, PerfLev, of the services.
RaterType#2: Raters of this type are also honest in nature, but always exercise their subjective preferences while reporting their feedback ratings about the perceived service quality. The feedback ratings of this type of raters are simulated with a deviation of
RaterType#3: Raters of this type are dishonest or malicious in nature. They always purposely submit distorted feedback ratings about the perceived service quality. This type of raters is simulated to generate random feedback ratings in the range
Three sets of experiments have been conducted.
Experiment#1: This experiment is conducted assuming that only RaterType#1 and RaterType#2 exist in the system. It includes nine sub-experiments with varying density of RaterType#2 from 10 to 90 percent, with a step of 10 percent.
Experiment#2: It considers the presence of only RaterType#1 and RaterType#3 in the system. It also includes nine sub-experiments with varying density of RaterType#3 from 10 to 90 percent, with a step of 10 percent.
Experiment#3: This experiment is carried out in the presence of all the three types of raters. Here, nine sub-experiments have been performed with varying combined density of RaterType#2 and RaterType#3 from 10 to 90 percent, with a step of 10 percent. In each sub-experiment, the densities of both RaterType#2 and RaterType#3 are equal.
Experimental results
For each of the sub-experiments, 10 rounds of simulations have been conducted and their average has been calculated to obtain the final reputations of the services. In all the experiments, the values for
RMSEs recorded by varying the density of RaterType
RMSEs recorded by varying the density of RaterType
It can be observed from Table 2 that, RMSEs recorded for any density of RaterType in both Experiment#1 and Experiment#3 are within acceptable range. It is also to be mentioned here that, in these two experiments, though the obtained RMSEs indicate slight variations in estimated reputations of services, but the actual rankings of the services are not disturbed. The same is true for density of RaterType#3 upto 60% in Experiment#2. For 70%, 80% and 90% density of RaterType#3 in Experiment#2, the obtained RMSEs are reasonably high and also the actual rankings of the services are disturbed. This is due to the presence of high percentage of malicious raters in the system. In such situations, the unfair feedback ratings submitted by these high numbers of malicious raters are granted as true and the fair feedback ratings of the honest raters are suppressed. However, Whitby et al. [24] suggest that the presence of high percentage of malicious raters in real systems is impractical and a much lower rate of malicious raters should be expected. Thus, it can be concluded that the proposed approach can fairly assess service reputations in the presence of various kinds of raters in the system.

Status of the estimated service reputations in compared to the service performance levels for Experiment#2.
Figure 1 is used to illustrate the rankings of the services for every density of malicious consumers (RaterType#3) mentioned in Experiment#2. It depicts the status of the estimated service reputations in compared to the service performance levels for Experiment#2. For each of the nine experiments, the estimated reputations of all the services that hold the same performance level are averaged to compare them against the corresponding service performance level as a whole. This is done to demonstrate the comparison in a smaller window, instead of showing the comparison for individual service’s estimated reputation with its performance level. Thus, in all the cases of Fig. 1, the ten groups of services that correspond to a distinct performance level are shown.
In this section, the performance of the proposed approach has been compared with an existing reputation measurement approach from the literature. The comparison has been done under the same simulated environment setting mentioned above.
In [18], the authors proposed two service reputation assessment methods: HITSW (without the neutralization of malicious users) and HITSN (after the neutralization of malicious users). It can be observed that HITSN performs better than HITSW for any density of malicious users in the system. The performance of the proposed reputation measurement approach has been compared with HITSN in terms of RMSE.
HITSN [18] is a three-phase process for service reputation evaluation. The first phase deals with the evaluation of user credibility values which is an iterative process. Initially, considering all the users as honest (that is, all user credibility values are initialized with 1 in the first iteration), the service reputation values are evaluated using equation (18).
After that, the user credibility values are evaluated by equation (19).
The iteration continues until the reputation and credibility values are stabilized or the maximum number of iterations is reached.
In second phase, the credibility values are sorted and a threshold (θ) is calculated to identify honest and malicious users.
In the last phase, the phase 1 is repeated with the refined sets of users and services.
Experiment#4: Two types of users are considered in this experiment: honest and malicious. Honest users submit their feedback ratings in the range
Comparison between RMSEs of candidate methods
Comparison between RMSEs of candidate methods
Table 3 shows the comparison between RMSEs obtained by the proposed service reputation measurement approach and HITSN. It can be observed that the proposed model outperforms HITSN, owing to the fact that the proposed model employs the measures of interrater reliability for the user credibility assessment.
Service reputation measurement is important when selecting the optimal service from a pool of functionally equivalent services. Feedback ratings are considered to be the prominent indicator of service reputation. However, because of the simultaneous presence of both malicious consumers and consumers with subjective preferences, the actual reputation of a service may be distorted. Therefore, a service reputation measurement approach is proposed in this paper accounting the diversified act of rating from various types of consumers in the system. The measure of interrater reliability is employed to devise an efficient user credibility assessment methodology. And finally, the service reputation is estimated by giving more weightage to the feedback ratings of high credible consumers and relatively less weightage to the feedback ratings of low credible consumers. Experiments are performed employing a simulated environment and results show that the proposed approach can fairly assess service reputations in the presence of various kinds of raters in the system.
