Abstract
Objectives:
Fleiss' Kappa (FK) has been commonly, but incorrectly, employed as the “standard” for evaluating chance-removed inter-rater agreement with ordinal data. This practice may lead to misleading conclusions in inter-rater agreement research. An example is presented that demonstrates the conditions where FK produces inappropriate results, compared with Gwet's AC2, which is proposed as a more appropriate statistic. A novel format for recording a Chinese Medical (CM) diagnoses, called the Diagnostic System of Oriental Medicine (DSOM), was used to record and compare patient diagnostic data, which, unlike the contemporary CM diagnostic format, allows agreement by chance to be considered when evaluating patient data obtained with unrestricted diagnostic options available to diagnosticians.
Design:
Five CM practitioners diagnosed 42 subjects drawn from an open population. Subjects' diagnoses were recorded using the DSOM format. All the available data were initially used to evaluate agreement. Then, the subjects were sorted into three groups to demonstrate the effects of differing data marginality on the calculated chance-removed agreement.
Outcome measures:
Agreement between the practitioners for each subject was evaluated with linearly weighted simple agreement, FK and Gwet's AC2.
Results and Conclusions:
In all cases, overall agreement was much lower with FK than Gwet's AC2. Larger differences occurred when the data were more free marginal. Inter-rater agreement determined with FK statistics is unlikely to be correct unless it can be shown that the data from which agreement is determined are, in fact, fixed marginal. It follows that results obtained on agreement between practitioners with FK are probably incorrect. It is shown that inter-rater agreement evaluated with AC2 statistic is an appropriate measure when fixed marginal data are neither expected nor guaranteed. The AC2 statistic should be used as the standard statistical approach for determining agreement between practitioners.
Introduction
There is no clear consistent approach to measuring inter-rater agreement in Chinese Medicine (CM). In a recent review of diagnostic reliability, O'Brien and Birch
1
found the following utilization of statistical approaches: Forty-six percent of articles used simple agreement or did not define the statistic employed. Thirty-six percent of articles used Fleiss' Kappa (FK) statistics. Sixteen percent of articles used other statistics such as Kendall's coefficient of concordance or Spearman's correlation rank coefficient.
Lack of detail regarding the statistical methods used in almost half of CM diagnostic reliability studies reviewed and a lack of consistency in reporting statistics make it difficult to determine the actual levels of agreement currently attained in CM practice.
No study investigating agreement between practitioners has been found involving an open population; defined as randomly recruited subjects and not because they have been included as the result of having a particular, prediagnosed condition. Since many choices are available to express the disease state of a patient, exact phonetic agreement between practitioners is very unlikely and this is required if there is to be agreement between raters. It seems that to overcome this difficulty, strategies of investigating single disease states and generally restricting practitioners to textbook decreed diagnostic options 1 –11 or investigating single diagnostic factors such as tongue or pulse diagnosis 1,12 have been adopted. This approach has the effect of drastically reducing the number of choices given to raters, thereby causing the level of agreement to be inflated compared with that expected with essentially unrestricted diagnostic options.
Popplewell et al. 13 have shown that simple weighted diagnostic agreement is only 19% when subjects are drawn from an open population. The large numbers of diagnostic choices available to practitioners when contemporary CM diagnostic formats are used do not allow chance-removed statistics to be applied. This study appears to be the first investigation of diagnostic agreement in an open population in CM, or possibly in any modality.
Joshua et al. 14 showed that FK also appears to be the popular choice for determining Western medicine diagnostic reliability, leading to the conclusion that reliability in diagnoses is potentially misunderstood in other medical modalities as well.
The problem with the use of FK arises from the fundamental assumption used in its derivation, namely that the data are uniformly distributed between all the available choices. The term “fixed marginal” 15 is used to describe such data. No diagnostic reliability study found in the literature, in which FK had been used, gave any indication of the marginality of the data used, bringing into question the validity of the results. The only way to obtain data that are fixed marginal appears to be for subjects to be objectively categorized before inclusion in a study. This appears to be a severe restriction on the way experiments can be set up and is certainly artificial.
If the probability of the occurrence of each category is not uniform, the data are called “free marginal.” 16 If FK is used with free marginal data, it significantly overestimates agreement by chance, thereby drastically reducing the actual value of agreement. 16 –20 Many researchers have reported low values of Kappa, while at the same time indicating high simple agreement, 8,21 –23 without realizing that this was most probably caused by the data being free marginal. What is troubling is that none of these researchers seemed to question the large difference between simple agreement and chance-removed agreement derived with FK.
The AC1 statistic, although first published in 2002, 24 is only just being adopted as a statistic for general inter-rater agreement evaluations. 7,8,25 While there are other competing statistics, 26 including the recently developed PABAK 19 (which can only be used for two raters), Gwet's AC1 seems to be the best option, as it apparently addresses the free marginal data issue. It is important to note that the AC1 statistic is actually a form of Kappa statistic, with a superior approach to the estimate of agreement by chance that accommodates the assessment of free marginal data.
It is logical that a diagnosis of a symptom should be reported not only as its presence but also its severity. An ordinal scale can ideally be used as means of indicating the acuteness of an illness, thereby also allowing for an accurate recording of marginal changes in patients' ailment after interventions. A later development of the AC1 statistic for use with ordinal data is the AC215,24,25 statistic. The FK statistic 27,28 also accommodates ordinal data, but unfortunately suffers from the same difficulties as the original Kappa statistics, regarding the marginality of data which can be used. Another advantage of the ordinal data approach is the possibility of use of weighted statistics.
Two studies were found that utilized weighted statistics, 29,30 but both used weighted FK. One article was found using weighted Kappa in radiology, 31 indicating that the misuse of Kappa statistics is not limited to CM. No investigation was found that evaluated agreement with the AC2 statistic.
While there are many more alternatives, three weighting approaches, quadratic, linear, or radical, seem to represent the range of appropriate options for weighting purposes in the present application. Each of these approaches is best illustrated in Figure 1.

Figure 1a–c indicates the level of weighting on a scale 0%–100%, attributed to differences in rater scores. As may be seen in Figure 1a, an almost full agreement is ascribed when quadratic weighting is employed with a one-point difference between raters. However, as shown in Figure 1c, when radical weighting is used for the same score variation, only around 50% agreement is assigned, a condition that occurs at a four-point score difference with quadratic weighting. Linear weighting, Figure 1b, falls between the two extremes. The consequence employing different weightings is that the highest agreement is obtained with quadratic weighting, the lowest with radical weighting.
While it can be argued that a difference in score of 20% may be huge, for example, in marking a mathematics examination paper, it could be argued that in diagnoses, a one-point difference in a six-point scale is generally seen as quite a small opinion, but not as small as indicated by quadratic weightings or as large as suggested by radical weighting. As a result, the linear weighting option is used in the present study to evaluate agreement between raters.
Examples will be presented from the present authors' data collection to highlight the deficiencies of FK. These will also be compared with the more appropriate results obtained with the Gwet AC2 statistic.
Materials and Methods
Ethics approval 32 was sought and obtained from the Human Ethics Committee of the University of Technology, Sydney, to collect diagnostic data from five CM practitioners who rated 42 subjects. The patients did not have a particular illness and hence represented an open population. The Diagnostic System of Oriental Medicine (DSOM) diagnostic format, described and validated by Lee et al., 33 was utilized. The DSOM was originally developed to report women's health.
The DSOM provides a summary of the subject's health by scoring 16 CM diagnostic variables that endeavor to represent the essence of the patient's CM constitution. The DSOM diagnostic format is usually populated from scores derived from the DSOM patient questionnaire. In this study, practitioners diagnosed the subjects as they would in their clinics; however, they recorded their diagnoses using the DSOM format. This format is an interesting and potentially important approach for recording and comparing CM diagnoses from different practitioners.
Five practitioners saw each patient. Patients were allocated in random order to one of five practitioners to control for order bias. No time limit was assigned for the consultations. The practitioners described the health of each subject with DSOM diagnostic descriptors, with each descriptor allocated a score between zero and five. The selection of a zero score for a descriptor by practitioners is significant as it indicates that in their opinion there was an absence of any health issue that could be described by this diagnostic factor. Thus, for the first time, it was possible to evaluate inter-rater agreement with the diagnostic data of patients from an open population with chance-removed statistics, an exciting prospect.
The total scores ascribed to each descriptor for a subject by the practitioners were used to evaluate a health index, termed total pathogenic score (TPS), and used as an indicator of the heath of the subject. Since one could expect that different levels of practitioner agreement would depend on the complexity of subjects' health status, after determining the overall agreement between practitioners, the data were divided into three wellness groups, Most well, Intermediate, and Least well, on the basis of subject's TPS. Each group consisted of 14 subjects.
Agreement was evaluated with three statistics using the software AgreeStat 2015.6, namely linearly weighted simple agreement, FK, and Gwet's AC2.
Results
The results are presented in Table 1.
Average Agreement in Each Group Using the Stated Statistics
TPS, total pathogenic score.
Discussion
As may be seen in Table 1, inter-rater weighted agreement of all subjects was as follows: simple agreement 0.78 ± 0.01, AC2 0.60 ± 0.02, and FK 0.25 ± 0.03. As would be properly expected, there is a reasonable reduction in agreement when the estimated agreement by chance is removed by the AC2 statistic. However, a two-thirds reduction in agreement when the agreement by chance is estimated by the methods proposed by Fleiss seems quite excessive.
In all cases, the very small standard error shown in Table 1 is small. Indeed, the consistency of the agreement results indicates that the results are accurate with the largest error, ±0.03, in the FK value of 0.25 in the Most well group, representing a likely 12% error.
The average TPS of all subjects and in each wellness group is presented in the far right column of Table 1. The range of scores confirms the open nature of the population recruited, with large differences noted between the average TPS for each group. The other results presented in Table 1 are discussed using Landis scale, 34 outlined in Table 2 for FK agreement.
Landis Scale for Describing Kappa Agreement
All subjects
When the average agreement of all subjects is examined, a large difference is observed between FK and the other two statistical methods. Only fair agreement is found when FK was used. This is in strong contrast to the much higher substantial agreement obtained with simple weighted agreement and the AC2 statistics. The large differences between the results reached seem so large that the agreement by chance must surely have been overestimated in the FK instance. The difference between the simple agreement and Gwet's AC2 statistical results, which use a different method of estimating agreement by chance, is significantly smaller and seems much more realistic.
Wellness groups
As may be seen in Table 1, as the average TPS of a wellness group of subjects increases, the level of agreement between the diagnosing practitioners deteriorates with simple agreement and AC2 statistical approaches. This is understandable as it would be expected that diagnoses become more complex with increasingly poor health. However, it is interesting to note that the agreement as measured by FK increases as the health of patients deteriorates.
To further understand the reasons for the observed changes in agreement outcome, the scores allocated in each wellness group need to be examined. This will shed light on the marginality of the data in each group. The allocation of scores within the three wellness groups is summarized in Table 3.
Scores Within the Three Wellness Groups
For the data to be fixed marginal, a uniform distribution between all the wellness possibilities available needs to occur; that is, each of the scores would have the same probability of being chosen. It is clear from an inspection of Table 3 that this is not the case in any of the wellness groups, but the data became more fixed marginal as wellness declined, with a reduction in the number of zero scores being allocated and an increase in the other scores. The reason for the large differences between the AC2 values and FK results is that the data are not fixed marginal. Furthermore, the increase in the value of FK is due to the data becoming a little closer to a fixed marginal distribution as the health of patients deteriorates. Clearly therefore, FK is not a reliable tool unless the data can be shown to be uniformly distributed.
The assertion is well illustrated by the results of the Most well group, which contained many subjects with no issue in a large number of DSOM diagnostic descriptors. As may be seen in Table 3, 73% of the scores were allocated to the zero score. As a result, these data are far from fixed marginal, leading to the Most well group having the lowest FK. This group had the highest weighted simple agreement due to the preponderance of zero selections and, as may have been expected, the highest AC2 value. As mentioned above, the fact that this group had the lowest FK does not make sense and shows that FK should be used with great caution.
To clearly illustrate the difference, AC2 and FK results of the three wellness groups are presented in Figure 2.

Average agreement of the three wellness groups using linearly weighted Gwet's AC2 and Fleiss' Kappa.
Figure 2 shows the steady converging of Gwet's AC2 and FK results as the data become more fixed marginal, represented by the increased scoring that occurred in the DSOM data as the patients were deemed less well. Indeed, if a group of subjects were studied, whose data were completely fixed marginal, there would be no difference between Gwet's and Fleiss' results. Unexpected increases in the FK values as patient health declines are also clearly illustrated in Figure 2, and are in stark contrast with the more realistic decrease in the magnitude of the AC2 value with decreasing wellness.
Since no inter-rater agreement studies 1,12,35 using FK statistics give any information on the marginality of the data collected, there is no way to determine the reliability of the results obtained. Unless data are absolutely fixed marginal, the level of agreement reported with FK is always underestimated. Inter-rater agreement studies that use Gwet's AC125 are only just beginning to appear. No studies have been found in the literature that use weighted AC1 (AC2), and just a few utilize weighted Kappa. 29,30
It appears that the use of incorrect statistics, the reporting of narrow investigations of limited aspects of diagnoses, and an avoidance of the use of open populations, together with a potentially inadequate diagnostic framework, have combined to totally obscure the determination of whether diagnostic agreement is at acceptable levels in clinic and research settings.
Large efforts are currently being made to uncover possible mechanisms of acupuncture. 36 –38 Indeed, investigations into the effectiveness of acupuncture in various common conditions are currently underway or being reported. The CM research community is now just assimilating the results of these early studies. MacPherson et al., 39 who conducted the largest scale meta-analyses ever performed to date, recently found “There was little evidence that different characteristics of acupuncture or acupuncturists modified the effect of treatment on pain outcomes.” The problems outlined above may have contributed to this finding.
It seems surprising that investigations determining effectiveness of differing treatments, or attempting to understand the mechanisms of acupuncture, have been undertaken without a proper diagnostic benchmark. It seems logical that both must take place on a solid foundation of repeatable CM diagnoses, which, if not considered, may be a confounding variable in the data collected. With the current situation in diagnostic reliability, where not even the elementary basics of reporting with correct statistics is the norm, there is no possibility of a real understanding of the levels of diagnostic agreement currently taking place.
No previous investigation was found of chance-removed inter-rater agreement using open populations, with no restrictions on the health problems of any subject. Since in a clinical setting, patients from an open population are likely to be present, the huge gap in diagnostic understanding in the CM profession is a significant problem. It is not known how reliable practitioners are “in the field”; that is, there is little appreciation of how repeatable diagnoses are. Reliable diagnoses must be the first step since the evaluation of treatment effectiveness is impossible without it.
The diagnostic categories DSOM employed in the present study greatly facilitate the expression of CM diagnoses in a format, seeming to better capture the essential practitioner assessment than contemporary CM diagnostic formats. However, the diagnostic descriptors used in the DSOM may not be fully representative of what would be required to adequately define a CM diagnosis. The diagnostic format utilized in the DSOM is nonetheless a promising approach and should be the subject of further investigation.
Conclusion
It has been demonstrated that the FK is an inappropriate and misleading statistic for measuring inter-rater agreement with ordinal data and must not be used in future diagnostic reliability investigations, unless the data are tested to determine if there are equal numbers in each diagnostic category available to the testers, an unrealistic prerequisite.
It has also been shown that the AC2 statistic seems to more correctly determine agreement between multiple raters recording ordinal data than FK, where fixed marginal data are neither expected nor guaranteed, which is the more likely case in CM. Since inter-rater reliability results reported to date are unreliable, due to incorrectly applied or undisclosed statistical methods, it is proposed that linearly weighted AC2 statistics should be used as the standard agreement statistic. Similarly, investigations into treatment effectiveness or mechanisms of CM treatment action should report diagnostic reliability of the subjects' condition using AC1 or AC2 statistics.
The DSOM diagnostic configuration appears to enable the expression of CM diagnoses so that the essence of practitioner's assessments can be more effectively compared than contemporary CM diagnostic formats. Furthermore, the appropriate chance-removed statistics are readily obtained with this format, which is a crucial advantage. Finally, the diagnostic descriptors used in the DSOM may not be fully representative of what factors would be required to adequately define an unrestricted CM diagnosis and a separate investigation should be carried out to study this. The diagnostic format utilized in the DSOM is nonetheless a promising approach and should be the subject of further investigation.
Footnotes
Acknowledgments
The authors thank UTS for the generous provision of funds, staff, and making clinic facilities available for this research. Special thanks goes to Professor Lee for the use of her diagnostic questionnaire and format. They also thank the late Narelle Smith for her contribution to this project.
Author Disclosure Statement
No competing financial interests exist.
