Inter-Rater Agreement in Traditional Chinese Medicine: On the Potential Contribution of Popplewell's Work

Abstract

Editors' Note : This invited commentary takes a close look at innovative approaches to overcoming some of the methodological challenges of establishing inter-rater reliability. It is provided by Dr. Stephen Birch who has written extensively on the many issues encountered when conducting rigorous research in Traditional East Asian Medicine. We hope you will take the time to read this insightful commentary. We thank Dr. Stephen Birch for his contribution. —Rosa Schnyer, DAOM, LAc, IFMCP and Claudia Citkovitz, MS, PhD, LAc Guest Editors

This issue includes a series of articles which are derived from the recent PhD of Michael Popplewell from his studies at the University of Technology Sydney, Australia. In the first article, Popplewell et al. present the results of a study examining the level of diagnostic agreement (DA) in an open population of patients.¹ In the second article they present arguments for not using the Fleiss Kappa statistic as a test of agreement in Traditional Chinese Medicine (TCM) diagnostic studies and show that the Gwet AC2 test is a better statistical test to use.² In the third article, they present the details of the instrument Popplewell developed to address the problem of low DA in TCM studies, the “Traditional Chinese Medical Diagnostic Descriptor” (TCMDD), detailing the methodology and testing of the instrument.³

There has generally been a paucity of research on the diagnostic methods and conclusions in the practice of traditional East Asian Medicine (TEAM),⁴ of which TCM is the most commonly found system. There are a number of different systems of practice each with their own foci in diagnosis and treatment.^5

–8 The first task for developing research on diagnosis in TEAM is to define the nature of the practice system that is to be studied. TCM is a complex system of practice that has combined different treatment systems (e.g., herbs and acupuncture) into one with a single diagnosis system.^5,9 While in Japan, the diagnostic processes and patterns for herbal medicine and acupuncture are different,¹⁰ which is understandable given that they have been practiced separately for a long time.⁵ It is probable that the types of diagnoses for the different systems are likely to be different due to differences in how they act and changes that can be observed,^11,12 which raises the question of whether a single diagnosis system can precisely capture the details of different systems of practice (herbs and acupuncture) and to what extent this may introduce problems in establishing DA for TCM?

Another critical question for TCM relates to the role the symptoms of the patient play in deciding the pattern of diagnosis. This may at first seem to be a strange question, but it is relevant because there are TEAM practice systems where the diagnosis is routinely arrived at without discussion of or use of the symptoms of the patient to decide the pattern of diagnosis.¹⁰ There are also systems that usually include the symptoms of the patient when deciding the pattern of diagnosis¹⁰ and there are systems that seek a diagnosis that explains why the symptom occurs,¹³ so that the pattern is based on a limited subset of patterns that could explain the origin of the symptom.^8,10 To understand the role of symptoms in deciding the pattern in a specific system of practice, literature reviews and surveys of practice need to be conducted. A survey of general practice can identify the frequency of types of practice and distribution of diagnostic patterns, whereas a survey related to a specific symptom might yield a more limited distribution of diagnostic patterns that are related to that symptom.

The work published in these three articles have addressed two primary issues: first, what problems are there testing agreement in a system like TCM, where there are a potentially large number of diagnoses and how to address this? Second, is there a problem with the usual statistical test used in DA studies and if so, what is a better statistical test to use?

The first article describes a rigorously conducted study to assess DA in TCM practice. Having identified potential difficulties with establishing DA, namely that there are over 100 different patterns with many combinations,¹ they used records from a large number of treatments at a busy teaching clinic to establish a smaller set (56) of possible patterns. Having established this smaller set of patterns they then conducted a study to measure DA among 2–3 diagnosticians, examining 35 patients. The practitioners were constrained to answering from among the smaller set of patterns and to use a Likert (1–5) scale to rate each identified pattern. The results of the study were very poor with around only 20% agreement. Did the fact that either two or three practitioners examined each patient as opposed to all three practitioners examining all patients affect the results? With the statistical testing used it may have been a factor in the poor results. Additionally, how comfortable were the practitioners using a Likert scale? This is not something used in routine clinical practice. Did this affect results? While the potential roles of these factors are not clear to me and likely to have been minor, this was otherwise a well-conducted study.

The second article identifies a limitation of the usual statistical test for DA, the Fleiss Kappa: if there are missing variables or unselected variables, the test will not properly compute.² I was aware of this problem in the small pilot study I conducted in the early 1990s, where a number of results would not compute due to the lack of utilization of those less common variables,¹⁴ this limitation made my results unpublishable in a peer-reviewed journal. If the Gwet AC2 statistic test bypasses this problem and gives a more precise analysis of DA, this is an important solution and development for studies of DA in TEAM practice systems.

The third article,³ discusses the problems of testing DA when there are many possible options, such as in an open population. The authors have claimed, and probably correctly, that with so many possible diagnoses, it becomes impractical to try testing DA.³ As a solution to this, a simplified diagnostic approach was developed, the TCMDD, which examines for the presence of a smaller subset (15) of key diagnostic descriptors that can be combined. These descriptors were derived from an analysis of a similar set of descriptors developed in Korea, which were then tested and modified following rigorous procedures.³ The authors have further claimed that using this approach can potentially improve the teaching and practice of TCM, by improving accuracy of key aspects of diagnosis. They then tested the TCMDD instrument in a well-designed study and found higher levels of agreement³ than that of the first study.¹ For further validation of this, it will be necessary to conduct surveys of practitioners of TCM to see how well this can work for them, and then try it in clinical practice to see what impact it may have.

A potential weakness of the TCMDD instrument is that it is not clear to what extent TCM practitioners are likely to use diagnoses based on the symptom and therefore how generalizable results of a study of open population practice might be. The TCMDD may be applicable for open populations, where it becomes practically impossible without adopting a system like it, but how well will this work on a specified patient population where the symptom-centered diagnosis is likely to be used and a more limited set of diagnostic patterns is chosen from? Examples of the more limited symptom-based pattern differentiation can be seen in low back pain^15,16 and rheumatoid arthritis.^17,18 The answer to this question is not yet clear. There have not been enough surveys of clinical practice to be able to state clearly how practitioners use the diagnosis/treatment systems of TCM. If such a survey shows that practitioners tend to not use the symptom-focused approach, choosing instead to leave open the possible patterns, then the TCMDD could prove to be very important for testing and improving clinical practice. A different approach for testing DA may be needed if surveys found that the majority of practitioners tend to practice the more symptom-centered approach. This may also be the case in studies that examine effectiveness of TCM for a particular symptom, In the first of the three articles, Popplewell et al. found low DA,¹ but in a study of DA in rheumatoid arthritis patients, Zhang et al. also found low DA¹⁷ and then showed that having the practitioners study further together to work on their differences can significantly increase the DA.¹⁸ Zhang et al.s' work seem to raise questions about whether the TCMDD is needed in this circumstance.

The field of TEAM needs to increase research output on DA for diagnostic methods and judgments. As a founding member of the international Pattern Identification Network Group,¹⁹ I have been working with my colleagues to promote and further this research. The work of Popplewell and other articles in this collection are coincidentally being published around the same time as the collection we are doing in the European Journal of Integrative Medicine¹⁰; this is important as the likelihood of increasing focus on and output of research on DA should be greater with the two collections. While too early to say what the limits and benefits of the approach Popplewell has developed will be, it should be helpful for future research in the area, and could also impact the way students are taught. These three articles present us an important set of tools and ways of thinking to start addressing some of the problems researching DA in TCM.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

References

Popplewell

, Reizes

, Zaslawski

. Consensus in Traditional Chinese Medical Diagnosis in open populations. J Altern Complement Med, 2019; 25:1109–1114.

Popplewell

, Reizes

, Zaslawski

. Appropriate statistics for determining chance removed inter-practitioner agreement. J Altern Complement Med, 2019; 25:1115–1120.

Popplewell

, Reizes

, Zaslawski

. A novel approach to describing traditional Chinese medical patterns: The “Traditional Chinese Medical Diagnostic Descriptor.”. J Altern Complement Med, 2019; 25:1121–1129.

O'Brien

, Birch

. A review of the reliability of traditional East Asian medical diagnoses. J Altern Complement Med, 2009; 15:353–366.

Birch

, Felt

. Understanding Acupuncture. London: Churchill Livingstone, 1999.

Birch

, Lewith

. Acupuncture research, the story so far. In: MacPherson

, Hammerschlag

, Lewith

, Schnyer

, eds. Acupuncture Research: Strategies for Building an Evidence Base. London: Elsevier, 2007:15–35.

Schnyer

, Birch

, MacPherson

. Acupuncture practice as the foundation for clinical evaluation. In: MacPherson

, Hammerschlag

, Lewith

, Schnyer

, eds. Acupuncture Research: Strategies for Building an Evidence Base. London: Elsevier, 2007:153–179.

Scheid

. Patterns, syndromes, types: Who should we be? What should we do?. Eur J Orient Med, 2013; 7:10–21.

Scheid

. Chinese Medicine in Contemporary China. Durham, NC: Duke University Press, 2002.

10.

Birch

, Bian

, Lee

, et al. Pattern identification—History, nature and strategies for treating patients: A narrative review. Eur J Integr Med (in review).

11.

Birch

. Acupuncture: How might the mechanisms of treatment have contributed to the diagnosis of “patterns” and pattern-based treatments—Speculations on the evolution of acupuncture as a therapy. Implications for researchers. J Acupunct Res, 2018; 35:47–51.

12.

Birch

, Alraek

. Traditional East Asian medicine: How to understand and approach diagnostic findings and patterns in a modern scientific framework?. Chin J Integr Med, 2014; 20:333–337.

13.

Okabe

Introduction to traditional Japanese acupuncture (parts 1 and 2), N Am J Orient Med, 1998; 5:3–13.

14.

Birch

. An exploration with proposed solutions of the problems and issues in conducting clinical research in acupuncture [PhD thesis]. Exeter, UK: University of Exeter, 1997.

15.

Sherman

, Cherkin

, Hogeboom

. The diagnosis and treatment of patients with chronic low-back pain by traditional Chinese medical acupuncturists. J Altern Complement Med, 2001; 7:641–650.

16.

Sherman

, Hogeboom

, Cherkin

. How traditional Chinese medicine acupuncturists would diagnose and treat chronic low back pain: Results of a survey of licensed acupuncturists in Washington State. Complement Ther Med, 2001; 9:146–153.

17.

Zhang

, Lee

, Bausell

, et al. Variability in the traditional Chinese medicine (TCM) diagnoses and herbal prescriptions provided by three TCM practitioners for 40 patients with rheumatoid arthritis. J Altern Complement Med, 2005; 11:415–421.

18.

Zhang

, Singh

, Lee

, et al. Improvement of agreement in TCM diagnosis among TCM practitioners for persons with the conventional diagnosis of rheumatoid arthritis: Effect of training. J Altern Complement Med, 2008; 14:381–386.

19.

Lee

, Lee

, Alraek

, et al. Current research and future directions in pattern identification: Results of an international symposium. Chin J Integr Med, 2016; 22:947–955.