Inter-Rater Reliability in Traditional Chinese Medicine: Challenging Paradigmatic Assumptions

Abstract

Rosa N. Schnyer, DAOM, LAc, IFMCP

Claudia Citkovitz, PhD, MS, LAc

The World Health Organization has announced that its newly updated International Classification of Diseases (ICD-11)¹ will include standardized terminology for Traditional Chinese Medicine (TCM) pattern diagnosis. This is in keeping with its stated goal of providing standardized, consistent data across global regions, and reflects usage by millions of patients in East Asia. The change also reflects increasing prevalence of TCM's use in the United States and other countries throughout the world, stronger evidence of acupuncture's effect on pain over and above placebo,² and increasing integration of TCM into mainstream health care systems.³

Critics questioning this change cite the “prescientific” origin of TCM's patterns, such as qi stagnation, as patent evidence of its unfitness for integration.⁴ A more substantive objection might be the paucity of research validating not only TCM patterns but also the assessments that underlie them, such as palpation of the pulse and inspection of the tongue.

Indeed—and here is the rub—it is not even entirely clear what “validation” would mean in this context. Validity is a statistical concept that denotes the general accuracy of the diagnostic conclusions and represents the best available approximation to the right diagnosis.⁵ In biomedical research, new tests can be compared with existing ones. However, what should be compared with what to verify that a wiry pulse is indeed wiry or that it is positively correlated with other diagnostic findings of qi stagnation? Some researchers have instead chosen to “validate” TCM patterns by examining the prevalence of either the diagnosis or associated symptoms in relation to biomedical disease entities such as menopause⁶ or as in this issue, poststroke cognitive impairment (Liu et al., this issue) and chronic atrophic gastritis (Zhang et al., this issue).⁸

Underlying any assessment of validity, and at least as problematic for TCM assessment, is inter-rater reliability (IRR). This term denotes the consistency of diagnostic assessment between two or more providers. If the purpose of diagnostic codes such as the ICD-11 is to provide a common language for comparing and sharing data in a consistent and standard way, then it does seem a reasonable expectation that codes will be assigned reliably between users. And yet, IRR has proven remarkably low among TCM practitioners in studies published over the last two decades (Jacobson, this issue), at least in the West. Low IRR in TCM assessment has further widened the cultural chasm between biomedical diagnostic categories and TCM patterns. It is often cited as an obstacle to the integration of TCM constructs into clinical trials (see third article by Popplewell et al., this issue).

It is important to note that establishing IRR has also proven challenging in other areas of medicine, even when using “objective” assessment tools such as MRI⁷ or computed tomography angiography.⁸ It remains to be seen how the recently adopted inclusion of TCM pattern differentiation as “diagnosis” within the ICD-11 will impact clinical practice, increase reliability, or facilitate research.

Inter-Rater Reliability in Traditional Chinese Medicine

As seen in several of the articles in this issue, a variety of reasons have been cited for the poor IRR of TCM assessments, and a number of methodological approaches have been taken to improve it—with results that are modest at best. Possible responses to this state of affairs fall, broadly speaking, into three categories.

Focusing on methodology

One approach is to continue treating the problem as a methodological one. This follows the example of other fields of study such as psychiatry and psychology where researchers, similarly confronted, have continued to identify and address challenges in establishing IRR. Methodologically it can be argued that a diagnostic manual, and further development of a structured interview process⁹ and training (Schnyer, this issue), will eventually yield increased IRR. It is important to note that the results in these other fields have been mixed so far^10,11 and likely will be even more challenging with the controversial latest version of the Diagnostic and Statistical Manual of Mental Disorders, DSM-5.¹² Researchers in these fields are cognizant of the problem and are exploring innovative methods.¹³ For TCM, it can also be argued that we just need to apply the right statistics to successfully assess IRR in TCM (see second article by Popplewell et al., this issue), or that reducing the heterogeneity of TCM assessment categories will effectively simplify the process (see third article by Popplewell et al., this issue). However, as argued by Rioux in her commentary, attempts to quantify IRR in a decontextualized “snapshot” may be doomed to failure, given the dynamic nature of the TCM clinical encounter. Rioux suggests that a more appropriate research focus would be developing model-valid methods and ecologically valid concepts (Rioux, this issue).

Valuing heterogeneity

A second approach acknowledges the complexity and heterogeneity inherent in TCM assessment not as an obstacle but as a valuable attribute, characterizing the TCM clinical reasoning process as a Complex Adaptive System (CAS).¹⁴ This approach argues that “the conversation regarding low IRR is fundamentally incongruent with TCM theory and training” (see commentary by Taylor-Swanson et al., this issue) and that complex dynamic systems, such as the TCM pattern assessment and treatment, “can be directed along the same pathways of change by more than one type of input” (Jacobson et al., this issue). If that is true for TCM, then, as stated by Jacobson et al., “individual variations among practitioners in their initial diagnoses and treatments are not an obstacle to reliably good clinical outcomes, and it just requires more sophisticated modeling.” An example of such modeling is the Markov clustering analysis (Liu et al., this issue), and the hierarchical clustering analysis and complex system entropy clustering analysis used by Zhang et al. (this issue). As proposed by Taylor-Swanson et al. in their commentary, applying CAS methods to inform future research may provide a congruent yet rigorous model to investigate the topic of TCM pattern assessment.

Highlighting patient characteristics

A third default approach is simply to consider TCM assessments as a set of patient characteristics, without specific attention either to improving practitioner IRR or to modeling the complexity underlying the assessments. This pragmatic approach is characteristic of observational research on large data sets, which tolerate heterogeneity better than controlled testing environments. It was also taken in Taylor-Swanson et al.'s matrix analysis article in this issue. This approach can be assumed to have been taken in any study that reports TCM diagnoses without also reporting methods taken to address IRR. If the ICD-11 is implemented as planned, then millions of TCM diagnoses, from practice environments as diverse as China, Taiwan, Australia, the United States, and South Africa, will be comparable on a seemingly apples-to-apples basis. It is possible that the large volume of data will allow central tendencies and clinically meaningful insights to emerge from such analysis. It is also possible that the addition of an international diagnostic standard will stimulate increased attention to cultivating IRR in TCM education. The data resulting from large-scale collection of ICD-11 diagnoses may also be sufficiently chaotic that attempts at post hoc validation will be inconclusive despite the large data sets. This eventuality would presumably strengthen the arguments of those opposed to TCM's inclusion in the first place. The one thing that can be stated with certainty is that this is a space to watch.

Introducing This Issue

This JACM Special Focus Issue on Challenges in Inter-Rater Reliability has a grassroots origin. It did not come about via a top-down call for articles. Rather, the issue grew organically out of conversations between the editor, peer reviewers, and editorial board members in relation to articles already submitted. For instance, if an article purports to draw links between TCM diagnoses and biomedical patterns, what can this mean if little constancy in the reliability of the former can be guaranteed? What should JACM's policy be on such articles? Accordingly, this collection has a touch of found art in it. The editors made no effort to achieve comprehensiveness or full representativeness of all the associated issues around IRR, individual treatment, and debatable appropriateness of the term “diagnosis” to describe TCM's prevention- and health-oriented assessment process. The original research presented here reopens a long-standing conversation at a time when complex issues such as personalization, patient centeredness, and shifting to health rather than disease outcome—long apparent within the TCM field—are seen as innovations in biomedicine. We saw the incoming, decided to take a look again at the problem, identifying areas of difficulty, approaches taken to illuminate the issues, and areas of remaining darkness. Here are the headlines.

What are the challenges?

As seen in this issue's articles, a number of epistemic and methodological challenges have been proposed to explain the low levels of IRR found in TCM studies to date. Key issues include the following:

Complexity: Pattern assessment versus “differential diagnosis”

TCM does not seek to diagnose and treat diseases in the biomedical sense, but rather to detect and address patterns of imbalance. Balance is an individual experience of well-being, and is contextual and dynamic rather than rigidly categorical. The Chinese medicine clinical encounter, therefore, seeks to detect specific patterns (zheng) of impaired function. These may or may not be differentiated by quantifiable information alone. Pattern identification (bian zheng) considers factors at multiple levels of analysis: within the individual, in the context of the individual's physical, social, and cultural environment, and in consideration for the individual's own experience. It is based on the relationship among variables and the dynamics of that relationship. The Chinese medicine clinical assessment is an emergent process between two people, at a particular point in time. A TCM assessment such as “blood stasis” helps the clinician to understand the dynamics of the patient's imbalance, to place it in the context of previous experience and the historical clinical literature, and to respond to it dynamically over time. Such an assessment is not a discrete fixed entity such as “otitis media” or “depression.” Changes in the patient's experience or presentation between or even within sessions continuously inform the pattern identification process.

All but three of the articles in this issue have taken methodological approaches to improving IRR by training, operationalizing, or simplifying the diagnostic process. The remaining three articles—Zhang et al., Liu et al., and Taylor-Swanson et al.—take an observational approach, aggregating data to investigate emergent patterns among assessments made “in the wild,” under uncontrolled circumstances. Rather than manipulating the assessment process, these articles pragmatically explore the robustness and clinical utility of the assessments themselves. As noted by Mist in his commentary, Zhang and Liu use statistical techniques that help uncover the complex relationships among diagnoses, even though they are not assessing IRR. Taylor-Swanson et al. also aimed at uncovering relationships between different imbalances.

Heterogeneity of patterns/diagnoses

Patients' experiences of disease, distress, and suboptimal function are constantly shifting in relation to multiple internal and external factors. For TCM practitioners and other clinicians alike, the ability to understand, anticipate, and usefully intervene in pathologic processes depends on matching the patient's experience with pre-existing clinical principles. Given the transitory nature of patients' experience and the multiplicity of potentially influential factors, a TCM “diagnosis” may include any number of co-occurring and potentially interacting patterns affecting the chief complaint. Accordingly, TCM patterns identified for a group of patients with the same biomedically defined disease will vary between patients, as well as across time for a single patient. This inherent heterogeneity greatly complicates any quest for homogenous pattern assignment or prioritization between practitioners. Furthermore, it is critical to acknowledge that although this issue focuses explicitly on TCM, these factors are pertinent to other East Asian Medicine (EAM) styles of treatment.

One approach to reducing heterogeneity taken by several articles in this issue is limiting the number of assessment categories to a specified set of descriptors. This reduction may happen through experimental modification of the assessment process itself as modeled in 2004 by Langevin et al.'s Yin and Yang scores (reviewed in Jacobson et al). Conversely, such reduction may be accomplished by post hoc “binning” where similar but not identical values are grouped to streamline data processing. Taylor-Swanson et al. use this method. Popplewell et al. propose a system of 15 descriptors that can either be used prospectively or applied through post hoc binning.

Another approach, initially proposed by Schnyer and Allen¹⁵ and taken by a number of studies since (Jacobson et al., this issue), is to increase the replicability of the TCM assessment process through operationalization of the steps taken by clinicians in making the assessment, such as inquiry, pulse taking, and tongue diagnosis. This operationalization may include the use of a treatment manual,¹ the development of a structured interview⁹ (modeled e.g., on the Structured Clinical Interview for DSM Disorders or SCID¹⁶), in-person practitioner training, or a combination.

Each of these approaches raises potential concerns. Post hoc binning facilitates statistical differentiation of assessment categories but leaves open the possibility of major confounding by differing interpractitioner assessments. Modification of the assessment process, either by category reduction or by operationalization, runs the risk of “squishing the butterfly”—a graphic term for disrupting the essence of a living process by deconstructing and standardizing it.

Furthermore, it must be acknowledged that the cognitive process of pattern recognition—our ability to recognize and predict in Chinese medicine and elsewhere—involves matching new information received with information we already know.¹⁷ For biomedical concepts based on testable structural or biochemical anomalies, diagnostic criteria can be specified; for TCM's more functional assessments, “diagnosis” based on presenting complaints may be irreducibly heterogenous. For instance, Taylor-Swanson et al. note that “in the patient/practitioner interaction, differing information could emerge based on degree of patient trust or, the practitioner's history or training” (commentary in this issue). It is important to consider that diagnostic heterogeneity is not unique to TCM but presents in other areas of health care where there is considerable overlap of symptoms across categories. Mainstream psychiatry and neurologic rehabilitation are outstanding examples.^18,19

For all the reasons discussed above, pursuit of distinct, statistically robust assessment categories suitable for inclusion in the IRR may reinforce the misinterpretation of dynamic patterns as static diagnoses. Reducing the number of diagnostic categories may be a useful method of managing heterogeneity in post hoc analysis of unmanipulated assessment. However, if undertaken as a method of simplifying the assessment process, there is a risk of moving away from personalized medicine and moving closer to reductionist care.

Lack of standard terminology

Standardization of terminology is a deep and persistent problem in TCM education and clinical discourse as well as research. TCM as developed in China over the past 70 years itself represents a movement to streamline and standardize the process and terminology for a breathtakingly heterogeneous collection of medical practices transmitted through thousands of years of history along divergent “threads” of family and institutional tradition. In the West, compelling arguments have been made for the clinical and didactic utility of a standard set of terms for translation. However, practitioner and educator consensus on standardization of terms has proven elusive, with a number of alternative term sets already established in common usage, and a number of translators arguing that flexibility of terms used is needed to clearly communicate the meaning and spirit of clinical texts.²⁰

In the Popplewell articles, one key strategy to improve IRR is the consolidation of syndromes into 15 basic designations, which can be combined as needed. For example, a patient complaining of chilliness, joint and muscle pain, and loose stools in winter might be assessed with the designations of “Cold, Damp, Yang Xu, Spleen, Kidney.” An argument could be made for the value of such a streamlined system, particularly for research aiming to explore the relationship between biomedically and EAM-defined syndromes, as in Taylor-Swanson et al.'s matrix analysis article in this issue. However, consolidating terms runs the risk of obscuring clinically meaningful differences.

IRR is only one dimension of clinimetric validation for an assessment instrument such as the Popplewell system. “Content validity” denotes the completeness with which an experimental model represents the phenomenon under study. Acupuncturists will be quick to note that the Yang organs and channels are not represented in the system, meaning, for example, that large intestine-based constipation could find itself either orphaned or grouped rather awkwardly with lung disorders. Equally problematic, the category of “Damp” subsumes diagnoses of both damp and phlegm fluid disorders that can overlap, but in many cases respond to quite different treatment approaches (e.g., diuresis and sweating in particular tend to resolve damp but may worsen phlegm). “Face validity” is the most basic and least specific type of validity, invoked when there are logical lapses or “on the face of it” problems with a proposed model. While Popplewell's modular approach does point in an intriguing and possibly useful direction, care must be taken not to prioritize IRR over face and content validity.

Problematic statistics

Various versions of the Kappa statistic are used to assess IRR. Kappa is generally agreed to be a conservative estimate, with important limitations such as inconsistency across levels of prevalence of the item rated.²¹ For TCM, it is also important to note that Kappa calculations do not weight for degree of disagreement: raters finding blood deficiency and Yin deficiency (which have considerable overlap) would be just as much in “nonagreement” as they would for opposite ratings such as Yin and Yang deficiency. Calculations also assume that the items rated are independent, which they are not: for example, scalloped tongues are more often pale than red. Both of these considerations meaningfully undermine Kappa's appropriateness for TCM assessment. Kappa is calculated on a scale of 0 (no agreement) to 1 (perfect agreement) by comparing the number of ratings made identically between raters with the number of ratings that would be identical if left to chance. A negative Kappa would indicate that IRR was worse than chance.

Jacobson et al., in their critical review of TCM IRR studies, discuss these issues in more detail. Popplewell et al. propose a possible alternative to Kappa statistics in the second of three articles by their team in this issue. It is unclear how well this other statistical method can accurately capture the dynamics of the progression of patterns of imbalance in TCM (see Mist commentary in this issue).

Communicating across paradigms and cultures

As seen in the challenges discussed above, and demonstrated clearly in Jacobson et al.'s review, the simple use of standard diagnostic statistics to TCM assessment yields low IRR findings, even when TCM-intrinsic measures are taken to improve them. If experimental validation of TCM assessment is to be advanced, then genuine TCM expertise needs to be combined with statistical and clinimetric insights. The enterprise also needs to be deeply informed by emerging complexity science. As argued by Rioux in her commentary (Rioux, this issue), “reducing an analysis of inter-rater agreement to statistical measures undermines the diagnostic reasoning of TCM and makes a relevant understanding of ecological validity improbable.” Rioux suggests that qualitatively oriented methods of ascertaining reliability may be better suited to assessing internal consistency and replicability of TCM than statistical ones. TCM diagnosis may need to be modeled as a framework that supports identification and de-escalation of patterns emergent from the dyadic patient-practitioner interaction. This would be a marked departure from statistical approaches that model the system as a set of independent diagnoses with fixed clinical markers, such as those in the ICD-11. However, many patients appear to be seeking just such a departure when they choose TCM.^22,23 MacPherson et al. suggest that acupuncture research has advanced biomedical insights in a number of key areas, including the therapeutic importance of the individual patient encounter.²⁴ In the long run, it may be that biomedicine itself can benefit from complexity-oriented research into the TCM diagnostic process, as long as such research is well informed by experts on TCM practice as well as clinimetrics and biostatistics.

It must now be recognized that validating TCM diagnoses will require a critical mass of all four of these key domains of expertise in one research group. A second challenge will be keeping such a research group in collaborative engagement long enough to develop a language of shared new meanings, rather than simple translation of terms.

The studies published and reviewed in this issue generally originated in the West, except for two, which have senior authors located in the United States (Zhang et al.; Liu et al). In decades past, it was assumed that Western studies of TCM were more likely to be methodologically robust than studies conducted in East Asia.²⁵ Times have changed, however. Research has been conducted on many aspects of the TCM diagnostic process, including not only human-based, complexity-oriented approaches²⁶ but also validation of various digital and artificial intelligence-aided systems.²⁷ Once interdisciplinary teams have been established, a first order of business must be to collaborate on understanding TCM-intrinsic and interdisciplinary advances in assessing validity and reliability in China, Korea, and other East Asian countries, as well as advances in reliability and validity in other fields of study, with the goal of forming international teams that can work together to forge the way forward.

Approaches taken: Groups of articles and commentaries

The articles in this issue fall naturally into three groups, in terms of approaches taken to working with IRR. For each group, we invited commentary from experts in related fields to begin what we see as a necessary process of multilateral discussion.

The first set of articles represent in our view a “first wave” of work done on the challenge of IRR, including assessments of the methodological challenges and steps taken to address them. It is important to note the overlap of authors across articles in this group. Long-term collaborations and critical conversations exemplify the process of evolution in approaching this subject.

Jacobson et al. is a painstaking systematic review of work done on IRR in TCM since 2001. Twenty-one studies are reviewed, including 14 assessments of TCM diagnosis, 2 studies of IRR on diagnostic indices, and 5 novel rating systems (including Popplewell's). The authors report a mean Kappa finding of 0.34, particularly dismal given that many of the studies reviewed implemented specific measures to improve IRR. They additionally note that one of the approaches associated with higher IRR is reducing the number of diagnostic categories from which raters can choose.

Schnyer et al. identified interpractitioner differences in training and in the process of information gathering as main challenges to IRR, and used study methods specific to addressing those challenges. However, none of the study's interventions significantly impacted IRR. The question is raised: if these methodological factors do not account for low IRR, then what epistemic or other factors are we overlooking, or what incorrect assumptions are we making about what it is we are doing? Or is it simply that more work is necessary to improve reliability, as modeled by other fields of study?²⁸

Taylor-Swanson et al. do not examine IRR per se, but rather conduct a post hoc attempt to aggregate TCM diagnostic data in a clinically meaningful way. Initial and final diagnoses of veterans suffering from Gulf War illness, freely assessed by practitioners in a pragmatic trial, were “binned” into nonexclusive categories of constitutional excess, constitutional deficiency, and channel imbalance. This approach represents a novel and noteworthy approach to streamlining post hoc analysis of diagnostic data.

Commentary on these articles is provided by Dr. Jennifer Rioux, a medical anthropologist and Ayurvedic medicine practitioner. Dr. Rioux's commentary highlights important insights from whole systems research on the methodological inconsistencies of applying “biomedical positivist research concept methods and principles” to quantify the contextual, process-oriented, and dynamic clinical encounter of TCM and other traditionally based system of care.

The second group of articles can be seen as a small second wave of work, combining ideas generated in the first wave with a statistical alternative to the commonly used Cohen's or Fleiss Kappa for calculating IRR. All three of these articles stem from Michael Popplewell's work. In the first article, Popplewell et al. evaluated diagnostic agreement between two or three practitioners who diagnosed 35 patients in a TCM clinic at the University of Technology in Sydney, Australia. In the second article, this team proposes an alternate statistical test to Kappa (Gwet's AC2) for assessing IRR, among TCM practitioners who diagnosed 42 subjects drawn from the same open population. In the third and final article, they present a novel approach to recording TCM diagnostic patterns, the Traditional Chinese Medical Diagnostic Descriptor (TCMDD), and provide an example of its use in mapping TCM diagnostic patterns and evaluating IRR in the same 35 patients of the first study.

Commentary on Dr. Popplewell's articles is provided by Dr. Stephen Birch, a long-term acupuncture practitioner and researcher. Dr. Birch has written extensively on the challenges of conducting clinical research in acupuncture. He is a founding member of the international Pattern Identification Network Group (iPING).²⁹ Dr. Birch's commentary looks more closely at the methodological issues addressed in Dr. Popplewell's work.

The third group consists of only two articles, but represents a large and important third wave of work underway in China.³⁰ Recognizing that the empirical knowledge collected during clinical practice provides the fundamental source for TCM research, these articles focus on mining large-scale clinical data sets to both generate hypotheses and to support clinical decisions. Rather than relying on a structured data entry template, they acknowledge that the heterogeneity inherent in the TCM clinical reasoning process is at the core of this empirical knowledge. They instead utilize a huge data capturing framework, plus data processing and analysis to incorporate the flexible structured narrative of TCM. This innovative methodology may emerge as an important tool in TCM research; however, it appears to de facto bypass IRR altogether and to rest on dual assumptions that (1) big data can potentially validate TCM theory, and (2) practitioners have equivalent or interchangeable knowledge of TCM. An argument can be made—defensible if challenging—that it is necessary to model the underlying understanding of TCM first and then to explore how closely the data fit that model (Scott Mist, personal communication).

A reflective commentary on this group of articles is provided by Scott Mist, PhD, Assistant Professor at Oregon Health Sciences University and JACM biostatistics editor, who brings the unique combination of acupuncture clinic practice and biostatistics and complexity training. Dr. Mist highlights the challenges presented by the use of innovative statistical analysis and data mining used by our Chinese colleagues, given that they use “unsupervised methods” that assume that the true nature of TCM diagnoses can be found by looking at the structure of the data, with no prior construct validation.

At the present time, none of these questions is settled. If the complexity science view is accurate, then IRR of TCM assessments may remain intractably elusive. However, several new articles in this issue present innovative methods for improving IRR, which should not be discounted without careful consideration. At this juncture, it may be useful to step back and consider what assumptions about the nature and construction of knowledge might underlie the concept of IRR, or affect our understanding of it. In considering the articles, we invite readers to ask the following questions.

Epistemic challenges

For IRR to be an operable concept, TCM patterns must be stable entities with agreed-upon definitions. Does this kind of fixed diagnostic category accurately represent TCM acupuncture and/or herbal medicine practice?

Is TCM diagnosis always the most important determinant of treatment? In the delivery of acupuncture for musculoskeletal pain, for example, qi stagnation is always diagnosed, possibly with other factors such as blood stasis or cold. However, needle placement and clinical success may vary widely between practitioners, much as two dentists might make different choices regarding the type and placement of filling material for the same diagnosis of dental caries.

Would a high level of IRR make individual practitioners interchangeable, and would that be desirable?

Complexity science regards health as an emergent property of complex living systems, and medical interventions as skillful collaborative perturbations of a disordered system that may help to stimulate homeoregulation.

○ If more than one sequence of initial assessment and perturbation may lead to clinical success, is IRR necessary?

○ In this context, what are the utility and validity either of assessing IRR or of using the ICD-11 diagnostic categories?

Research process and priorities

Is IRR necessary for establishing the validity of traditional constructs and ultimately for improving clinical effectiveness? Why?

Guidelines for clinimetric research do exist,³¹ concerned with issues such as the choice of component variables, and the evaluation of consistency, validity, and responsiveness.

○ How important is establishing IRR as a research priority for EAM in general, and specifically for TCM?

○ If it is a priority, how can we collaborate more closely with clinimetric experts to avoid methodological pitfalls?

If the process of EAM assessment does not constitute a “diagnosis,” how can we evaluate IRR in a way that is consistent with the EAM model?

If simplifying TCM assessment diagnosis is potentially helpful in increasing IRR, what is a methodologically sound simplification process that does not obscure important clinical nuances?

○ How can such a simplification process possibly take account of the multiplicity of other systems of practice within Traditional East Asian Medicine (TEAM).

○ Would such a process need to be applied differently to acupuncture and herbal medicine systems?

Can big data mining provide an innovative model for TCM pattern validation in the absence of IRR?

○ If it does, what potentially important clinical nuances may be overlooked?

○ If the “binning” approaches used to simplify data for post hoc analysis should begin to be applied to the assessment process, as in Popplewell's proposed system, what clinical information may be overlooked?

Conclusion

Although this Special Issue does not represent a comprehensive overview of the important work that has been conducted in the area of TCM reliability, the original articles and accompanying commentaries highlight critical questions in this field and give us pause in considering a way forward. Several challenges remain, which will require a dialogue across cultures, disciplines, and continents. Let us not ignore that fundamental work is currently taking place in China, Japan, and Korea, most of which cannot be easily accessed without fluency in these languages. Let us also recognize that the complexity of the human experience in the clinical encounter is not exclusive of TCM or even EAM, and that teams of highly qualified experts are working on applying novel systems biology models to better understand chronic complex disease. These too may provide valuable insights on the development of methodologies to better understand the TCM (and other EAM) clinical reasoning process. Let us remember that establishing IRR and construct validity is a well-developed field, with its own process and insights.

Acupuncture efficacy research was initially seen as a necessary preliminary to large-scale effectiveness research, viewed through the lens of the pharmaceutical research paradigm. Just as expert opinion on that question has shifted through enormous collaborative effort,³² it appears to be time to re-examine our current assumptions regarding validation of TCM diagnostic entities (patterns, tongue/pulse findings, etc.) and their associated IRR.

Our Call to Action

We propose to convene a multidisciplinary panel of experts, including EAM clinicians and clinical researchers, anthropologists, clinimetricians, and statisticians versed in complexity science to join us in a think tank. This think tank could draft guidelines for assessing reliability and validity of EAM systems and suggestions for analyzing results, based on current knowledge of the potential and limitations of established statistical methods; a sort of STRICTA³³ for conducting reliability. Our overarching vision is the development of a divergent model of validation that neither constrains nor ignores the difficult task of cross-fertilization. This is an evolving process. As such, it is iterative and will require time.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

The authors received no financial support for the research, authorship, and/or publication of this article.

References

World Health Organization. International Statistical Classification of Diseases and Related Health Problems (11th Revision). 2018. Online document at: https://icd.who.int/browse11/l-m/en, accessed September 11, 2019 .

Vickers

, Vertosick

, Lewith

, et al.; Acupuncture Trialists' Collaboration. Acupuncture for chronic pain: Update of an individual patient data meta-analysis. J Pain, 2018; 19:455–474.

Mann

, Burch

, Shakeshaft

. Attitudes toward acupuncture among pain fellowship directors. Pain Med, 2015; 17:494–500.

Gorski

. ICD-11: A triumph of the “integration” of quackery with real medicine. Science-Based Medicine. 2018. Online document at: https://sciencebasedmedicine.org/icd-11-a-triumph-of-the-integration-of-quackery-with-real-medicine, accessed August 29, 2019 .

Cook

, Campbell

. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago IL: Rand Mc Nally, 1979.

Chen

, Wong

, Cao

, Lam

. An evidence-based validation of traditional Chinese medicine syndromes. Complement Ther Med, 2010; 18:199–205.

, Buerba

, Long

3rd , et al. Interrater and intrarater agreements of magnetic resonance imaging findings in the lumbar spine: Significant variability across degenerative conditions. Spine J, 2014; 14:2442–2448.

Maldaner

, Stienen

, Bijlenga

, et al. Interrater agreement in the radiologic characterization of ruptured intracranial aneurysms based on computed tomography angiography. World Neurosurg, 2017; 103:876.e1–882.e1.

Schnyer

, Conboy

, Jacobson

, et al. Development of a Chinese medicine assessment measure: An interdisciplinary approach using the Delphi method. J Altern Complement Med, 2005; 11:1005–1013.

10.

Lobbestael

, Leurgans

, Arntz

. Inter-rater reliability of the Structured Clinical Interview for DSM-IV Axis I Disorders (SCID I) and Axis II Disorders (SCID II). Clin Psychol Psychother, 2011; 18:75–79.

11.

Dreessen

, Arntz

. Short-interval test-retest interrater reliability of the Structured Clinical Interview for DSM-III-R Personality Disorders (SCID-II) in outpatients. J Pers Disord, 1998; 2:138–148.

12.

Wakefield

. Diagnostic issues and controversies in DSM-5: Return of the false positives problem. Annu Rev Clin Psychol, 2016; 12:105–132.

13.

Madhoo

, Levine

. Network analysis of the Quick Inventory of Depressive Symptomatology: Reanalysis of the STAR*D clinical trial. Eur Neuropsychopharmacol, 2016; 26:1768–1774.

14.

Koithan

, Bell

, Niemeyer

, Pincus

. A complex systems science perspective for whole systems of complementary and alternative medicine research. Complement Med Res, 2012; 19(Suppl 1):7–14.

15.

Schnyer

, Allen

. Bridging the gap in complementary and alternative medicine research: Manualization as a means of promoting standardization and flexibility of treatment in clinical trials of acupuncture. J Altern Complement Med, 2002; 8:623–634.

16.

Spitzer

, Williams

, Gibbon

, First

. The Structured Clinical Interview for DSM-III-R (SCID). I: History, rationale, and description. Arch Gen Psychiatry, 1992; 49:624–629.

17.

Shugen

. Framework of pattern recognition model based on the cognitive psychology. Geo Spat Inform Sci, 2002; 5:74–78.

18.

Allsopp

, Read

, Corcoran

, Kinderman

. Heterogeneity in psychiatric diagnostic classification. Psychiatry Res, 2019; 279:15–22.

19.

Boes

, Prasad

, Liu

, et al. Network localization of neurological symptoms from focal brain lesions. Brain, 2015; 138:3061–3075.

20.

Hui

, Pritzker

. Terminology standardization in Chinese medicine: The perspective from UCLA Center for East-West medicine. Chin J Integr Med, 2007; 13:64–66.

21.

Kottner

, Audigé

, Brorson

, et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud, 2011; 48:661–671.

22.

Cassidy

. Chinese medicine users in the United States part II: Preferred aspects of care. J Altern Complement Med, 1998; 4:189–202.

23.

Paterson

, Britten

. The patient's experience of holistic care: Insights from acupuncture research. Chronic Illn, 2008; 4:264–277.

24.

MacPherson

, Hammerschlag

, Coeytaux

, et al. Unanticipated insights into biomedicine from the study of acupuncture. J Altern Complement Med, 2016; 22:101–107.

25.

Vickers

, Goyal

, Harland

, Rees

. Do certain countries produce only positive results? A systematic review of controlled trials. Control Clin Trials, 1998; 19:159–166.

26.

Liu

, Zhou

, Wang

, et al. Data processing and analysis in real-world traditional Chinese medicine clinical data: Challenges and approaches. Stat Med, 2012; 31:653–660.

27.

Liu

, Zhai

, Han

, et al. The future development of traditional Chinese medicine from the perspective of artificial intelligence with Big Data. In: 2018 IEEE 4th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS). Omaha, NE: IEEE; 2018, 204–209.

28.

Tuijn

, Janssens

, Robben

, van den Bergh

. Reducing interrater variability and improving health care: A meta-analytical review. J Eval Clin Pract, 2012; 18:887–895.

29.

Lee

, Lee

, Alraek

, et al. Current research and future directions in pattern identification: Results of an international symposium. Chin J Integr Med, 2016; 22:947–955.

30.

Zhou

, Chen

, Liu

, et al. Development of traditional Chinese medicine clinical data warehouse for medical knowledge discovery and decision support. Artif Intell Med, 2010; 48:139–152.

31.

Fava

, Tomba

, Sonino

. Clinimetrics: The science of clinical measurements. Int J Clin Pract, 2012; 66:11–15.

32.

Vickers

, Cronin

, Maschino

, et al.; Acupuncture Trialists' Collaboration. Acupuncture for chronic pain: Individual patient data meta-analysis. Arch Intern Med, 2012; 172:1444–1453.

33.

MacPherson

, Altman

, Hammerschlag

, et al. Revised STandards for Reporting Interventions in Clinical Trials of Acupuncture (STRICTA): Extending the CONSORT statement. PLoS Med, 2010; 7:e1000261.