Reliability of Chinese Medicine Diagnostic Variables in the Examination of Patients with Osteoarthritis of the Knee

Abstract

Background:

Chinese medicine (CM) has its own diagnostic indicators that are used as evidence of change in a patient's condition. The majority of studies investigating efficacy of Chinese herbal medicine (CHM) have utilized biomedical diagnostic endpoints. For CM clinical diagnostic variables to be incorporated into clinical trial designs, there would need to be evidence that these diagnostic variables are reliable. Previous studies have indicated that the reliability of CM syndrome diagnosis is variable. Little information is known about where the variability stems from—the basic data collection level or the synthesis of diagnostic data, or both. No previous studies have investigated systematically the reliability of all four diagnostic methods used in the CM diagnostic process (Inquiry, Inspection, Auscultation/Olfaction, and Palpation).

Objectives:

The objective of this study was to assess the inter-rater reliability of data collected using the four diagnostic methods of CM in Australian patients with knee osteoarthritis (OA), in order to investigate if CM variables could be used with confidence as diagnostic endpoints in a clinical trial investigating the efficacy of a CHM in treating OA.

Methods:

An inter-rater reliability study was conducted as a substudy of a clinical trial investigating the treatment of knee OA with Chinese herbal medicine. Two (2) experienced CM practitioners conducted a CM examination separately, within 2 hours of each other, in 40 participants. A CM assessment form was utilized to record the diagnostic data. Cohen's κ coefficient was used as a measure of the level of agreement between 2 practitioners.

Results:

There was a relatively good level of agreement for Inquiry and Auscultation variables, and, in general, a low level of agreement for (visual) Inspection and Palpation variables.

Conclusions:

There was variation in the level of agreement between 2 practitioners on clinical information collected using the Four Diagnostic Methods of a CM examination. Some aspects of CM diagnosis appear to be reliable, while others are not. Based on these results, it was inappropriate to use CM diagnostic variables as diagnostic endpoints in the main study, which was an investigation of efficacy of CHM treatment of knee OA.

Introduction

C hinese medicine (CM) differs from conventional medicine (Western medicine) in a number of ways. For example, the concept of “disease” in CM is different from that in Western medicine. In CM, the term disease may correspond with a biomedically defined disease or condition, a disorder, or even just a symptom or a sign. In general, the CM diagnostic approach relies more on the clinician's interpretation of the patient's symptoms and signs rather than on laboratory tests, tending to be far more conceptual and less technologically driven.¹ The process of CM diagnosis starts with collection of clinical data, by taking a case history; making an overall observation of clinical signs; analysis of these signs according to CM theories (data analysis); and then drawing a conclusion with regard to the disease and CM Syndrome (data summary).^2,3 A CM Syndrome is a subcategory of a disease/disorder/symptom/sign characterized by particular signs and symptoms, and is indicative of the underlying pathogenesis of the condition at that point in time. CM Syndromes may change over time.

CM is characterized by its adoption of an integrated view of the mind and body. There are four diagnostic methods used in CM diagnosis: Inquiry; (visual) Inspection; Auscultation/Olfaction; and Palpation. Based on the information obtained by the four diagnostic methods, by a process known as Syndrome Differentiation, CM practitioners are able to diagnose the diagnostic subcategory or “pattern of disharmony” occurring in the body.

As in Western medicine, the consultation between the doctor and patient is also the fundamental art of CM. Inquiry is important not only to find out about the patient's life and symptoms, but also to determine the perceived causes of the disease/condition and underlying disharmony or Syndrome. In general, Inspection can be divided into three components: (1) inspection of the patient's vitality and complexion; (2) inspection of body parts; and (3) inspection of the tongue (tongue diagnosis). As a traditional medicine, the methods of direct Auscultation or Olfaction (without assistance of modern instrumentation) are still used, although there is less reliance on olfaction. This kind of diagnosis is based on the view that the sound of the voice, breath sounds, coughing, and body odors are related to the various internal organs. The diagnostic methods of Palpation refer to palpation of the pulse, (pulse diagnosis) the head and neck, chest and abdomen, muscles, skin, extremities, and acupoints. Pulse diagnosis has been developed to a very sophisticated level in CM, assessing the characteristics of the radial pulse and interpreting these in relation to the functional and physiologic state of the internal organs. Pulse and tongue diagnosis are considered to be the “two pillars” of the four examination methods in traditional practice. There have been several studies that have investigated the reliability of CM Syndrome diagnosis, in a variety of clinical conditions.^4

–11 Overall, the majority of studies have indicated that CM Syndrome diagnosis is not particularly reliable. There was no analysis of the reliability of the clinical data collected in these studies; therefore, it is unknown where the inconsistency lies—at the basic level of data collection or in the synthesis of the data to form the Syndrome diagnosis. A few studies have investigated the consistency of data collected using two of the main CM diagnostic methods, tongue diagnosis^12,13 and pulse diagnosis.^13,14 One (1) study¹³ examined comprehensively the reliability of three of the four main methods of collecting clinical information, but there have been no studies that have investigated systemically the reliability of all four diagnostic methods.

CM Syndrome diagnosis is dependent on the individual diagnostic observations (data). If the basic diagnostic data are not reliable, there will be lower confidence that the CM Syndrome diagnosis would be reliable and, therefore, lower confidence that optimal treatment is being received by the patient (because treatment follows Syndrome diagnosis). If CM diagnostic variables are consistent, then they can also be used in clinical trials as outcome indicators to assess the efficacy of CM therapies in addition to biomedical diagnostic endpoints. Thus, a study of the reliability of the CM diagnostic process was conducted as part of a clinical trial investigating the efficacy of a Chinese Herbal Medicine (CHM) in knee osteoarthritis (OA).

Materials and Methods

This study was part of a clinical trial to assess the efficacy of a CHM for treating Australian patients with knee OA. The trial was conducted in the Centre for Clinical Studies of the Nucleus Network, and the Alfred Hospital in Melbourne, Australia. Ethics approval was obtained from the Victoria University ethics committee and the Alfred Hospital ethics committee.

Practitioners and participants

Two (2) CM practitioners participated in this study. Both are registered CM practitioners in the state of Victoria, Australia, with bachelor degree–level training in CM and at least 10 years clinical experience. One (1) practitioner was trained and was practicing in China, and the other practitioner had completed training and is in clinical practice in Australia.

Forty (40) eligible patients from the CHM study participated in the inter-rater reliability study. The inclusion criteria for the main clinical study included unilateral or bilateral OA of the knee and fulfillment of the criteria provided by the American College of Rheumatology (ACR) 1995,¹⁵ mainly based on knee pain and presence of radiographic osteophytes. Radiographic evidence of OA was based on the Kellgren-Lawrence radiographic system,¹⁶ either grade II or grade III severity primary tibio-femoral OA as a condition of inclusion. The exclusion criteria of the study included secondary OA or rheumatoid inflammatory or any other type of arthritis, accompanying OA of hip of sufficient severity to interfere with the functional assessment of the knee, having received intra-articular treatment of the involved joint or joint lavage in the previous 6 months (e.g., corticosteroids or hyaluronic acid) or knee surgery during the previous 3 months, and any significant systemic illnesses or medical conditions that could lead to difficulty with complying with the protocol.

There was no CM assessment made for any of the participants prior to the study.

Materials

A CM assessment form was utilized to record systematically CM diagnostic information during a CM examination. This form was developed based on standard questions and observations using the four diagnostic methods: Inquiry; Inspection; Auscultation; and Palpation, as set out in the State Administration of Traditional Chinese Medicine's Advanced Textbook on Traditional Chinese Medicine and Pharmacology, vol. 1,¹⁷ a standard textbook used in CM curricula in China. The form was originally developed by consensus of a group of Australian researchers, and applied in a study of the reliability of CM diagnosis in hypercholesterolemic Australians, and, thus, had face validity and content validity.¹³ This assessment form was modified to include specific questions related to the knee joint.

The clinical information in the form recorded clinical data in a manner that allowed it to be readily analyzed. The first section (Inquiry, case history) consisted of a series of questions related to bodily functioning, with extra questions about the knee joint added for this study. The other sections recorded information relating to Inspection, Auscultation, and Palpation. For most questions, information was recorded as categorical variables, and practitioners were required to choose one answer from a limited range of answers.

With respect to pulse diagnosis, the basic characteristics recorded in this study were pulse speed, force, and depth. Pulse speed was measured as the number of beats of beats per breath of the patient. A normal speed is considered to be 4–5 beats per breath. A slow pulse is defined as ≤3 beats per breath, and a fast pulse is described as ≥6 beats per breath. The CM assessment form used in this study is considered to have face validity, construct validity, and criterion validity in terms of the theoretical structure and the real practice of CM diagnosis. Diagnostic endpoints utilised in the CM assessment form are set out in Table 1.

Table 1.

Chinese Medicine Diagnostic Endpoints

Diagnostic method	Diagnostic variables	Investigated endpoints	Response options
Inquiry	Body temperature	Presence of chills	Yes; No
		Severity of chills	Slight; Moderate; Strong
		Presence of cold	Yes; No
		Severity of cold	Slight; Moderate; Very
		Presence of cold hands and feet	Yes; No
		Severity of cold hands and feet	Slight; Moderate; Very
		Presence of the sensitivity to cold	Yes; No
		Severity of the sensitivity to cold	Slight; Moderate; Very
		Presence of the sensitivity to heat	Yes; No
		Severity of the sensitivity to heat	Slight; Moderate; Very
		Presence of fever	Yes; No
		Timing of fever	Constant; Alternating
		Severity of fever	Slight; Moderate; Strong
	Sweat	Presence of spontaneous sweating	Yes; No
		Severity of spontaneous sweating	Slight; Moderate; Heavy
		Presence of night sweating	Yes; No
		Severity of night sweating	Slight; Moderate; Heavy
	Headache	Presence of headache	Yes; No
		Location	Left; Right; Back; Forehead; Top
		Onset time	Morning; Afternoon; Evening; Anytime
		Type	Throbbing; Distending; Pressure; Moving; Dull
		History	<2 weeks; 2–4 weeks; 1–3 months; 3–6 months; >6 months
	Dizziness	Presence of dizziness	Yes; No
		Onset time	Morning; Afternoon; Evening
		Onset frequency	Daily; every 2–3 days; every 4–6 days; weekly; fortnightly; monthly; 3 monthly; 6 monthly
		History	<2 weeks; 2–4 weeks; 1–3 months; 3–6 months; >6 months
	Body parts	Presence of numbness	Yes; No
		Location of numbness	R fingers; R hand; R arm; R toes; R foot; R leg; L fingers; L hand; L arm; L toes; L foot; L leg; Face; Head; Neck; Upper back; Mid-back; Lower back; Hips
		Presence of body pain	Yes; No
		Location of body pain	R fingers; R hand; R elbow; R arm; R toes; R foot; R knee; R leg; R face; R neck; R upper back; R mid-back; R lower back; R hip; L fingers; L hand; L elbow; L arm; L toes; L foot; L knee; L leg; L face; L neck; L upper back; L midback; L lower back; L hip;
	Urine	Color	Clear; Light yellow; Dark yellow
		Quantity	Little; Moderate; Large
		Frequency	Normal; Frequent
		Night urination	Yes; No
		Burning Sensation	Yes; No
		Incomplete sensation	Yes; No
		Leakage	Yes; No
	Stools	Frequency of defecation	<1; day; 1–2; day; >2; day
		Quality of stools	Watery; Loose; Soft; Firm; Hard; Alternating loose and hard
		Presence of constipation	Yes; No
	Appetite	Level of appetite	Poor; Good; Excessive
		Taste in the mouth	No; Sweet; Salty; Bitter; Sour; Pungent; Bland; Greasy; Inability to taste
		Sensation after eating	Normal; Nauseous; Bloated; Belching; Tired
	Thirst	Presence of thirst	Yes; No; Yes but no desire to drink
	Chest	Abnormal sensation	Yes; No
		Cough phlegm	Yes; No
	Abdomen	Presence of abdominal pain	Yes; No
		Location	Upper; Centre; Lower; Left side; Right side; Moving
		Quality	Stabbing; Distending; Spasms; Other
		Pain aggravated by heat	Yes; No; Don't know
		Pain aggravated by pressure	Yes; No; Don't know
		Pain relieved by heat	Yes; No; Don't know
		Pain relieved by pressure	Yes; No; Don't know
		Presence of bloating	Yes; No
	Ears and eyes	Hearing	Normal; Slight decrease; Significant decrease
		Presence of tinnitus	Yes; No
		Presence of blurred vision	Yes; No
		Presence of dry eyes	Yes; No
		Presence of watery eyes	Yes; No
		Presence of sensitivity to light	Yes; No
		Presence of sensitivity to wind	Yes; No
	Sleep	Quality	Poor; Good; Very Good
		Dreaming	No; Seldom; Occasional; Frequent
		Difficulty falling asleep	Yes; No
		Waking up at night	Yes; No
	Energy	Energy level	Little; Moderate; Abundant
	Breath	Presence of shortness of breath	Yes; No
	Emotions	Emotions experienced over the past week	Calm; Anxiety; Excessive thinking; Irritability; Sadness; Fearfulness; Excessive happiness
	For women	Status of menopause	Yes: No
	Knee joints	Presence of knee pain	Yes: No
		Affected knee	Right; Left; Both
		Pain duration	Constant; Variable; Moving
		Pain quality	Throbbing; Heavy; Dull; Burning; Grinding
		Weather which affected pain	No; Cold; Damp; Windy; Hot
		Presence of knee stiffness	Yes; No
		Onset time of stiffness	Morning; Afternoon; Evening; Occasional
		Stiffness duration	<30 minutes; 30 minutes to several hours; All the time
		Presence of knee swelling	Yes; No
		Swelling location	Right; Left; Both
		Swelling sensation	No; Hot; Cold; Other
		Swelling history	<2 weeks; 2 weeks to 1 month; 2–6 months; >6 months
		Status of knee strength	Normal; Weaker but doesn't affect activities; Weaker and affects activities; Need assistance
		Knee weakness history	<1 month; 2–6 months; 6–12 months; >1 year
		Presence of muscle atrophy	No; Little; Notable
Inspection	Spirit	Strength	Poor; Moderate; Strong
	Complexion	Colour	Normal; Yellow; Pale; Red; Green; Black
		Lustre	Dry; Moist
	Hair	Amount	Plentiful; Thinning; Receding forehead; Balding
		Appearance	Dry; Lustrous
	Physical build	Body frame	Small; Medium; Large
		Musculature	Muscular; Flaccid
		Body fat	Underweight; Moderate weight; Overweight
	Posture	Sitting posture	Upright; Slumped
		Walking posture	Restricted; Unrestricted
	Tongue body	Size	Small; Moderate; Swollen; Thin
		Teeth marks	Yes; No
		Colour	Pale; Pink; Red/Crimson; Purple
		Red tip	Yes; No
		Constitution	Soft; Firm
		Cracks	Yes; No
		Spots	Yes; No
		Trembling	Yes; No
		Deviation	Yes; No
	Tongue coating	Presence	Yes; No
		Thickness	Absent; Very thin; Thin; Thick
		Quality	Dry; Moist; Sticky; Curdy; Peeled
		Colour	White; Yellow; Grey; Green; Black
	Lips	Colour	Pale; Pink/Red; Bright red; Purple
Auscultation	Voice	Strength of voice	Soft; Moderate; Loud
	Breath sounds	Characteristics of sounds	Normal/Silent; Wheezing; Heavy
Palpation	Pulse	Speed (Left)	Slow; Moderate; Fast
		Location (Left)	Superficial; Mid-level; Deep
		Force (Left)	Weak; Moderate; Forceful
		Speed (Right)	Slow; Moderate; Fast
		Location (Right)	Superficial; Mid-level; Deep
		Force (Right)	Weak; Moderate; Forceful
	Hands	Temperature	Cold; Warm; Hot
		Moisture	Sweaty; Neither sweaty nor dry; Dry

Study procedures

The reliability study was conducted at the first study visit of the CHM clinical trial. Both practitioners conducted the CM assessment separately on the same day, with a maximum of 30 minutes between the two CM assessments (each session lasted approximately 30 minutes). There was no training or discussion between the 2 practitioners prior to or after the CM assessment and the CM assessment forms were kept separate until data analysis. Data entry and analysis were conducted by the first author of this article.

Statistical analysis

The level of agreement (%) for the key diagnostic endpoints was calculated, along with the (Cohen's) κ coefficient. If data were missing from at least one tabulation cell (the responses of an endpoint were concentrated on a particular choice) the κ coefficient could not be calculated. The interpretation of κ values used in this study¹⁸ is shown in Table 2.

Table 2.

Interpretation of κ Values

κ value	Strength of agreement
<0.00	Poor
0.00–0.20	Slight
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
0.81–1.00	Almost perfect

Results

A total of 40 study participants were recruited: 23 females and 17 males. The mean age of participants was 62.2 years (standard deviation [SD]=11.0) and the age range was 42–92.

Inquiry variables

There were 84 investigated endpoints in the Inquiry section. Insufficient data were collected for 9 variables, and data for a further 23 variables that could not be analyzed using the κ coefficient because there were missing data in at least one of the tabulation cells (mainly because of the consistent recording of one particular response for one of the practitioners). Inter-rater reliability was assessed for 52 key Inquiry variables and ranged from “poor” to “almost perfect.” Results are shown in Table 3.

Table 3.

Reproducibility of Key Inquiry Variables

Investigated endpoints		Valid cases (n)	Level of agreement (%)	κ of result	κ SE (κ 95% CI)	Interpretation of κ
Presence of chills		40	80.0	0.23	0.19 (−0.13–0.60)	Fair
Presence of cold		31	70.0	0.35	0.29 (−0.22–0.92)	Fair
Presence of cold hands and feet		40	87.5	0.68	0.13 (0.44–0.93)	Substantial
Presence of the sensitivity to cold		38	85.0	0.73	0.12 (0.49–0.97)	Substantial
Presence of the sensitivity to heat		40	82.5	0.63	0.12 (0.39–0.88)	Substantial
Presence of fever		40	97.5	^a	^a	^a
Presence of spontaneous sweating		40	87.5	0.73	0.11 (0.52–0.94)	Substantial
Presence of night sweating		40	90.0	0.76	0.11 (0.55–0.98)	Substantial
Presence of headache		40	97.5	0.94	0.06 (0.83–1.05)	Almost perfect
If yes	Location	12	75.0	^a	^a	^a
	Onset time	12	83.3	^a	^a	^a
	Type	12	58.3	^a	^a	^a
	History	12	75.0	^a	^a	^a
Presence of dizziness		40	95.0	0.86	0.10 (0.66–1.05)	Almost perfect
If yes	Onset time	3	66.7	^a	^a	^a
	Onset frequency	7	42.9	^a	^a	^a
	History	8	37.5	^a	^a	^a
Presence of numbness		40	97.5	0.95	0.05 (0.84–1.05)	Almost perfect
If yes	Location of numbness	14	28.6	^a	^a	^a
Presence of body pain		40	95.0	0.48	0.31 (−0.12–1.08)	Moderate
If yes	Location of body pain	37	43.2	^a	^a	^a
Urine colour		40	85.0	0.72	0.10 (0.52–0.92)	Substantial
Urine quantity		40	92.5	0.68	0.17 (0.35–1.02)	Substantial
Urine frequency		40	75.0	0.40	0.14 (0.13–0.68)	Fair
Night urination		40	92.5	0.82	0.10 (0.62–1.01)	Almost perfect
Urine: burning sensation		40	100.0	1.00	0.00 (1.00–1.00)	Almost perfect
Urine: Sensation of incomplete urination		40	87.5	–0.06	0.03 (−0.12–0.00)	Poor
Urine: Leakage		40	90.0	0.78	0.10 (0.57–0.98)	Substantial
Frequency of defecation		40	87.5	0.72	0.12 (0.49–0.95)	Substantial
Quality of stools		40	70.0	0.58	0.10 (0.39–0.77)	Moderate
Presence of constipation		40	95.0	0.84	0.11 (0.63–1.05)	Almost perfect
Level of appetite		40	95.0	0.48	0.31 (−0.11–1.08)	Moderate
Taste in the mouth		40	97.5	0.87	0.12 (0.64–1.09)	Almost perfect
Sensation after eating		40	90.0	^a	^a	^a
Presence of thirst		40	65.0	0.38	0.13 (0.12–0.65)	Fair
Abnormal chest sensation		40	95.0	0.72	0.18 (0.37–1.08)	Substantial
Cough phlegm		34	85.0	1.00	0.00 (1.00–1.00)	Almost perfect
Presence of abdominal pain		40	95.0	0.84	0.11 (0.63–1.05)	Almost perfect
If yes	Location	7	57.1	^a	^a	^a
	Quality	7	28.6	^a	^a	^a
	Pain aggravated by heat	7	42.9	^a	^a	^a
	Pain aggravated by pressure	7	57.1	0.36	0.27 (−0.16–0.89)	Fair
	Pain relieved by heat	7	57.1	0.32	0.28 (−0.23–0.88)	Fair
	Pain relieved by pressure	7	71.4	0.56	0.27 (0.04–1.08)	Moderate
	Presence of bloating	7	85.7	^a	^a	^a
Hearing		40	95.0	0.91	0.06 (0.80–1.03)	Almost perfect
Presence of tinnitus		40	92.5	0.76	0.13 (0.50–1.01)	Substantial
Presence of blurred vision		40	92.5	0.69	0.19 (0.33–1.05)	Substantial
Presence of dry eyes		39	97.5	0.93	0.07 (0.80–1.06)	Almost perfect
Presence of watery eyes		40	92.5	0.82	0.10 (0.62–1.01)	Almost perfect
Presence of the sensitivity to light		39	85.0	0.74	0.11 (0.53–0.95)	Substantial
Presence of the sensitivity to wind		40	95.0	0.88	0.08 (0.72–1.04)	Almost perfect
Sleep quality		40	77.5	0.81	0.11 (0.60–1.03)	Almost perfect
Sleep dream		40	75.0	0.61	0.10 (0.41–0.81)	Substantial
Difficulty to fall asleep		40	92.5	0.84	0.09 (0.66–1.01)	Almost perfect
Waking up at night		40	90.0	0.66	0.15 (0.36–0.96)	Substantial
Energy level		40	80	0.54	0.14 (0.27–0.82)	Moderate
Presence of the shortness of breath		40	90.0	0.66	0.16 (0.35–0.96)	Substantial
Experienced emotions over the past week		39	66.7	^a	^a	^a
Status of menopause		21	100.0	1.00	0.00 (1.00–1.00)	Almost perfect
Presence of knee pain		40	95.0	^a	^a	^a
If yes	Affected knee	38	94.7	0.92	0.06 (0.08–1.03)	Almost perfect
	Pain duration	38	71.1		^a	^a
	Pain quality	38	63.2		^a	^a
	Affected weather	38	68.4		^a	^a
Presence of knee stiffness		40	90.0	0.28	0.26 (−0.23–0.79)	Fair
If yes	Onset time of stiffness	35	68.6	^a	^a	^a
	Stiffness duration	34	73.5	0.51	0.12 (0.26–0.75)	Moderate
Presence of knee swelling		40	95.0	0.90	0.07 (0.76–1.03)	Almost perfect
If yes	Swelling location	21	47.6	0.22	0.17 (−0.10–0.55)	Fair
	Swelling sensation	21	85.7	^a	^a	^a
	Swelling history	21	95.2	^a	^a	^a
Status of knee strength		39	71.8	0.27	0.15 (−0.02–0.57)	Fair
If yes	Weakness history	33	72.7	^a	^a	^a
	Presence of muscle atrophy	33	69.7	0.51	0.13 (0.26–0.76)	Moderate

Data missing from at least one tabulation cell, so Kappa could not be calculated.

SE, standard error; CI, confidence interval.

Inspection variables

A total of 24 Inspection variables were assessed. Nine variables could not be analyzed using the κ coefficient because 1 rater consistently chose the one response. Agreement for the 15 Inspection diagnostic variables ranged from “poor” to “moderate.” Results are shown Table 4.

Table 4.

Reproducibility of Inspection, Auscultation, and Palpation Variables

Investigated endpoints	Valid cases (n)	Level of agreement (%)	κ of result	κ SE (κ 95% CI)	Interpretation of κ
Strength of spirit	40	52.5	0.07	0.08 (−0.09–0.24)	Slight
Color of complexion	40	72.5	^a	^a	^a
Skin luster	25	68.0	0.14	0.13 (−0.11–0.39)	Slight
Amount of hair	40	75.0	0.54	0.11 (0.33–0.75)	Moderate
Appearance of hair	37	67.6	^a	^a	^a
Body frame	40	72.5	0.47	0.13 (0.21–0.73)	Moderate
Musculature	40	60.0	0.05	0.11 (−0.18–0.27)	Slight
Body fat	40	67.5	0.39	0.13 (0.14–0.64)	Fair
Sitting posture	40	97.5	^a	^a	^a
Walking posture	40	50.0	0.10	0.09 (−0.07–0.27)	Slight
Tongue size	40	52.5	0.15	0.12 (−0.08–0.38)	Slight
Presence of teeth marks on tongue	37	62.2	0.24	0.16 (−0.07–0.54)	Fair
Tongue colour	39	35.9	0.11	0.10 (−0.09–0.30)	Slight
Presence of red tip on tongue	39	74.4	0.22	0.13 (−0.04–0.48)	Fair
Tongue constitution	40	17.5	^a	^a	^a
Presence of cracks on tongue	40	60.0	0.26	0.12 (0.02–0.50)	Fair
Presence of spots on tongue	40	95.0	^a	^a	^a
Presence of tongue trembling	40	77.5	^a	^a	^a
Presence of tongue deviation	40	85.0	^a	^a	^a
Presence of tongue coating	40	90.0	–0.04	0.03 (−0.10–0.02)	Poor
Thickness of tongue coating	40	47.5	0.22	0.11 (0.00–0.44)	Fair
Quality of tongue coating	36	47.2	^a	^a	^a
Color of tongue coating	37	64.9	0.34	0.12 (0.09–0.58)	Fair
Colour of lips	40	85.0	^a	^a	^a
Strength of voice	40	82.5	^a	^a	^a
Character of breathing sounds	40	97.5	^a	^a	^a
Pulse speed (left)	39	76.9	–0.05	0.03 (−0.12–0.01)	Poor
Pulse location (left)	40	65.0	0.20	0.15 (−0.11–0.50)	Slight
Pulse force (left)	40	42.5	0.08	0.10 (−0.11–0.28)	Slight
Pulse speed (right)	40	77.5	0.11	0.16 (−0.20–0.42)	Slight
Pulse location (right)	40	67.5	0.31	0.14 (0.03–0.59)	Fair
Pulse force (right)	40	42.5	0.13	0.10 (−0.08–0.33)	Slight
Temperature of hands	40	95.0	0.66	0.32 (0.03–1.28)	Substantial
Moisture of hands	40	82.5	^a	^a	^a

Data missing from at least one tabulation cell, so Kappa unable to be calculated.

SE, standard error; CI, confidence interval.

Auscultation variables

Agreement between 2 practitioners for the two auscultation variables was 82.5% for voice strength and 97.5% for characteristics of breath sounds. Given that one practitioner consistently chose the same responses for each participant for these variables, the κ coefficient could not be applied. Results are shown in Table 4.

Palpation variables

A total of eight Palpation variables were assessed. One variable (moisture of hands) could not be assessed using the κ coefficient because one rater consistently chose one response. Agreement ranged from “poor” to “substantial” for the eight palpation endpoints investigated. “Fair” agreement was found for the variable of location of right pulse. “Slight” agreement was found for four variables of pulse diagnosis (location and force of left pulse, force and speed of right pulse) A “poor” level of agreement was found for speed of the left pulse (level of agreement 76.9%). Results are shown in Table 4.

Discussion

In general, this study suggests that there is substantial variation in level of agreement for diagnostic information collected in a CM examination. In some cases, it appears to be quite reliable and in other cases, it appears to be unreliable. The majority of studies of reliability of CM diagnosis for a variety of clinical conditions have yielded relatively low levels of agreement among practitioners.^4

–11 Little information is available about where in the diagnostic process the variability exists—at the basic level of data collection or at the stage of synthesizing the data according to various CM theories to arrive at the CM syndrome diagnosis, or both. The current study examined the reliability of the fundamental stage of the diagnostic process—data collection.

Reproducibility of Inquiry variables

Given that most of the information elicited from a case history comprise, by nature, subjective symptoms, it is common to find inconsistency in diagnostic assessments among different doctors—even in conventional medicine.^19,20 Zhang and colleagues argued that Inquiry as a diagnostic method is not a valid instrument, because of the low level agreement of diagnosis found among 3 practitioners working with patients who had rheumatoid arthritis (RA).^4,10 The use of a CM Assessment form used in this study attempted to standardize the Inquiry process.

In the current study, a relatively good level of agreement for most Inquiry variables was found. For almost 50% of the Inquiry endpoints there was a “substantial” to “almost perfect” level of agreement.

The limitations of the κ statistics need to be considered when interpreting some items. A “fair” level of agreement was found for some symptoms, such as characteristics of abdominal pain; there was only a small number of valid cases for the κ analysis and therefore there were insufficient data from which to judge reliability. There were also responses to 23 questions that could not be analyzed. Reasons included missing data for κ tabulation cells, or small numbers of valid cases. In the case of some variables when the κ coefficient interpretation indicated only “moderate” to “fair” levels of agreement, the actual percentage agreement may have been quite high. Caution should be exercised in the interpretation of the level of agreement using the κ coefficient when there is a skewing of a high number of responses for one particular response option. This is one of the limitations of the κ coefficient.²¹

Given the nature of the response options (fixed, categorical) and the use of a fixed form of words, it is not surprising that there was a reasonably high level of agreement on Inquiry variables. This contrasts with clinical practice in which asking an open-ended question may elicit more varied responses. Other research found the use of objective questionnaires instead of conventional case notes for Inquiry can increase agreement on CM diagnosis among practitioners for RA.²²

Reproducibility of Inspection variables

Inspection in CM is based on the concept of correspondence between the Internal Organs and their external manifestations. A “slight” to “fair” level of agreement was found for the majority of the eight Inspection variables assessed, indicating a relatively low level of agreement overall for Inspection variables in general. The results are similar to those found in other studies that found variable levels of agreement for Inspection endpoints.^12,13 For example, although not directly comparable to the current study, because their study used three practitioners, O'Brien and colleagues' found that the level of agreement between 3 practitioners for color and size of tongue body was “slight,” and for thickness of tongue coating it was “fair” (level of agreement was higher when level of agreement was measured for at least 2 practitioners).¹³ The low level of agreement is indicative of the subjective nature of Inspection, which can require quite subtle observations. Clear definitions of these characteristics in CM are needed. Work is underway in China to develop objective methods for tongue diagnosis, including color detection instruments and computerized image analysis systems based on different color models.^23

–26 These models are an attempt to quantify characteristics of the tongue body and tongue coating color. Although studies have found that tongue color identification systems can reflect the characteristics of tongue color and record similar judgments to those of CM practitioners, there are also several factors that could contribute to errors in measurement, such as the shape of tongue, the structure of its curved surface, and the sampling area.^24
–26 In addition, studies in Western medicine have also indicated considerable variation among observers in physical examinations.^27

–31

Reproducibility of Auscultation variables

The level of agreement between 2 practitioners on the strength of voice and character of breathing sounds was reasonably high in the current study; they were 82.5% and 97.5%, respectively (although κ could not be calculated). This was a similar result to that found by O'Brien and colleagues who found a “moderate” to “almost perfect” level of agreement.¹³ However in CM clinical practice, not much significance is attached to Auscultation because the hearing ability of practitioners varies from person to person. Some research has attempted to quantify and interpret the voice using a spectrogram.^32,33 However, Auscultation in CM also includes many other sounds such as eructation, groaning, and crying which have not been studied with respect to reliability. More studies are needed to establish the reproducibility of Auscultation variables.

Reproducibility of Palpation variables

Palpation in CM includes palpation of the chest, abdomen, other body parts as indicated, meridians, and acupoints, but the most important aspect in clinical practice is pulse diagnosis. The results of this study indicated “poor” agreement for left pulse speed, “slight” agreement for four aspects (left pulse location and force, and right pulse speed and force), and “fair” for one aspect (right pulse location). This is not dissimilar to O'Brien and colleagues' study that found the level of agreement among 3 practitioners was “slight” for location and “fair” for force for the right pulse.¹³ The results support the notion that pulse diagnosis is the most difficult part of the art of CM diagnosis, which requires extensive clinical experience to master.³

Despite its crucial role in the diagnostic process in CM, it is difficult to objectify and standardize pulse diagnosis. Some attempts have been made in China to try to objectify pulse diagnosis through development of pulse-measuring apparatuses.^34,35 However, it has been argued that those apparatusus are simply pulse-tracing devices based on anatomy and physiology as described in conventional medicine rather than in CM theory.³⁶ Current research indicates that unambiguous definitions of pulse characteristics are critical in pulse research.^37,38 Standardizing the pressure applied (i.e., measurement of finger strength) in pulse detection is the most difficult part in pulse research.³⁶ A review of reliability studies of pulse diagnosis³⁸ showed that level of agreement varies from “low” to “very good” agreement.

Study limitations

There were a number of factors that may have affected the results of this study. First, the use of the CM assessment form may have interrupted or sidetracked the clinical thinking process by requiring the practitioner to ask about symptoms/signs that were not necessarily relevant. This may be likened to breaking a whole picture into too many (diagnostic) pieces. Some of the knee-related questions were more general, and it emerged that they may not have captured the experience of the condition of knee OA sufficiently. For example, some patients reported that they did not have pain or stiffness, but had “discomfort.”

Second, the difference in CM training and experience between the 2 practitioners may have contributed to the variability in observations and diagnosis. Some signs are open to interpretation, which is likely to be influenced by clinical training and experience.

Third, there was no training prior to the study. Other studies have found that prior training improved the level of agreement between practitioners.²²

Finally, there are limitations of the κ coefficient, which have already been discussed. κ can become unstable or even inappropriate as a statistic.³⁹

To improve the reliability of CM data collected using the four diagnostic methods, clear definitions for all outcome variables should be established and prior training of the examiners (as some other studies have done) should be incorporated into future study designs.

Conclusions

This study investigated the reliability of the CM diagnostic process in a comprehensive and systematic way, investigating all four diagnostic methods. The level of agreement was not sufficiently high to justify inclusion of CM diagnostic variables in the current authors' main study investigating the efficacy of CHM for treating symptoms of knee OA (manuscript in progress). Nonetheless, the study in this article contributed important information about the reliability of the fundamental stage of reaching a diagnosis of a CM syndrome—the first step of data collection. The reader is cautioned, however, that this is only 1 study conducted for one condition (knee OA), therefore, no definitive generalisations about the reliability of data collection in CM can be made at this point. Further studies for a range of conditions are needed.

Footnotes

Acknowledgments

This study formed part of the PhD candidacy of the first author and was supported by Victoria University and Nucleus Network (Baker Medical Research Institute), Melbourne, Australia. The support of these institutions is gratefully acknowledged. This study formed part of a clinical trial into the efficacy of a Chinese herbal medicine in the treatment of symptoms of knee OA, registered with the Australian and New Zealand Clinical Trials Registry (ACTRN12608000468325).

Disclosure Statement

The authors state that no competing financial interests exist.

References

Zhang

, Bausell

, Lao

et al. Assessing the consistency of traditional Chinese medical diagnosis: An integrative approach. Alternat Ther Health Med, 2003; 9:66–71.

Deng

, Guo

, Cheng

et al. Diagnostics of Traditional Chinese Medicine. Chinese . Shanghai: Shanghai Science & Technology Press, 1984.

Maciocia

. Diagnosis in Chinese Medicine: A Comprehensive Guide. Edinburgh: Churchill Livingstone, 2004.

Zhang

, Lee

W-L

, Lao

et al. The variability of TCM pattern diagnosis and herbal prescription on rheumatoid arthritis patients. Alternat Ther Health Med, 2004; 10:58–63.

Sung

JJY

, Leung

, Ching

JYL

et al. Agreements among Traditional Chinese Medicine practitioners in the diagnosis and treatment of irritable bowel syndrome. Aliment Pharmacol Ther, 2004; 20:1205–1210.

MacPherson

, Thorpe

, Thomas

et al. Acupuncture for low back pain: Traditional diagnosis and treatment of 148 patients in a clinical trial. Complem Ther Med, 2004; 12:38–44.

Hogeboom

, Sherman

, Cherkin

. Variation in diagnosis and treatment of chronic low back pain by Traditional Chinese Medicine acupuncturists. Complement Ther Med, 2001; 9:154–166.

Birch

, Sherman

. Zhong Yi. Acupuncture and low-back pain: Traditional Chinese medical acupuncture differential diagnoses and treatments for chronic lumbar pain. J Altern Complement Med, 1999; 5:415–425.

Coeytaux

, Chen

, Lindemuth

et al. Variability in the diagnosis and point selection for persons with frequent headache by Traditional Chinese Medicine acupuncturists. J Altern Complement Med, 2006; 12:863–872.

10.

Zhang

, Lee

, Bausell

et al. Variability in the Traditional Chinese Medicine (TCM) diagnoses and herbal prescriptions provided by three TCM practitioners for 40 patients with rheumatoid arthritis. J Altern Complement Med, 2005; 11:415–421.

11.

O'Brien

, Abbas

, Jiansheng

et al. An investigation into the reliability of Chinese medicine diagnosis according to Eight Guiding Principles and Zang-Fu Theory in Australians with hypercholesterolemia. J Altern Complement Med, 2009; 15:259–266.

12.

Kim

, Cobbin

, Zaslawski

. Traditional Chinese Medicine tongue inspection: An examination of the inter- and intrapractitioner reliability for specific tongue characteristics. J Altern Complement Med, 2008; 14:527–536.

13.

O'Brien

, Abbas

, Zhang

et al. Understanding the reliability of diagnostic variables in a Chinese medicine examination. J Altern Complement Med, 2009; 15:727–734.

14.

King

, Cobbin

, Walsh

et al. The reliable measurement of radial pulse characteristics. Acupunct Med, 2002; 20:150–159.

15.

Hochberg

, Altman

, Brandt

et al. Guidelines for the medical management of osteoarthritis: Part II. Osteoarthritis of the knee: American College of Rheumatology. Arthrits Rheum, 1995; 38:1541–1546.

16.

Kellgren

, Lawrence

. Radiological assessment of osteoarthrosis. Ann Rheum Dis, 1957; 16:494–502.

17.

State Administration of Traditional Chinese Medicine. Advanced Textbook on Traditional Chinese Medicine and Pharmacologyvol. 1Beijing: New World Press, 1995.

18.

Landis

, Koch

. The measurement of observer agreement for categorical data. Biometrics, 1977; 33:159–174.

19.

Moyer

, Ahn

, Sneed

. Accuracy of clinical judgment in neonatal jaundice. Arch Pediatr Adolesc Med, 2000; 154:391–394.

20.

Santucci

, Biggeri

, Feller

et al. Accuracy, concordance, and reproducibility of histologic diagnosis in cutaneous T-cell lymphoma: An EORTC Cutaneous Lymphoma Project Group Study. Arch Dermatol, 2000; 136:497–502.

21.

Crewson

. Fundamentals of clinical research for radiologists: Reader agreement studies. Am J Roentgenol, 2005; 184:1391–1397.

22.

Zhang

, Singh

, Lee

et al. Improvement of agreement in TCM diagnosis among TCM practitioners for persons with the conventional diagnosis of rheumatoid arthritis: Effect of training. J Altern Complement Med, 2008; 144:381–386.

23.

Chen

, Zhang

. Study on key problems in objective studies of tongue complexion and facial complexion [in Chinese] Chinese Arch Tradit Chinese Med, 2008; 26:1372–1374.

24.

Wang

, Yang

, Zhou

et al.

Image segmentation in tongue characterization [Chinese]

J Biomed Eng (China), 2005; 22:1128–1133.

25.

Wei

, Shen

, Wang

et al.

A digital tongue image analysis instrument for Traditional Chinese Medicine [Chinese]

Chinese J Med Instrumentation, 2002; 26:164–169.

26.

Weng

, Huang

. Studies on externalization of application of tongue inspection of TCM [in Chinese] Eng Sci (China)., 2001; 3:78–82,93.

27.

Joshua

, Celermajer

, Stockler

. Beauty is in the eye of the examiner: Reaching agreement about physical signs and their value. Intern Med J, 2005; 35:178–187.

28.

Gjłrup

, Bugge

, Jensen

. Interobserver variation in assessment of respiratory signs: Physicians' guesses as to interobserver variation. Acta Med Scand, 1984; 216:61–66.

29.

Vogel

. Influence of additional information on interrater reliability in the neurologic examination. Neurology, 1992; 42:2076–2081.

30.

Hansen

, Sindrup

, Christensen

et al. Interobserver variation in the evaluation of neurological signs: Observer dependent factors. Acta Neurol Scand., 1994; 90:145–149.

31.

Shinar

, Gross

, Mohr

et al. Interobserver variability in the assessment of neurologic history and examination in the Stroke Data Bank. Arch Neurol, 1985; 42:557–565.

32.

, Zhang

. Current research situation and perspective of Traditional Chinese Medicine auscultation [Chinese] Chinese J Basic Med Tradit Chinese Med, 1998; 4:54–56.

33.

, Cai

, Zhang

et al.

Clinical experimental research of objectification of auscultation in Traditional Chinese Medicine [Chinese]

Chinese hinese J Basic Med Tradit Chinese Med, 1998; 4:37–43.

34.

Cai

, Zhou

, Huang

et al.

Current progress of research on measuring sphygmus information [Chinese]

J Biomed Eng (China), 2007; 24:709–712.

35.

Yang

, Niu

, Wang

. Summary of research development of pulse detecting devices and analysis methods [Chinese] J Beijing Med University of TCM., 2000; 23:68–69.

36.

Tian

, Lu

. Review of the research principles of pulse diagnosis standardisation [Chinese] J Sichuan Tradit Chinese Med, 2008; 26:21–22.

37.

King

, Walsh

, Cobbin

. The testing of classical pulse concepts in chinese medicine: Left- and right-hand pulse strength discrepancy between males and females and its clinical implications. J Altern Complement Med, 2006; 12:445–450.

38.

O'Brien

, Birch

. A review of the reliability of traditional East Asian medicine diagnoses. J Altern Complement Med, 2009; 15:353–366.

39.

Haas

. Statistical methodology for reliability studies. J Manipulative Physiol Ther, 1991; 14:119–132.