Reliability of Manual Pulse Diagnosis Methods in Traditional East Asian Medicine: A Systematic Narrative Literature Review

Abstract

Background/Objective:

Little evidence shows the reliability of Chinese medicine pulse diagnosis. Regularly used in modern practice, it is believed to gather important diagnostic information. However, in the current evidence-based healthcare system, basing clinical decisions on unproven methods is problematic and obviously questions the relevancy of the procedure. Therefore, the literature on reliability of practitioners implementing the method was reviewed.

Methods:

Major medical databases and reference lists of identified articles were searched. All studies published in English that investigated manual pulse diagnosis applied to the radial artery by human testers were considered.

Results:

Twelve eligible studies were included; three evaluated intra- and inter-rater pulse diagnosis reliability, and nine assessed inter-rater reliability. Acceptable levels of intra- and inter-rater reliability were achieved with operationally defined methods. Poor reliability was related to unclear definitions and terminology existing within the classical definitions, and with standardized systems to persisting imprecise descriptions that can be interpreted differently. Reliability of pulse qualities was influenced by sensation complexity and the amount of sensory input provided to the testers' fingers by the impulse. Consistent study limitations included small sample sizes; the possibility that testers' prior knowledge confounded the data; and, most notably, the fact that many studies did not consider intra-rater reliability. Assessing the effectiveness of interventions in clinical practice is guided by comparisons of markers to baseline. The absence of intra-rater results may therefore raise methodologic concerns for these types of studies.

Conclusion:

Strategies for future studies include using pulse methods with concrete operational definitions; investigating intra- and inter-rater reliability for extrapolation to clinical practice; similar training and experience in the method to control for tester variance; maintaining independence of the data by ensuring testers have no prior knowledge of the participants' pulses; and for more rigorous testing, consideration of the number of pulse variables, participants, and testers.

Introduction

Few studies have evaluated the reliability of diagnostic techniques used in Chinese and traditional East Asian medicine (TEAM).^1,2 Pulse diagnosis is commonly used in modern clinical practice³ and contributes significantly to palpation within the “four diagnostic methods.”⁴ According to tradition, radial artery palpation provides important diagnostic information concerning organ pathology, psychological functioning, constitutional potential, physiologic effects of previous illnesses or traumatic events, and the success of treatment methods administered.^4

–8 Pulse descriptions therefore appear frequently in the Chinese medicine literature despite little evidence that demonstrates the reliability of the method.

The ongoing debate regarding the subjectivity of diagnostic procedures⁹ questions their relevancy and hinders the development of Chinese medicine and TEAM alongside current evidence-based healthcare practices. Significantly information gathered by these methods is interpreted according to tradition and empirical assumption and directs clinical decisions regarding patient care. The legitimacy of using untested or unreliable methods, such as pulse diagnosis, is questionable and should be substantiated. Therefore, the literature on the reliability of practitioners applying manual pulse diagnosis to assess patients was reviewed.

Methods of Literature Review

Searches

Electronic databases searched in September 2015 included MEDLINE, PubMed, PsychInfo, ProQuest Central, CINAHL (EBSCO), and Google Scholar; these databases were searched from their inception. The basic search terms “pulse diagnosis” and “reliability” were applied, then further refined with the addition of the individual terms “agreement,” “acupuncture,” “Chinese medicine,” “Traditional East Asian medicine,” and “Oriental medicine.” Additional efforts were made to locate unpublished studies by hand searching references of records returned by electronic sources.

Eligibility criteria

Included studies were published in English (with no date or publication status restrictions) and incorporated manual methods of pulse diagnosis applied at the radial artery by human testers. Studies using practitioner and student testers, and healthy as well as ill participants, were considered. Outcome measures reported included intra- and/or inter-rater reliability of defined pulse characteristics. Studies that measured the radial artery using electronic or mechanical devices were excluded from the review.

Selection of studies

The primary author searched the literature and performed all eligibility assessments. Potential studies were screened initially by title, followed by abstract, and then by full text as each continued to meet the inclusion criteria.

Data collection identified 12 studies for inclusion, the process for which is outlined in Figure 1. The basic search terms returned 123 records, and refining terms narrowed this by 24; the studies removed primarily related to oximetry and measurement of pulse waves by devices and instruments. A first screening of titles removed duplicates, resulting in the exclusion of 9 more records. Examination of remaining abstracts and full texts excluded 7 further studies that were not relevant. Searching included references located another 4 cited studies that satisfied the eligibility criteria. Details of the 12 studies included in this review are presented in Table 1.

FIG. 1.

Flow diagram of included studies.

Table 1.

Summary of Included Studies Evaluating the Reliability of Manual Methods of Pulse Diagnosis

Study	Participants (n)	Participant characteristics	Pulse testers	Method of pulse palpation used	Study design	Statistical analysis	Outcomes
Cole, 1977 (Unpublished)
Intra-rater A	Retested 5 of 29 participants	Hospital patients	1 acupuncturist (18 years' experience)	6 traditional pulse positions ascribed by Wang Shu He	Repeat practical test/retest; testers blinded to participants; no verbal communication; data recorded on standard pulse map using ++, +, 0, −, – notation (+, yang; −, yin); intra-rater reliability calculated	Spearman rank correlation	1 participant identical pulse maps; 3 of 5 SRCT > 0.5. Possible for stable information to be obtained from a pulse; however, not achieved on all occasions
Intra-rater B	Retested 4 of 49 participants	Hospital staff and patients	1 acupuncturist (20 years' experience)	As for intra A with addition of 7 pulse qualities	As for intra A	As for intra A	SRCT for the repeated pulse maps 0.56, 0.31, 0.24, and 0.53. Showed varied intra-rater reliability.
Intra-rater C	Retested 2 of 10 participants	Participants of mixed health, age 18–56 yr	1 acupuncturist (18 years' experience), same as A	As for A	As for intra A	As for intra A	SRCT for the repeated pulse maps, 0.93 and 0.96. Indicated pulse findings were reproduced reliably.
Intra-rater D	12 participants	Healthy male medical students	1 acupuncturist (14 years' experience)	As for intra A	As for intra A	Spearman rank correlation;Wilcoxon test	SRCT 0.9, 0.9, 0.6, 0.5, 0.5, −0.1, 0.1, 0.7, 0.9, −0.5, 0. 0.5; Wilcoxon test verified that values were unlikely due to chance (p < 0.01). Showed that significantly similar pulse patterns for the same individuals were reported on 2 occasions.
Intra-rater over time	5 participants	Hospital patients	1 acupuncturist (20 years, experience), same as B	As for intra A	As for intra A; in addition, 2 participants retested 5 times, 3 retested 3 times	KCC	KCCs were 0.7, 0.95, 0.5, 0.3 and 0.4. Showed, to some extent, that stable pulse patterns were reported. 81% of variance in correlation was accounted for by physical and projected stereotyped components.
Inter-rater 1	12 participants	Acupuncture patients	3 testers: 1 acupuncturist (same as B) and 2 of his students	As for intra A; in addition both students were trained by the acupuncturist and presumably using the same technique	Practical test; testers blinded to participants; no verbal communication; data recorded on standard pulse map using ++, +, 0, −, – notation. Inter-rater reliability calculated.	Spearman rank correlation; KCC.	SRCT −0.80 to 0.9 (with most being low). Indicated poor inter-rater reliability for blinded conditions. KCC of 0.2, 0.3, and 0.4; showed testers not generating stereotyped pulse patterns.
Inter-rater 2	12 participants	Healthy medical students; 9 male, 3 female	2 acupuncturists with 11 years' experience and (same as D) 14 years' experience.	As for intra A	As for inter-rater 1 but open testing; testers not blinded to participants	As for inter-rater 1	SRCT −0.8 to 0.8; showed varied (mostly poor) inter-rater reliability for open conditions. KCC 0.3 and 0.2; testers did not stereotype pulse patterns. Combined inter-rater 1 and 2 showed 1 tester could agree with himself but disagree with another tester; testers perceived their own gestalt in the pulses (recorded what they expected to feel) independent of being blinded to the participants.
Kass, 1990	10 participants	Sex not specified; mean age, 80.5 yr, all with medical conditions	2 TCM practitioners: 1 using manual palpation and 1 using electronic device	Not specified; assessed 3 depths in 3 pulse positions on right and left sides; 31 pulse patterns used	Practical test, testers assessed all participants. Data recorded on forms with 3 sections; general, individual positions and pulse subtype (qualities). Compared manual palpation to electronic assessment.	Normal approximation to binomial; α = 0.05	General and individual positions sections manual and electronic responses matched 79% and 70% (α = 0.0001). Subtype (qualities) had significant matches (α = 0.05) in <50% of groupings. Reliability decreased as levels of distinction became subtler.
Craddock, 1997 (Unpublished)	8 participants	4 men and 4 women; healthy participants	4 TCM practitioners	Not specified	Practical test/retest. Testers assessed 12 participants; 4 had repeat test; rated 7 pulse categories. Data recorded on standard form; testers blinded. Intra- and inter-rater reliability calculated.	Percentage agreement	Across all 7 categories average agreement: inter-rater, 63%; intra-rater, 56.1%. Lowest inter/intra-rater agreement found in category “individual qualities.” Inter/intra-rater reliability decreased as pulse quality complexity increased.
Walsh et al., 2001	18 participants total; collections 1, 2, 3, each for 6 different participants	44% male, 56% female	TCM students Collection 1: 35 Collection 2: 29 Collection 3: 20	Not specified; cun, guan, and chi positions assessed	Practical lineal test, 3 data collections (pulse diagnosis class; collection 1, week 1; collection 2, week 14; collection 3, 1 year after); no repeat test on any participants. Rated 12 categories on pulse (total of 72 categories/collection); data recorded on a standard form. Inter-rater reliability calculated.	Chi-square test (α = 0.05)	Collection 2 inter-rater agreement greater than chance in 31 of 72 sets of data (α = 0.05, χ² = 0.046). Collection 2 and collection 3 agreement no different than chance alone. Low inter-rater agreement due to the inadequacies of the pulse literature versus ability of students to learn pulse diagnosis.
King et al., 2002	Collection 1: 66 participants 2: 30 participants	Collection 1: 27 men, 39 women Collection 2: 13 men, 17 women	2 TCM practitioners	Researchers developed operational definitions and a standardized manual palpation method; definitions provided	Practical test/retest method; no apparent repeat test on any participants. Rated 16 categories on pulse; data recorded on a standard form. Inter-rater reliability calculated.	Chi-square test; (α = 0.05); percentage agreement	Mean percentage % agreement for collection 1 and 2 > 80%. Agreement in collection 1, 13 of 16 categories >70% (10 > 80%); collection 2 > 80% for 11 categories. With operationally defined pulse method acceptable levels of inter-rater reliability achieved.
King E et al., 2006	65 participants	27 men 38 women; healthy participants	2 TCM practitioners, same as in King et al. 2002 study	Cun, guan, and chi positions assessed; same method as in King et al. 2002 study	Practical test. Testers rated all participants' pulses with respect to dominant left–right balance; data recorded on a standard form.	Chi-square test(α = 0.05); percentage agreement	Inter-rater agreement 86% in rating the relative strength of the participants' pulses. Sex-related right–left pulse strength differences not supported.
O'Brien et al., 2009	45 participants	Participants with hypercholesterolemia; age range, 20–75 yr	3 TCM practitioners	Not specified whether testers used a consistent method	Part of larger study to assess the efficacy Chinese herbal medicine. Included of 3 of the 4 methods of diagnosis in Chinese medicine. Rated 3 categories on pulse; data recorded on a standard form. Inter-rater reliability calculated.	κ coefficient; κ interpreted with Landis and Koch values.	Location: all 3 testers slight agreement (κ = 0.15); 2 of 3 testers perfect agreement (κ = 1.0). Force: all 3 testers fair agreement (κ = 0.29); 2 of 3 testers almost perfect agreement (κ = 0.86). Speed: Compared assessment by breath (2 testers) to rate by electronic device (1 tester); κ = 0.72 and 0.86 (substantial and near perfect agreement). Agreement between 2 practitioners greater than agreement between all 3. Suggested redefining diagnostic process.
O'Brien et al., 2009	62 participants	22 men, 40 women; healthy participants	2 teachers from Toyohari Medical Association (Japan) with 10–12 years' experience using TMT	TMT pulse diagnosis	Part of larger study to assess physiologic correlates in the cardiovascular system related to TMT root treatment. Tested pulse, abdominal diagnosis. Rated pulse depth, strength, speed beats/patient breath, on a 5-tier nominal scale. Data were recorded on a standard form, stored securely.	Percentage agreement; weighted κ For analysis, pulse scale reduced from 5 to 3 tiers κ interpreted with Landis and Koch values.	Level of agreement for pulse depth, 57%; speed, 61%; strength, 77%. Weighted κ for 3 scales: pulse categories depth, κ = 0.37; speed, κ = 0.40; strength, κ = 0.38. Concluded reasonable agreement for pulse characteristics with room for improvement.
Bilton et al., 2010 Bilton, 2012	15 participants; 1 did not return for retest, reducing total number of participants to 14	Healthy participants; 11 white, 2 Hispanic, 1 Asian; 3 men, 11 woman	6 instructors for Dragon Rises Seminars with 7–15 years; experience using CCPD. 5 testers trained by the 6th, who documented CCPD. All testers attended biannual instructor meetings to further refine skills.	CCPD method operationally defined over 30 yr, documented in text Chinese Pulse Diagnosis: A Contemporary Approach	Real-life design (as per clinical practice), practical test/retest. Standard CCPD pulse-taking procedure used. Data recorded on standard CCPD forms, stored securely. Testers counted rates with a time device (beginning, end, exertion, exertion change). Assessed 11 pulse categories using bilateral or 6-finger palpation, 19 with single finger palpation. Recorded pulse qualities for each. Data management: 30 separate files created for 14 participants. If quality present (recorded by tester) assigned 2, if quality absent (not recorded) assigned 1. For each file, data organized according to tester and day of testing using notation t1d1, t1d2, t2d1, t2d2, t3d1, t3d2, t4d1 and t4d2, where t = tester and d = day. Intra-rater compared test/retest results for each tester (e.g., t1d1 × t1d2). Inter-rater compared results of 2 testers at a time across both days of testing.	κ coefficient. κ measured reliability in terms of pulse quality matches for categories. Intra-rater analysis gave 4 κ values for each of the 30 pulse categories in the 14 participants, totaling 4 × 30 × 14 = 1680 κ calculations. Inter-rater analysis gave 6 tester combinations (t1 × t2, t1 × t3, t1 × t4, t2 × t3, t2 × t4, and t × ^*t4), 4 day combinations (d1 × d1, d1 × d2, d2 × d2, and d2 × d1), resulting in 24 κ values for each category in all participants, thus 24 × 30 × 14 = 10,080 κ calculations. For ease of handling and reporting, κ values were averaged for each of the 30 pulse categories, analyzed for trends according to tester, participant, testing day, pulse position and pulse quality. κ interpreted with Jelles et al values.	Intra-rater: (1680 κ values): 43.2% (726) excellent agreement, κ ≥ 0.75; 42.5% (713) moderate to good agreement, κ = 0.41–0.74; and 14.3% (241) poor agreement, κ ≤ 0.40. Inter-rater: (10,080 κ values): 23.5% (2366) excellent agreement, κ ≥ 0.75; 46% (4642) moderate to good agreement, κ = 0.41–0.74; and 30.5% (3072) poor agreement, κ ≤ 0.40. Intra-rater reliability (67% κ values ≥0.60) greater than inter-rater (44.1%, κ ≥ 0.60). Showed testers tended to agree with their own judgments more often than they did with those of others. Bilateral or 6-finger palpation (72.1% intra-rater, 52.8% inter-rater κ ≥ 0.60) more reliable than single-finger (64.1% intra-rater, 39% inter-rater κ ≥ 0.60). Pulse quality reliability lower for those with more complex descriptions and those representing qi-yang deficiency. Higher incidence of poor agreement (κ ≤ 0.40) in 3 participants, 1 tester and 3 pulse positions. For position intra-rater agreement greater than inter-indicated variance in tester technique for these positions, due to unclear CCPD terminology. Concluded acceptable reliability can be achieved with an operationally defined system of pulse diagnosis. Agreement depended on tester skill; stability of participant's pulse; specific pulse position/quality being assessed. Clarity of terminology helps control for variance relating to subjectivity of tester technique.
Hua et al, 2012	40 participants	Participants with unilateral or bilateral osteoarthritis of the knee meeting criteria of American College of Rheumatology based on pain and presence of radiographic osteophytes	2 practitioners with >10 years' experience	Specific pulse method not specified	Included all 4 methods of Chinese medicine diagnosis. Data recorded on a standard form. Assessed pulse speed (number of beats per breath of patient), location and force on a 3-tier nominal scale for left and right sides. Inter-rater reliability calculated.	Percentage agreement; κ coefficient. κ interpreted with Landis and Koch values.	Inter-rater reliability was fair for right pulse location (κ = 0.31); slight for left location (κ = 0.20), force (κ = 0.08), right force (κ = 0.13) and speed (κ = 0.11), and poor for left speed (κ = −0.05). The results showed pulse diagnosis as most difficult part of Chinese medicine diagnosis. Extensive clinical experience to master. Recommended clear definitions be established and prior training of the examiners for future study designs.
Ko et al., 2013	628 participants	Participants admitted to hospital within <30 d after stroke	18 TCM experts with >3 years' experience with stroke patients	Pulse examination parameters extracted from report for the standardization of stroke diagnosis developed by the Korea Institute of Oriental Medicine	2 testers rated each participant. Pulse location, rate, force, shape (string-like, slippery, fine, rough, or surging) graded for severity on 1–3 scale. Also included pattern identification. Data recorded on standard stroke management forms. Data storage methods not specified.	Percentage agreement; κ coefficient; Gwet's AC1 κ interpreted with Jelles et al values.	Inter-rater reliability results for pulse signs, κ ranged from poor (κ = 0.19) to moderate (κ 0.49). AC1 measure of agreement was generally high, ranging from 0.65 to 0.93 (exception of slippery pulse, which had an AC1 of 0.38). Where testers concurred on pattern identification, κ ranged from moderate (κ = 0.40) to good (κ = 0.49), with exceptions of rough and sunken pulse (κ = 0.17 and 0.34, respectively). AC1 was generally high, ranging from moderate (AC1 = 0.41) to excellent (AC1 = 0.94). Concluded rater reliability for pulse diagnosis in stroke patients is not particularly high when objectively quantified. Suggested detailed-oriented criteria and better training of the clinicians to improve reliability.
Lee et al., 2014	168 participants	Participants admitted to hospital within <30 d after `stroke	2 TCM experts	Pulse method from standardization of stroke diagnosis (Korea Institute of Oriental Medicine)	2 testers rated each participant for pulse location (floating or sunken), rate (slow or rapid), force (strong or weak), and shape (slippery, fine, or surging). Each variable graded for severity on 1–3 scale	Percentage agreement; κcoefficient; Gwet's AC1 κ interpreted with Jelles et al values.	κ ranged from poor (κ = 0.37) to moderate (κ = 0.61) while AC1 measures of agreement for the 2 experts were generally high and ranged from 0.66 to 0.89. Study showed that standardized pulse diagnosis for stroke diagnosis has good agreement. Recommended diagnostic indicators should be standardized to improve agreement among clinicians.

SRCT, Spearman rank correlation test; KCC, Kendall coefficient of concordance; TCM, Traditional Chinese Medicine; TMT, Toyahari Meridian Therapy; CCPD, Contemporary Chinese Pulse Diagnosis.

The literature search located a previous review reporting the reliability of all TEAM diagnostic and treatment methods.^10* Contributing somewhat to existing knowledge, the extent of the review topic was enormous. The resulting analysis of pulse diagnosis studies was thus limited, in some cases overlooked relevant data, and presented the results in table format only. The discussion and conclusion addressed all diagnostic methods collectively, and, in doing so, lacked specificity for findings of the component diagnostic methods. Suggestion of strategies to improve reliability did not surpass that offered by the original studies. By concentrating on pulse diagnosis studies, this review presents a more comprehensive analysis of the current eligible literature.

Extraction of data items for analysis

A form based on the Cochrane Consumers and Communication Review Group's template¹¹ was designed to extract data and included (1) subject numbers, (2) participant, (3) testers, (4) method of pulse palpation used, (5) study design, (6) statistical analysis, (7) reporting of agreement or reliability (Table 1). Methods and results were further explored for limiting factors and clues suggesting missing data or selective reporting bias.

Results

Cole's doctoral research^† included 16 smaller studies,^‡ 7 being relevant: 4 intra-rater reliability (intra-RR), 1 intra-rater reliability over time (intra-RROT), and 2 inter-rater reliability (inter-RR) studies. The intra-RR substudies retested participants from a larger sample (5 hospital patients, 4 hospital staff and patients, 2 mixed health [age 18–56 years], and 12 healthy male medical students). Intra-RROT retested five hospital patients. Inter-RR studies recruited 12 patients each (from one tester's clinic); healthy medical students: 9 male, 3 female). Pulse testers included acupuncturists with 14–20 years' experience and 2 students.

Intra-RR trials used practical tests/retests with testers blinded, talking forbidden, and a different tester for each substudy. Inter-RR studies compared responses of practitioners and students recorded on standardized pulse maps. Although the exact method of pulse palpation was not described, these maps used Wang Shu He pulse positions¹²; the intra-RR substudies included seven pulse qualities as cited in Chan.¹³ Intra-RR data were analyzed by using Spearman rank correlation test (SRCT) and Wilcoxon test (WT) to assess significance. Intra-RROT and inter-RR studies used Kendall coefficient of concordance (KCC)^§ to assess overlap of measured pulse components, constant physical versus stereotyped projected onto the pulse.

Intra-RR demonstrated high proportions of SRCT greater than 0.5; WT confirmed that chance is unlikely, showing it was possible for testers to reliably record stable information from a pulse. Corroborating these findings, intra-RROT KCC (0.95–0.3) indicated that reliable pulse patterns were recorded throughout the day. Eighty-one percent of variance was due to physical and stereotyped components (not chance). Low inter-rater reliability was demonstrated for both blinded and open conditions (low SRCT values), and it was established that testers did not generate stereotyped pulse patterns (KCC, 0.2–0.4).

Cole concluded that the same tester could reliably record “objective” pulse patterns within the same subject on different occasions; different testers did not detect similar findings. Irrespective to blinding, testers projected “subjective” influences or individual “gestalt” onto subjects' pulses. Although limited by small sample sizes, potential prior knowledge of subjects' pulses, and inadequacy in reporting some methods, the detail and range of procedures used to test manual pulse diagnosis support the basic conclusion that intra-rater was greater than inter-rater reliability.

Kass's doctoral research¹⁵ evaluated the reliability and validity of manual and electronic pulse-taking techniques. Ten subjects (mean age, 80.5 years) recruited from a senior's health facility^** on the basis of their medical file, and two pulse testers with “extensive experience” (one using manual palpation, the other an electronic pulse-taking device) were included in the study.

In a blinded practical test, both testers examined subjects' pulses and aimed to replicate readings (reliability), then correctly match their medical files on the basis of pulse analysis alone (validity). Information was recorded on a form comprising three sections (general, subtype, and individual pulse). Although the exact method was not stated, 31 pulse patterns¹⁶ and 18 locations (three depths and positions bilaterally) were included. Data were analyzed by using normal approximation to the binomial (p < 0.05)¹⁷ to determine whether the results were better than chance alone.

Outcomes for general and individual pulse sections (e.g., depth, intensity, amplitude, frequency) exhibited 79% and 70% matches respectively (p < 0.0001), while pulse subtype (qualities) matches were significant in less than 50% of groupings (p < 0.05). Kass concluded that pulse diagnosis reliability decreased as more subtle levels of distinction were attempted. Although the relatively small sample size and the fundamental design of comparing a manual pulse-taking procedure with an electronic device was problematic, the inference of decreasing reliability as the subtlety of the measured variable increases was reasonable.

Craddock^†† investigated intra- and inter-practitioner reliability of pulse evaluation in a study conducted for undergraduate acupuncture course requirements. Eight subjects (24–56 years; four female, four male; one student, one staff, and six with no school affiliation) and four testers (teachers with 3 years' training and 5 years' experience) were included in the study.

With testers blinded and talking forbidden, seven pulse categories were rated according to a standardized questionnaire in a practical test and retest (of four subjects). There was no specific indication of the pulse model used and no citation for the source of definitions for pulse characteristics assessed by the testers. Li Shi Zhen was mentioned in reference to positioning of the subjects' wrists.

Reported in percentage agreement, outcomes were inter-rater agreement of 63.3% (six categories ≥56.3%) and intra-rater agreement of 56.1% (five categories ≥58.3%) with “individual qualities” as the least reliable category (intra- and inter-rater). Craddock concluded that disagreement resulted from inadequate operational definitions for pulse diagnosis, and intra- and inter-rater reliability decreased as pulse variables became more complex. Conclusions from this study must be considered in view of the small sample size and the possibility of bias confounding the data by the potential for testers prior knowledge of some subjects' pulses.

Different aspects of pulse diagnosis reliability were investigated in several studies from the University of Technology, Sydney.^9,18,19 Walsh et al.¹⁸ assessed agreement frequency of TCM students identifying specified pulse characteristics (e.g., speed, depth, volume, length, and quality) in 18 subjects (8 male, 10 female). A practical lineal test with testers blinded and talking forbidden was used to assess six subjects in three episodes of testing (week 1 pulse diagnosis classes, conclusion/week 14, then 1 year later). For each collection, standard assessment forms were used and testers rated 12 pulse characteristics (total of 72 pulse characteristics for each). The number of pulse testers (UTS students) for collection 1 was 35, 29 for collection 2, and 20 for collection 3. Cun, guan, and chi positions were palpated; however, the exact method, and the source for included pulse descriptions were not provided. Data was analyzed using chi-square (χ²); level of significance was 0.05.

Stated outcomes were inter-rater agreement significantly greater than chance at collection 2 (χ² = 0.046) in 31 of 72 sets of data, with no difference from chance alone at collections 1 and 3. The authors concluded that poor agreement was due to the confusing information existing within the traditional pulse diagnosis literature rather than the ability of the students to learn the skill.

King et al.⁹ investigated whether a pulse diagnosis method with concrete operational definitions could reliably assess pulse parameters. The study involved two testers (UTS lecturers with 5 and 7 years' experience) and healthy persons recruited from UTS students, staff, and the general population (proportions not reported). Data collection 1 included 66 participants (27 men and 39 women), and data collection 2 included 30 participants (13 men and 17 women). Seventy percent of participants were European and 30% were Asian.

The researchers developed operational definitions for a standardized manual palpation method based on traditional pulse definitions and repeated practical test/retest procedures. The study used a practical test and retest (specific conditions, such as blinding, were not described); 16 pulse characteristics were rated in each participant, and inter-rater reliability was measured as percentage agreement. It is unclear whether the retest included participants from the initial test or if they were completely different population samples.

Outcomes reported were mean percentage agreement tested against the chi-square goodness of fit; level of significance = 0.05. Agreement for pulse characteristics across both collection phases was reported at 80% or greater. Data collection I showed greater than 70% agreement for 13 of 16 categories (10 were >80%) and data collection 2 showed 80% or greater for 11 categories. The authors concluded that acceptable inter-rater reliability was possible with a standardized pulse-taking procedure and concrete operational definitions.

King et al.¹⁹ investigated differences in right left pulse strength in relation to sex and also reported inter-rater reliability. Having previously demonstrated acceptable inter-rater reliability,⁹ the researchers recruited 65 healthy participants from staff, students, and the general population (27 men, 38 women). With a practical test design and open conditions, with talking prohibited, testers rated participants' pulses for comparative left–right strength, presumably using the same pulse method as that used in King and colleagues' previous study.⁹ Data were analyzed by using percentage agreement and chi-square test; the level of significance was 0.05. Outcomes reported were inter-rater agreement of 86%; the Chinese medicine assumption of sex-related right–left pulse strength differences was not supported.

Although these studies^9,18,19 generally incorporated sound design, methods, and conclusions, all overlooked testing of intra-rater reliability. It is therefore unknown whether individual testers could replicate the methods reliably in the same participant on successive occasions, as required with patient re-evaluations in clinical practice. In addition, because testers and participants were recruited from the same university, the risk of bias existed. As it was not stated otherwise, testers' prior knowledge of some participants' pulses may have influenced the inter-rater results.

Australian researchers reported pulse diagnosis reliability in three papers,^20
–22 all based on the results of larger studies. Each used a similar design and analyzed inter-rater reliability by using percentage agreement, Cohen's κ coefficient, or weighted κ interpreted by Landis and Koch values.²³ One investigated the reliability of TCM diagnostic methods in 45 participants (age 20–75 years) with hypercholesterolemia and no heart disease or serious medical conditions.²⁰ Three practitioners (5–20 years' experience) assessed pulse location, force and speed (one used a timing device, and the others counted by breath). Outcomes showed slight to fair agreement (κ = 0.15–0.29) for all three testers for pulse location and force. When only two testers were compared, agreement was higher for both categories (κ = 0.86). Agreement for speed assessed by a watch was higher (κ = 0.84 and 0.72) than that seen with traditional methods (κ = 0.63).

O'Brien et al.²¹ reported on two Toyohari Meridian Therapy (TMT) instructors with 10–12 years' experience who diagnosed 62 healthy persons (age 20–65 years) according to TMT principles. In terms of pulse diagnosis, depth, strength, and speed were rated on a 5-tier nominal scale that was reduced to 3 tiers for analysis. Results showed agreement for pulse depth of 57%, speed of 61%, and strength of 77%. Weighted κ for 3-scale pulse categories were as follows: for depth, κ = 0.37; for speed, κ = 0.40; and for strength, κ = 0.38. The authors concluded reasonable agreement for pulse characteristics with room for improvement.

Hua and colleagues²² reported inter-rater reliability of two experienced practitioners (>10 years' experience) using the four diagnostic methods to assess 40 patients with knee osteoarthritis. With respect to pulse, speed (number of beats per breath of patient), location, and force on a 3-tier nominal scale for left and right sides were recorded on a standard form. Outcomes reported fair to poor inter-rater reliability, with κ ranging from 0.30 to −0.05. Authors established pulse diagnosis as the most problematic part of a Chinese medical examination and recommended clear definitions and prior training of examiners for future study designs.

Although these studies^20
–22 showed sound design and methods, the choice of Landis and Koch values²³ for reporting levels of agreement was debatable. This interpretation rates κ < 0 as indicating poor agreement and κ of 0.01–0.40 as indicating slight to fair agreement. Previous studies that assessed reliability of subjective clinical diagnostic procedures suggested that κ ≤ 0.40 represented poor agreement and that such procedures were unacceptable for use in patients.^24,25 This therefore questions some of the conclusions drawn from these studies.

An extensive unfunded doctoral research^26,27 investigated the reliability of Contemporary Chinese Pulse Diagnosis (CCPD), a method standardized over 25 years.^5,6 A real-life practical test and retest was used, wherein four testers assessed 34 attributes of the pulse. Four rate categories were counted by using a timing device (beginning, end, exertion, and change with exertion) and 30 pulse categories were palpated, 11 by using bilateral/six fingers and 19 by using one finger. In open testing conditions with talking prohibited, testers recorded pulse qualities that were present for each category on standard CCPD pulse forms, which were collected and stored securely. Retest was completed 28 days later on the same day to replicate for female menstrual cycles and diurnal variations. Participants were excluded from retest if their condition had changed (e.g., acute illness, emotional upset, medication change). Fourteen participants completed retest (11 Caucasian, 2 Hispanic, and 1 Asian; 3 men and 11 women).

Tester responses were transcribed into electronic format with 30 separate files created for each participant. Data were organized according to tester and day of testing. Intra-rater reliability compared test/retest results for each tester, and inter-rater reliability compared results of two testers at a time across both days of testing. Agreement for pulse rates was analyzed by using analysis of variance, and Cohen's κ coefficient²³ was used to measure reliability in terms of pulse quality matches for each of the categories. Intra-rater analysis included 1680 κ calculations (4 testers × 30 categories × 14 participants), and inter-rater analysis included 10,080 κ calculations (24 tester/day combinations × 30 categories × 14 participants). For ease of handling and reporting, κ values were averaged for each of the 30 pulse categories and analyzed for trends according to tester, participants, testing day, pulse position, and pulse quality. Results were not extrapolated to a wider population. κ values were interpreted according to values recommended for subjective diagnostic tests²⁵ (κ ≤ 0.40 represented unacceptable or poor agreement).

Reported outcomes were 43.2% of intra-rater κ calculations with excellent agreement (κ ≥ 0.75), 42.5% moderate to good agreement (κ 0.41–0.74), and 14.3% poor agreement (κ ≤ 0.40). Inter-rater results showed 23.5% of κ values with excellent agreement, 46% with moderate to good agreement, and 30.5% with poor agreement. Overall, 67% of intra-rater κ values were ≥0.60, and 44.1% of inter-rater κ values were ≥0.60, indicating that testers tended to agree with their own judgments more often than they did with that of others. Bilateral palpation methods demonstrated greater reliability, with 72.1% intra- and 52.8% inter-rater κ values ≥0.60, while single finger exhibited 64.1% of intra- and 39% of inter-rater κ values ≥0.60.

A higher incidence of poor agreement (κ ≤ 0.40) was reported in one tester, three participants and several complementary pulse positions. Authors noted that these participants showed similar intra- and inter-rater disagreement, indicating that some people may have more variable pulses. For positions, intra-rater reliability remained greater than inter-rater, supporting the possibility of variance in tester technique. With such variance each interprets difficult or unclear terminology slightly differently and therefore develops an individual method.

The study reported reliability of individual pulse qualities as complicated and related this to several factors, including sensation complexity, location, and unilateral versus bilateral palpation methods. Pulse qualities representing qi-yang deficiency and those with multifaceted descriptions were established as less reliable. The authors proposed that the extent of sensory input to testers was an important aspect of pulse diagnosis reliability.

Supporting earlier findings, Bilton et al.^26,27 concluded that acceptable levels of reliability can be achieved when a system of pulse diagnosis is operationally defined or when all users interpret the terminology and replicate the procedure in the same way on all occasions. Clarity of definitions and terminology was essential to control for variance relating to subjectivity of tester technique. Reliability was demonstrated to depend on tester skill or training, stability of a patient's pulse, and specific pulse position and quality being assessed. Authors recommended review of the terminology for the pulse positions within CCPD that recorded unacceptable reliability.

Statistical analysis with κ values proved challenging for this study because the results were affected by bias and prevalence paradoxes.^28
–30 The κ value was therefore used as a descriptive statistic to identify trends in the data. Confidence intervals were not stated, so results were not extrapolated to a wider population. Another factor limiting the results of this study was the small population sample; however, the 34 attributes that were measured and remeasured on each participant provided an enormous amount of data that gave some support to the findings.

Funded studies with sound methods, data management, and reporting^31,32 investigated the reliability of traditional Korean diagnostic methods (including pulse), developed by the Korea Institute of Oriental Medicine, for stroke patients. Ko and colleagues³¹ included 18 different assessors (>3 years of experience with stroke patients) and 628 patients admitted to nine Oriental medical university hospitals less than 30 days after stroke. For the pulse portion, 2 testers graded each patient on 1–3 scale for pulse location, rate, force, and shape (string-like, slippery, fine, rough, or surging). Lee et al.³² used the same patient inclusion criteria and methods and incorporated 168 post-stroke patients from 4 university hospitals. Two experts rated each patient for pulse location (floating or sunken), rate (slow or rapid), force (strong or weak), and shape (slippery, fine, or surging).

Both studies analyzed data by using percentage agreement, κ coefficient (values interpreted by Jelles et al.²⁵) and Gwet's AC1,²⁸ which is not vulnerable to the paradoxes of κ when there is a very high or low incidence of traits within a population.^28
–30 In Ko and colleagues' study,³¹ κ results showed that inter-rater reliability ranged from poor (κ = 0.19) to moderate (κ = 0.49), while AC1 measure of agreement ranged from 0.38 to 0.93. The authors concluded that inter-rater reliability for pulse diagnosis in stroke patients was not particularly high when objectively quantified.³¹

Lee et al.³² reported κ results that ranged from poor (κ = 0.37) to moderate (κ = 0.61), while AC1 measures of agreement for the 2 experts were generally high (ranging from 0.66 to 0.89). The authors noted that patients who demonstrated higher agreement of other diagnostic variables also showed more reliable pulse assessments.³² They concluded that there was good reliability for pulse assessment in the Korea Institute of Oriental Medicine stroke diagnosis and recommended standardized diagnostic indicators with detailed-oriented criteria and better training of clinicians to improve reliability. Although there were only 2 raters and pulse diagnosis reliability constituted only a small portion of the study, the results and interpretation of data indicated favorable reliability for the method in stroke patients. It is not known, however, whether this pulse method can be applied to a healthy population.

Discussion of Results and Limitations of Literature Analysis

The review of these studies confirmed several realities concerning pulse diagnosis. First, it is possible for the same tester to detect similar pulse patterns on the same patient on different occasions.^{†,††,26,27,32} Testers tended to agree with themselves more often than they did with others when rating the same patient's pulses, demonstrating intra-rater reliability to be higher than inter-rater.^{†,††,26,27} This suggested that a subjective component of pulse diagnosis is independent of other sensory input, demonstrated by similar agreement levels for both blind and open conditions.^†

Poor inter-rater reliability was regularly linked to inconsistent tester interpretations of pulse definitions resulting from confusing terminology used in the classic texts and inadequate operational definitions for modern methods of pulse diagnosis.^{†,††,8,15,18} Inter-rater reliability further decreased as the number of pulse testers increased²⁰ and as the complexity of the measured pulse variable increased.^{††,9,15,26,27} Later studies suggested that unclear definitions are interpreted differently and result in different users developing individual techniques, thereby reducing inter-rater agreement.^{†,††,26,27} When, however, the method of pulse diagnosis was standardized or operationally defined or the terminology was interpreted and the procedure implemented in the same way every time, acceptable inter-rater reliability was possible.^{9,19,26,27,32} Some studies suggested that within specific methods, revising problematic terminology, developing detailed-oriented criteria, and better training of the clinicians controlled for variance relating to subjectivity of tester technique.^{22,26,27,31,32}

Training and experience of the clinician directly influenced the reliability of pulse diagnosis.^{22,26,27,31,32} Greater agreement was found between raters who had more skill with the standardized procedure.^26,27,32 Despite experienced testers and operationally defined procedures, several studies found less agreement for pulse assessments with particular participants.^26,27,32 This suggested that some people may have pulses that are more variable than others, which supports the change itself as being what is diagnostically relevant.^5,6

The literature established the reliability of individual pulse qualities as not straightforward.^{††,15,26,27} Inter- and intra-rater reliability of specific qualities decreased as pulse definitions became more complicated or the number of significant descriptive aspects relevant for sensory differentiation increased.^{††,15,26,27} More complex descriptions require assessors to consider multiple defining traits during palpation, thus increasing the difficulty of decisions and reducing reliability. Furthermore, it was suggested that pulse quality reliability varied depending on the location in which the quality was detected.^26,27

Sensory input was suggested as important for pulse diagnosis reliability in several ways. Pulse qualities representing yang deficiency, or those that are weak or difficult to detect, were less reliable than those possessing more power and strength.^26,27 When a sensation is barely perceptible, the lack of tactile information potentially presents clinicians with more uncertainty, resulting in less reliability for pulse qualities that exhibit these characteristics. Greater reliability for bilateral methods of palpation was similarly related to increased sensory input for the rater.^26,27 The larger area of contact for finger pads may have delivered more sensory information to the cognitive processes on which clinicians based their judgements. For individual pulse qualities, some that were unreliable when detected with a single finger were reliable with bilateral palpation.^26,27 This suggests that incorporating bilateral palpation may achieve more reliable pulse assessments.

Three recurrent limitations and/or biases were found across the literature. Relatively small sample sizes restricted the extrapolation of results of most studies to a wider population. The studies that incorporated both testers and participants from the same institution risked the potential for bias or the possibility that testers' prior knowledge of participants' pulses influenced the data. Most significantly, many of the more recent studies demonstrated inter-rater reliability and did not investigate intra-rater reliability.

Demonstration of intra-rater reliability was considered contentious by some authors because tester memory may influence results and pulse characteristics change within hours or days.¹⁰ Both claims, however, appear to be unsupported. Investigations of intra-rater reliability can be validated by reducing tester memory and allowing time to pass between test and retest procedures.³³ The volatility of pulse parameters, on the other hand, contradicts millennia of empirical knowledge on which pulse systems across many medical paradigms are predicated, including the origins of current allopathic cardiovascular measurements.^18,34
–36

Intra-rater reliability must also be considered in view of clinical practice where patient baselines are assessed, treatment is implemented, and then the same markers are reassessed to judge treatment outcomes. Some methods, such as TMT, even monitor changes in the radial pulse to determine the length of an acupuncture treatment.³⁷ It is therefore reasonable to conclude that intra-rater reliability is of equal importance. Hence, this may call into question the usefulness of studies that investigated inter-rater reliability and ignored prior or concurrent demonstration of testers repeating reliable assessments on the same participants at different times.

Conclusions of Literature Analysis

For reliable pulse assessments, the literature indicates that the method should be operationally defined with clear, concrete terminology so that all practitioners interpret the definitions and implement the method in the same way every time. For translation of results to the clinical setting, future pulse diagnosis studies should examine both intra- and inter-rater reliability while controlling for pulse tester variance; this can be achieved by ensuring similar training and experience in the same method. To more rigorously investigate inter-rater reliability, the number of testers that evaluate each patient should be greater than two, and consideration be given to the number of participants and pulse characteristics tested to allow wider extrapolation of results. Finally, independence of data should be guaranteed by ensuring that pulse testers have no prior knowledge of the participants' pulses, and all test data should be stored impartially and securely.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

References

Walsh

, King

, Simpson

. Pulse diagnosis: an introductory guide for the experienced practitioner. J Chin Med. 2009; 89:5–12.

Birch

, Felt

. Understanding Acupuncture. London: Harcourt Brace, 1999.

Kim

, Jeon

, Kim

, Lee

, Kim

. Novel diagnostic model for the deficient and excess pulse qualities. Evid Based Complement Altern Med. 2012; 563958.

Deng

, Ergil

. Practical Diagnosis in Traditional Chinese Medicine. Edinburgh: Churchill Livingstone, 1999.

Hammer

. Chinese Pulse Diagnosis, A Contemporary Approach: Revised Edition. Seattle: Eastland Press, 2005.

Hammer

, Bilton

. Handbook of Contemporary Chinese Pulse Diagnosis. Seattle: Eastland Press, 2012.

Huynh

, Seifert

. Pulse Diagnosis: Li Shi Zhen. Brookline, MA: Paradigm Publications, 1981.

Flaws

The Secret of Chinese Pulse Diagnosis. Denver: Blue Poppy Press; 2006.

King

, Cobbin

, Walsh

, Ryan

. The reliable measurement of radial pulse characteristics. Acupunct Med. 2002; 20:150–159.

10.

O'Brien

, Birch

. A review of the reliability of traditional East Asian medicine diagnoses. J Altern Complement Med. 2009; 15:353–366.

11.

Cochrane Consumers and Communication Review Group. Cochrane Consumers and Communication Review Group. 2009;Version 1.3.0, updated August 5, 2009.

12.

Yang

. The Pulse Classic: A Translation of the Mai Jing. Boulder: Blue Poppy Press, 1997.

13.

Chan

. The History and Methods of Physical Diagnosis in Classical Chinese Medicine. London: Pilot Press, 1960.

14.

Seigel

, Castellan

Jr.

Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill, 1988.

15.

Kass

. Traditional Chinese Medicine and pulse diagnosis in San Francisco health planning: implications for a Pacific Rim city. Berkeley, CA: University of California, Berkeley, 1990.

16.

Porkert

. The Essentials of Chinese Diagnostics. Columbia: Chinese Medicine Publications, 1983.

17.

Prazen

. Modern Probability Theory and Its Applications. New York: Wiley, 1960.

18.

Walsh

, Cobbin

, Bateman

, Zaslawski

. Feeling the pulse: trial to assess agreement level among TCM students when identifying basic pulse characteristics. Eur J Orient Med. 2001; 3:25–31.

19.

King

, Walsh

, Cobbin

. The testing of classical pulse concepts in Chinese medicine: left- and right-hand pulse strength discrepancy between males and females and its clinical implications. J Altern Complement Med, 2006; 15:727–734.

20.

O'Brien

, Abbas

, Zhang

, et al. Understanding the reliability of diagnostic variables in a Chinese medicine examination. J Altern Complement Med, 2009; 15:727–734.

21.

O'Brien

, Abbas

, Movsessian

, Hook

, Komesaroff

, Birch

. Investigating the reliability of Japanese toyohari meridian therapy diagnosis. J Altern Complement Med, 2009; 15:1099–10105.

22.

Hua

, Abbas

, Hayes

, Ryan

, Nelson

, O'Brien

. Reliability of Chinese medicine diagnostic variables in the examination of patients with osteoarthritis of the knee. J Altern Complement Med, 2012; 18:1028–1037.

23.

Landis

, Koch

. The measurement of observer agreement for categorical data. Biometrics, 1977; 33:155–174.

24.

Devane

, Lalor

. Midwives' visual interpretation of intrapartum cardiotocographs: intra- and inter-observer agreement. J Adv Nurs, 2005; 52:133–141.

25.

Jelles

, Van Bennekom

, Lankhorst

, Sibbel

, Bouter

. Inter- and intra-rater agreement of the Rehabilitation Activities Profile. J Clin Epidemiol, 1995; 48:407–416.

26.

Bilton

. Investigating the reliability of Contemporary Chinese Pulse Daignosis as a diagnostic tool in Oriental medicine. Syndney: University of Technology, 2012.

27.

Bilton

, Smith

, Walsh

, Hammer

. Investigating the reliability of contemporary Chinese pulse diagnosis. Aust J Acupunct Chin Med, 2010; 5:3–13.

28.

Gwert

. Computing inter-rater reliability with the SAS system. Stat Methods Inter-Rater Reliab Assess, 2003; 3:1–16.

29.

Cicchetti

, Feinstein

. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol, 1990; 43:551–558.

30.

Feinstein

, Cicchetti

. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol, 1990; 43:543–549.

31.

, Park

, Lee

, et al. Interobserver reliability of pulse diagnosis using traditional Korean medicine for stroke patients. J Altern Complement Med, 2013; 19:29–34.

32.

Lee

, Ko

, Kang

B-K

, et al. Interobserver reliability of four diagnostic methods using traditional Korean medicine for stroke patients. Evidence Based Complement Altern Med, 2014; 2014.

33.

Sim

, Wright

. The kappa statistic in reliability studies: use interpretation and sample size requirements. Phys Ther, 2006; 85:257–268.

34.

O'Rourke

, Kelly

, Avolio

. The Arterial Pulse. Pennsylvania: Lea & Febiger, 1992.

35.

Schmieder

, Schobel

, Gatzka

, et al. Effects of angiotensin converting enzyme inhibitor on renal haemodynamics during mental stress. J Hypertens, 1996; 14:1201–1207.

36.

Veerman

, Imholz

, Wieling

, Wesseling

, van Montfrans

. Circadian profile of systemic hemodynamics. Hypertension, 1995; 26:55–59.

37.

O'Brien

, Birch

, Abbas

, Movsessian

, Hook

, Komesaroff

. Traditional East Asian medical pulse diagnosis: a preliminary physiologic investigation. J Altern Complement Med, 2013; 19:793–798.