Abstract
Background/Objective:
Little evidence shows the reliability of Chinese medicine pulse diagnosis. Regularly used in modern practice, it is believed to gather important diagnostic information. However, in the current evidence-based healthcare system, basing clinical decisions on unproven methods is problematic and obviously questions the relevancy of the procedure. Therefore, the literature on reliability of practitioners implementing the method was reviewed.
Methods:
Major medical databases and reference lists of identified articles were searched. All studies published in English that investigated manual pulse diagnosis applied to the radial artery by human testers were considered.
Results:
Twelve eligible studies were included; three evaluated intra- and inter-rater pulse diagnosis reliability, and nine assessed inter-rater reliability. Acceptable levels of intra- and inter-rater reliability were achieved with operationally defined methods. Poor reliability was related to unclear definitions and terminology existing within the classical definitions, and with standardized systems to persisting imprecise descriptions that can be interpreted differently. Reliability of pulse qualities was influenced by sensation complexity and the amount of sensory input provided to the testers' fingers by the impulse. Consistent study limitations included small sample sizes; the possibility that testers' prior knowledge confounded the data; and, most notably, the fact that many studies did not consider intra-rater reliability. Assessing the effectiveness of interventions in clinical practice is guided by comparisons of markers to baseline. The absence of intra-rater results may therefore raise methodologic concerns for these types of studies.
Conclusion:
Strategies for future studies include using pulse methods with concrete operational definitions; investigating intra- and inter-rater reliability for extrapolation to clinical practice; similar training and experience in the method to control for tester variance; maintaining independence of the data by ensuring testers have no prior knowledge of the participants' pulses; and for more rigorous testing, consideration of the number of pulse variables, participants, and testers.
Introduction
F
The ongoing debate regarding the subjectivity of diagnostic procedures 9 questions their relevancy and hinders the development of Chinese medicine and TEAM alongside current evidence-based healthcare practices. Significantly information gathered by these methods is interpreted according to tradition and empirical assumption and directs clinical decisions regarding patient care. The legitimacy of using untested or unreliable methods, such as pulse diagnosis, is questionable and should be substantiated. Therefore, the literature on the reliability of practitioners applying manual pulse diagnosis to assess patients was reviewed.
Methods of Literature Review
Searches
Electronic databases searched in September 2015 included MEDLINE, PubMed, PsychInfo, ProQuest Central, CINAHL (EBSCO), and Google Scholar; these databases were searched from their inception. The basic search terms “pulse diagnosis” and “reliability” were applied, then further refined with the addition of the individual terms “agreement,” “acupuncture,” “Chinese medicine,” “Traditional East Asian medicine,” and “Oriental medicine.” Additional efforts were made to locate unpublished studies by hand searching references of records returned by electronic sources.
Eligibility criteria
Included studies were published in English (with no date or publication status restrictions) and incorporated manual methods of pulse diagnosis applied at the radial artery by human testers. Studies using practitioner and student testers, and healthy as well as ill participants, were considered. Outcome measures reported included intra- and/or inter-rater reliability of defined pulse characteristics. Studies that measured the radial artery using electronic or mechanical devices were excluded from the review.
Selection of studies
The primary author searched the literature and performed all eligibility assessments. Potential studies were screened initially by title, followed by abstract, and then by full text as each continued to meet the inclusion criteria.
Data collection identified 12 studies for inclusion, the process for which is outlined in Figure 1. The basic search terms returned 123 records, and refining terms narrowed this by 24; the studies removed primarily related to oximetry and measurement of pulse waves by devices and instruments. A first screening of titles removed duplicates, resulting in the exclusion of 9 more records. Examination of remaining abstracts and full texts excluded 7 further studies that were not relevant. Searching included references located another 4 cited studies that satisfied the eligibility criteria. Details of the 12 studies included in this review are presented in Table 1.

Flow diagram of included studies.
SRCT, Spearman rank correlation test; KCC, Kendall coefficient of concordance; TCM, Traditional Chinese Medicine; TMT, Toyahari Meridian Therapy; CCPD, Contemporary Chinese Pulse Diagnosis.
The literature search located a previous review reporting the reliability of all TEAM diagnostic and treatment methods.10* Contributing somewhat to existing knowledge, the extent of the review topic was enormous. The resulting analysis of pulse diagnosis studies was thus limited, in some cases overlooked relevant data, and presented the results in table format only. The discussion and conclusion addressed all diagnostic methods collectively, and, in doing so, lacked specificity for findings of the component diagnostic methods. Suggestion of strategies to improve reliability did not surpass that offered by the original studies. By concentrating on pulse diagnosis studies, this review presents a more comprehensive analysis of the current eligible literature.
Extraction of data items for analysis
A form based on the Cochrane Consumers and Communication Review Group's template 11 was designed to extract data and included (1) subject numbers, (2) participant, (3) testers, (4) method of pulse palpation used, (5) study design, (6) statistical analysis, (7) reporting of agreement or reliability (Table 1). Methods and results were further explored for limiting factors and clues suggesting missing data or selective reporting bias.
Results
Cole's doctoral research † included 16 smaller studies, ‡ 7 being relevant: 4 intra-rater reliability (intra-RR), 1 intra-rater reliability over time (intra-RROT), and 2 inter-rater reliability (inter-RR) studies. The intra-RR substudies retested participants from a larger sample (5 hospital patients, 4 hospital staff and patients, 2 mixed health [age 18–56 years], and 12 healthy male medical students). Intra-RROT retested five hospital patients. Inter-RR studies recruited 12 patients each (from one tester's clinic); healthy medical students: 9 male, 3 female). Pulse testers included acupuncturists with 14–20 years' experience and 2 students.
Intra-RR trials used practical tests/retests with testers blinded, talking forbidden, and a different tester for each substudy. Inter-RR studies compared responses of practitioners and students recorded on standardized pulse maps. Although the exact method of pulse palpation was not described, these maps used Wang Shu He pulse positions 12 ; the intra-RR substudies included seven pulse qualities as cited in Chan. 13 Intra-RR data were analyzed by using Spearman rank correlation test (SRCT) and Wilcoxon test (WT) to assess significance. Intra-RROT and inter-RR studies used Kendall coefficient of concordance (KCC) § to assess overlap of measured pulse components, constant physical versus stereotyped projected onto the pulse.
Intra-RR demonstrated high proportions of SRCT greater than 0.5; WT confirmed that chance is unlikely, showing it was possible for testers to reliably record stable information from a pulse. Corroborating these findings, intra-RROT KCC (0.95–0.3) indicated that reliable pulse patterns were recorded throughout the day. Eighty-one percent of variance was due to physical and stereotyped components (not chance). Low inter-rater reliability was demonstrated for both blinded and open conditions (low SRCT values), and it was established that testers did not generate stereotyped pulse patterns (KCC, 0.2–0.4).
Cole concluded that the same tester could reliably record “objective” pulse patterns within the same subject on different occasions; different testers did not detect similar findings. Irrespective to blinding, testers projected “subjective” influences or individual “gestalt” onto subjects' pulses. Although limited by small sample sizes, potential prior knowledge of subjects' pulses, and inadequacy in reporting some methods, the detail and range of procedures used to test manual pulse diagnosis support the basic conclusion that intra-rater was greater than inter-rater reliability.
Kass's doctoral research 15 evaluated the reliability and validity of manual and electronic pulse-taking techniques. Ten subjects (mean age, 80.5 years) recruited from a senior's health facility ** on the basis of their medical file, and two pulse testers with “extensive experience” (one using manual palpation, the other an electronic pulse-taking device) were included in the study.
In a blinded practical test, both testers examined subjects' pulses and aimed to replicate readings (reliability), then correctly match their medical files on the basis of pulse analysis alone (validity). Information was recorded on a form comprising three sections (general, subtype, and individual pulse). Although the exact method was not stated, 31 pulse patterns 16 and 18 locations (three depths and positions bilaterally) were included. Data were analyzed by using normal approximation to the binomial (p < 0.05) 17 to determine whether the results were better than chance alone.
Outcomes for general and individual pulse sections (e.g., depth, intensity, amplitude, frequency) exhibited 79% and 70% matches respectively (p < 0.0001), while pulse subtype (qualities) matches were significant in less than 50% of groupings (p < 0.05). Kass concluded that pulse diagnosis reliability decreased as more subtle levels of distinction were attempted. Although the relatively small sample size and the fundamental design of comparing a manual pulse-taking procedure with an electronic device was problematic, the inference of decreasing reliability as the subtlety of the measured variable increases was reasonable.
Craddock †† investigated intra- and inter-practitioner reliability of pulse evaluation in a study conducted for undergraduate acupuncture course requirements. Eight subjects (24–56 years; four female, four male; one student, one staff, and six with no school affiliation) and four testers (teachers with 3 years' training and 5 years' experience) were included in the study.
With testers blinded and talking forbidden, seven pulse categories were rated according to a standardized questionnaire in a practical test and retest (of four subjects). There was no specific indication of the pulse model used and no citation for the source of definitions for pulse characteristics assessed by the testers. Li Shi Zhen was mentioned in reference to positioning of the subjects' wrists.
Reported in percentage agreement, outcomes were inter-rater agreement of 63.3% (six categories ≥56.3%) and intra-rater agreement of 56.1% (five categories ≥58.3%) with “individual qualities” as the least reliable category (intra- and inter-rater). Craddock concluded that disagreement resulted from inadequate operational definitions for pulse diagnosis, and intra- and inter-rater reliability decreased as pulse variables became more complex. Conclusions from this study must be considered in view of the small sample size and the possibility of bias confounding the data by the potential for testers prior knowledge of some subjects' pulses.
Different aspects of pulse diagnosis reliability were investigated in several studies from the University of Technology, Sydney. 9,18,19 Walsh et al. 18 assessed agreement frequency of TCM students identifying specified pulse characteristics (e.g., speed, depth, volume, length, and quality) in 18 subjects (8 male, 10 female). A practical lineal test with testers blinded and talking forbidden was used to assess six subjects in three episodes of testing (week 1 pulse diagnosis classes, conclusion/week 14, then 1 year later). For each collection, standard assessment forms were used and testers rated 12 pulse characteristics (total of 72 pulse characteristics for each). The number of pulse testers (UTS students) for collection 1 was 35, 29 for collection 2, and 20 for collection 3. Cun, guan, and chi positions were palpated; however, the exact method, and the source for included pulse descriptions were not provided. Data was analyzed using chi-square (χ2); level of significance was 0.05.
Stated outcomes were inter-rater agreement significantly greater than chance at collection 2 (χ2 = 0.046) in 31 of 72 sets of data, with no difference from chance alone at collections 1 and 3. The authors concluded that poor agreement was due to the confusing information existing within the traditional pulse diagnosis literature rather than the ability of the students to learn the skill.
King et al. 9 investigated whether a pulse diagnosis method with concrete operational definitions could reliably assess pulse parameters. The study involved two testers (UTS lecturers with 5 and 7 years' experience) and healthy persons recruited from UTS students, staff, and the general population (proportions not reported). Data collection 1 included 66 participants (27 men and 39 women), and data collection 2 included 30 participants (13 men and 17 women). Seventy percent of participants were European and 30% were Asian.
The researchers developed operational definitions for a standardized manual palpation method based on traditional pulse definitions and repeated practical test/retest procedures. The study used a practical test and retest (specific conditions, such as blinding, were not described); 16 pulse characteristics were rated in each participant, and inter-rater reliability was measured as percentage agreement. It is unclear whether the retest included participants from the initial test or if they were completely different population samples.
Outcomes reported were mean percentage agreement tested against the chi-square goodness of fit; level of significance = 0.05. Agreement for pulse characteristics across both collection phases was reported at 80% or greater. Data collection I showed greater than 70% agreement for 13 of 16 categories (10 were >80%) and data collection 2 showed 80% or greater for 11 categories. The authors concluded that acceptable inter-rater reliability was possible with a standardized pulse-taking procedure and concrete operational definitions.
King et al. 19 investigated differences in right left pulse strength in relation to sex and also reported inter-rater reliability. Having previously demonstrated acceptable inter-rater reliability, 9 the researchers recruited 65 healthy participants from staff, students, and the general population (27 men, 38 women). With a practical test design and open conditions, with talking prohibited, testers rated participants' pulses for comparative left–right strength, presumably using the same pulse method as that used in King and colleagues' previous study. 9 Data were analyzed by using percentage agreement and chi-square test; the level of significance was 0.05. Outcomes reported were inter-rater agreement of 86%; the Chinese medicine assumption of sex-related right–left pulse strength differences was not supported.
Although these studies 9,18,19 generally incorporated sound design, methods, and conclusions, all overlooked testing of intra-rater reliability. It is therefore unknown whether individual testers could replicate the methods reliably in the same participant on successive occasions, as required with patient re-evaluations in clinical practice. In addition, because testers and participants were recruited from the same university, the risk of bias existed. As it was not stated otherwise, testers' prior knowledge of some participants' pulses may have influenced the inter-rater results.
Australian researchers reported pulse diagnosis reliability in three papers, 20 –22 all based on the results of larger studies. Each used a similar design and analyzed inter-rater reliability by using percentage agreement, Cohen's κ coefficient, or weighted κ interpreted by Landis and Koch values. 23 One investigated the reliability of TCM diagnostic methods in 45 participants (age 20–75 years) with hypercholesterolemia and no heart disease or serious medical conditions. 20 Three practitioners (5–20 years' experience) assessed pulse location, force and speed (one used a timing device, and the others counted by breath). Outcomes showed slight to fair agreement (κ = 0.15–0.29) for all three testers for pulse location and force. When only two testers were compared, agreement was higher for both categories (κ = 0.86). Agreement for speed assessed by a watch was higher (κ = 0.84 and 0.72) than that seen with traditional methods (κ = 0.63).
O'Brien et al. 21 reported on two Toyohari Meridian Therapy (TMT) instructors with 10–12 years' experience who diagnosed 62 healthy persons (age 20–65 years) according to TMT principles. In terms of pulse diagnosis, depth, strength, and speed were rated on a 5-tier nominal scale that was reduced to 3 tiers for analysis. Results showed agreement for pulse depth of 57%, speed of 61%, and strength of 77%. Weighted κ for 3-scale pulse categories were as follows: for depth, κ = 0.37; for speed, κ = 0.40; and for strength, κ = 0.38. The authors concluded reasonable agreement for pulse characteristics with room for improvement.
Hua and colleagues 22 reported inter-rater reliability of two experienced practitioners (>10 years' experience) using the four diagnostic methods to assess 40 patients with knee osteoarthritis. With respect to pulse, speed (number of beats per breath of patient), location, and force on a 3-tier nominal scale for left and right sides were recorded on a standard form. Outcomes reported fair to poor inter-rater reliability, with κ ranging from 0.30 to −0.05. Authors established pulse diagnosis as the most problematic part of a Chinese medical examination and recommended clear definitions and prior training of examiners for future study designs.
Although these studies 20 –22 showed sound design and methods, the choice of Landis and Koch values 23 for reporting levels of agreement was debatable. This interpretation rates κ < 0 as indicating poor agreement and κ of 0.01–0.40 as indicating slight to fair agreement. Previous studies that assessed reliability of subjective clinical diagnostic procedures suggested that κ ≤ 0.40 represented poor agreement and that such procedures were unacceptable for use in patients. 24,25 This therefore questions some of the conclusions drawn from these studies.
An extensive unfunded doctoral research 26,27 investigated the reliability of Contemporary Chinese Pulse Diagnosis (CCPD), a method standardized over 25 years. 5,6 A real-life practical test and retest was used, wherein four testers assessed 34 attributes of the pulse. Four rate categories were counted by using a timing device (beginning, end, exertion, and change with exertion) and 30 pulse categories were palpated, 11 by using bilateral/six fingers and 19 by using one finger. In open testing conditions with talking prohibited, testers recorded pulse qualities that were present for each category on standard CCPD pulse forms, which were collected and stored securely. Retest was completed 28 days later on the same day to replicate for female menstrual cycles and diurnal variations. Participants were excluded from retest if their condition had changed (e.g., acute illness, emotional upset, medication change). Fourteen participants completed retest (11 Caucasian, 2 Hispanic, and 1 Asian; 3 men and 11 women).
Tester responses were transcribed into electronic format with 30 separate files created for each participant. Data were organized according to tester and day of testing. Intra-rater reliability compared test/retest results for each tester, and inter-rater reliability compared results of two testers at a time across both days of testing. Agreement for pulse rates was analyzed by using analysis of variance, and Cohen's κ coefficient 23 was used to measure reliability in terms of pulse quality matches for each of the categories. Intra-rater analysis included 1680 κ calculations (4 testers × 30 categories × 14 participants), and inter-rater analysis included 10,080 κ calculations (24 tester/day combinations × 30 categories × 14 participants). For ease of handling and reporting, κ values were averaged for each of the 30 pulse categories and analyzed for trends according to tester, participants, testing day, pulse position, and pulse quality. Results were not extrapolated to a wider population. κ values were interpreted according to values recommended for subjective diagnostic tests 25 (κ ≤ 0.40 represented unacceptable or poor agreement).
Reported outcomes were 43.2% of intra-rater κ calculations with excellent agreement (κ ≥ 0.75), 42.5% moderate to good agreement (κ 0.41–0.74), and 14.3% poor agreement (κ ≤ 0.40). Inter-rater results showed 23.5% of κ values with excellent agreement, 46% with moderate to good agreement, and 30.5% with poor agreement. Overall, 67% of intra-rater κ values were ≥0.60, and 44.1% of inter-rater κ values were ≥0.60, indicating that testers tended to agree with their own judgments more often than they did with that of others. Bilateral palpation methods demonstrated greater reliability, with 72.1% intra- and 52.8% inter-rater κ values ≥0.60, while single finger exhibited 64.1% of intra- and 39% of inter-rater κ values ≥0.60.
A higher incidence of poor agreement (κ ≤ 0.40) was reported in one tester, three participants and several complementary pulse positions. Authors noted that these participants showed similar intra- and inter-rater disagreement, indicating that some people may have more variable pulses. For positions, intra-rater reliability remained greater than inter-rater, supporting the possibility of variance in tester technique. With such variance each interprets difficult or unclear terminology slightly differently and therefore develops an individual method.
The study reported reliability of individual pulse qualities as complicated and related this to several factors, including sensation complexity, location, and unilateral versus bilateral palpation methods. Pulse qualities representing qi-yang deficiency and those with multifaceted descriptions were established as less reliable. The authors proposed that the extent of sensory input to testers was an important aspect of pulse diagnosis reliability.
Supporting earlier findings, Bilton et al. 26,27 concluded that acceptable levels of reliability can be achieved when a system of pulse diagnosis is operationally defined or when all users interpret the terminology and replicate the procedure in the same way on all occasions. Clarity of definitions and terminology was essential to control for variance relating to subjectivity of tester technique. Reliability was demonstrated to depend on tester skill or training, stability of a patient's pulse, and specific pulse position and quality being assessed. Authors recommended review of the terminology for the pulse positions within CCPD that recorded unacceptable reliability.
Statistical analysis with κ values proved challenging for this study because the results were affected by bias and prevalence paradoxes. 28 –30 The κ value was therefore used as a descriptive statistic to identify trends in the data. Confidence intervals were not stated, so results were not extrapolated to a wider population. Another factor limiting the results of this study was the small population sample; however, the 34 attributes that were measured and remeasured on each participant provided an enormous amount of data that gave some support to the findings.
Funded studies with sound methods, data management, and reporting 31,32 investigated the reliability of traditional Korean diagnostic methods (including pulse), developed by the Korea Institute of Oriental Medicine, for stroke patients. Ko and colleagues 31 included 18 different assessors (>3 years of experience with stroke patients) and 628 patients admitted to nine Oriental medical university hospitals less than 30 days after stroke. For the pulse portion, 2 testers graded each patient on 1–3 scale for pulse location, rate, force, and shape (string-like, slippery, fine, rough, or surging). Lee et al. 32 used the same patient inclusion criteria and methods and incorporated 168 post-stroke patients from 4 university hospitals. Two experts rated each patient for pulse location (floating or sunken), rate (slow or rapid), force (strong or weak), and shape (slippery, fine, or surging).
Both studies analyzed data by using percentage agreement, κ coefficient (values interpreted by Jelles et al. 25 ) and Gwet's AC1, 28 which is not vulnerable to the paradoxes of κ when there is a very high or low incidence of traits within a population. 28 –30 In Ko and colleagues' study, 31 κ results showed that inter-rater reliability ranged from poor (κ = 0.19) to moderate (κ = 0.49), while AC1 measure of agreement ranged from 0.38 to 0.93. The authors concluded that inter-rater reliability for pulse diagnosis in stroke patients was not particularly high when objectively quantified. 31
Lee et al. 32 reported κ results that ranged from poor (κ = 0.37) to moderate (κ = 0.61), while AC1 measures of agreement for the 2 experts were generally high (ranging from 0.66 to 0.89). The authors noted that patients who demonstrated higher agreement of other diagnostic variables also showed more reliable pulse assessments. 32 They concluded that there was good reliability for pulse assessment in the Korea Institute of Oriental Medicine stroke diagnosis and recommended standardized diagnostic indicators with detailed-oriented criteria and better training of clinicians to improve reliability. Although there were only 2 raters and pulse diagnosis reliability constituted only a small portion of the study, the results and interpretation of data indicated favorable reliability for the method in stroke patients. It is not known, however, whether this pulse method can be applied to a healthy population.
Discussion of Results and Limitations of Literature Analysis
The review of these studies confirmed several realities concerning pulse diagnosis. First, it is possible for the same tester to detect similar pulse patterns on the same patient on different occasions. †,††,26,27,32 Testers tended to agree with themselves more often than they did with others when rating the same patient's pulses, demonstrating intra-rater reliability to be higher than inter-rater. †,††,26,27 This suggested that a subjective component of pulse diagnosis is independent of other sensory input, demonstrated by similar agreement levels for both blind and open conditions. †
Poor inter-rater reliability was regularly linked to inconsistent tester interpretations of pulse definitions resulting from confusing terminology used in the classic texts and inadequate operational definitions for modern methods of pulse diagnosis. †,††,8,15,18 Inter-rater reliability further decreased as the number of pulse testers increased 20 and as the complexity of the measured pulse variable increased. ††,9,15,26,27 Later studies suggested that unclear definitions are interpreted differently and result in different users developing individual techniques, thereby reducing inter-rater agreement. †,††,26,27 When, however, the method of pulse diagnosis was standardized or operationally defined or the terminology was interpreted and the procedure implemented in the same way every time, acceptable inter-rater reliability was possible. 9,19,26,27,32 Some studies suggested that within specific methods, revising problematic terminology, developing detailed-oriented criteria, and better training of the clinicians controlled for variance relating to subjectivity of tester technique. 22,26,27,31,32
Training and experience of the clinician directly influenced the reliability of pulse diagnosis. 22,26,27,31,32 Greater agreement was found between raters who had more skill with the standardized procedure. 26,27,32 Despite experienced testers and operationally defined procedures, several studies found less agreement for pulse assessments with particular participants. 26,27,32 This suggested that some people may have pulses that are more variable than others, which supports the change itself as being what is diagnostically relevant. 5,6
The literature established the reliability of individual pulse qualities as not straightforward. ††,15,26,27 Inter- and intra-rater reliability of specific qualities decreased as pulse definitions became more complicated or the number of significant descriptive aspects relevant for sensory differentiation increased. ††,15,26,27 More complex descriptions require assessors to consider multiple defining traits during palpation, thus increasing the difficulty of decisions and reducing reliability. Furthermore, it was suggested that pulse quality reliability varied depending on the location in which the quality was detected. 26,27
Sensory input was suggested as important for pulse diagnosis reliability in several ways. Pulse qualities representing yang deficiency, or those that are weak or difficult to detect, were less reliable than those possessing more power and strength. 26,27 When a sensation is barely perceptible, the lack of tactile information potentially presents clinicians with more uncertainty, resulting in less reliability for pulse qualities that exhibit these characteristics. Greater reliability for bilateral methods of palpation was similarly related to increased sensory input for the rater. 26,27 The larger area of contact for finger pads may have delivered more sensory information to the cognitive processes on which clinicians based their judgements. For individual pulse qualities, some that were unreliable when detected with a single finger were reliable with bilateral palpation. 26,27 This suggests that incorporating bilateral palpation may achieve more reliable pulse assessments.
Three recurrent limitations and/or biases were found across the literature. Relatively small sample sizes restricted the extrapolation of results of most studies to a wider population. The studies that incorporated both testers and participants from the same institution risked the potential for bias or the possibility that testers' prior knowledge of participants' pulses influenced the data. Most significantly, many of the more recent studies demonstrated inter-rater reliability and did not investigate intra-rater reliability.
Demonstration of intra-rater reliability was considered contentious by some authors because tester memory may influence results and pulse characteristics change within hours or days. 10 Both claims, however, appear to be unsupported. Investigations of intra-rater reliability can be validated by reducing tester memory and allowing time to pass between test and retest procedures. 33 The volatility of pulse parameters, on the other hand, contradicts millennia of empirical knowledge on which pulse systems across many medical paradigms are predicated, including the origins of current allopathic cardiovascular measurements. 18,34 –36
Intra-rater reliability must also be considered in view of clinical practice where patient baselines are assessed, treatment is implemented, and then the same markers are reassessed to judge treatment outcomes. Some methods, such as TMT, even monitor changes in the radial pulse to determine the length of an acupuncture treatment. 37 It is therefore reasonable to conclude that intra-rater reliability is of equal importance. Hence, this may call into question the usefulness of studies that investigated inter-rater reliability and ignored prior or concurrent demonstration of testers repeating reliable assessments on the same participants at different times.
Conclusions of Literature Analysis
For reliable pulse assessments, the literature indicates that the method should be operationally defined with clear, concrete terminology so that all practitioners interpret the definitions and implement the method in the same way every time. For translation of results to the clinical setting, future pulse diagnosis studies should examine both intra- and inter-rater reliability while controlling for pulse tester variance; this can be achieved by ensuring similar training and experience in the same method. To more rigorously investigate inter-rater reliability, the number of testers that evaluate each patient should be greater than two, and consideration be given to the number of participants and pulse characteristics tested to allow wider extrapolation of results. Finally, independence of data should be guaranteed by ensuring that pulse testers have no prior knowledge of the participants' pulses, and all test data should be stored impartially and securely.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
