Abstract
We conducted a systematic review of studies of observer agreement for medical specialist diagnosis via videoconferencing. The review was based on searches of electronic databases and a hand search of relevant journals and reference lists between 1966 and June 2010. There were 20 studies comparing videoconferencing diagnosis with a non-telemedicine alternative by reporting a measure of agreement. Half of the studies were in the field of dermatology; these studies provided solid support for the reliability of videoconferencing. The other 10 studies were in psychiatry, geriatrics, minor injuries, neurology and rheumatology. Reliability of diagnosis via videoconferencing was confirmed in all studies. In the studies where physical examination was an element of the diagnostic process, results were reliable but authors recommended greater care during the diagnostic process (e.g. good equipment, onsite support, additional camera angles). Four studies incorporated a second group to measure agreement in paired face-to-face assessments. Although useful evidence for the reliability of diagnosis via videoconferencing was provided by the studies in the review, the range of medical specialties was small. The variation in research methodology and statistical analysis suggests a lack of clarity about which research design is appropriate for measuring observer agreement in relation to diagnostic reliability.
Introduction
Medical specialists provide expert diagnosis or advice regarding complex or challenging health matters. The most common method of providing advice is by face-to-face (FTF) appointment between specialist and patient. Specialist advice is sought for a range of problems and the depth of information required by the specialist during the decision-making processs can differ depending on the matter being addressed. Not all patients require a full FTF consultation.
Specialists are limited in number which means that either the patient or the doctor may have to travel long distances to enable consultations to take place. Videoconferencing may allow a more timely and convenient response. However, videoconferencing is not appropriate in all situations, particularly when expert physical examination is required. The aim of the present paper was to provide a summary of comparative studies of medical specialist diagnostic agreement using videoconferencing.
Methods
An electronic search was carried out of the MEDLINE, CINAHL and PubMed databases using the Medical Subject Headings (MeSH) terms listed in Table 1. The search was completed in June 2010. A hand search of the table of contents of the Journal of Telemedicine and Telecare and Telemedicine and E-health was also carried out to identify relevant papers. Reference lists of telemedicine reviews were hand searched for relevant papers. 1–10 Papers were excluded if the sample size was less than 20, based on the protocol used in the Cochrane review of telemedicine. 11 The inclusion/exclusion criteria are listed in Table 2. An attempt was made to contact the authors if additional information was required.
Search terms for MEDLINE search strategy
Inclusion and exclusion criteria
Results
A total of 1707 papers was identified from the computerized literature search. Initial screening of these articles reduced the total to 23. An additional nine papers were identified from the hand search. The full-text of 32 papers was read. Nine papers were excluded, six because the sample size was less than 20. In all, there were 22 relevant papers (Figure 1). Papers that discussed the same study were grouped together and counted as one study, which brought the final total to 20 studies. A summary of the levels of agreement for each study is provided in Table 3.

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram of study election process 35
Studies examining observer agreement of diagnostic agreement via VC
The observer agreement studies in the review included dual assessment by two doctors. These doctors were often specialists, and in limited supply, which makes this type of study difficult to conduct. For this reason many studies were carried out in large teaching hospitals, far removed from the actual population that would normally be serviced by telemedicine. Recognition in the summary below is given to those studies which were able to draw on a rural population. A ‘telemedicine population‘ is defined as a sample of patients who would, if a telemedicine service was available, be the usual recipients of such a service because of geographical or other limitations in obtaining clinic-based care. There is now greater recognition of the role of urban telemedicine to support people who are isolated in well populated areas, such as frail older residents in aged care facilities. These people might be described as an urban telemedicine population.
Studies in dermatology
Warren et al. conducted a teledermatology study for patients referred by their general practitioner (GP). 12 The study sample was drawn from a remote (telemedicine) population. All patients were seen via videoconferencing (VC) for their first assessment, followed by a FTF consultation (the time between assessments was 1–3 days). Often the GP was present for the assessment. Two dermatologists shared the VC assessments, but only one travelled to the site to see the patients FTF. They decided that in some instances a physical examination would have been helpful, but that the differential diagnosis was similar despite lack of palpation.
The study by Oakley and colleagues in a dermatology clinic required the treatment group to undergo two assessments, first via VC then FTF. 13 The sample included the usual patient referrals for a dermatology clinic. Usually both assessments were by the same dermatologist (79% of patients). The results indicated that agreement between two different dermatologists was significantly lower than when the same dermatologist saw the patients for both assessments. They found that the clinician's level of confidence in the diagnosis was a good indicator of actual agreement. Diagnosis was easier when historical factors played a key role in understanding the disorder. Specialists were least confident when diagnosing pigmented lesions because of the implications for the patient if a melanoma was misdiagnosed. The authors noted that additional equipment, not available for the study, would have improved the visual quality of physical examination. They concluded that telemedicine was adequate for the majority of consultations, although a proportion of patients would always require FTF assessment for diagnosis.
Phillips and colleagues reported two teledermatology studies. The first involved 60 new referrals to a dermatology clinic. 14 Two dermatologists saw each patient, one via VC and the other FTF. There was non-random allocation to either VC or FTF assessment as the first assessment. There was a similar percentage agreement in this study as in the Oakley study. Phillips et al saw teledermatology as a feasible option, particularly with improvements in camera quality.
A second study by Phillips et al. specifically addressed the question of screening for skin tumours and the identification of malignancy. The study enrolled 51 patients for a FTF assessment followed by a VC assessment. 15 Two dermatologists were involved and they shared the VC and FTF assessments equally (divided by group session). The population was a telemedicine population. There was less diagnostic agreement than expected, but the level of concern about a lesion was similar between both assessors. The VC specialist was more likely to be unsure about a lesion and order a biopsy. The study sample was not large enough to include sufficient malignant lesions to assess differences in recognition or ordering of biopsies. The authors felt that teledermatology was suitable for identifying suspicious skin lesions.
Two studies introduced a control group (FTF/FTF) as a baseline indication of clinician agreement in usual clinical practice. Lesher and colleagues assessed diagnostic agreement in teledermatology in a telemedicine population. 16 Sixty patients were randomly recruited at the health centre where the teledermatology clinic operated. One dermatologist saw the first 30 patients via VC, while the other did the FTF assessments. For the second group of 30, they reversed the allocation of assessments. The interviews were unstructured and no patient history was provided to the specialist prior to the VC or FTF assessment. The control group patients (n = 36) were assessed individually FTF by the two teledermatology specialists and an independent dermatologist. Agreement via VC was significantly lower than for the control group. There was only one case of complete disagreement and this occurred in the VC group. The majority of non-agreement cases were defined as partial agreement. The authors indicated that improved assistance at the remote site for physical examination might increase the levels of agreement.
In a study conducted by Lowitt and colleagues, 102 patients underwent assessment via VC followed by an in-person assessment by two separate dermatologists. 17 Agreement in dual FTF assessments was also examined (n = 29). As in previous studies, the need to touch the skin as a part of diagnosis in some instances was identified as interfering with agreement of assessment during VC. After diagnosis doctors were asked to rate their level of confidence in the accuracy of the diagnosis. Agreement on diagnosis was higher in the VC/FTF group when specialists had a high level of confidence in the accuracy of their own diagnosis. In most cases where disagreement existed, the VC specialist had identified less confidence in the diagnosis. This outcome supports the findings of the study by Oakley et al. 13
In three studies the patient participated in a VC assessment in the presence of their GP who assisted with physical examination, provided additional information or addressed their own questions to the specialist. Diagnostic agreement and management plans were the focus of this study by Loane and colleagues. 18 Patients were referred by their GP. Consenting patients then returned to their health centre and, accompanied by their GP, were seen via VC by a dermatologist. On the same day the patient attended the nearby outpatient clinic and was seen FTF without the GP present. A total of 351 patients were enrolled in the study, of whom 125 were seen by two different dermatologists, one via VC and the other, FTF. The level of agreement on diagnosis was similar to previous studies. This was the first study to consider the effect of diagnostic agreement on management of a skin lesion. The agreement on management of a skin lesion increased if there was agreement on the diagnosis, and was significantly affected by whether both assessments were carried out by the same dermatologist.
In the study by Gilmour and colleagues (1 colleague being Loane from two previous studies 13,18 ), the age range of the patients was three months to 83 years. 19 Patients and their GP were seen by the specialist via VC first, followed by a FTF assessment. The study involved two sites and five dermatologists. At one site, the patient was seen by the same dermatologist for both assessments on the same day (n = 65). At the second site, two different specialists saw each patient (n = 61). The issue of missing a life-threatening skin lesion was raised again, and the significance of picture quality was provided as a solution. This study confirmed that specialists using VC were more cautious in making a definitive diagnosis. The authors concluded that effective use of VC includes being able to recognise its limitations.
The study conducted by Nordal et al. evaluated teledermatology in a comparative study of VC and FTF consultations. 20 Each patient received two assessments, by two different specialists with equivalent experience. Generally, the first assessment was via VC with the GP present, and the second assessment was FTF. The study was carried out in a telemedicine population with the dermatologist flying to the local site. Twenty percent of the cases were identified as unsuitable for teledermatology. These were unusual cases or those requiring skin palpation or specialist equipment for diagnosis. These findings supported a previous study by Lowitt et al. 17
The final study in dermatology involved the consecutive enrolment of 228 patients at a dermatology outpatient clinic for three phases of assessment by two specialists. 21 Both specialists reviewed digital photographs and clinical information using a store-and-forward methodology and recorded their diagnosis for each patient. This was followed by each doctor separately interviewing each patient using VC. They recorded their diagnosis for a second time. In the last step, one of the specialists interviewed each patient FTF and recorded a final diagnosis. Agreement between the two specialists improved when a VC component was added to the store-and-forward data.
Studies in mental health
The five studies in mental health were from the fields of psychiatry (n = 3) and geriatrics (n = 2).
Psychiatry
The study by Baigent et al. utilised three interview settings: interviewer and observer in the same room (n = 22); interviewer via VC with observer in the same room (n = 20); and both interviewer and observer via VC (n = 21). 22 Two psychiatrists used a semi-structured interview which followed the format of a standard psychiatric history. The authors concluded that there were some measurable differences in VC, but they were not sufficient to cause errors in interpretation.
Consecutive new referrals to a community mental health service in New Zealand were enrolled in a study to examine diagnostic agreement, patient risk, drug and non-drug interventions. 23 Two psychiatrists interviewed each of the patients (n = 37) either via VC or FTF, in random order. Diagnoses were made using the criteria from the Diagnostic and Statistical Manual of Mental Disorders. 24 The authors concluded that telepsychiatry was a dependable method of assessment for new routine outpatient psychiatric referrals.
Child psychiatry was the focus in the study by Elford et al. 25 Patients with their parents participated in a FTF assessment and a VC assessment. The patients were divided into two groups. The first group had VC assessment then FTF assessment, and order of assessment was reversed for the second group. Each patient was seen by two different psychiatrists, who both concluded that videoconferencing did not interfere with diagnosis.
Geriatrics
Loh et al. assessed patients FTF using one of eight physicians and via VC using one of two physicians. 26,27 The patient group was a telemedicine population. The order of assessment was by alternate allocation. Each assessment involved administering a series of standardised assessments, reviewing laboratory and imaging results and conducting an unstructured interview with the patient. The patients were carefully selected as being sufficiently mobile to travel to the telemedicine site, remain for the duration of the assessment and to have no hearing or vision impairment. There was a high level of agreement in the diagnosis of Alzheimer's Disease via VC.
A study by Martin-Khan and colleagues extended the work by Loh et al. and focused on diagnostic agreement for cognitive assessment in a unselected patient population with complex cognitive impairment issues. 28 Each patient received both a VC and a FTF assessment by alternate allocation, using two different specialists. A second set of patients were seen FTF by two different specialists, providing a baseline indication of clinician agreement in standard practice. Each specialist had access to the results of a series of standardized assessments prepared beforehand by the clinic nurse, as well as laboratory and imaging results. Diagnostic agreement was in a similar range to other studies, but significantly lower than the study by Loh et al. 26,27 This may be the consequence of more complex cases or less stringent assessment protocols.
Studies in minor injuries
Tachakra and colleagues combined clinical examination and the use of radiology to evaluate diagnostic agreement using 200 patients in a hospital accident and emergency department. 29 Patients were seen via VC with an emergency room nurse relaying the images from the patient's room. The patient was then seen by the same specialist FTF, and a second specialist who only saw the patient FTF. Key aspects of the clinical examination (such as colour change, instability, swelling and decreased movement) could be seen well enough to allow VC diagnosis. Processes to identify the presence of increased tenderness improved levels of agreement.
Current practice for the assessment of minor injuries at a peripheral hospital in the UK, at the time of publication, was a review by a GP. Benger et al. compared the diagnostic agreement of minor injuries in 600 patients using three different scenarios: a telemedicine emergency medicine specialist, an onsite emergency medicine specialist and an onsite GP. 30 Radiographs were available for all assessors, and additional requests could be made as required. Discrepancies in diagnosis were reviewed by an independent panel of 10 specialists who were blinded to the format of the assessments. The authors found that the safety of minor injuries' telemedicine was similar to conventional practice.
Studies in neurology
Craig et al. described an interactive VC to assess patients admitted to a hospital with neurological symptoms. 31 A junior physician (1-3 years experience) made a diagnosis FTF which was followed by a VC with a neurologist. All patients were seen FTF by a consultant neurologist within four weeks of the VC assessment. This study was conducted in a telemedicine population, with the consultant neurologist travelling to the hospital site for a neurology clinic and ward round. The neurologist was not blinded to the VC outcome (it would have been unethical to withhold the diagnosis or treatment plan for the duration required for a blinded study given this methodology). The authors concluded that a specialist neurological assessment via VC could provide reliable support to a junior physician working in a remote hospital.
A second study by Craig and colleagues, reported in the same year, involved consecutive enrolment of 25 unselected patients who were referred to a neurological outpatient clinic by their GP. 32 All patients were seen first by one specialist via VC and then by a different specialist, FTF. A junior doctor was present with the patient during the VC and provided support to the neurologist by carrying out a guided neurological examination and summarising the patient's details and referral letter. The methodology of this study, where the two specialists were blinded to each other's diagnosis, provided confirmation of the findings from the previous Craig study. 31 .
Studies in rheumatology
A non-randomised prospective study by Leggett et al. assessed diagnostic agreement of telephone conferencing and VC. 33,34 A GP took the history of each patient in the study, followed by a three-way telephone conference between the GP, patient and the rheumatologist. This was followed by a VC between the three participants. Finally the same specialist met the patient for a FTF interview. It was observed that VC was highly reliable for diagnosis.
Discussion
The studies identified in the present review provide reasonable evidence of the reliability of VC for diagnosis. Although there were only two studies of teleneurology, these studies form part of a larger body of work in this field which did not meet the strict inclusion criteria for the review. More robust evidence was confined to a few medical specialities where studies have been carried out with suitably powered samples and in a range of settings. Studies in the field of teledermatology accounted for half of those in the review. In many cases evidence was provided by only a few preliminary studies (e.g. geriatrics, psychiatry, rheumatology).
A ‘telemedicine population’ is a study sample derived from locations where telemedicine would form a potentially useful part of the local health service. These might be remote hospitals, rural health services or high care residential facilities where residents are unable to travel even short distances. Many telemedicine studies, particularly reliability studies where agreement between two doctors is required, are carried out in large metropolitan hospitals where the additional staff are readily available for research projects. Reliability studies in telemedicine populations are challenging because of the extra cost of travel shared between two clinicians at the host and remote sites. Several studies in the present review used patients from a telemedicine population, which is commendable in view of the challenges. 12,15,20,26,27,31 Reliability studies are only one element of a suite of methodologies (economic analysis, cost analysis, satisfaction studies) which provide evidence of the benefit of VC. While not essential for reliability studies, the use of a telemedicine population is important for studies such as feasibility, economic or satisfaction studies because ‘real world settings’ have an impact on implementation.
Assessment of new technologies is often challenging. Preliminary studies which aim to test feasibility or to gather initial data to calculate sample sizes or refine research protocols for larger studies may be restricted as a result of limited funding. These preliminary studies are important because the results provide data for refining the protocols of larger, more expensive and comprehensive studies. Costly mistakes are avoided through this process. While the value of these studies is recognised, their results for the reliability of VC should be interpreted cautiously. This is because pilot studies with careful patient selection and small sample sizes may artificially inflate levels of agreement. Studies of less than 20 participants were excluded from the present review.
The predominant study design was comparing agreement between a diagnosis reached following an assessment via VC and the diagnosis from a FTF assessment. This design assumes that the FTF assessment is correct and takes no account of variation between doctors. Bias based on the between-doctor variation in diagnostic technique is reduced by ensuring that the two doctors complete both FTF and VC assessments. Levels of agreement are generally compared with those achieved in other studies. For diseases where diagnosis is complex, such as borderline dementia diagnosis, agreement may be low because of the progressive nature of the disease rather than the use of VC. For this reason, incorporating a second sample of dual FTF assessments provides a better baseline comparison for estimating the extent to which reliability is affected by the VC element of the diagnostic process. In the present review, four studies used a second sample of paired FTF assessments. 16,17,22,30 Two studies used a baseline comparison sample of almost equal size to the VC sample, 22,30 which enabled statistical comparison of outcomes.
A study of reliability not only needs to report measures of agreement but also to report influences on agreement which may positively or negatively affect the outcome. High levels of variation in the clinic model, study design – including the clinical reference standard – and reporting methods were evident in the present review. The variation in statistical analysis of studies of observer agreement is summarised in Table 4. Such variation makes it more difficult to compare outcomes across studies.
Statistical methods reported in the studies reviewed
Conclusion
Communication via videoconference is becoming more commonplace. The consistent good to excellent diagnostic agreement across the specialities suggests that VC is likely to be a reliable tool to communicate with patients for the purpose of making a diagnosis, in situations where the diagnostic process lends itself to this format.
The majority of studies identified speciality-specific recommendations for improving the reliability of diagnosis when using VC. This reinforces the need, despite general confidence in the use of VC, for continued attention to individual diseases or specialities. Dermatology showed good to high levels of agreement consistently across the numerous studies in the review. This supports the general recommendation of reliability for teledermatology. While under-represented in the review, work in teleneurology is also quite advanced. Observer agreement in the remaining studies, across the specialities, was also good. The limited number of studies, or the small sample sizes, suggests that more work is still required before recommendations can be made for these additional specialities.
Footnotes
Acknowledgements
The study was funded by the National Health and Medical Research Council (NHMRC), grant no 456135. MM-K was funded by an NHMRC PhD scholarship. JW was supported by a US Department of Veterans Affairs grant (HSR&D IIR 05-278). The views expressed in this article are those of the authors and do not necessarily represent those of the US Department of Veterans Affairs.
