Abstract
Background
In a large radiological center, the ultrasound (US) quality assurance (QA) program involves several professionals. Although the operator and the parameters utilized can contribute to the results, the selected QA parameters should still reflect the quality of the US scanner, not the measuring process.
Purpose
To evaluate the reproducibility of recommended phantom-based US QA parameters in a realistic environment.
Material and Methods
Six sonographers measured six high-end US scanners with 20 transducers using a general purpose phantom. Every transducer was measured altogether seven times, using one frequency per transducer. The QA parameters studied were homogeneity, visualization depth, vertical and horizontal distance measurements, axial and lateral resolution, and the correct visibility of anechoic and high-contrast masses. The evaluation of the homogeneity was based on visual observations. Inter-observer interquartile ranges were computed for the grading of the masses. For the other QA parameters, the mean inter- and intra-observer coefficients of variation (CoV) were calculated. In addition, the symmetry of the reverberations when imaging air with a clean transducer was checked.
Results
The mean inter-observer CoVs were: visualization depth 11 ± 4%, vertical distance 1.7 ± 0.4%, horizontal distance 1.4 ± 0.6%, axial resolution 22 ± 7%, and lateral resolution 16 ± 8%. The mean intra-observer values were about half of these values with similar standard deviations. The visual evaluation of the homogeneity and the symmetry of the reverberations produced false-positive findings in 5% of the cases, but were found useful in detecting a defective transducer. The grading of the masses had mean interquartile ranges of 20–30% of the grading scale.
Conclusion
The inter-observer variability in measuring phantom-based QA parameters can be relatively high. This should be considered when implementing a phantom-based QA protocol and evaluating the results.
Keywords
Most of the methods utilized for B-mode ultrasound (US) quality assurance (QA) are based on detecting the image quality using a phantom (1–11). Often, the images are analyzed visually and with manual measurements during the scanning session (1–5). To attain more objectivity, computer programs for automatic image analysis have also been developed (6–10). There are also controversial opinions on the detectable changes in image quality using parameters from phantom measurements in modern US scanners (5, 12). Besides phantom measurements, e.g. testing systems examining the functionality of every element of the transducer can be utilized (13, 14). While the new American College of Radiology (ACR) standard on monitoring the performance of real-time US equipment lists recommended image quality parameters to be checked regularly, it does not always specify what methodology should be utilized in gathering the results (11).
In our imaging center, 19 units perform radiological US examinations. Due to the wide spatial distribution of the units and the amount of equipment, several persons must be involved in the QA process. On the other hand, no extra human resources are available, thus all the measurements performed must be carefully considered. In ultrasound imaging, besides the quality of the scanner, the imaging results depend on the operator handling the transducer and on the imaging parameters utilized. Factors affecting the accuracy of phantom measurements include human error, image pixel size and resolution, caliper precision, velocity and distance calibration, and phantom-related errors, e.g. the propagation of ultrasound in the phantom (15). If the expected human error is prominent in the measuring process, the QA parameter has little value in the continuing performance assessment. The purpose of this work was to study how reliable the recommended QA parameters could be reproduced by several sonographers in a realistic setting. A general purpose phantom with manual analysis was utilized, since this is still the most straightforward approach for US QA, with existing standards and recommendations (1, 2). This study was a part of a larger US QA project and also linked with training of new sonographers.
Material and Methods
Six high-end US scanners with altogether 20 transducers located in three different radiological units were measured by six sonographers. The scanners had been purchased from three different vendors during 2004–2009. The scanners included linear very high-frequency transducers, linear high-frequency transducers, micro-convex or sector transducers and convex low-frequency transducers (Table 1). Every transducer was measured with one frequency, the lowest available (convex low-frequency transducers) or the highest (other types of transducers).
Scanners, transducers, and frequencies used in this study
*In some transducers, the user can choose between three frequency ranges: penetration, general, and resolution. The total frequency ranges of these transducers, specified by the manufacturer, are given in parentheses
Res = resolution, Pen = penetration
Every sonographer measured five scanners once and one scanner twice. Every scanner was thus measured seven times, during no more than 10 days to ensure the same condition of the transducer and the scanner in every measurement. The phantom was a CIRS model 040 general purpose phantom with ZerdineTM as the background material and with nylon filament targets (diameter of 0.1 mm) and anechoic and high-contrast masses (16). The attenuation in the measurements was 0.5 dB/(MHz cm).
The QA parameters studied are described in Table 2. The sonographers performed the measurements without any knowledge about the earlier QA results for this equipment. Thus the expected results were not known, except for the vertical and horizontal distance measurements.
Measured QA parameters, their detailed descriptions, and analysis methods
The QA protocol for every transducer was implemented and saved in the scanners. For the same transducer, always the same protocol was selected by the different sonographers. Between the different scanner models, exactly the same imaging parameter settings were not always possible to implement. The main principle was to include minimum processing of the signal meaning that the more sophisticated features, such as harmonic imaging, spatial/frequency compounding and the manufacturer's proprietary filtering methods were switched off. For the other parameters, the following guidelines were used: output power was set to the maximum level, time-gain-compensation (TGC) to achieve uniform signal across the field of view, and dynamic range to 60 dB. Individual gain settings were allowed to obtain the best possible visibility for each measurement. Rejection, edge enhancement, and frame averaging were set to the lowest level possible. A linear gray map was selected. Line density was set to the highest possible value. A single focus in the same depth as the structure studied in different measurements was utilized, and the overall imaging depth was selected to allow the best visibility of the structure. Whenever possible, the abdomen was selected as the body part to be imaged. The TI, MI, and frame rate values were noted down for each measurement. Also, an image from each measurement was saved to a picture archiving and communication system (PACS).
The sonographers had a one-day training including a lecture on quality control and a workshop to demonstrate how different scanner settings influence QA with a phantom. In addition, the measurement protocol was gone through using one of the scanners included in this study in groups consisting of two sonographers. Before the measurements started, the sonographers also had the chance to practice with the phantom and have feedback on the results.
If one of the sonographers found an abnormal sign in the air image or in the homogeneity image possibly referring to dead elements in the transducer, all the corresponding measurements by the other sonographers were also checked when analyzing the results. The number of cases interpreted as false-positives was counted.
To estimate the human error in measuring the visualization depth, vertical and horizontal distances, and axial and lateral resolution, the coefficient of variation (CoV) (15) was computed for each transducer, including the results from all sonographers (inter-observer CoV) or results from the same sonographer (intra-observer CoV). To obtain a single inter- and intra-observer estimate for every QA parameter, the mean inter- and intra-observer CoVs were computed, including all the transducers.
For the grades of the anechoic and high-contrast masses (Table 2), the inter-observer interquartile range (17) was computed for every transducer. Also, the mean inter-observer interquartile ranges were determined including all transducers.
The distance and resolution measurements and the evaluation of the masses were performed only in the range of the measured visualization depth of the individual transducer with the chosen frequency, although the high-contrast targets may have been visible deeper. Thus, the very high-frequency linear transducers were excluded from the lateral resolution measurements in the depth of 60 mm. Also, the deepest lateral resolution measurements as well as the deepest horizontal distance measurements were only performed with the low-frequency convex transducers. In the distance and resolution measurements, zooming was allowed, as it should not have a significant effect on the results (3).
Results
Doubtful non-symmetry or inhomogeneity in the air image or in the homogeneity image, interpreted as a false-positive, was reported in 5% of the images. With one linear transducer, in six out of the seven measurements, severe non-symmetry of the reverberations in the air image was noticed. Although not known by the sonographers, the transducer in question had 22 dead elements in one corner of the transducer, detected earlier using a FirstCall AperioTM transducer measurement system (Sonora Medical Systems Inc., Longmont, CO, USA).
The mean inter- and intra-observer CoVs for the visualization depth, distance, and resolution measurements are presented in Table 3. The mean inter-observer interquartile ranges for the anechoic and high-contrast masses were 0.4 ± 0.3 and 0.3 ± 0.2, respectively.
Mean inter- and intra-observer coefficient of variations (CoV) with standard deviations (std) for the visualization depth, vertical and horizontal distance measurements, and axial and lateral resolution measurements
In measuring the visualization depth, one result for a high-frequency linear transducer was excluded from the results. The result in question was only 63% of the mean of the other corresponding results for this transducer. The lower TI and MI values in this measurement suggested that probably the output power was accidentally set too low. The low-frequency convex transducers were excluded from the visualization depth results, since the bottom of the phantom at the depth of 180 mm could be seen with every low-frequency convex transducer using about 2 MHz frequency.
Due to misunderstanding, the axial and lateral resolution results by measuring the dimensions of a filament were only available from four sonographers. Thus, the number of these measurements in estimating the inter-observer precision was only four or five per transducer, and the intra-observer estimates for the resolution measurements were lacking for the transducers of two scanners (GE Logiq 9 and Toshiba Aplio XG).
Discussion
The purpose of this work was to evaluate the reproducibility of phantom-based QA parameters in a realistic setting. In a large radiological center, the QA must be performed by several professionals – inevitably some more experienced than others. In this work, six sonographers measured typical recommended phantom-based US QA parameters using six scanners with altogether 20 transducers. Every transducer was measured seven times.
The evaluation of the air image and of the homogeneity image produced false-positive findings in 5% of these images altogether. The one known defective transducer in this study (22 consecutive missing elements in one corner) was detected in six of seven measurements in the air image. However, the curved edge on the other side of the homogeneity image of this linear transducer, due to the missing elements, was not noticed by any of the sonographers when performing the measurements.
In general, it is not clear how small amount of missing elements can be detected in a phantom image in the first place, probably depending also on the transducer type (14), the total number of the elements and the aperture size. This was not an issue of this work.
The inter-observer precision for the visualization depth was low (the mean CoV of 11%). The images saved to PACS were all very similar for the same transducer, but the interpretation of the depth, in which the noise started to dominate the speckle, varied. The American Association of Physicists in Medicine (AAPM) and the American Institute of Ultrasound in Medicine (AIUM) recommend the defect level of the change in the visualization depth to be set as 10 mm (1, 2), when compared to the baseline value from the acceptance test. Typical visualization depths for the linear and micro-convex or sector transducers included in this study, with the imaging parameters utilized, varied between 40 to 110 mm. The CoV of 11% meant thus about 4–12 mm inter-observer standard deviations.
The vertical and horizontal distance measurements had mean inter-observer CoVs of 1.7% and 1.4%, respectively. The AAPM and AIUM recommend that the vertical distance error should not exceed 2% and the horizontal 3% (1, 2). In this study, the inter-observer measurement precision alone was close to the AIUM and AAPM vertical distance defect level.
The methods utilized for the resolution measurements produced relatively high inter-observer variations, CoVs of 9–23%. This could partly be due to the small mean values, varying between 0.4 and 2 mm, in computing the CoVs. On the other hand, the results were of the same order of magnitude as the estimated reproducibility of resolution measurements in Dudley et al. (5). With method 1, diverging results were more rarely seen than with method 2, but the differences between the diverging results were bigger. This was obviously due to the discrete results with method 1. With method 2, using a continuous scale, exactly the same results were obtained less frequently, but the differences between the diverging results were smaller. The recommended defect level for the axial resolution by the AAPM (1) and AIUM (2) is 1 mm or 2 mm (frequency < 4 MHz). The recommended defect level for lateral resolution depends on the focal length, frequency and aperture diameter (1).
The evaluation of the anechoic and high-contrast masses did not seem to give very useful information in this study, since the divergence between the results was high when compared to the scale of grading.
Automatic estimation of the QA parameters from the images could result in more repeatable analysis (6–10). For example, Gorny et al. (9) found the standard deviation of measuring the visualization depth to vary between 0.4–2 mm with their automatic analysis methods. In our study, a suitable properly tested analysis program was not available. Consistent scanning of the images by different operators would still be needed even if the images were automatically analyzed. Another possibility would be use of a less practical approach with a special transducer holder (9).
Different types of transducers – linear very high-frequency transducers, linear high-frequency transducers, micro-convex or sector transducers and convex low-frequency transducers – were included in the computation of the mean CoVs. The inter- and intra-observer variabilities may also depend on the transducer type. For example, the visualization depth results utilizing the micro-convex or sector transducers were clearly more variable than with the linear transducers. Also, the inter-observer precision among the convex transducers when measuring the lateral resolution (filament separation) near the phantom surface was worse than with the linear transducers, probably due to the more freedom in directing a convex transducer. Still, most of the results did not have clear differences that depended on the transducer type. The amount of transducers was too small for a more specific type-wise analysis.
There were some aspects in the measurements and in the interpretation of the results which should have been emphasized more in teaching phantom-based QA before the study. The sources of the problems faced were found efficiently thanks to the availability of the images and important measurement parameters afterwards. Valuable information and experience on the teaching and learning process itself and on creating an effective measurement protocol was obtained during the study. In general, working with a phantom was also found to be a valuable learning tool for the sonographers.
In conclusion, the inter-observer variability in measuring phantom-based QA parameters in a large imaging center can be relatively high. In this study with manual analysis of the QA images, the recommended defect levels for some of the QA parameters could be reached due to the inter-observer variability alone. The inter-observer variability should be carefully considered to avoid useless efforts in performing QA and wrong conclusions from the results.
