Abstract
To assess the compliance of “integrated” continuous glucose monitoring (CGM) systems with U.S. Food and Drug Administration requirements, the calculation of confidence intervals (CIs) on agreement rates (ARs), that is, the percentage of CGM measurements lying within a certain deviation of a comparator method, is stipulated. However, despite the existence of numerous approaches that could yield different results, a specific procedure for calculating CIs is not described anywhere. This report, therefore, proposes a suitable statistical procedure to allow transparency and comparability between CGM systems. Three existing methods were applied to six data sets from different CGM performance studies. The results indicate that a bootstrap-based method that accounts for the clustered structure of CGM data is reliable and robust. We thus recommend its use for the estimation of CIs of ARs. A software implementation of the proposed method is freely available (
Introduction
In 2018, the U.S. Food and Drug Administration (FDA) introduced requirements for “integrated” continuous glucose monitoring (CGM) systems. In terms of point accuracy, that is, the agreement between CGM and comparator blood glucose measurements at a given point in time, these requirements are mainly defined for agreement rates (ARs) (Table 1). 1 The AR provides the percentage of CGM measurements lying within certain limits, for example, ±20%, of their paired comparator measurements.
Food and Drug Administration Requirements Defined for Point Accuracy
From U.S. Food and Drug Administration. 1
Lower bound of one-sided 95% confidence interval.
AR, agreement rate.
This report deals with one specific aspect of the FDA requirements: the fact that the AR requirements are defined for the lower bound of the corresponding one-sided 95% confidence interval (CI). The use of CIs marks a fundamental shift compared with the acceptance requirements of blood glucose monitoring systems (BGMS) and rightly acknowledges the existence of statistical uncertainty when drawing conclusions about the accuracy of CGM systems in general based on individually planned clinical studies. The lower bound of the CI can be interpreted as a threshold above which we are 95% confident that the true AR lies. An issue with these requirements is, however, that a specific description of the procedure for calculating the CI is neither required by the FDA, nor provided in existing FDA approval requests. 1,2
Furthermore, discussions on possible CI calculation procedures within the context of CGM performance studies are missing so far. In contrast, numerous general statistical approaches for this procedure potentially leading to different results have been proposed in the literature. 3 –6 The standardization of the statistical method is, therefore, essential to allow transparency and comparability of the results between CGM systems. The aim of this brief report is to review and evaluate selected approaches for CI calculation regarding their applicability to CGM data from real clinical performance studies and to provide a statistical foundation for developers, manufacturers, regulators, and researchers.
In the FDA requirements the ARs and their associated CIs in different glucose ranges are calculated with respect to the measurements of the CGM system. In contrast to the assumed to be more accurate comparator method, this categorization with respect to the CGM system is unusual in the field of measurement method comparison and means that the comparison of different CGM systems among each other is hindered. However, this aspect will not be further discussed as the approaches for CI calculation are independent from the measurement method used for glucose range determination.
Methods
Selection of statistical approaches for CI calculation
A well-known characteristic of the CI is its dependence on sample size. A larger sample size typically reduces the statistical uncertainty, which is reflected in a narrower CI, that is, the distance between AR and associated lower bound. Another relevant factor is the variability of the measurement values. A higher variability extends the CI and can be compensated by an increase in sample size. In the case of CGM systems, there is an additional important point to consider: measurement values within one sensor correlate as they are affected, for example, by physiological factors of the individual subject and by measurement properties of the sensors.
Therefore, the distinction between sensor-to-sensor variability and variability within a sensor is essential. Data from each sensor are thereby considered as a cluster. Standard approaches that disregard the clustered structure of the data tend to underestimate the width of the CI, 4,7 which could lead to a requirement being wrongfully met. Therefore, several methods for calculating the CI of ARs in clustered data have already been proposed in the context of clinical trials. 5,6 In this report three approaches for both clustered and unclustered data were selected and compared.
The first approach is the standard Clopper–Pearson (CP) method 8 that neglects the clustered nature of the data. It was selected for comparison with the following two approaches to highlight the importance of using a method that takes the clustered structure of the data into account.
The second approach considered in this report is the Wilson Score Continuity-Corrected Interval for Clustered Data (WCC). In this study, the CI is calculated by applying defined statistical formulas according to specific assumptions to the CGM data. This semiparametric method is relatively new and has been shown to outperform other semiparametric methods for small sample sizes under a wide range of scenarios. The WCC approach accounts for the clustered structure of the data by differentiating between sensor-to-sensor variability and variability within a sensor using a standard analysis of variance (ANOVA) approach. 5,6
The third approach uses a technique known as bootstrapping. 9,10 It allows general conclusions about the accuracy of the CGM system without relying on assumptions about the statistical properties of the target parameter and is, therefore, a viable alternative to the semiparametric methods. The bootstrapping method randomly resamples the data with replacement, thereby mimicking the data collection process and allowing the simulation of many “virtual” repetitions of the clinical study. Here it is important to note that the resampling is done with respect to the sensors, therefore preserving the clustered structure of the data. For each repetition, the AR is calculated, and its CI is then estimated from the bootstrap samples of “virtual” clinical studies. A recommended variant of the bootstrapping method for calculating CIs is the bias-corrected and accelerated bootstrapping (BCa) method. 9,11 It allows the correction of possible bias and skewness of the bootstrapping estimator. To ensure reliability and reproducibility of the CI, the number of bootstrap samples was determined to be 10,000 by repeating the entire procedure multiple times and ensuring sufficiently good reproducibility.
The results are expressed in terms of the negative width of the CI as we are concerned with the lower bound of the CI. This facilitates the comparison of results independent from the value of the AR and indicates the conservative nature of the methods, that is, a more conservative method will lead to a smaller lower bound of the CI and thus to a more negative value for the CI width.
Data description
To evaluate the methods for CI calculation six data sets corresponding to six CGM systems obtained in four clinical CGM performance studies are used. Comparator BG measurements were collected from capillary samples using commercially available BGMS. The number of subjects with valid data per data set varies between 23 and 48, and only data from one sensor were used per subject. Details of the used data sets are provided in the Supplementary Data.
Results
The results of calculating the CI widths with the three selected approaches for the different data sets are summarized in Figure 1. Detailed results for each individual data set are provided in Figure 2. Here, the FDA requirements would be fulfilled if the antennae (lower one-sided 95% CI) are fully contained within the green areas for all glucose ranges and both requirements.

CI widths calculated as the difference between the lower bound of the one-sided 95% CIs and the corresponding AR for FDA requirements 1

Detailed results of the ARs (solid markers) and their lower one-sided 95% CIs (antennae) for all data sets
Discussion
The results in Figures 1 and 2 demonstrate substantial differences in the CIs obtained from the three methods, which highlights the need for standardization. In particular, the results show that, except for the glucose range <70 mg/dL, the WCC and BCa approaches yield similar median CI widths, whereas the CP method leads to narrower CIs. This confirms the expectation that the CP method underestimates the width of the CIs when applied to CGM data, because the clustered data structure is neglected. However, the difference between WCC and BCa on the one hand, and CP on the other hand, is decreased for requirement 2 (Fig. 1B), indicating that the cluster effect is less prominent in this case.
Comparing WCC and BCa, it can be observed that, although similar, the WCC approach yields slightly broader CI estimates in almost all cases. This confirms the results of previous works that used Monte Carlo simulations and concluded that the WCC approach tends to be overly conservative. 6
Inspecting the results for the glucose range <70 mg/dL, especially for requirement 2, larger differences between WCC and BCa approaches become apparent. This is mainly caused by the fact that in this glucose range, individual sensors in the data sets have no or only a single data point (see the details of the used data sets in the Supplementary Data), because reliably inducing multiple glucose values in the hypoglycemic range for every subject can be challenging in practice. In this case, the WCC approach is inadequate as the required ANOVA-based calculation of within- and between-sensor variability is impaired. This can be demonstrated by excluding the respective data of sensors with only one data point per range from analysis (detailed results are provided in Supplementary Fig. S1 in the Supplementary Data). Here, the median CI widths of the CP and BCa methods are only marginally affected, whereas the WCC leads to considerably smaller median CI widths and thus approaches the results of the BCa method.
Considering our goal to suggest a universally applicable method without explicit knowledge of data characteristics and the overly conservative nature of the WCC approach as well as its issue with small sample sizes, we propose to apply the BCa method to calculate the CIs of ARs in CGM accuracy studies. Furthermore, the BCa method could be easily adapted and evaluated in the context of CI estimation for other CGM accuracy parameters such as mean absolute relative difference.
In the extreme case of an AR of exactly 100% (all values are within the limit), the variability between sensors can no longer be calculated as every individual sensor has the same AR of 100%. In this case, the BCa method yields no result, whereas the WCC approach only considers the number of sensors and disregards the number of data points per sensor. This provides a far too conservative estimate (Fig. 1B, glucose range >180 mg/dL, data sets 5 and 6) and for this reason, these results were excluded from the median calculation. In this case we suggest applying the CP method instead of the BCa method, as the difference between CP and BCa is no longer pronounced for ARs close to 100%.
Although the sample size of any single data set examined in this report is limited, we argue that by considering six data sets with a total of 191 sensors and ∼26,000 data points, the overall findings regarding the comparability and suitability of the considered methods translate to larger data sets typically used for FDA approval submission (∼150 sensors with ∼20,000 data points). 1,2
Conclusions
This report evaluated different methods for calculating the CI of ARs from data collected in CGM performance studies and found that the bootstrap-based BCa approach accounting for the clustered nature of the data is most suitable. In the case of an observed AR of 100% the CP method should be applied. We thus encourage researchers and manufacturers to apply the procedure to CGM performance studies in general to benefit from its meaningfulness.
In the interest of transparency and to facilitate the use of the proposed method by manufacturers and the scientific community a software implementation in Python and R is published alongside this brief report (
Footnotes
Acknowledgments
The authors thank the Diabetes Center Berne for their financial support.
Authors' Contributions
Conceptualization, formal analysis, methodology, software, visualization, and writing—original draft by P.S. Conceptualization, data curation, formal analysis, methodology, software, visualization, and writing—review and editing by M.E. Conceptualization, methodology, visualization, and writing—review and editing by D.W. Methodology, validation, and writing—review and editing by S.P. and M.R. Writing—review and editing by C.H. Conceptualization and writing—review and editing by G.F.
Author Disclosure Statement
G.F. is general manager and medical director of the IfDT (Institut für Diabetes-Technologie Forschungs- und Entwicklungsgesellschaft mbH an der Universität Ulm, Ulm, Germany), which carries out clinical studies on the evaluation of BG meters, with CGM systems and medical devices for diabetes therapy on its own initiative and on behalf of various companies. G.F./IfDT have received speakers' honoraria or consulting fees from Abbott, Ascensia, Berlin Chemie, Beurer, BOYDsense, CRF Health, Dexcom, i-SENS, Lilly, Metronom, MySugr, Novo Nordisk, Pharmasens, Roche, Sanofi, Sensile, Terumo and Ypsomed. M.E., D.W., S.P., and C.H. are employees of the IfDT. P.S. is an advisor to the IfDT. M.R. is an employee of Diabetes Center Berne.
Funding Information
This study was supported by Diabetes Center Berne, Switzerland.
Supplementary Material
Supplementary Data
Supplementary Figure S1
Supplementary Table S1
Supplementary Table S2
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
