Abstract
Purpose:
The P value has been used as a statistical tool in randomized controlled trials (RCTs) to establish significance but does not provide information on the robustness of a study when used alone. The fragility index (FI) provides a supplemental approach for demonstrating robustness in RCTs that report dichotomous outcomes. This study aims to determine the statistical fragility of RCTs that compare minimally invasive techniques with open techniques in managing benign and malignant colorectal diseases.
Methods:
Dichotomous outcomes of minimally invasive surgery versus open surgery in RCTs from 2000 to 2023 were assessed. The overall FI and fragility quotient (FQ) of each study were calculated.
Results:
Of the 1377 screened studies, 50 met the inclusion criteria. In total, 820 outcomes were recorded with 747 outcomes reported as not significant (P ≥ .05) and 73 as significant (P < .05). The overall FI for all studies including all outcomes was 5 (interquartile range [IQR] 4–7) with a FQ of 0.031 (IQR 0.014–0.062). Of the 50 RCTs, 6 (12%) reported a loss to follow-up that was greater than the overall FI of 5.
Conclusions:
As RCTs are judged increasingly beyond just the P value, practicing colorectal surgeons will benefit from using and interpreting the FI, FQ, and the P value of studies both in analyzing future RCTs and in determining whether or not to make a change in their clinical practice if there is an efficiently true discovery.
Introduction
The management of diseases involving the colon and rectum has evolved over the years and many of these diseases are definitively treated with surgery. To address this category of colorectal diseases, methods of management are moving toward minimally invasive techniques including laparoscopy and robotic surgery. Despite the growing use of these modalities, several techniques have been described in the surgical literature. For example, methods such as side-to-end versus end-to-end techniques are used for colorectal anastomosis 1 and isoperistaltic versus antiperistaltic ileocolic anastomosis. 2 The decision to choose one surgical approach over another is often unclear. To investigate which surgical approach produces superior outcomes when managing surgical diseases, randomized controlled trials (RCTs) are considered the gold standard of research evidence. Replicability and reproducibility are what physicians count on when referring to RCT evidence and statistics to their patients. 3 However, it has recently come to attention that attempts to replicate randomized control studies result in contradictory evidence. 4 This begs the question of the applicability of the results from RCT studies. The reason for disparate evidence has been attributed to several causes such as biases in the reporting of data, problems in study design, and reliance on a P value to determine statistical significance. 5 The latter has been considered a major contributor to irreproducibility.
The P value has been used as a statistical tool in RCTs to establish significance. Although the value can vary, the P value is often set to an alpha value of .05 implying that the collected data is significant if there is a 5% or less probability that the difference observed is owing to chance. This threshold of .05 is arbitrarily used to reject or accept the null hypothesis. However, this is misleading because the P value does not provide information on the strength or robustness of a study when used alone. 6 To address this shortcoming, the fragility index (FI) was proposed by researcher and epidemiologist Alvan Feinstein in 1990 and was first implemented by Walsh et al. in 2014.7,8 The FI provides a supplemental approach to demonstrating fragility, or in converse, the stability of the value of a P value in a RCT reporting dichotomous outcomes. In other words, the FI is the minimum number of patients required to switch from an event to a nonevent to deem the P value no longer significant. The fragility quotient (FQ) divides the FI by the sample size to provide a standardized measure of the fragility of the study. Therefore, the FQ standardizes the fragility to the sample size of the study to be compared across studies. An FI analysis with a large value indicates that the study is robust, can withstand many changes in patient outcomes, and remains significant. However, a study is fragile if the FI is small indicating that a few changes in the outcomes can result in a loss of statistical significance. For instance, if a study has an FI analysis of 2, this would mean that two patients from the experimental group would need to change their outcome from an event to a nonevent for the outcome to no longer be significant. Therefore, the larger the FI the more robust the study. Similarly, when comparing FQ, the larger the value the more robust the study and the smaller the value, the less robust the study. As it stands right now, there is no current cutoff value for an FI or an FQ to definitively define a study as robust or fragile. Rather, on the continuum the higher the FI/FQ value the better.
With the increased use of minimally invasive techniques in the management of colorectal cancers and disease, it is vital to understand dichotomous outcomes in the literature before deciding to implement a new technique. To be specific, minimally invasive techniques include the use of laparoscopic or robotic approaches in the management of both benign and malignant colorectal disease processes. The purpose of this study is to determine the statistical fragility of RCTs that compare minimally invasive techniques versus open techniques in the management of colorectal cancers and diseases.
Methods
This review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Fig. 1). The PubMed database was queried from January 1, 2000 to March 15, 2023, for all RCTs relating to minimally invasive surgeries for colorectal benign and malignant diseases. To be included for the analysis, RCTs needed to report dichotomous outcomes with associated P values. The included studies were from a variety of journals listed in Table 1. Each study was manually evaluated to determine if they met inclusion criteria for studies comparing minimally invasive surgical techniques to open techniques for both benign and malignant diseases of the colon and rectum. Studies were subsequently excluded if they were non-RCT studies, not a surgical intervention, post hoc analysis RCTs, animal model studies, cadaveric studies, studies with anything other than 1:1 randomization, and studies reporting non-dichotomous outcomes. For each included study, the following data points were extracted, the journal name, study design, authors, publication year, PubMed Identifier, loss to follow-up (LTF), dichotomous outcomes (specified as primary or secondary), and associated P values for each outcome, if provided.

PRISMA diagram of included studies. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses.
Overall Fragility Results
IQR, interquartile range; LTF, loss to follow-up.
The obtained outcomes were labeled as significant (P < .05) or nonsignificant (P ≥ .05). The FI calculation for each outcome was achieved using a two-by-two contingency table that included the dichotomous outcomes from each trial and calculated as previously reported by Walsh et al. 8 Each outcome event was manipulated until the reversal of significance was achieved. The number of manipulations required to change the P value from significant to nonsignificant or vice versa was determined to be the FI for each outcome. An example of this process can be found in Table 2. The median FI as well as the interquartile range for all outcomes within a study were reported as the overall FI for that study. To standardize the FI value, the FI was divided by the associated total sample size to give the FQ for each study. In addition, the FI and the FQ along with their interquartile ranges were stratified for outcome type (primary versus secondary), reported significance (P < .05 vs P ≥ .05), and year (Table 1). Finally, the overall FI and FQ were determined with the incorporation of all outcomes.
Demonstration of Reversal of Significance with a Fragility Index (FI) = 1
Each study was evaluated and a Cochrane risk bias of assessment was also performed for each of the individual studies (Table 3). Seven items were used to assess bias risk: random sequence generation (selection bias), allocation concealment (selection bias), blinding of participants and personnel (performance bias), blinding of outcome assessment (detection bias), complete outcome data (attrition bias), selective reporting (reporting bias), and other bias. A series of Cochrane signaling questions were applied to each article and a score was provided via the Cochrane algorithm, with each category scored as having a risk of bias that was low, high, or unclear.
Cochrane Risk of Bias Assessment
Results
Of the 1377 screened studies, 50 met the inclusion criteria (Fig. 1). Characteristics of the included trials are listed in Table 1. In total, 820 outcomes were recorded with 747 of them reported as not significant (P ≥ .05) and 73 as significant (P < .05). Of the 747 nonsignificant outcomes, the median FI was 5 (interquartile range [IQR] 4–7) with an FQ of 0.034 (IQR 0.015–0.065). Of the 73 significant outcomes, the median FI was 2 (IQR 1–5) with an FQ of 0.020 (IQR 0.007–0.030). From the 820 total outcomes, 278 (33.9%) were primary and 542 (66.1%) were secondary. The median FI for both primary and secondary outcomes was the same with an FI of 5 (IQR 4–7) with an associated FQ of 0.035 (IQR 0.020–0.065) and 0.029 (IQR 0.013–0.061), respectively. Of the 50 RCTs, 6 had more patients LTF than the overall FI of 5. Therefore, 12% of studies reported an LTF value that was greater than the overall FI (Fig. 2).

Distribution of the number of patients lost to follow-up.
The overall FI for all studies including all outcomes was 5 (IQR 4–7) with an FQ of 0.031 (IQR 0.014–0.062) indicating that the reversal of 3 of 100 outcomes may change the study significance of the included RCTs. The FI stratified by year of publication identified an FI of 5 (IQR 4–7) from 2000 to 2008, an FI of 5 (IQR 4–7) from 2009 to 2016, and an FI of 5 (IQR 4–8) from 2017 to 2023 demonstrating fragility over the 23-year period (Table 1).
Discussion
This study finds that the overall median FI was 5 with an associated FQ of 0.031 for RCTs evaluating minimally invasive techniques in colorectal surgery in the past 23 years. An FI of 5 tells us that the reversal of five patients’ outcomes would be enough to change the significance of the outcome. Standardizing the studies for sample size, the FQ of 0.031 means that about 3 out of 100 patients would need to reverse their outcomes to alter the significance of each of the studies. Of the included RCTs, 12% (6) of studies presented an LTF value greater than the overall FI. This suggests that the unknown outcomes of the patients LTF may have been enough to reverse the significance of the study if their outcomes favored a reversal of significance. The remaining studies either had less than five patients LTF or no LTF reported. A low FI and FQ, in addition to higher LTF than overall FI, suggests that the data reported in the literature on RCTs for minimally invasive techniques within colorectal surgery may be fragile and not as robust as we thought.
These results coincide with previous studies done analyzing FI in other specialties.57–63 A study by Nelms et al in 2021 looked at all colorectal surgical randomized control studies between the years of 2016 and 2018 and calculated an FI median of 3 and 57% of trials had an LTF greater than the FI. 59 Our study examining FI in minimally invasive techniques over the past 23 years is slightly higher at 5 but still overall low. Of the reported studies, a median FI has been reported as high as 8 in its first application of the FI analysis by Walsh et al. 8 and even up to an FI of 12 in a study examining clinical practice guidelines for acute coronary syndrome. 61 Our results add to a growing body of literature examining the quality of reported data in RCTs especially as it relates to the delivery of clinical care. Clear and informative objective data from RCTs provide physicians with the necessary tools required to make an informed decision about whether or not to utilize results from published studies in their day-to-day practice. Objective data that provides solely a P value of significance does not give information about the robustness of a study and whether or not it is replicable. The use of the FI and FQ gives additional information than a P value alone in that it is correlated with statistical power and sample size. 57 A high FI can mean that the P value of a trial is far from .05 and/or has a high power and conversely a low FI can mean a P value is near .05 or that the sample size is low. The latter leads to a high likelihood that the study’s findings are not replicable and likely not a true discovery. Therefore, a study with a low FI demonstrates that only a few individuals are required to change the outcome of significance and may not be a reliable study to change clinical practice. As statistical power refers to the likelihood of detecting a statistically significant effect if it indeed exists, a low power signifies a heightened risk of false negatives, implying that significant effects might remain undetected. By juxtaposing the FI with statistical power, researchers and clinicians can gauge whether statistically significant results stem from adequately powered studies or if they are precarious and susceptible to data variations. Furthermore, replication and reproducibility play pivotal roles in corroborating the credibility of research findings in clinical contexts. Fragile findings pinpointed through low FI values may necessitate replication to establish robustness. Moreover, considering factors such as heterogeneity among studies and the presence of similar studies aids in assessing the necessity for replication, highlighting the importance of a standardized approach like an FQ. The interplay between FI and statistical power, coupled with other statistical parameters, enables researchers to evaluate the resilience, validity, and reproducibility of clinical research findings. The identification of fragile findings and comprehension of study power are critical strides toward ensuring the reliability of evidence-based medicine.
At this time, there are still no accepted FI or FQ targeted numbers to define whether or not a study is robust or fragile. Therefore, it is still not certain what FI and FQ are acceptable to determine which RCTs are better used to govern clinical practice by physicians. Studies currently evaluating FI/FQ values are doing so using a composite of the current published literature; however, FI and FQ values should be used in future individual RCT studies as part of the analysis to contribute to the growing body of literature. The goal is to be able to use this study among the growing literature in the FI to help determine what the acceptable cutoff will be. Whether that cutoff should be a generalized value in the literature or stratified between specialties will also need to be determined. Nonetheless, despite the hopes of utilization, the fragility analysis comes with its own limitations. First, they can only be utilized in RCTs that have dichotomous outcomes. Another limitation is that studies with continuous variables and outcomes cannot use the fragility analysis to determine robustness or fragility and excludes these studies from further inquiry. In addition, this study does not adjudicate the quality of colorectal surgery literature but rather brings to the forefront the need to critically analyze objective data before adjusting clinical practices as well as discuss the issue of replicability and how understanding and utilizing statistics is vital in clinical practice. Over time, as more fragility analysis studies are conducted on the current published data, and hopefully its incorporation into future RCT data reporting, practicing physicians can be provided with a more well-rounded understanding of the provided data to make a more informed decision prior to changing clinical practice.
Conclusion
There are several minimally invasive techniques for alleviating diseases within colorectal surgery. Some have been adopted and others still are contested owing to published evidence in RCTs providing P values of significance. However, several studies have shown that the P value may not be enough to change clinical significance. This study reports fragility within minimally invasive colorectal surgeries showing that a median number of five patients is all that would be needed to change the significance of the current reported data. Patients’ LTFs have the potential to alter significance in up to 12% of the current literature examining RCTs for minimally invasive colorectal surgeries. The need for more than a P value is increasingly being recognized across specialties in the literature including colorectal. As RCTs are judged increasingly beyond just the P value, practicing colorectal surgeons will benefit from using and interpreting the FI, FQ, and the P value of studies both in analyzing future RCTs and in determining whether or not to make a change in their clinical practice if there is an efficiently true discovery.
Footnotes
Acknowledgments
The author expresses gratitude to Michael Megafu and Emmanuel Megafu for their insights into statistical analysis for fragility studies.
Authors’ Contributions
The author confirms sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and article preparation.
Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
