Emergent Applications of Machine Learning for Diagnosing and Managing Appendicitis: A State-of-the-Art Review

Abstract

Background:

Appendicitis is an inflammatory condition that requires timely and effective intervention. Despite being one of the most common surgically treated diseases, the condition is difficult to diagnose because of atypical presentations. Ultrasound and computed tomography (CT) imaging improve the sensitivity and specificity of diagnoses, yet these tools bear the drawbacks of high operator dependency and radiation exposure, respectively. However, new artificial intelligence tools (such as machine learning) may be able to address these shortcomings.

Methods:

We conducted a state-of-the-art review to delineate the various use cases of emerging machine learning algorithms for diagnosing and managing appendicitis in recent literature. The query (“Appendectomy” OR “Appendicitis”) AND (“Machine Learning” OR “Artificial Intelligence”) was searched across three databases for publications ranging from 2012 to 2022. Upon filtering for duplicates and based on our predefined inclusion criteria, 39 relevant studies were identified.

Results:

The algorithms used in these studies performed with an average accuracy of 86% (18/39), a sensitivity of 81% (16/39), a specificity of 75% (16/39), and area under the receiver operating characteristic curves (AUROCs) of 0.82 (15/39) where reported. Based on accuracy alone, the optimal model was logistic regression in 18% of studies, an artificial neural network in 15%, a random forest in 13%, and a support vector machine in 10%.

Conclusions:

The identified studies suggest that machine learning may provide a novel solution for diagnosing appendicitis and preparing for patient-specific post-operative complications. However, further studies are warranted to assess the feasibility and advisability of implementing machine learning-based tools in clinical practice.

Appendicitis is one of the most common surgically treated diseases, with more than 300,000 appendectomies being performed in the United States annually.¹ Despite the prevalence of appendicitis in children and adults, appendicitis (especially acute complicated appendicitis) is difficult to diagnose. Diagnoses of appendicitis are missed in 3.8% to 15.0% of children and in 5.9% to 23.5% of adults during emergency department visits; the incidence of negative appendectomy is also notably 15% to 39% in the United States.^2,3 Although white blood cell, C-reactive protein, and bilirubin levels are informative, no specific biomarker has been identified for acute appendicitis. Diagnosis of appendicitis thus currently relies on clinical scoring systems such as the Alvarado Score to stratify patients by risk of perforation.⁴

However, atypical presentations and poor predictive value of laboratory tests complicate diagnoses and decisions for surgical intervention. Using ultrasound and computed tomography (CT) scans does enhance the accuracy of appendicitis diagnosis, but each imaging method bears their own unique drawbacks. Ultrasound is highly operator-dependent in its implementation and radiologic interpretation, whereas also being less sensitive in its predictions. Past consensus studies estimate ultrasound sensitivity at 55%: because of its low sensitivity, ultrasound can yield false negatives and cannot rule out equivocal or negative cases of appendicitis.⁵ On the other hand, CT provides better sensitivity and specificity, but it is a high-cost approach that involves radiation exposure. It would thus be prudent to develop a framework for selective use of CT scans, especially for equivocal cases of appendicitis. Furthermore, low-resource settings such as in lower middle-income countries face profound challenges with securing universal access to imaging.⁶ Subsequent health disparities are further exacerbated by challenges with post-surgical decision-making; systematic tools for predicting potential complications are also limited. The need to address these concerns of healthcare quality and equity warrants the development of new diagnostic and prognostic tools for appendicitis.

Machine learning, an emergent computational approach in healthcare, may have the potential to improve diagnostic sensitivity and specificity for appendicitis beyond current clinical tools. This artificial intelligence approach uses historical data to train a model that captures existing patterns in data to make informed predictions. Contrary to traditional statistical methods, machine learning (ML) models are scalable and equipped to analyze large, complex datasets in a high-throughput fashion.⁷ Machine learning models have already been applied in other surgical specialties such as vascular surgery, neurosurgery, plastic surgery, and orthopedic surgery.⁸ The systematic review by Senders et al.⁹ of ML applications for neurosurgical diagnosis, pre-surgical planning, and outcome prediction especially claimed that artificial intelligence methods can outcompete “natural intelligence” (clinical expertise) in most performance metrics. Of note, beyond solely contributing to diagnostic classifications in the aforementioned surgical disciplines, ML algorithms have also been trained to predict various post-operative variables such as surgical site infections, patient survival, and even pain/satisfaction ratings. Machine learning is thus starting to be used at various stages of peri-operative management for surgically treated diseases. The aim of our review is to catalog the recent use of such novel machine learning algorithms in the context of appendicitis diagnosis and management.

Methods

A state-of-the-art review was conducted based upon systematic assessment of relevant articles found in PubMed, Web of Science, and Embase published from January 1, 2012, to January 1, 2022. State-of-the-art reviews are narrative reviews that seek to catalog the current state of a field as well as identify present challenges that warrant future investigation; this approach seemed most suitable given the rapidly evolving nature of our topic.¹⁰ Search terms included the following: “Appendectomy” OR “Appendicitis” and “Machine Learning” OR “Artificial Intelligence.” Boolean operators were used to connect related keywords appropriately. Only studies including an application of at least one machine learning algorithm implemented on a separable appendicitis-specific dataset were considered. All studies attempting to predict appendicitis diagnoses were required to use pathology as a gold standard. Studies with pediatric or adult cases were indiscriminately accepted. Our review protocol was registered with Open Science Framework (osf.io/4u7gq).

Results

We initially identified 150 articles through our search of PubMed (n = 37), Web of Science (n = 43), and Embase (n = 70). There were 84 articles remaining after duplicates were removed. Furthermore, we excluded 26 studies that did not implement a ML algorithm and 19 studies that did not analyze an independent appendicitis/appendectomy dataset. For example, several results in our query contained data concatenated for different emergency general surgery procedures. No results were excluded on the basis of using pathology as a gold standard for diagnosis; all diagnostic prediction studies satisfied this guideline. The final state-of-the-art review included 39 studies (Fig. 1).

FIG. 1.

Preferred Reporting Items for Systematic reviews and Meta-Analyses for Protocols (PRISMA) flow diagram for our state-of-the art review. After relevant screening based on inclusion criteria listed in Methods, 39 studies remained.

Study characteristics

The 39 identified studies utilized an average sample size of 16,426 patients in their data: 24 studies included <1,000 patients, six studies included 1,000 to 10,000 patients, and nine studies included 10,000+ patients. With the exception of three prospective single-center studies automating post-operative pain assessments, all other studies were retrospective. Of the 36 retrospective studies, 26 included single-center data and 10 included multicenter data.

Algorithm specifications and applications

The most common use case of ML algorithms overall was for predicting appendicitis diagnosis (n = 28; Table 1). There were 27 diagnostic studies that used supervised learning methods (three of which applied deep learning to supervised learning tasks) and one study that used unsupervised learning. Other common applications of ML included predicting various post-operative outcomes (n = 11; Table 2) such as pain rating, development of sepsis, length of hospital stay, and 30-day mortality. For post-operative studies, there were 10 studies that used supervised learning methods (two of which applied deep learning to supervised learning tasks) and one study that used unsupervised learning.

Table 1.

Characteristics of Diagnostic Prediction Studies (n = 28)

Authors & year	Diagnostic classifications	Features included	Sample size	ML type	Algorithms tested	Optimal model ^a	Performance metrics
Xia et al., 2022¹⁶	Complicated vs. uncomplicated	CRP, heart rate, body temperature, and neutrophils	298	SL	RF, SVM	SVM	83.56% accuracy 81.71% sensitivity 85.33% specificity
Su et al., 2022¹⁸	Complicated vs. uncomplicated	Miscellaneous variables extracted from EHR text data	40,441	SL	LR, RF	LR	96% accuracy 73 sensitivity 68% specificity 0.78 AUC
Mijwil et al., 2022²⁹	Appendicitis vs. no appendicitis	Hemoglobin, neutrophils, lymphocytes, mean corpuscular volume, mean platelet volume, hematocrit, deep vein thrombosis, platelets, CRP, WBC count	625	SL	LR, Naïve Bayes, GLM, Decision Tree, SVM, Gradient Boosted Tree, RF	RF	83.75% accuracy 81.08% sensitivity 81.01% specificity
Ghareeb et al., 2022³⁰	Complicated vs. uncomplicated	Patient demographics, chronic illnesses, history of similar pain, body temperature, anorexia, nausea, vomiting, site and duration of pain, serum hemoglobin, leukocyte count, radiologic findings	319	SL	Decision tree, LR, NB, SVM, Nearest Neighbor, Ensemble	Ensemble technique	91.1% accuracy 0.82 AUC
Kim et al., 2022¹²	Appendicitis vs. no appendicitis	Ultrasound images	100	uSL	Fuzzy c-means pixel clustering	Fuzzy c-means pixel clustering	84.82% accuracy 83.78% precision 86.04% recall
Xia et al., 2021³¹	Appendicitis vs. no appendicitis	Patient demographics, vitals, WBC count, lymphocytes, neutrophils, monocytes, eosinophils, hemoglobin, red blood cells, platelets, urea, nitrogen, blood glucose, creatinine, bilirubin, and CRP	Not indicated	SL	KNN, CART, BP, SVM, WOA-KELM, GWO-KELM, MFO-KELM, SSA-KELM, GCMFO-KELM	GCMFO-KELM	76.84% accuracy 77.96% sensitivity 77.13% specificity
Reismann et al., 2021¹⁴	Complicated vs. uncomplicated	Genomic data	29	SL	Not indicated	Not indicated	0.84 AUC
Noguchi et al., 2021¹³	Appendicitis vs. no appendicitis	CT images	20,690	SL (Deep)	CNN	CNN	Not indicated
Marcinkevics et al., 2021³²	Appendicitis vs. no appendicitis	Patient demographics, Alvarado Score, Pediatrics Appendicitis Score, physical exam results (e.g., rebound tenderness), laboratory parameters, ultrasound findings	430	SL	LR, RF, GBM	GBM	93% sensitivity 86% specificity 0.94 AUC
Kang et al., 2021³³	Complicated vs. uncomplicated	Nausea, vomiting, abdominal pain time, neutrophils, CD4⁺ T cell, helper T cell, B lymphocyte, NK counts, highest body temperature, procalcitonin, and C-reactive protein	136	SL	LR	LR	90.6% accuracy 0.926 AUC
Hayashi et al., 2021³⁴	Appendicitis vs. no appendicitis	Ultrasound images	70	SL (Deep)	CNN	CNN	Not indicated
Gunasingha et al., 2021³⁵	Appendicitis vs. no appendicitis	Patient's history, laboratory values, CT readings, and pathology	200	SL	RF, SVM, and Bayesian Network Classifiers	SVM	90% sensitivity 90% specificity 0.95 AUC
Al Masud et al., 2021³⁶	Appendicitis vs. no appendicitis	Fever, rate of nausea, abdominal pain, pain rating, migrating pain, vomiting, and appetite	200	SL	Unknown algorithm	Unknown algorithm	Not indicated
Zhao et al., 2020¹⁵	Complicated vs. uncomplicated	Urinary proteomic data	134	SL	RF, SVM, Naive Bayes	RF	83.6% accuracy 81.2% sensitivity 84.4% specificity 0.75 AUC
Stiel et al., 2020³⁷	Appendicitis vs. no appendicitis	Rebound tenderness, cough/hopping tenderness, ultrasound and laboratory results	463	SL	Unspecified AI-based scoring system	Not indicated	98% sensitivity 17.5% specificity 0.71 AUC
Aydin et al., 2020³⁸	Complicated vs. uncomplicated	Patient demographics and pre-operative blood analysis (excluded CRP)	7,244	SL	NB, KNN, SVM, GLM, RF, decision tree	decision trees	70.83% accuracy 66.81% sensitivity 81.88% specificity 0.7947 AUC
Akmese et al., 2020³⁹	Complicated vs. uncomplicated	Hemoglobin, neutrophil, lymphocytes, mean corpuscular volume, mean platelet volume, hematocrit, thrombosis, CRP, and WBC count	595	SL	Neural network, k-NN, Logistic Regression, SVM, RF, gradient boosting tree	Gradient boosting tree	95.31% accuracy
Reismann et al., 2019⁴⁰	Complicated vs. uncomplicated	Blood cell counts, CRP, appendiceal diameter from US	590	SL	Not indicated	Not indicated	51% accuracy 95% sensitivity 33% specificity 0.80 AUC
Majeed Alneamy et al., 2019⁴¹	Appendicitis vs. no appendicitis	Not indicated	106	SL (Deep)	Fuzzy Wavelet NN	Fuzzy Wavelet NN	93.51% accuracy
Koren et al., 2019⁴²	Appendicitis vs. no appendicitis	Miscellaneous variables from unstructured EHR text data	130	SL	simple Bayesian network, LR, RF/XG Boost, NN	Not specified	85% accuracy
Donnelly et al., 2019⁴³	Appendicitis vs. no appendicitis	Ultrasound reports	28,615	SL	Recurrent NN	Recurrent NN	Not indicated
Shahmoradi et al., 2018⁴⁴	Complicated vs. uncomplicated	Patient demographics, right iliac fossa pain, migrating pain, anorexia, nausea/vomiting, fever, rebound tenderness, Rovsing's sign, leukocytosis, neutrophil left shift, CRP concentration, negative urine analysis	181	SL	ANN and LR	ANN	92.9% accuracy 80% sensitivity 97.5% specificity
Norman et al., 2017⁴⁵	Appendicitis vs. no appendicitis	Miscellaneous variables from unstructured EHR text data	15,074	SL	LR	LR	F1 score: 0.8391
Mitroulias et al., 2013⁴⁶	Appendicitis vs. no appendicitis	Not indicated	516	SL	ANN, RF, SVM	RF	91.67% accuracy
Lee et al., 2013⁴⁷	Complicated vs. uncomplicated	Patient demographics, temperature, CRP, WBC count, migration of abdominal pain, anorexia, nausea/vomiting, right lower quadrant pain, and rebound tenderness.	716	SL	Ensemble technique (SVM + other classifiers)	Ensemble	57.3% sensitivity 66.7% specificity 0.619 AUC
Iliou et al., 2013¹⁷	Appendicitis vs. no appendicitis	Duration of pain, (tenderness, leucocytosis, neutrophilia, urinalysis	516	SL	SVM	SVM	Not indicated
Deleger et al., 2013⁴⁸	Appendicitis vs. no appendicitis	Miscellaneous variables from unstructured EHR text data	2,100	SL	Conditional random field	Conditional random field	F-measure of 0.867
Malley et al., 2012⁴⁹	Appendicitis vs. no appendicitis	Not indicated	106	SL	RF, LR	RF	0.976 AUC 0.061 Brier score

ML = machine learning; SL = supervised learning; uSL = unsupervised learning; SVM = support vector machine; LR = logistic regression; RF = random forest; GLM = generalized linear model; GBM = generalized boosted regression model; NN = neural network; ANN = artificial neural network; CNN = convolutional neural network; KNN = k-nearest neighbors; CRP = C-reactive protein; WBC = white blood cell; HER = electronic health record; AUC = area under the curve.

Optimal models listed only on the basis of accuracy (or AUC if accuracy is not reported).

Table 2.

Characteristics of Post-Operative Outcome Prediction Studies (n = 11)

Authors & year	Purpose	Features included	Sample Size	ML type	Algorithms tested	Optimal model ^a	Performance metrics
Susam et al., 2022⁵⁰	Predict post-operative pain ratings after laparoscopic appendectomy	Electrodermal data and facial expression video data	58	SL	SVM	SVM	90.91% accuracy 100% sensitivity 81.82% specificity
Eickhoff et al., 2022⁵¹	Predict Clavien-Dindo post-surgical complications, need for ICU stay >24 h, prolonged hospitalization	Basic demographic data, comorbidities, peri-operative data (e.g. operative time)	163	SL	RF	RF	For extended ICU stay: 88% accuracy 88% sensitivity 88% specificity
Syed et al., 2021²⁶	Predict total hospitalization costs	Hospital ID, length of stay, zip code, patient disposition, race	8,293	SL	XGBoost regression	XGBoost	R² of 0.60, RMSE = 33051
Sartori et al., 2021⁵²	Predict occurrence of post-operative complications	Age, AIR and AAST scores, gender, year of treatment, surgery timing, surgical technique, and conversion to laparotomy	1,337	SL (Deep)	GLM, XGBoost, Distributed RF, deep NN, and NB (Naïve Bayes classifier), two Ensemble models	GLM	0.724 AUC
Kim et al., 2021⁵³	Find separable surgical patient clusters with longer length of stay	Patient and hospital characteristics	231,241	uSL	k-prototype clustering	k-prototype clustering	Cluster with significantly longer LOS found
Guedj et al., 2021⁵⁴	Screen clinical notes relevant to analyzing race/ethnicity differences in analgesic and opioid administration	Race, other patient demographics, pain score	4,780	SL	LR	LR	Not indicated
Bunn et al., 2021¹⁹	Predict the occurrence of post-operative sepsis	Patient demographics, comorbidities, laboratory parameters	223,214	SL	LR, SVM, RFDT, XGBoost, Ensemble	Ensemble	64.64% sensitivity 64.54% specificity 0.7074 AUC
Kempa-Liehr et al., 2020⁵⁵	Predict patient recovery time	Patient demographics, operative time, admission time	Not indicated	SL	GLM	GLM	Not indicated
Al Khatib et al., 2019⁵⁶	Predict intra-abdominal abscess risk	Patient demographics, operative time, post-operative temperature, WBC counts, antibiotic agent delivery, symptom duration	1574	SL (Deep)	ANN	ANN	89.84% accuracy 70% sensitivity 96.31% specificity
Sikka et al., 2015⁵⁷	Predict post-operative pain ratings after laparoscopic appendectomy	Facial expression video data	50	SL	LR	LR	0.84-0.94 AUC
Huang et al., 2014⁵⁸	Predict post-operative pain ratings after laparoscopic appendectomy	Facial expression video data	NA	Unknown	Not indicated	Not indicated	Not indicated

ML = machine learning; ICU = intensive care unit; LOS = length of stay; WBC = white blood cell; RMSE = root mean squared error; AUC = area under the curve; SL = supervised learning; uSL = unsupervised learning; SVM = support vector machine; LR = logistic regression; RF = random forest; GLM = generalized linear model; NN = neural network; ANN = artificial neural network; CNN = convolutional neural network; KNN = k-nearest neighbors.

Optimal models listed only on the basis of accuracy (or AUC, if accuracy is not reported).

Of the 28 diagnostic studies, 17 developed an algorithm for differentiating individuals with appendicitis from those without and 11 developed an algorithm for differentiating between complicated and uncomplicated appendicitis. Studies for diagnostic prediction mostly used laboratory characteristics such as C-reactive protein and white blood cell (WBC) counts as well as imaging findings such as >6 mm appendiceal diameter as features for training their ML models. Studies primarily used combinations of laboratory data (n = 14), imaging findings (n = 9), and symptoms/clinical features uncovered by physical examinations (n = 5). One study used genomic data exclusively to identify genes that were differentially expressed between complicated and uncomplicated appendicitis cases for subsequent diagnostic predictions. Another study analogously examined proteomic signatures in urine alone in their algorithm to predict acute appendicitis. Other unique features used for predictive modeling of diagnosis across the various studies included demographic data, vitals, text extracts from clinical records, and raw imaging data for automatic processing.

The 11 studies leveraging post-operative data assessed several types of surgical outcomes: length of stay (n = 3), post-operative pain (n = 3), sepsis development (n = 2), 30-day mortality (n = 1), financial cost (n = 1), and racial/ethnic disparities (n = 1). Studies assessing postoperative outcomes primarily used patient demographics (n = 8), facial expression data (n = 3), and hospital characteristics (n = 2) as parameters for machine learning models. Facial expression data was only used in the three studies that attempted to predict pain ratings following laparoscopic appendectomy.

Model performances

On average, the algorithms across all chosen studies yielded an accuracy of 85.5% (range, 51%–96%), a sensitivity of 81.1% (range, 57.3%–100%), a specificity of 74.9% (range, 17.5%–97.5%), and an area under the receiver operating characteristic curve (AUROC) of 0.815 (range, 0.619–0.976) where reported. Diagnostic study algorithms performed with an average accuracy of 84.7% (range, 51%–96%), a sensitivity of 81.3% (range, 57.3%–98%), specificity of 72.4% (range, 17.5%–97.5%), and an area under the curve (AUC) of 0.825 (range, 0.619–0.976). Post-operative study algorithms performed with an average accuracy of 89.6% (range, 88%–90.9%), sensitivity of 80.7% (range, 64.7%–100%), specificity of 82.7% (range, 64.5%–96.3%), and AUC of 0.774 (range, 0.707–0.89). Accuracy was reported in 18 of 39 studies, sensitivity in 16 of 39 studies, specificity in 16 o 39 studies, and AUROC in 15 of 39 studies. Certain studies reported alternative performance metrics such as an F-score (n = 2), precision/recall (n = 1), or a Brier score (n = 1). Each of the three studies that compared their highest-performing algorithm to the Alvarado Score reported that their ML-based method demonstrated greater accuracy than the Alvarado scoring system.

Based on accuracy alone, across all use cases, logistic regression was the optimal model in seven studies, an artificial neural network in six studies, a random forest in five studies, and a support vector machine in four studies. The highest performing models in the remaining studies used various ensemble algorithms or otherwise unique techniques.

Three of the 39 overall studies used varying methods of learning optimization, including parameter optimization (n = 2) to find the best training weights for each input feature and hyperparameter tuning (n = 4) to optimize certain variables inherent to each algorithm type. All four studies with hyperparameter tuning used k-fold cross-validation followed by grid search as a part of their optimization strategy. One study notably used both grasshopper optimization to find the best training parameters and opposition-based learning to search for the best algorithm hyperparameters.

Discussion

To analyze the potential applicability of machine learning methods for appendicitis, we have reviewed the current literature to understand better the successes and challenges faced in developing such algorithms among the studies that met criteria for inclusion.

Feature selection

Feature selection is a seminal challenge in developing predictive models. Choosing too few input features generally results in suboptimal model performance (lower accuracy), whereas the inclusion of too many input features may result in overfitting of the training dataset. The prominent inclusion of measures of CRP levels and WBC counts as diagnostic algorithm features mirrors the contributions of leukocytosis and left shift of WBC count (neutrophilia) to the Alvarado Score. Use of the >6 mm maximal outer diameter rule in imaging findings to conclude acute inflammation also recapitulates established diagnostic guidelines.¹¹ However, several studies opted to also include additional demographic variables (age, gender, etc.), vital signs, and unique data modalities to achieve notable model performance.

For example, to address the high operator dependency and potential for interpretive error inherent to ultrasound, Kim et al.¹² created a tool for automated segmentation to extract appendiceal features from raw ultrasound images instead of relying on clinical records of imaging findings. Noguchi et al.¹³ developed an analogous tool for CT scan results to assess appendiceal diameter, wall enhancement, and peri-appendiceal fat stranding among other hallmarks of appendicitis. The appreciable model performance of the two studies, Resimann et al.¹⁴ and Zhao et al.,¹⁵ that exclusively used genomic or urinary proteomic data suggests that incorporating additional data types into algorithms may further optimize accuracy. Whereas input features seemed to be generally chosen based on the consensus of clinical experts or prior literature, some studies such as Xia et al.¹⁶ and Iliou et al.¹⁷ used computational methods such as random forests to select the most informative features and reduce redundancy. A combination of both manual and computational vetting may be a useful strategy for isolating the most important features for future algorithms that could integrate diverse data types.

Compared with the features used in diagnostic algorithms, algorithms for predicting post-operative outcomes levied more diverse modalities of data across studies. This may in part be because of the additional data available in the given time window (both pre-operative and intra-operative variables for post-operative predictions). Although the studies for predicting diagnosis and post-operative outcomes all have similarly well-performing models with accuracies greater than 75%, a reasonable next step would be to standardize input features. Feature standardization may be especially critical given the previously mentioned issue of overfitting. Most of the reviewed studies trained and tested their models on internal datasets within respective hospitals, and thus insights from their algorithms may not be generalizable to distinct datasets. Only the studies by Su et al.¹⁸ and Bunn et al.¹⁹ used nationally validated databases: the National Hospital Ambulatory Medical Care Survey (NHAMCS) and the National Surgical Quality Improvement Program (NSQIP), respectively. However, ambitions for standardizing input features may be limited by access to relevant data. For example, NSQIP, as a clinical database, contains more clinically relevant and longitudinal data whereas NHAMCS, as an administrative database, exclusively provides data about inpatient comorbidities and complications.²⁰ Equity-based concerns related to data availability also exist: access to imaging is a key limitation in low-resource settings.²¹

The utility of demographic data (used in two studies found in this review) in appendicitis diagnosis is also worth further interrogating; current literature seems to find no link between ethnicity and likelihood of diagnosis, but other demographic factors, including male gender and patient age, have identified as independent predictors of positive histology for appendicitis.²² On the other hand, ethnicity has been associated with post-operative outcomes including hospital length-of-stay. The literature surrounding the impact of age on appendicitis diagnosis is also mixed. Some studies cite that unusual presentations in pediatric appendicitis lead to more frequent misdiagnoses, whereas other epidemiologic analyses suggest no age-related difference in presentation or perforation rates.^23,24 Of our 39 selected studies, 11 studies explicitly analyzed pediatric patient cohorts. Several other studies notably either combined pediatric and adult cases in their analysis or did not explicitly specify the age range of the patients included in the study.

Patient socioeconomic status as an input feature is also contentious. Some global studies of appendicitis diagnosis have established that patients from low-income populations bear higher risks for appendicitis as well as higher hospital costs, but the one study we found using socioeconomic status as an input feature concluded variability between hospital policies to be a stronger predictor of inpatient expenditure.^25,26 Overall, discrepancies between facilities was found to be a more important factor for post-operative outcome prediction than for appendicitis diagnosis prediction.

Summary of applications

Despite current challenges in systematizing feature selection, the retrieved studies (28 for pre-operative predictions, 11 post-operative predictions) showcase the diverse applications machine learning algorithms may offer at several stages of appendicitis management (Fig. 2). Pre-operative studies for predicting diagnosis and the need for surgical intervention levied several modalities of data from genomic, proteomic (laboratories, and radiomic sources. No study has presently integrated all of these modalities to assess the utility of a composite predictive algorithm. Studies leveraging imaging data notably varied in their use of raw data versus radiological interpretations. For example, Kim et al.¹² and Noguchi et al.¹³ developed segmentation algorithms to quantify appendiceal diameters and detect whether they exceeded the pathological threshold, thereby bypassing challenges of ultrasound operator subjectivity. All other diagnostic studies relied on conclusions drawn from radiologist notes. Reporting of algorithmic input features was less detailed for post-operative studies, but these studies more frequently used baseline demographic characteristics when compared to pre-operative studies.

FIG. 2.

Depiction of use cases for machine learning algorithms in retrieved literature. Machine learning (ML) can use pre-operative data to predict diagnosis and guide surgical decision-making, while also being capable of using pre-operative and intra-operative data to predict post-operative outcomes. (Source: Figure adapted from “Risk Factors of Dementia” by BioRender.com (2022) and retrieved from https://app.biorender.com/biorender-templates).

Performance metrics

Although ML is clearly applicable to appendicitis and surgery at-large, the approach's advisability in specific clinical contexts is still to be determined. Less than half of the 39 reviewed studies reported standard model performance metrics such as accuracy, sensitivity, specificity, and AUROC. This low level of reporting adherence may be expected due to the recent rise of such studies in the last decade, but it also shows the need for standardizing the reporting of machine learning algorithm performance metrics for appendicitis management and beyond.

Studies most commonly reported accuracy and evidently featured high-performing models, with an average accuracy of 85.5%. However, more focus on accuracy as a primary performance metric seemed to come with tradeoffs in sensitivity and specificity. On the extremes, sensitivity could be as low as 57.3% and specificity as low as 17.5%. Overall averages of sensitivity (81.1%) and specificity (71.4%) were also notably lower than the average accuracy attained across all the studies; this finding was true even among the individual groupings of diagnostic and post-operative studies. The AUROC was also the least commonly reported metric (15/39 studies), despite being a better indicator of predictive model performance at several thresholds. Furthermore, a limited number of articles compared their algorithms to standard scoring systems such as the Alvarado Score or the Appendicitis Inflammatory Response Score. All of the three studies that did conduct a comparison to the Alvarado Score did however outcompete the system in accuracy, sensitivity, and specificity. Nevertheless, the performance metrics cataloged in this study may serve as additional benchmarks for future algorithm developers to improve upon in this nascent field, alongside further comparisons to existing clinical benchmarks like the Alvarado score (see Table 3 for a full list of performance metrics used by the selected studies).

Table 3.

Common Performance Metrics

Metric	Description
Accuracy	Fraction of total predictions that model predicted correctly
AUC (AUROC)	Measures “separability,” how well a model can distinguish between different classes (e.g., diagnoses or outcomes); calculated as area under a plot of true positive rate (sensitivity) vs. false positive rate (1-specificity)
Sensitivity	The probability of a true positive, given that the patient is positive for the disease/outcome. Also sometimes referred to as recall.
Specificity	Probability of a true negative, given that the patient is negative for the disease/outcome.
Precision	Probability of a true positive, given that the model predicts the patient to be positive for the disease/outcome. Also commonly referred to as positive predictive value.
Brier score	Mean of squared differences between a probabilistic prediction and the outcome (assigned as 0 or 1 for binary classifications)
F1 score	Composite metric for accuracy that is calculated by taking the harmonic mean of precision and recall/sensitivity

AUC = area under the curve; AUROC = area under the receiver operating characteristic curves

Limitations

There are certain limitations to the insights gleaned from this study regarding the applicability of ML for appendicitis. First, publication bias ensures that only projects with the most accurate algorithms appear in literature. Especially given the use of parameter and hyperparameter optimization conducted within each study's specific dataset, it is difficult to predict how each study's model would generalize to new datasets. The aforementioned problem of feature standardization, especially complicated by struggles in determining the clinical utility of certain patient features such as patient age, also makes generalizability difficult. Furthermore, data involving real-time implementation of machine learning is limited. Except for the three prospective studies involving the prediction of post-operative pain assessments, no clinical trials or studies implementing machine learning at the bedside were found.

Future considerations

This review has illustrated the need for standardized computational and ML performance reporting frameworks to enable better model comparisons. But given the lack of real-time implementation of ML appendicitis models in clinical settings, it would also be advisable to investigate how patients would tolerate the incorporation of artificial intelligence tools for critical healthcare diagnoses like that of appendicitis. Past studies into patient perception of human-artificial intelligence interactions in health care cite anxieties about communication barriers, regulatory standards, and health privacy.²⁷ The accuracy and scalability of ML models thus does not undercut the importance of patient-physician interactions: in fact, ML predictions trained on limited datasets may bias physicians and lead to undesirable heuristic shortcuts.²⁸ At the same time, such drawbacks would have to be weighed against the cost-effective nature of AI-based approaches to diagnosis in low-resource settings.

Conclusions

The identified studies suggest that ML may perform similarly to or better than current clinical predictive tools relevant to diagnosis and post-operative management of appendicitis. However, despite the broad emergence of ML studies in the field of appendicitis treatment, translation to real-world practice remains limited. No formal clinical trial studies using machine learning for appendectomy were found. Further studies will be needed to elucidate the relative performance of such approaches to the Alvarado Score and to assess the feasibility and advisability of implementing ML-based tools in clinical practice.

Footnotes

Acknowledgments

Presented at the Connecticut Chapter of the American College of Surgeons Annual Meeting, Trumbull, Connecticut, October 2022 and the Society of American Gastrointestinal and Endoscopic Surgeons Annual Meeting, Montreal, Canada, March 2023.

Authors' Contributions

Each author has contributed significantly to the review and has contributed in one or more aspects of the study as noted below.

Conceptualization: Bhandarkar, Schneider, Ahuja. Methodology: Bhandarkar, Schneider, Brackett, Ahuja. Formal analysis: Bhandarkar. Investigation: Bhandarkar, Tsutsumi. Resources: Schneider, Brackett, Ahuja. Data curation: Bhandarkar, Brackett. Writing—original draft: Bhandarkar. Writing—review and editing: Bhandarkar, Tsutsumi, Schneider, Ong, Paredes, Brackett, Ahuja. Visualization: Bhandarkar. Supervision: Tsutsumi, Schneider, Ong, Ahuja. Project administration: Schneider, Ahuja.

Funding Information

There were no sources of funding for this work.

Author Disclosure Statement

The authors declare no conflicts of interest.

References

Symer

, Abelson

, Sedrakyan

, et al. Early operative management of complicated appendicitis is associated with improved surgical outcomes in adults. Am J Surg, 2018; 216(3):431–437; doi: 10.1016/j.amjsurg.2018.04.010

Mahajan

, Basu

, Pai

C-W

, et al. Factors associated with potentially missed diagnosis of appendicitis in the emergency department. JAMA Netw Open, 2020; 3(3):e200612. doi: 10.1001/jamanetworkopen.2020.0612

Noureldin

, Hatim Ali

, Issa

, et al. Negative appendicectomy rate: Incidence and predictors. Cureus, 2022; 14(1):e21489; doi: 10.7759/cureus.21489

Ohle

, O'Reilly

, O'Brien

, et al. The Alvarado score for predicting acute appendicitis: a systematic review. BMC Med, 2011; 9:139; doi: 10.1186/1741-7015-9-139

Wonski

, Ranzenberger

, Carter

. Appendix Imaging. StatPearls Publishing; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK549903/ [Last accessed: June 20, 2023].

DeStigter

, Pool

K-L

, Leslie

, et al. Optimizing integrated imaging service delivery by tier in low-resource health systems. Insights Imaging, 2021; 12(1):129; doi: 10.1186/s13244-021-01073-8

Bzdok

, Altman

, Krzywinski

. Statistics versus machine learning. Nat Methods, 2018; 15(4):233–234; doi: 10.1038/nmeth.4642

, Feridooni

, Cuen-Ojeda

, et al. Machine learning in vascular surgery: A systematic review and critical appraisal. NPJ Digit Med, 2022; 5(1):7; doi: 10.1038/s41746-021-00552-y

Senders

, Arnaout

, Karhade

, et al. Natural and artificial intelligence in neurosurgery: A systematic review. Neurosurgery, 2018; 83(2):181–192; doi: 10.1093/neuros/nyx384

10.

LibGuides: Systematic Reviews: Types of Reviews. 2021. Available from: https://guides.mclibrary.duke.edu/sysreview/types [Last accessed: June 20, 2023].

11.

Mostbeck

, Adam

, Nielsen

, et al. How to diagnose acute appendicitis: Ultrasound first. Insights Imaging, 2016; 7(2):255–263; doi: 10.1007/s13244-016-0469-6

12.

Kim

, Song

, Park

. Robust automatic segmentation of inflamed appendix from ultrasonography with double-layered outlier rejection fuzzy C-means clustering. NATO Adv Sci Inst Ser E Appl Sci, 2022; 12(11):5753; doi: 10.3390/app12115753

13.

Noguchi

, Matsushita

, Kawata

, et al. A fundamental study assessing the generalized fitting method in conjunction with every possible coalition of N-combinations (G-EPOC) using the appendicitis detection task of computed tomography. Pol J Radiol, 2021; 86:e532–e541; doi: 10.5114/pjr.2021.110309

14.

Reismann

, Kiss

, Reismann

. The application of artificial intelligence methods to gene expression data for differentiation of uncomplicated and complicated appendicitis in children and adolescents: A proof of concept study. BMC Pediatr, 2021; 21(1):268; doi: 10.1186/s12887-021-02735-8

15.

Zhao

, Yang

, Sun

, et al. Discovery of urinary proteomic signature for differential diagnosis of acute appendicitis. Biomed Res Int, 2020; 2020:3896263; doi: 10.1155/2020/3896263

16.

Xia

, Wang

, Yang

, et al. Performance optimization of support vector machine with oppositional grasshopper optimization for acute appendicitis diagnosis. Comput Biol Med, 2022; 143:105206; doi: 10.1016/j.compbiomed.2021.105206

17.

Iliou

, Anagnostopoulos

C-N

, Stephanakis

, et al. Combined classification of risk factors for appendicitis prediction in childhood. In: Engineering Applications of Neural Networks Springer Berlin Heidelberg;, 2013; pp. 203–211; doi: 10.1007/978-3-642-41016-1_22

18.

, Li

, Zhang

, et al. Prediction of acute appendicitis among patients with undifferentiated abdominal pain at emergency department. BMC Med Res Methodol, 2022; 22(1):18; doi: 10.1186/s12874-021-01490-9

19.

Bunn

, Kulshrestha

, Boyda

, et al. Application of machine learning to the prediction of postoperative sepsis after appendectomy. Surgery, 2021; 169(3):671–677; doi: 10.1016/j.surg.2020.07.045

20.

Alluri

, Leland

, Heckmann

. Surgical research using national databases. Ann Transl Med, 2016; 4(20):393; doi: 10.21037/atm.2016.10.49

21.

Frija

, Blažić

, Frush

, et al. How to improve access to medical imaging in low- and middle-income countries?. EClinicalMedicine, 2021; 38:101034; doi: 10.1016/j.eclinm.2021.101034

22.

Bhanderi

, Ain

, Siddique

, et al. Demographic factors associated with length of stay in hospital and histological diagnosis in adults undergoing appendicectomy. Turk J Surg, 2022; 38(1):36–45; doi: 10.47717/turkjsurg.2022.5406

23.

Lee

, Ho

. Acute appendicitis: Is there a difference between children and adults?. Am Surg, 2006; 72(5):409–413; doi: 10.1177/000313480607200509

24.

Armağan

, Duman

, Cesur Ö, et al. Comparative analysis of epidemiological and clinical characteristics of appendicitis among children and adults. Ulus Travma Acil Cerrahi Derg, 2021; 27(5):526–533; doi: 10.14744/tjtes.2020.47880

25.

Lin

K-B

, Lai

, Yang

N-P

, et al. Epidemiology and socioeconomic features of appendicitis in Taiwan: A 12-year population-based study. World J Emerg Surg, 2015; 10:42; doi: 10.1186/s13017-015-0036-3

26.

Syed

, Kamal

, Haq

, et al. S1303 predicting the cost of major gastrointestinal infection admissions with machine learning. Am J Gastroenterol, 2021; 116(1):S599; doi: 10.14309/01.ajg.0000778744.85629.d3

27.

Esmaeilzadeh

, Mirzaei

, Dharanikota

. Patients' perceptions toward human-artificial intelligence interaction in health care: Experimental study. J Med Internet Res, 2021; 23(11):e25856; doi: 10.2196/25856

28.

Ong

, Burattini

, Schena

. Editorial: Artificial intelligence in human physiology. Front Physiol, 2022; 13:1075819; doi: 10.3389/fphys.2022.1075819

29.

Mijwil

, Aggarwal

. A diagnostic testing for people with appendicitis using machine learning techniques. Multimed Tools Appl, 2022; 81(5):7011–7023; doi: 10.1007/s11042-022-11939-8

30.

Ghareeb

, Emile

, Elshobaky

. Artificial intelligence compared to Alvarado scoring system alone or combined with ultrasound criteria in the diagnosis of acute appendicitis. J Gastrointest Surg, 2022; 26(3):655–658; doi: 10.1007/s11605-021-05147-2

31.

Xia

, Zhang

, Li

, et al. Generalized oppositional moth flame optimization with crossover strategy: An approach for medical diagnosis. J Bionic Eng, 2021; 18(4); doi: 10.1007/s42235-021-0068-1

32.

Marcinkevics

, Reis Wolfertstetter

, Wellmann

, et al. Using machine learning to predict the diagnosis, management and severity of pediatric appendicitis. Front Pediatr, 2021; 9:662183; doi: 10.3389/fped.2021.662183

33.

Kang

C-B

, Li

X-W

, Hou

S-Y

, et al. Preoperatively predicting the pathological types of acute appendicitis using machine learning based on peripheral blood biomarkers and clinical features: A retrospective study. Ann Transl Med, 2021; 9(10):835; doi: 10.21037/atm-20-7883

34.

Hayashi

, Ishimaru

, Lee

, et al. Identification of appendicitis using ultrasound with the aid of machine learning. J Laparoendosc Adv Surg Tech A, 2021; 31(12):1412–1419; doi: 10.1089/lap.2021.0318

35.

Gunasingha

RMKD

, Grey

, Munoz

, et al. To scan or not to scan: Development of a clinical decision support tool to determine if imaging would aid in the diagnosis of appendicitis. World J Surg, 2021; 45(10):3056–3064; doi: 10.1007/s00268-021-06246-6

36.

Al Masud

, Royel

MRI

, Sajal

MMH

, et al. Smart risk prediction tools of appendicitis patients: A machine learning approach. Biointerface Res Appl Chem, 2020; 11(1,2020):7804–7813; doi: 10.33263/BRIAC111.78047813

37.

Stiel

, Elrod

, Klinke

, et al. The modified Heidelberg and the AI appendicitis score are superior to current scores in predicting appendicitis in children: A two-center cohort study. Front Pediatr, 2020; 8:592892; doi: 10.3389/fped.2020.592892

38.

Aydin

, Türkmen İU, Namli

, et al. A novel and simple machine learning algorithm for preoperative diagnosis of acute appendicitis in children. Pediatr Surg Int, 2020; 36(6):735–742; doi: 10.1007/s00383-020-04655-7

39.

Akmese

, Dogan

, Kor

, et al. The use of machine learning approaches for the diagnosis of acute appendicitis. Emerg Med Int, 2020; 2020:7306435; doi: 10.1155/2020/7306435

40.

Reismann

, Romualdi

, Kiss

, et al. Diagnosis and classification of pediatric acute appendicitis by artificial intelligence methods: An investigator-independent approach. PLoS One, 2019; 14(9):e0222030; doi: 10.1371/journal.pone.0222030

41.

Majeed Alneamy JS

A H

, ameed Alnaish

, Mohd Hashim

, et al. Utilizing hybrid functional fuzzy wavelet neural networks with a teaching learning-based optimization algorithm for medical disease diagnosis. Comput Biol Med, 2019; 112:103348; doi: 10.1016/j.compbiomed.2019.103348

42.

Koren

, Souroujon

, Shaul

, et al. “A patient like me”—An algorithm-based program to inform patients on the likely conditions people with symptoms like theirs have. Medicine, 2019; 98(42):e17596; doi: 10.1097/MD.0000000000017596

43.

Donnelly

, Grzeszczuk

, Guimaraes

, et al. Using a natural language processing and machine learning algorithm program to analyze inter-radiologist report style variation and compare variation between radiologists when using highly structured versus more free text reporting. Curr Probl Diagn Radiol, 2019; 48(6):524–530; doi: 10.1067/j.cpradiol.2018.09.005

44.

Shahmoradi

, Safdari

, Mirhosseini

, et al. Predicting risk of acute appendicitis: A comparison of artificial neural network and logistic regression models. Acta Med Iran, 2019; 56 No. 12(2018):85. Available from: https://acta.tums.ac.ir/index.php/acta/article/view/7363 [Last accessed: June 20, 2023].

45.

Norman

, Davis

, Quinn

, et al. Automated identification of pediatric appendicitis score in emergency department notes using natural language processing. In: 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) IEEE;, 2017; doi: 10.1109/bhi.2017.7897310.

46.

Mitroulias

, Theofilatos

, Likothanassis

, et al. AppendicitisScan tool: A new tool for the efficient classification of childhood abdominal pain clinical. In: International Conference on Engineering Applications of Neural Networks Springer Verlag;, 2013; pp. 110–118; doi: 10.1007/978-3-642-41016-1_12

47.

Lee

Y-H

, Hu

PJ-H

, Cheng

T-H

, et al. A preclustering-based ensemble learning technique for acute appendicitis diagnoses. Artif Intell Med, 2013; 58(2):115–124; doi: 10.1016/j.artmed.2013.03.007

48.

Deleger

, Brodzinski

, Zhai

, et al. Developing and evaluating an automated appendicitis risk stratification algorithm for pediatric patients in the emergency department. J Am Med Inform Assoc, 2013; 20(e2):e212-20; doi: 10.1136/amiajnl-2013-001962

49.

Malley

, Kruppa

, Dasgupta

, et al. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med, 2012; 51(1):74–81; doi: 10.3414/ME00-01-0052

50.

Susam

, Riek

, Akcakaya

, et al. Automated pain assessment in children using electrodermal activity and video data fusion via machine learning. IEEE Trans Biomed Eng, 2022; 69(1):422–431; doi: 10.1109/TBME.2021.3096137

51.

Eickhoff

, Bulla

, Eickhoff

, et al. Machine learning prediction model for postoperative outcome after perforated appendicitis. Langenbecks Arch Surg, 2022; 407(2):789–795; doi: 10.1007/s00423-022-02456-1

52.

Sartori

, Podda

, Botteri

, et al. Appendectomy during the COVID-19 pandemic in Italy: A multicenter ambispective cohort study by the Italian Society of Endoscopic Surgery and new technologies (the CRAC study). Updates Surg, 2021; 73(6):2205–2213; doi: 10.1007/s13304-021-01126-z

53.

Kim

, Lodaya

, Marinaro

, et al. PGI28 investigating length of stay in gastrointestinal patient surgical clusters in the national inpatient sample with machine learning. Value Health, 2021; 24:S99; doi: 10.1016/j.jval.2021.04.517

54.

Guedj

, Marini

, Kossowsky

, et al. Racial and ethnic disparities in pain management of children with limb fractures or suspected appendicitis: a retrospective cross-sectional study. Front Pediatr, 2021; 9:652854; doi: 10.3389/fped.2021.652854

55.

Kempa-Liehr

, Lin

CY-C

, Britten

, et al. Healthcare pathway discovery and probabilistic machine learning. Int J Med Inform, 2020; 137:104087; doi: 10.1016/j.ijmedinf.2020.104087

56.

Al Khatib

, Alramadhan

, Murphy

, et al. 2438. Using artificial neural networks to predict intra-abdominal abscess risk post-appendectomy. Open Forum Infectious Diseases, 2019; 6(Suppl 2):S842; doi: 10.1093/ofid/ofz360.2116

57.

Sikka

, Ahmed

, Diaz

, et al. Automated assessment of children's postoperative pain using computer vision. Pediatrics, 2015; 136(1):e124-31; doi: 10.1542/peds.2015-0029

58.

Huang

, Craig

, Diaz

, et al. (108) Automated facial expression analysis can detect clinical pain in youth in the post-operative setting. J Pain, 2014; 15(4 Suppl):S3; doi: 10.1016/j.jpain.2014.01.014