Abstract
The pupillary light reflex (PLR) is an important biomarker for the detection and management of traumatic brain injury (TBI). We investigated the performance of PupilScreen, a smartphone-based pupillometry app, in classifying healthy control subjects and subjects with severe TBI in comparison to the current gold standard NeurOptics pupillometer (NPi-200 model with proprietary Neurological Pupil Index [NPi] TBI severity score). A total of 230 PLR video recordings taken using both the PupilScreen smartphone pupillometer and NeurOptics handheld device (NPi-200) pupillometer were collected from 33 subjects with severe TBI (sTBI) and 132 subjects who were healthy without self-reported neurological disease. Severe TBI status was determined by Glasgow Coma Scale (GCS) at the time of recording. The proprietary NPi score was collected from the NPi-200 pupillometer for each subject. Seven PLR curve morphological parameters were collected from the PupilScreen app for each subject. A comparison via t-test and via binary classification algorithm performance using NPi scores from the NPi-200 and PLR parameter data from the PupilScreen app was completed. This was used to determine how the frequently used NPi-200 proprietary NPi TBI severity score compares to the PupilScreen app in ability to distinguish between healthy and sTBI subjects. Binary classification models for this task were trained for the diagnosis of healthy or severe TBI using logistic regression, k-nearest neighbors, support vector machine, and random forest machine learning classification models. Overall classification accuracy, sensitivity, specificity, area under the curve, and F1 score values were calculated. Median GCS was 15 for the healthy cohort and 6 (interquartile range 2) for the severe TBI cohort. Smartphone app PLR parameters as well as NPi from the digital infrared pupillometer were significantly different between healthy and severe TBI cohorts; 33% of the study cohort had dark eye colors defined as brown eyes of varying shades. Across all classification models, the top performing PLR parameter combination for classifying subjects as healthy or sTBI for PupilScreen was maximum diameter, constriction velocity, maximum constriction velocity, and dilation velocity with accuracy, sensitivity, specificity, area under the curve (AUC), and F1 score of 87%, 85.9%, 88%, 0.869, and 0.85, respectively, in a random forest model. The proprietary NPi TBI severity score demonstrated greatest AUC value, F1 score, and sensitivity of 0.648, 0.567, and 50.9% respectively using a random forest classifier and greatest overall accuracy and specificity of 67.4% and 92.4% using a logistic regression model in the same classification task on the same dataset. The PupilScreen smartphone pupillometry app demonstrated binary healthy versus severe TBI classification ability greater than that of the NPi-200 proprietary NPi TBI severity score. These results may indicate the potential benefit of future study of this PupilScreen smartphone pupillometry application in comparison to the NPi-200 digital infrared pupillometer across the broader TBI spectrum, as well as in other neurological diseases.
Introduction
In the United States, traumatic brain injury (TBI) is the leading cause of mortality in people under age 45. 1,2 The pupillary light reflex (PLR) is a well-validated biomarker for assessing neurological disease in pre-hospital and in-hospital settings, mainly used in the management of TBI. 3 The most widely used technique for assessing the PLR outside of the neurological intensive care unit (ICU) setting is manual pupillometry using a penlight 4 ; however, this method suffers from poor inter-rater reliability. 5 We have developed a smartphone application 6 that assesses the PLR binocularly while being held a standardized distance away from the subject's face using an augmented-reality viewfinder built into the technology. Other smartphone pupillometry technologies have been studied with varying hardware and software features and requirements. 7 –10 The PupilScreen smartphone pupillometry app studied here requires only a standard iPhone and a cloud-based neural network computer vision algorithm with no external hardware required. This technology measures the PLR with better inter-rater reliability than the manual penlight (Randolph's kappa of 0.75 for PupilScreen and 0.47 for the manual penlight). 11
The digital infrared pupillometer is the current gold-standard pupillometry device, but its use is limited primarily to the neurological ICU settings 12 given its cost, required disposables and learning curve. It is used on one eye at a time and is positioned up against the subject's face to measure a proprietary aggregate variable called the Neurological Pupil Index (NPi) statistic, which was designed on a scale of 0 to 5, with scores of less than three indicating pupillary abnormality in the setting of TBI. 13 The primary objective of this study is to compare the PupilScreen, smartphone-based pupillometer app PLR parameters and the NeurOptics (NO) device proprietary NPi metric in the task of differentiating healthy subjects from subjects with severe TBI.
Methods
Study setting, population, and data collection
This study was approved by the University of Washington Institutional Review Board (Study #8009). The analysis focused on the ability of the smartphone pupillometer with PLR parameters and the NPi statistic to produce an indication of whether the pupil is reacting normally (healthy cohort with absence of reported neurological disease) or in a reduced manner (severe TBI cohort) based on a recording of the PLR. 11 The PupilScreen machine learning-enhanced smartphone pupillometry application (Apertur, Inc., Seattle, WA) and the NPi-200 digital infrared pupillometer (NeurOptics, Inc., Irvine, CA) were used sequentially to record the PLR of healthy subjects and patients hospitalized in a neurological ICU for treatment of sTBI. Demographic data as well as Glasgow Coma Scale 14 (GCS) values and TBI lesion types were collected. Recordings were conducted within a five-minute separation time to ensure a lack of interference from previous stimulation of the PLR with either device and to reduce environmental and physiological variations.
Images of the NPi-200 output of PLR parameters after monocular recording of the PLR were collected for each subject and stored in a secure, Health Insurance Portability and Accountability Act (HIPAA)–compliant database. A binocular video recording of the PLR was obtained within the PupilScreen smartphone app via the smartphone camera (native iPhone model 7 and above camera flash used as light stimulus) and uploaded to an encrypted HIPAA-compliant database. The seven morphological PLR curve parameters reported by the PupilScreen smartphone pupillometer were collected for each recording (Table 1), as was the proprietary NPi TBI severity score from the NPi-200 digital infrared pupillometer.
Definitions of Pupillary Light Reflex Parameters
Statistical analysis
We first investigated the differences in the seven PLR parameters (Table 1) between healthy and severe TBI (sTBI) cohorts with the PupilScreen smartphone pupillometer using a one-tailed student's t-test for independent means to aid in study dataset characterization. The same task was also performed using the NPi score produced by the NPi-200 pupillometer and frequently used at our institution for sTBI detection and monitoring. Subsequently, to compare the ability of the smartphone pupillometer to classify the study cohorts and to understand the most important parameters for each classifier, all combinations of averaged right and left eye PLR parameters for the smartphone pupillometer were used as features in logistic regression (LR), k-nearest neighbors (KNN), support-vector machine (SVM), and random forest (RF) classification models. The NPi statistic was separately used as a singular feature to separately train the same types of classification models to determine its performance on the same classification task of differentiating between healthy and sTBI PLR responses with the same set of subjects. All experiments were conducted using scikit-learn in the Python language (version 3.10), and default hyperparameters were used without tuning.
Receiver operating characteristic curves were produced along with mean classification accuracy, area under the curve (AUC), F1 score, sensitivity, and specificity values using the k-fold cross validation method 15 (k = 10) for generating repeated training and testing samples from our dataset. This was completed separately for the binary classification performance (on the healthy vs. sTBI classification task) of PupilScreen's PLR parameters (Table 1) and the binary classification performance (on the same classification task) of the proprietary NPi score that is used to aid in sTBI diagnosis and monitoring and that is produced by the NPi-200 pupillometer. The cross-validation technique includes the use of many different train and test groups to acquire mean accuracy, sensitivity, specificity, AUC, and F1 score metrics. Including multiple different train and test groups in this manner and combining the model outputs across them using this cross-validation technique, aims to reduce the effect of individual train and test group selection on model performance and results to get an accurate representation of the dataset and model classification ability. Shapley additive explanation (SHAP) 16 bee-swarm violin plots were generated to determine how different PLR parameters contributed to the best-performing models. The best-performing PLR parameter combinations for the smartphone pupillometer were determined by greatest overall accuracy and greatest AUC value separately. F1 score was used as a more robust indicator of model performance when compared to overall accuracy in the setting of class imbalance. A p value of <0.05 was considered statistically significant.
Results
Cohort characteristics
A total of n = 33 unique sTBI (defined by GCS 3-8) 17 and n = 132 healthy subjects were enrolled in this study, with n = 98 and n = 132 unique recordings of the PLR, respectively across both devices. Cohort demographics and GCS values are reported (Table 2). All subjects had one or more of the following TBI lesions in one or both cerebral hemispheres at the time of PLR recording: contusion, subdural hematoma, epidural hematoma, traumatic subarachnoid hemorrhage, and intraventricular hemorrhage. Eye colors were 14% green, 25% mixed, 28% blue, and 33% brown. “Mixed” was defined as including multiple colors or a color combination other than brown, blue, or green (i.e., hazel).
Study Cohort Characteristics
p < 0.05
GCS, Glasgow Coma Scale; NPi, Neurological Pupil Index; TBI, traumatic brain injury; IQR, interquartile range.
Best performing PLR parameters
Between-cohort PLR parameter differences for the smartphone pupillometer are reported (Table 3). All PLR parameters demonstrated significant differences between healthy and TBI cohorts for the smartphone pupillometer, and the NPi was significantly different between cohorts as well. The three best-performing PLR parameter combinations for the smartphone pupillometer as determined by AUC value were the same as those determined by overall accuracy value and are presented (Table 4). The top performing combination for the smartphone pupillometer was maximum pupillary diameter, mean constriction velocity, maximum constriction velocity, and mean dilation velocity, with accuracy, sensitivity, specificity, area under the curve (AUC), and F1 score of 87%, 85.9%, 88%, 0.869, and 0.85, respectively, in a random forest model. The NPi statistic demonstrated greatest AUC value, F1 score, and sensitivity of 0.648, 0.567, and 50.9%, respectively, using a random forest classifier and greatest overall accuracy and specificity of 67.4% and 92.4% using a logistic regression model (Table 5). SHAP diagrams for the best-performing combinations for PupilScreen based on AUC value (Fig. 1) and the NPi statistic based on AUC value (Fig. 2) are shown, to aid in understanding of how the models used the PLR parameters for prediction and compare smartphone pupillometer and NPi statistic performance in the study task. Confusion matrices for PupilScreen (smartphone pupillometer, Table 6) and NPi-200 (NPi score; Table 7) are also provided for a thorough understanding of the data and results of this study.

Shapley additive explanation (SHAP)
16
bee-swarm violin plots were generated to determine how different pupillary light reflex (PLR) parameters contributed to the best-performing models. Figure 1 demonstrates SHAP diagrams for the smartphone pupillometer in order of performance (

Shapley additive explanation (SHAP) 16 bee-swarm violin plots were generated to determine how different pupillary light reflex (PLR) parameters contributed to the best-performing models. Figure 2 demonstrates the SHAP diagram for Neurological Pupil Index (NPi) value in the best-performing classifier based on area under the curve value (random forest). SHAP value indicates whether the parameter positively or negatively predicts severe traumatic brain injury (TBI). Feature (parameter) value indicates whether having a lower or higher value of that parameter positively or negatively predicts TBI in the PLR recording. For example, the darker color combined with a slightly positive SHAP value for NPi in this figure indicates that having a lower NPi has a slightly positive impact on prediction of the presence of severe TBI in this random forest model.
PupilScreen (PS) Between-Cohort PLR Parameter Differences (Means) a
All means were significantly different (p < 0.05) between healthy and TBI cohorts.
PLR, pupillary light reflex; Max, maximum pupillary diameter; Min, minimum pupillary diameter; Lat, latency; CV, mean constriction velocity; MCV, maximum constriction velocity; DV, mean dilation velocity; TBI, traumatic brain injury.
Best-Performing PLR Parameter Combinations, Smartphone Pupillometer
PLR, pupillary light reflex; AUC, area under the curve; RF, random forest; Max, maximum pupillary diameter; Min, minimum pupillary diameter; Lat, latency; CV, mean constriction velocity; MCV, maximum constriction velocity; DV, mean dilation velocity.
Neurological Pupil Index Performance Across Classification Models
AUC, area under the curve; LR, logistic regression; SVM, support-vector machine; KNN, k-nearest neighbors; RF, random forest.
Confusion Matrix, Best-Performing Model (Random Forest—Max, CV, MCV, DV), Smartphone Pupillometer
Max, maximum pupillary diameter; CV, mean constriction velocity; MCV, maximum constriction velocity; DV, mean dilation velocity; sTBI, severe traumatic brain injury.
Confusion Matrix, Best-Performing Model (Random Forest), NPi
NPi, Neurological Pupil Index; sTBI, severe traumatic brain injury.
Discussion
An abnormal PLR may consist of either very little to no constriction in cases of sTBI 6 or reactive but abnormal levels of constriction in cases of concussion or mild TBI. 18 Current use of pupillometry is centered around in-hospital management of sTBI, 19 as compression or irritation of the 2nd or 3rd nerves caused by traumatic bleeding or increases in intracranial pressure may affect the PLR. 20 This is the likely mechanism by which the differences in PLR parameters between healthy subjects and subjects with sTBI are derived in this study.
We have demonstrated the ability of a smartphone pupillometer to perform quantitative digital pupillometry with classification accuracy (87%), sensitivity (85.9%), AUC (0.869), and F1 score (0.85) surpassing that of the NPi score (67.4%, 50.9%, 0.648, and 0.567, respectively) produced by the NPi-200 device that is frequently used in intensive care management of patients with severe TBI. 21 We report a diverse set of study subject demographics and eye colors, which is important to support the validity of these results in a wide range of potential patient populations. The accuracy of this smartphone pupillometer's neural network in tracking pupillary size changes has previously been demonstrated to be greater than the practice of manual penlight pupillometry and near-equivalent to digital infrared pupillometry. 6 These new results demonstrate diagnostic superiority in terms of accuracy, sensitivity and AUC when compared with the NPi score produced by the digital infrared pupillometer, although the NPi score demonstrated capability of producing a slightly higher specificity than the smartphone pupillometry PLR parameters (92.4% vs. 88%).
The PupilScreen app also demonstrated a lower sensitivity (85.9%) than specificity (88%), which may indicate a predilection of the technology for a greater number of false negative and lower number of false positive results which would support its use in helping to confirm the presence of sTBI. Previous results 11 using the PupilScreen application in a small sample of healthy and severe TBI patients using healthcare provider PLR curve interpretation that demonstrated superior accuracy (93% vs. 87%), sensitivity (94% vs. 85.9%), and specificity (92% vs. 88%) to the results in this paper (albeit in a much smaller sample cohort) suggest that adapting direct curve interpretation machine learning techniques for both smartphone and digital infrared pupillometry may be fruitful in this binary classification task.
One of the limitations of this study is that due to mechanistic differences inherent in the form factor of the two technologies resulting in different units of measurement for the PLR parameters, direct PLR parameter comparison was not feasible. Alternative technologies such as light detection and ranging (LIDAR) 22 may provide future ability to conduct such analyses if they are widely adopted in smartphone platforms. Another limitation is the class imbalance in our binary classification data with n = 98 sTBI PLR recordings on n = 33 subjects and n = 132 healthy recordings on n = 132 subjects. Although the number of recordings for each cohort are similar, we were unable to enroll the same volume of sTBI subjects due to a lower incidence and availability of sTBI meeting the required severity on GCS criteria in the dedicated neurological intensive care unit compared to the availability of healthy staff to enroll in this study. Despite this slight imbalance, our F1 score remains high at 0.85 (Table 4), indicating the robust nature of the results. Another limitation of this study is the potential for optic or oculomotor nerve injury due to craniofacial trauma such as an orbital fracture during the sTBI that may have confounded the PLR in these patients. A subgroup analysis for this factor was not performed and may be worthwhile in future studies.
The final limitation of this study is the lack of reporting of medication use at the time of PLR recordings conducted on sTBI subjects. The use of sedating and pain medications in this patient population is frequent, and the effect on PLR has been studied, with parameters such as maximum diameter, 23 -25 percent change, 25 and mean constriction velocity 23 being observed to decrease in patients under the influence of continuous sedation and analgesia. These effects are dependent on the type of anesthetic combination used 25 and have varying effects on the NPi-200 proprietary NPi score from reports of no effect on the NPi score 26,27 to slight but significant reduction depending on dosage and combination of medications. 23,25 The PLR, although diminished, has been noted as robust and quantifiable despite analgesic administration 23,26 -28 or even not affected during analgesic administration. 29 These findings in the literature represent a limitation to the study of sTBI patients using pupillometry; however, this study seeks to compare the NPi score to the PupilScreen PLR parameters in terms of ability to classify subjects as healthy or having sTBI across the same cohort of subjects with the same likely use of analgesic and sedative medications in a neurological intensive care unit. It must also be noted that quantitative pupillometry such as use of the NPi score and NPi-200 (or quantitative smartphone pupillometers such as PupilScreen in the future) are still routinely used in such situations as analgesic or sedative administration for neurological assessment as it is considered better than nothing in such situations when dealing with critical illness. Future study should elucidate the effects of specific analgesic and sedative medications on the ability of each of these technologies alone to detect sTBI, which is outside the scope of this study.
As the results of this study suggest, the PupilScreen app (Fig. 3) provides useful differentiation of pupillary status in sTBI patients, and is positioned for future study in the detection of neuro-worsening in the TBI population. In the current clinical landscape, using the PLR to directly discriminate between healthy and severe TBI patients as in this study is not always clinically relevant; however, our results are important in that we demonstrate comparability between the PupilScreen app and the NPi-200 digital infrared pupillometer (and superiority of the smartphone pupillometry app in the healthy versus sTBI classification task). Because the NPi measure is considered a standard numerical indicator of pupillary dysfunction in the intensive care unit setting, benchmarking an app-based pupillometer against this established technology in different patient types (severe TBI and healthy controls) lends credibility to the application of this new technology in pupillometry. The use of the PLR to differentiate mild, 30 -38 moderate, 39 and severe 28,39 TBI is currently of research interest in the field of pupillometry, with results indicating differences in various PLR parameters across the TBI spectrum. This study represents an initial step in validating the smartphone app against the current standard of the proprietary NPi score in the hospital setting with the long-term goal of establishing transparent thresholds of PLR parameters for TBI severity differentiation and clinical decision making and triage assistance. Our findings are a novel approach to the use of the PLR as a biomarker in the care of sTBI patients beyond the typical use in detecting herniation.

Screenshot of PupilScreen smartphone application design. App design is not complete, images are of test screens with artificial data. Panel
Conclusion
This study demonstrates the superiority in terms of discriminatory ability (AUC value) of PLR parameters produced by a smartphone pupillometer when compared with the NPi score produced by a digital infrared pupillometer in the classification of healthy and severe TBI subjects. These results guide future study of the utility of smartphone pupillometry and the NPi score in additional populations with neurological disease across the TBI spectrum and beyond, including clinician feedback and app design studies.
Footnotes
Authors' Contributions
AJM was responsible for data collection, manuscript writing and revisions, and data analysis. BG was responsible for data collection, data interpretation, and manuscript revisions. CL was responsible for data processing and manuscript revisions. DL was responsible for data collection and manuscript revisions. AM was responsible for data analysis and manuscript revisions. LBM and MRL were responsible for data analysis and interpretation, and manuscript revisions. All authors contributed to the article and approved the submitted version.
Funding Information
This work was not supported by any funding source.
Author Disclosure Statement
MRL: Consultant for Apertur, Medtronic, Aeaean Advisers, Metis Innovative; Equity interest in Proprio, Cerebrotech, Synchron, Hyperion Surgical, Fluid Biomed; Editorial board of Journal of NeuroInterventional Surgery and Frontiers in Surgery. AJM: Equity interest in Apertur. LBM: Co-founder with equity interest in Apertur.
For the other authors, no competing financial interests exist.
