Abstract
This paper describes a methodology for developing a new confidence metric to improve power grid operator reliance on ML event classifiers. Unlike traditional confidence scores that are generated by the ML, this confidence metric is generated by humans who have spent time studying the performance boundaries of the ML classifier. We refer to this metric as an Expert Derived Confidence (EDC) score. As an initial test of our methodology four participants (3 Subject Matter Experts, 1 Novice) learned the boundaries of an ML’s performance by studying a subset of events in the ML’s training data. Next, the participants rated their confidence in the ML’s ability to classify similar events. The researchers found that all participants’ EDC scores were correlated with the ML’s own uncertainty quantification score and on average EDC scores showed greater confidence in the ML’s ability to correctly classify events when compared to the ML’s own confidence scores. In addition, averaging EDC scores across all participants was the strongest predictor of model performance and predicted performance even after controlling for the ML’s own confidence.
Introduction
ML classifiers may help support power grid operators by identifying important events on the grid. However, even the best performing ML will occasionally misclassify events (Zhang, Liao & Bellamy, 2020). For example, imagery data with small distortions due to variability in format and image compression may cause a classifier to misclassify these images (Zheng, Song, Leung & Goodfellow, 2016). Power grid operators understand the classifier’s potential for error but may be less clear about when these errors are likely to occur and conversely when the tool’s classification can be trusted with an accurate recommendation.
To provide some guidance, developers can display a confidence score associated with each classification to communicate the tool’s confidence in its decision. Although confidence scores may serve as a useful guide for some classifications, for other events, the tool’s confidence may not be a good indicator of its own performance. For some classifiers, confidence scores can ‘deviate substantially from their true outcome probabilities’ (Zhang et al., 2020, p. 298). In addition, these confidence scores suffer from the same lack of transparency as the underlying classifications. The operator does not understand the underlying analytical processes that led to both the classifications and associated confidence scores. This lack of understanding may lead to automation bias if the operator replaces their own assessment of the situation with an automated system’s incorrect recommendation (Mosier, Skitka, Burdick & Heers, 1996). Wickens, Clegg, Vieane and Sebok (2015) found that automation bias experienced during a process control simulation degraded participant’s ability to diagnose faults in the automation and increased operator workload. On the other hand, lack of understanding may lead to algorithm aversion which is a tendency to avoid recommendations offered by algorithms in favor of human judgment despite the potential performance benefits of relying on such technology (Dietvorst, Simmons & Massey, 2015).
The purpose of this research is to take steps toward developing an Expert Derived Confidence (EDC) score to accompany an ML classifier’s classification decision. This score is based on domain expert confidence in the ML’s performance for a particular classification. The EDC would provide a way for operators to receive guidance from a domain expert who also has a clear mental model of the error boundaries of the ML classifier. This EDC could help an end user calibrate their reliance on the classifier (e.g., accept correct classifications, reject incorrect classifications). Confidence scores derived from experts may be both more accurate at predicting model performance and better aligned with operators’ mental model of the system’s performance since the EDC scores were generated by humans and not an ML algorithm. This paper describes the work completed to develop an initial set of EDC scores, our comparison of the EDC scores to an ML’s own confidence scores and analysis of the EDC’s ability to predict model performance. We hypothesize that our EDC scores will significantly predict model performance over and above the model’s own confidence scores.
Method
Our initial research effort involves a Learning Phase for domain experts to learn the error boundaries of an ML classifier followed by a Scoring Phase for experts to rate their confidence in the model’s ability to accurately classify events. These ratings served as our initial set of EDC scores. We are developing this method using real-world power systems event data from the Eastern Interconnection Situational Awareness and Monitoring System (ESAMS) (Follum et al., 2020) to improve the generalizability of our method outside the laboratory.
Participants
The researchers collected EDC scores from four participants. Three of the four participants are experts in electric power systems. The Subject Matter Experts (SMEs) have PhDs in Electrical Engineering and Systems Engineering. All the SMEs have extensive experience working with and developing tools to support electric power systems and are very familiar with the phasor measurement unit (PMU) data used in this study. The fourth participant is a novice in the domain and unfamiliar with the data. This participant was included to compare expert-novice differences.
Machine Learning Classifier
The participants were asked to rate their confidence in a machine learning classifier’s ability to classify events into two categories: generator trip events and other frequency events. The machine learning classifier’s algorithm was the support-vector machine (SVM). SVM is a conventional machine learning algorithm that is successful in many applications (Cristianini 2008). A SVM forms hyperplanes to separate training data into different categories and to predict new data according to such separations in the feature space. SVM is usually good at handling high-dimensional data when using radial basis function kernel, which is the kernel choice of this study. During the model training process, repeated cross-validation has been applied to fine-tune the hyper-parameters of SVM to avoid over-fitting. This SVM classifier can also provide the ML’s own confidence scores, which were compared to the EDCs. These scores were based on the uncertainty quantifications provided by the softmax equation in the caret package in R (Kuhn 2008).
Data
The machine learning model was trained on real-world PMU data of power system events from the Eastern Interconnection in the US. The events were labeled by a power system utility into two categories, namely generator tripping events and other frequency events as ground truth for this research. Among all the common events happening in the power grid, generator trips are usually more significant than the other events for system operators to take notice. The data set contains 42 generator tripping events and 165 other frequency events.
We selected a subset of 124 events from the training data for the Learning and Scoring Phases of our methodology. The performance of the ML was evaluated by exploring 5 different outcomes. We refer to these different outcomes as Event Types.
False Positives (FP) – The ML classified these events as Generator Trips but they were actually Other Frequency events.
Misses – The ML classified these events as Other Frequency events but they were actually Generator Trip events.
Near Neighbor FP – Each FP had two near neighbor events (one Generator Trip and one Other Frequency event). Since Near Neighbors may have different visual characteristics depending on their catagory we thought it would be beneficial for participants to see both Near Neighbor events. The SVM categorizes events within a multi-dimensional latent space. The Near Neighbor FPs were closest in this latent space to the FP.
Near Neighbor Miss – Each Miss had two near neighbor events (one Generator Trip and one Other Frequency event). These events were closest in latent space to the Miss.
Exemplar – These events included only correctly classified events that were not near neighbors. Exemplars were also selected based on their similarity to each other. Events that were dissimilar from each other based on their distance in latent space and visual inspection were selected.
Our ML classifier was 85% reliable which provided us with far fewer misclassified events to choose from when compared to correctly classified events. Of the 28 FPs in the training data, we selected 20 FP events. Although 20 was not the total number of FP events, it represented the large (71%) representative majority of FP events in the data, and we believe it reflected the diversity of FPs in the training data. The ML only missed 8 generator trip events. This small number made it manageable for the participants to work with the entire population of missed events. Therefore, we selected all 8 events from the training data. Each misclassified event had a Near Neighbor FP and a Near Neighbor Miss selected for our method. Near Neighbor events were identified using the dynamic time warping method of calculating similarities between time series data (Keogh, 2005).
All correctly classified events that were not identified as Near Neighbors were eligible for selection as Exemplar events. The research team selected 10 Exemplar events for each class (Generator Trip Events, Other Frequency Events). To guide selection, we calculated the similarity of the Exemplar events using dynamic time warping and selected events with the highest degree of dissimilarity based on both dynamic time warping results and visual inspection. Dissimilarity was prioritized to provide the participants with a visually diverse sample of Exemplar events. All event types were divided equally into two groups and assigned to either the Learning or Scoring Phases (see Table 1).
Number of Events for the Learning and Scoring Phases by Class and Event Type
Note. NN = Near Neighbor, Gen refers to the Generator Trip Event Class, Other refers to the Other Frequency Event Class.
Procedure
Learning Phase
The research team used the software platform MURAL to conduct the Learning Phase with the participants remotely. MURAL is a digital white board that can be accessed by remote collaborators for synchronous and asynchronous collaboration and allows teammates to import and manipulate images. In the Learning Phase all five types of events were visually displayed on MURAL for a total of 62 events. Each event was labeled on MURAL according to its class (generator trip, other frequency event) and Type and was assigned a unique ID number within its Type. The ID numbers of the misclassified events matched their near neighbor event ID numbers so that the SME could track these associations on MURAL.
Using both the visual profile of the time series data and their knowledge of model performance for each event (i.e., correctly classified or misclassified), the SMEs organized the events into categories that were meaningful to them and provided a label for each category (see Figure 1). The sorting task performed during the Learning Phase is similar to card sort tasks commonly employed as techniques for understanding humans’ mental models (Smith-Jentsch, Campbell, Milanovich & Reynolds; 2001; Wright, et al. 2020). This sorting task was intended to stimulate critical thinking and learning of the model’s performance boundaries by providing visual clues into why the model might be misclassifying particular events. The goal of this task was to develop a rich mental model that allows the participant to successfully predict when misclassification is likely to occur.

Portion of the MURAL board that shows the SME’s categorization of FP events into a Noise/Distortion group and a Spikes group (both highlighted in red).
The Learning Phase allowed the participant to see subtle distinctions between events that are typically misclassified and those that are similar, but correctly classified by the model (i.e., Near Neighbor events). This phase was largely self-guided because the researchers did not want to constrain the participants’ mental model development. On average participants spent 3 hours completing the Learning Phase.
Scoring Phase
The research team developed a questionnaire to capture participant EDC scores for each event in the Scoring Phase data set (see Table 1). EDCs were recorded on a questionnaire posted to the project’s Microsoft Teams page where it was accessed and completed by the participants. Each page of the questionnaire displayed a different event from the 62 Scoring Phase events and, unlike events in the Learning Phase, these events were not labeled to indicate their class or type. For each event, participants were asked to identify the event as either a Generator Trip or Other Frequency Event. Participants were also asked to provide a likelihood rating from 0 to 1 that the ML will correctly classify the event. This likelihood rating served as the EDC score for that event.
Results
Correlations
The researchers computed bivariate correlations to explore associations between participants as well as associations with the ML’s own uncertainty quantification score. Table 2 reveals EDC scores for individual participants were all significantly positively correlated with ML confidence scores. Correlations also revealed a fair amount of variability across participants. Only SME 2’s EDCs were significantly correlated with the other participants’ EDCs.
Correlations Between Confidence Scores.
EDC Logistic Regressions
Logistic Regressions were computed with ML performance (1 = correctly classified, 0 = incorrectly classified) as the binary dependent variable. These regressions were computed to test EDC scores’ ability to predict model performance. When EDC scores for each participant were computed separately as individual predictors, all EDCs significantly predicted ML performance (see Table 3).
The p values for individual predictors and averages across participants.
Note. SME 1 & 2 Average = Mean EDC scores between SMEs 1 and 2, SME Average = Mean EDC scores across all SMEs, Human Avera ge = Mean EDC scores across all participants.
The researchers were interested in combining participants’ EDCs into an average EDC to see if the average might cancel some idiosyncratic noise to produce a stronger predictor (i.e., wisdom of the crowd). Table 3 displays the p values for the logistic regressions. P values decreased demonstrating increased predictive power with the addition of each participant. P values are reported instead of odds ratios in Table 3 because the researchers found the odds ratios for our logistic regression results difficult to interpret. Since the sample size was equivalent across all analyses the researchers believe the p values were an acceptable and more interpretable statistic in this study.
Mean Comparisons
The researchers were interested in comparing the average human EDC score to the ML confidence score across different Event types. A 2 x 3 Mixed ANOVA was computed to test the effects of Event Type (Exemplar, Near Neighbor, Misclassified) and Agent (Human, ML) on confidence score. We had unequal sample sizes in each group (n=20, n=28, n=14) respectively, but Levene’s test revealed homogeneity of variance. Results revealed a significant main effect for Event Type, F(2, 59) = 40.25, p < .001, ηp2 = .58. Fischer’s Least Significant Difference Multiple comparisons revealed confidence scores for Exemplar events (M = .73, SD = .17; p <.001) and Near Neighbor events (M = .65, SD = .15; p <.001) were significantly higher than for Misclassified Events (M = .43, SD = .20); see Figure 2). In addition, confidence in ML’s ability to classify Exemplar events was significantly higher compared to Near Neighbor events (p<.012).

Mean Confidence Scores by Event Type.
The ANOVA also revealed a significant main effect for Agent, F(1, 59) = 108.88, p<.001, ηp2 = .65. The average EDC scores were significantly higher (M=.71, SD=.13) then the ML confidence scores (M=.49, SD=.14; see Figure 2). The interaction F(2,59) =.992, p = .377, ηp2 = .03 was not statistically significant.
Logistic Regression Results
To test our hypothesis EDC scores and the ML’s uncertainty quantification score were included as predictors in the regressions. A hierarchical regression supported our hypothesis. The Human Average EDC score significantly predicted model performance even after controlling for variance associated with the uncertainty quantification score (see Table 4).
Logistic Regression with the Human Average EDC score as a predictor of Model Performance after controlling for ML Confidence.
Discussion
Our hypothesis was supported. The average of the individual participant EDC scores was the strongest predictor of ML performance and significantly predicted model performance even after controlling for the ML’s uncertainty quantification score. Our results also revealed participant EDC scores were not all significantly correlated with each other but were all significantly associated with the ML’s uncertainty quantification score. Also, the average EDC scores were significantly higher than the ML’s uncertainty quantification scores. Humans were more confident in the ML’s performance than the ML was in its own performance.
Support for our hypothesis suggests that at least in some domains and for some classifiers humans can learn the performance boundaries of an ML classifier by spending a few hours studying the model’s training data. In addition, such studying (i.e., Learning Phase) may allow humans to outperform an ML’s own uncertainty quantification score. This finding is interesting in part because it suggests humans are identifying important variability in model performance that the ML’s own score is not capturing. These results are the first step toward demonstrating the utility of EDC scores as a guide for reliance on the ML.
Another interesting finding suggests averaging EDC scores across multiple people appears to improve the predictive power of the EDC. Although the benefit seems to be primarily moving from one to two participants. This finding suggests that when generating EDC scores, it may be beneficial to collect data from more than one person. Interestingly, our novice participant’s EDC scores were able to significantly predict model performance. Perhaps domain expertise is not critical for learning the performance boundaries of an ML classifier at least for some model domain combinations. Although this finding is interesting, it is important to note that scores from more than a single novice are needed to make confident claims about expert/novice differences.
We also found that the EDC scores were on average 22 points higher than the ML’s confidence scores. However, both scores were sensitive to event type. Both the EDC and ML’s confidence dropped when faced with an event that the model misclassified.
Future Research
For our method to be feasible it cannot require humans to score every event in the training data set. Instead, we are working to develop a technique to apply the scores from our Scoring Phase to similar unscored events. To determine which events in the dataset are similar to each scored events we will compute the similarities between events. The research team will explore a few methods for calculating similarity/distance between two time series and choose the similarity calculation that most closely aligns to human perception.
Once we can successfully generalize EDC scores from the Scoring Phase to unscored events, we must examine the impact of EDC scores on operator reliance. The expectation is that if EDC scores are a strong predictor of model performance, displaying these scores to operators will improve operator decision making. User studies must be executed to provide empirical support for this hypothesis.
Limitations
This work is the first attempt at a methodology for developing EDC scores to support operator reliance decisions. There may be characteristics of both our ML model and the data we are using that allow humans to learn the model’s performance boundaries in only a few hours. It is likely that more complex models and/or data may not lend themselves as easily to this type of learning. In addition, more participants are needed to strengthen our conclusions. In particular, more novice participants are needed to determine the importance of domain expertise in developing EDC scores.
