Abstract
Objective
This paper aimed to investigate the robustness of driver cognitive workload detection based on electrocardiogram (ECG) when considering temporal variation and individual differences in cognitive workload.
Background
Cognitive workload is a critical component to be monitored for error prevention in human–machine systems. It may fluctuate instantaneously over time even in the same tasks and differ across individuals.
Method
A driving simulation study was conducted to classify driver cognitive workload underlying four experimental conditions (baseline, N-back, texting, and N-back + texting distraction) in two repeated 1-hr blocks. Heart rate (HR) and heart rate variability (HRV) were compared among the experimental conditions and between the blocks. Random forests were built on HR and HRV to classify cognitive workload in different blocks and for different individuals.
Results
HR and HRV were significantly different between repeated blocks in the study, demonstrating the time-induced variation in cognitive workload. The performance of cognitive workload classification across blocks and across individuals was significantly improved after normalizing HR and HRV in each block by the corresponding baseline.
Conclusion
The temporal variation and individual differences in cognitive workload affects ECG-based cognitive workload detection. But normalization approaches relying on the choice of appropriate baselines help compensate for the effects of temporal variation and individual differences.
Application
The findings provide insight into the value and limitations of ECG-based driver cognitive workload monitoring during prolonged driving for individual drivers.
Keywords
Introduction
Cognitive workload management is intrinsic to human error prevention in human–machine systems in various domains, such as surface transportation (Brookhuis & de Waard, 2010) and aviation (Kantowitz & Casper, 1988). However, it is challenging to precisely detect cognitive workload in the real world because of its multidimensional construct (Hancock & Caird, 1993; Wickens, 2008) and individual differences (Matthews et al., 2015). In a framework of cognitive workload measures proposed by Xie and Salvendy (2000), “instantaneous workload” was used to represent the continuous and dynamic aspects of workload and illustrated by a continuous index along the timeline. Liang et al. (2007) predicted the instantaneous workload of drivers across different tasks using a decision variable generated by support vector machine (SVM) models. However, studies addressing real-time cognitive workload detection remain limited.
Compared to subjective and performance measures, physiological measures are more promising to indicate instantaneous cognitive workload because of their sensitivity to workload, continuity, and unobtrusiveness (Cannon et al., 2012; de Waard, 1996; Kramer, 1991). Among the physiological measures, electrocardiogram (ECG) could be a promising measure according to a meta-analysis of cardiac measures of cognitive workload (Hughes et al., 2019). ECG can be recorded for driver state monitoring via sensors embedded in vehicle components, such as steering wheel or driver seat (Koenig et al., 2015; Pinto et al., 2017). Studies in human factors and signal processing communities have demonstrated that heart rate (HR) and heart rate variability (HRV) derived from ECG are sensitive to task loads in driving (e.g., Lohani et al., 2019; Mehler, Reimer, Wang, 2011), and have been used alone or combined with other physiological measures to detect driver cognitive workload (e.g., Miyaji et al., 2009; Tjolleng et al., 2017).
ECG or other physiological signals are usually collected in relatively short driving studies (less than 60 min) for cognitive workload detection (e.g., Solovey et al., 2014; Tjolleng et al., 2017). However, the short experimental setup cannot reveal the impacts of prolonged monotonous driving on driver states, such as declined vigilance (Schmidt et al., 2009), mind wandering (Baldwin et al., 2017), and fatigue (Thiffault & Bergeron, 2003). These impacts may influence the amount of attentional resources distributed to the same task over time, leading to dynamic fluctuations of cognitive workload in the same driving context. The change in cognitive workload over time is referred to as “temporal variation” in this paper and should be differentiated from “task-induced variation” (i.e., cognitive workload changed by tasks). Under prolonged driving, the temporal variation of cognitive workload may manifest through changes in physiological measures such as decreased HR (Schmidt et al., 2009). Thus, cognitive workload detection may be less reliable in prolonged driving if relevant algorithms are trained only with early stage driving data. This temporal aspect of cognitive workload has rarely been explored in previous studies.
In addition to temporal variation, cognitive workload is difficult to measure because of individual differences (Matthews et al., 2015), such as differences in expertise (Sarkar et al., 2019), stress (Conway et al., 2013), and cognitive efficiency (Yang & Ferris, 2018). Due to individual differences, drivers may perform differently in the same cognitive task, thereby impacting the generalization of driver cognitive workload classification to different individuals. The effects of these differences have been observed in the related field of driver sleepiness classification research where, for instance, sensitivity dropped significantly from 86.5% to 41.4% when the random forest classifier was trained and tested with the physiological data from different participants (Mårtensson et al., 2018). However, Solovey et al. (2014) assumed that the physiological response to workload is somewhat consistent across all drivers and proposed a more robust binary classification of cognitive workload among different drivers. It is necessary to understand the impacts of individual differences on cognitive workload detection.
It is important to detect cognitive workload reliably at different stages of driving for different drivers. Ideally, driver cognitive workload monitoring ought to accommodate the temporal variation and individual differences in cognitive workload. Driven by this goal, we first explored the measures of temporal variation of cognitive workload. Second, we investigated the impacts of temporal variation and individual differences on cognitive workload detection by evaluating the performance of cognitive workload classification at different stages of driving and for different drivers. Third, we explored methods to improve cognitive workload detection under the impacts of temporal variation and individual differences.
To achieve these goals, we conducted a driving simulation study that consisted of an alert driving session and a drowsy driving session; however, this paper was only based on the alert session to exclude drowsy components. The Karolinska Sleepiness Scale (KSS; Shahid et al., 2011) was used to verify drowsy driver state. Participants were exposed to the same experimental conditions repeated in the first and second hours of driving as a means of observing temporal variation in driver cognitive workload, which were measured using physiological response and subjective measure (NASA Task Load Index [NASA-TLX]). Raw ECG was recorded during the study and further processed into HR and HRV measures for statistical analysis and classification algorithm training and testing. The study evaluated the performance of ECG-based classification of cognitive workload involving temporal variation and individual differences. The findings will inform the value and limitations of ECG for developing a robust and generalized cognitive workload detector for driver safety.
Method
Participants
Seventy-five participants were recruited from the Monash University Accident Research Center (MUARC) simulator driver database to participate in the study. Full secondary task and ECG data were available for 57 participants (age: mean = 31, SD = 11.6; M = 38, F = 19; years of driving experience: mean = 2.9, SD = 10.1).
The study was approved by Monash University Human Research Ethics Committee. The inclusion criteria included: (1) holding a current valid Australian driver license; (2) being in good health (no history of motion sickness, epilepsy, sleep problems/disorders, serious mental health conditions, illicit drug use or prescription drug abuse, smoking, or excess alcohol consumption); (3) not be taking medications (prescription or otherwise) that have a sedative effect; (4) not working night shift or having traveled across time zones greater than 3 hr within the last 2 weeks; (5) not consuming more than five caffeinated beverages on average per day; and (6) not being pregnant. All participants provided informed consent.
Apparatus
The experiment was conducted using the Advanced Driving Simulator at MUARC, comprising a Holden Calais cab on a three degree-of-freedom motion platform in a controlled room (Figure 1; Tomasevic et al., 2019). The simulated driving scenario was projected onto a 180º cylinder forward screen and a flat rear screen.

The setup of the Advanced Driving Simulator at MUARC. Note. MUARC = Monash University Accident Research Center.
ECG data were sampling at 500 Hz through a “dry pad electrode” (http://cognionics.com/index.php/59-products/sensors) that was attached slightly right of the sternum (above the breast bone) by surgical tape underneath participants’ clothing. Extra tape was used on outer clothing to help keep this dry pad sensor in place. The ECG signal collected from the dry pad electrode was transferred through the Cognionics Quick-20 EEG headset to computer storage for off-line analysis. Data from other sensors, including Seeing Machines’ Driver Monitoring System (DMS) and additional camera-based sensors, are not reported here.
Secondary Task
The secondary tasks included an N-back task, a texting task, and a combined N-back + texting task. The N-back task required participants to listen to a prerecorded audio sequence of letters (1 s letters with a 1-s interval between) and press a finger-mounted button when a given letter had been previously presented two letters ago. The texting task was self-paced and required participants to read pregenerated text messages that contained phases with one or multiple words missing, and type in the missing word(s) or “pass” to complete the phrase through a tablet mounted on the console (in Block 1) or right-side door (in Block 2). The texting task demanded drivers’ cognitive resources along with engaging their visual attention and manual control. The combined task required participants to do the N-back and texting tasks simultaneously. The N-back task has been proved as an effective manipulation of cognitive workload (Mehler, Reimer, Dusek, 2011), whereas the texting task represents driver distraction in real-world driving.
Procedure
The experiment comprised a 2-hr simulated drive on simple country roads with light traffic. During the drive, participants were required to complete a 30-min baseline drive without distraction tasks (the baseline condition), followed by the 10-min blocks of N-back, texting, and N-back + texting tasks in Blocks 1 and 2 (Figure 2).

The procedure of the driving simulation study. N, T, and N + T represent N-back, texting, and N-back + texting task conditions, respectively.
The baseline condition always came first in each block to ensure that reliable physiological metrics could be established during low workload driving prior to measuring responses to increased workload. This design also provided participants some respite after a series of tasks in Block 1, so a baseline-to-baseline comparison between blocks became feasible. The order of the three task conditions in both blocks was balanced among the participants. The KSS was administered at the beginning of the study (KSS1), the end of the two baseline conditions (KSS2 and KSS4), and the end of both blocks (KSS3 and KSS5; see Figure 2). NASA-TLX for each task condition was collected at the end of Block 2 but not reported in this paper due to our focus on ECG-based measures.
Data Analysis
ECG data processing
ECG was processed into the RR intervals (i.e., intervals between successive heartbeats) using Python module biosppy.signals.ecg. Next, the abnormal beats (mostly caused by the noise in ECG) in the RR intervals were removed using Python module hrvanalysis.remove_ectopic_beats. The remaining RR intervals without abnormal beats are called NN (Normal-to-Normal) intervals (Shaffer & Ginsberg, 2017). If the NN intervals in an experimental task condition were less than 80% of the original RR intervals, all NN intervals in this experimental condition were excluded from analysis (Peltola, 2012). Therefore, after data processing, the filtered NN intervals became imbalanced (Table 1). To utilize most available data, we analyzed the NN intervals from 45 (out of total 57) participants whose NN intervals were available in the baseline and at least one task condition in either Block 1 or 2. Among these 45 participants, 41 provided data in Block 1 and 39 provided data in Block 2 (41 and 39 are the total number of participants in the experimental condition sets indicated by ✓ in Table 1).
The Number of Participants With Available NN Intervals in Different Sets of Experimental Conditions
Note. NN = Normal-to-Normal. Experimental conditions include one baseline condition and three task conditions. ✓ indicates the sets of experimental conditions in which the NN intervals were chosen for analysis.
Python module hrvanalysis was used to generate a list of HR and HRV measures that quantify the autonomic dysregulation—abnormal coordination between the sympathetic and parasympathetic nervous systems—to indicate declines in attention and cognitive performance (Hughes et al., 2019). For statistical analysis, HR and HRV were calculated over the NN intervals over each experimental condition (30 min for baseline and 10 min for each task condition) in each block for each participant. The HRV measures were based on the time-domain, frequency-domain, and nonlinear-domain analyses (Table 2).
The Definition, Unit, and Minimal ECG Period of the Selected HR and HRV Measures
Note. SNS represents sympathetic nervous system and PNS represents parasympathetic nervous system. ECG = electrocardiogram; HR = heart rate; HRV = heart rate variability; NN = Normal-to-Normal.
Source. Shaffer and Ginsberg (2017).
Statistical analysis
The imbalanced HR and HRV were analyzed using Type III ANOVAs based on the linear mixed-effect models (Bagiella et al., 2000) in R package lmerTest (Kuznetsova et al., 2015), in which experimental condition (four experimental conditions) and block (two blocks) were the fixed factors and participant was the random factor. The main and interaction effects on HR and HRV were analyzed using Tukey post hoc tests in R package emmeans (Lenth et al., 2018). The partial eta squared for the main and interaction effects was calculated using R package effectsize (Ben-Shachar et al., 2020). The Wilcoxon signed rank tests were used to compare all KSS scores to the threshold of drowsiness (KSS = 7; Mårtensson et al., 2018). All statistical analyses were conducted in R 4.0.2. Statistical significance was set at p < .05.
Classification
Data augmentation
To perform the classification, HR and HRV were calculated over each sample of NN intervals. The NN intervals in each experimental condition were sampled by a fixed time window sliding from the beginning to the end of this experimental condition. The overlap between time windows was 95% (i.e., 95% data were shared by consecutive time windows) given that 95% overlap was found to contribute to the best performing driver cognitive distraction classification (Liang et al., 2007). Also, the high overlap represents a high sampling strategy, which is beneficial for real-time monitoring. The sliding time window with five lengths (1, 2, 3, 4, and 5 min) were applied for data augmentation because the minimal ECG period required for HRV measures varied from 1 to 5 min. The five types of time window generated five datasets. Table 3 shows the total number of augmented samples of NN-intervals in each dataset corresponding to the time window. These numbers were imbalanced due to longer baseline durations and data filtering.
The Total Number of NN Interval Samples by 95% Overlapping Time Windows (1–5 Min) in Each Experimental Condition
Note. B, N, T, and N + T represent the baseline, N-back, texting, and N-back + texting conditions, respectively.
Training and testing
Random forests proposed by Breiman (2001) have been used for driver cognitive workload classification. A random forest is a classifier consisting of a collection of tree-structured classifiers. Each tree is independently built using a bootstrap sample of the training data and casts a unit vote for the most popular class. We chose random forests because they are resistant to overfitting (Breiman, 2001) and achieved the best performance for driver distraction classification in algorithm comparisons (McDonald et al., 2020).
Since human data sampled by overlapping time windows is not i.i.d (i.e., independent and identically distributed), the classifiers built on such dataset should not be validated within subjects in the same conditions to avoid overfitting (Dehghani et al., 2019). Thus, the dataset at each time window was split into the training and test sets in two ways: (1) cross-block (training with 100% data in Block 1, testing on 100% data in Block 2), and (2) cross-subject (training with data of 37 participants, testing on data of eight other participants). The impacts of temporal variation and individual differences on multiclass classification (four experimental conditions) were revealed by the classification accuracy in the cross-block and cross-subject tests, respectively. Balanced class weights were used in random forests to handle the imbalanced sample size (Table 3).
We also converted the multiclass classification into the binary classification of baseline versus distraction (combining all distracting conditions) for evaluation. Random search with the five-fold cross-validation was used to determine the optimal estimates of the number of trees (between 10 and 200) and the maximum depth of trees (between 1 and 10) of a random forest built on each training set.
Classification performance evaluation
AUC-ROC score is the area under the receiver operating characteristic (ROC) curve (i.e., a plot of the true positive rate against the false positive rate at various threshold settings). When AUC-ROC score is closing to 1, the classifier has stronger diagnostic capacity—higher true positive rate and low false positive rate. The AUC-ROC score of random forests in the cross-block and cross-subject tests were first compared with a random guess (AUC-ROC score = .5) and then compared among different time windows by the Friedman test. Next, the precision and recall with confidence interval for each classifier were estimated based on the bootstrap method (i.e., resampling a dataset with replacement). The random forests in the multiclass and binary classifcations were built and evaluated using Python scikit-learn package (Pedregosa et al., 2011). The performance between classifiers at the 5-min time window was compared using McNemar’s test in R (Dietterich, 1998).
Data Normalization to Enhance Classification
HR and HRV were normalized to adjust for individual differences in heart response for driver cognitive workload classification in order to handle temporal variation and individual differences (e.g., Tjolleng et al., 2017). Therefore, HR and HRV were also normalized in this study using HR and HRV averaged over the baseline conditions within each participant. We adapted two types of normalization: (1) “traditional baseline” normalization to represent the normalization strategy adapted by most studies, in which HR and HRV were divided by the baseline collected at the beginning of the study (Baseline 1); and (2) “multistage” normalization, in which HR and HRV in each block were divided by the corresponding baselines in the same block (i.e., HR and HRV in Blocks 1 and 2 were normalized by Baselines 1 and 2, separately). The data normalization was completed in Python 3.7.0.
Results
HR/HRV Statistical Analysis
KSS
In the KSS, seven is used as the threshold of drowsiness (Mårtensson et al., 2018). The five KSS scores (mean KSS over 45 participants with available NN interval data: KSS1 = 2.56, KSS2 = 4.22, KSS3 = 3.78, KSS4 = 4.82, KSS5 = 3.95) were significantly lower than seven (indicating sleepy but no effort to keep awake; Shahid et al., 2011) according to the Wilcoxon signed rank tests (p < .001 for all comparisons).
Task effect
ANOVAs showed that the HRV was significantly different among the experimental conditions (Table 4), whereas mean HR was not affected by the experimental conditions. Three time-domain analyses (SDNN, CVNN, and triangular index), four frequency-domain analyses (VLF power, LF power, HF power, and total power), and three nonlinear domain analyses (CSI, modified CSI, and CVI) showed that HRV was significantly higher in the baseline condition than some or all task conditions (for details, see Table 4). In addition, HRV was higher in the N-back condition than the N-back + texting condition using the SDNN, CVNN, VLF power, total power, CSI, and modified CSI metrics.
The Measures of HR and HRV in Each Experimental Condition
Note. HR = heart rate; HRV = heart rate variability; RMSSD = root mean square of successive differences between normal heartbeats; SDNN = standard deviation of NN intervals; SDSD = standard deviation of successive NN interval differences; pNN50 = percentage of successive NN intervals that differ by more than 50ms; pNN20 = percentage of successive NN intervals that differ by more than 20 ms; CVSD = coefficient of variation of successive NN interval differences; CVNN = coefficient of variation of successive NN intervals; VLF Power = absolute power of very low frequency band; LF Power= absolute power of low frequency band; HF Power= absolute power of high frequecy band; LF/HF = ratio of LF-to-HF power; CSI = cardiac sympathetic index; CVI = cardiac vagal index. The numbers inside the parentheses are the standard error of mean for significantly different variables. B, N, T, and N + T represent the baseline, N-back, texting, and N-back + texting conditions, respectively.
Temporal effect
ANOVAs showed the significant temporal effect on HR and HRV. Mean HR was significantly higher in Block 1 than Block 2, whereas all HRV measures except LF/HF and CSI were significantly lower in Block 1 than Block 2 (Table 5).
The Measures of HR and HRV in Blocks 1 and 2
Note. HR = heart rate; HRV = heart rate variability; RMSSD = root mean square of successive differences between normal heartbeats; SDNN = standard deviation of NN intervals; SDSD = standard deviation of successive NN interval differences; pNN50 = percentage of successive NN intervals that differ by more than 50ms; pNN20 = percentage of successive NN intervals that differ by more than 20 ms; CVSD = coefficient of variation of successive NN interval differences; CVNN = coefficient of variation of successive NN intervals; VLF Power = absolute power of very low frequency band; LF Power= absolute power of low frequency band; HF Power= absolute power of high frequecy band; LF/HF = ratio of LF-to-HF power; CSI = cardiac sympathetic index; CVI = cardiac vagal index. The numbers inside the parentheses are the standard error of mean for significantly different variables.
Interaction effect
The interaction effect between experimental condition and block was significant on VLF power, LF power, total power, and triangle index (Table 6). Tukey post hoc analysis showed that the frequency-domain HRVs, such as VLF power, LF power, and total power, were not significantly different among the experimental conditions in Block 1, but in Block 2 they were much higher in the baseline condition than the task conditions. Moreover, VLF and total power were significantly higher in the N-back condition than the combined task condition in Block 2.
The Measures of HRV in Each Experimental Condition Between Blocks 1 and 2
Note. HRV = heart rate variability; VLF Power = absolute power of very low frequency band; LF Power= absolute power of low frequency band.. The numbers inside the parentheses are the standard error of mean for significantly different variables. BL presents block. B, N, T, and N + T represent the baseline, N-back, texting, and N-back + texting conditions, respectively.
Classification Performance
This section evaluates the performance of random forest classifiers build on the HR and HRV measures (analyzed in the section above) in the cross-block test and cross-subject test.
Cross-block test (100% data in Block 1 for training, 100% data in Block 2 for testing)
All AUC-ROC scores in the cross-block test are above 0.5, indicating that the random forests performed better than the random guess (Table 7). The Friedman tests showed that AUC-ROC scores were significantly different among the time windows regardless of the normalization type. The AUC-ROC scores were higher when the time window was 4 or 5 min.
The AUC-ROC Score of Multiclass Classification in the Cross-Block Test at Each Time Window With Different Types of Normalization
Note. AUC-ROC = area under the receiver operator characteristic curve. B1 represents Baseline 1 and B2 represents Baseline 2.
Given that random forests achieved the best performance with longer time windows, we decided to analyze the classification on each class when the time window was 5 min using precision (i.e., the fraction of detected driver states that are relevant) and recall (i.e., the fraction of relevant driver states that are detected). Precision and recall are more appropriate than accuracy to evaluate the classification when the data is highly imbalanced. In the cross-block test, McNemar’s test showed that random forests’ performance was significantly different when the two normalization strategies were applied (p < .001). Figure 3 shows that the normalization by Baseline 1 and 2 significantly increased the precision in the baseline and combined conditions and led to higher recall in all experimental conditions.

The precision and recall for each class in the cross-block test (at 5-min time window) with different types of normalization. The error bars represent the bootstrap 95% confidence interval.
Cross-subject test (37 participants for training, eight participants for testing)
Table 7 shows that all AUC-ROC scores in the cross-subject test were above 0.5, indicating that the random forests performed better than the random guess. The Friedman tests showed that AUC-ROC scores were significantly different among the time gaps regardless of the normalization type. The highest AUC-ROC score was 0.70 when the time window was 5 min with the B1 and B2 normalization.
In the cross-subject test, McNemar’s tests showed that random forests performed significantly different when the two normalization strategies were applied (p < .001). Figure 4 shows that the normalization by Baseline 1 increased the precision and recall in the baseline condition but did not improve the measures in other conditions. The multistage normalization by Baselines 1 and 2 increased the precision in the baseline condition and recall in the baseline and combined conditions.

The precision and recall for each class in the cross-subject test (at 5-min time window) with different types of normalization. The error bars represent the bootstrap 95% confidence interval.
Binary classification
The multiclass classification was converted into the binary classification by relabeling the N-back, texting and N-back + texting conditions as the distracting condition. In this more balanced dataset, the binary classification accuracy was 65.3%, 66.7%, and 75.9% with no normalization, Baseline 1 normalization, and multistage normalization. McNemar’s test showed that only the multistage normalization strategy significantly improved the binary classification performance (p < .001), whereas the Baseline 1 normalization did not (p = .927). Moreover, the multistage normalization led to better classification performance than the Baseline 1 normalization (p = .001). According to Figure 5, when the multistage normalization was applied, the maximal recall and precision were increased to 0.82 in the baseline condition and 0.64 in the distracting condition.

The precision and recall for the binary class in the cross-subject test (at 5-min time window) with different types of normalization. The error bars represent the bootstrap 95% confidence interval.
Discussion
Cognitive workload is an important component to be monitored for operational safety in human–machine systems. The continuous and unobtrusive characteristics of physiological measures make them a potentially practical approach to measure cognitive workload in real time. Physiological measures are sensitive to driver cognitive workload; however, a precise measure of cognitive workload is still challenging due to the change over time under the same task load and the susceptibility to individual differences (varying between different people). It is necessary to determine how well physiological-based cognitive workload classifiers perform over time and between individuals. This study evaluated the performance of ECG-based HR and HRV parameters as surrogates of cognitive workload in driving scenarios.
Ten out of the 16 HRV measures significantly decreased by the increased cognitive demand from external task processing. These sensitive measures came from all categories of analysis: time-domain (SDNN, CVNN, and triangular index), frequency-domain (VLF, LF, HF, and total powers), and nonlinear-domain (CSI, modified CSI, and CVI; see Table 4). The changes in most HRV measures indicated reduced activities in the parasympathetic nervous system (according to Table 2) as the result of imposing N-back, texting, and N-back + texting task loads. In contrast, mean HR was not significantly different among task conditions. Mehler, Reimer, Wang (2011), however, found that basic HR was more sensitive and robust than HRV to driver cognitive workload generated by the secondary N-back task in on-road driving. This difference highlights the challenge of generalizing the sensitivity of physiological measures to cognitive workload in different task conditions, by various workload manipulations, and across different individuals (Matthews et al., 2015; Mehler, Reimer, Wang, 2011). Hughes et al. (2019) concluded that there is no one “best” cardiac measure of cognitive workload and recommend practitioners using multiple measures for a comprehensive understanding of cognitive workload.
The temporal variation of cognitive workload and possible mental fatigue in more than 1 h driving were revealed by the changes in HR and HRV under the same task loads over time. Since participants were alert in the study according to KSS, drowsiness was excluded from the factors that changed HR and HRV in Block 2. In the second hour, decreasing HR reflected reduced activities in the sympathetic nervous system and 14 increasing (out of 16) HRV measures indicated stronger activities of the parasympathetic nervous system, suggesting that the tasks in the second hour may be less challenging and require less mental workload than the first hour due to higher familiarity with the driving simulator control and secondary tasks (Yanko & Spalek, 2013). Thus, decreasing HR and increasing HRV in Block 2 was perhaps associated with the decline of cognitive workload under the same task conditions. It is worth noting that mental fatigue may exist even in the alert drivers in Block 2 as a result of the confounding “time-on-task” effect, and may affect the autonomic nervous system and influence HR and HRV (Borghini et al., 2014).
As a result of temporal variation, random forests trained with HR and HRV in Block 1 did not perform well in the cross-block test (Figure 3). The HR and HRV in Block 2 changed significantly (Table 5), and some HRV measures were more sensitive to cognitive workload in Block 2 (Table 6). Therefore, HR and HRV in Block 2 did not fit into the decision boundary learned by random forests from Block 1. Also, the performance of random forests was not much improved after the normalization using Baseline 1 only, which is a normalization technique used in previous studies (e.g., Healey & Picard, 2005; Vicente et al., 2011). These findings showed the limitations of driver cognitive workload classifier built on ECG collected in a short duration driving and raised a challenging question regarding how we could improve the robustness of physiological measures of cognitive workload in prolonged driving.
After applying the multistage normalization, random forests’ precision and recall were significantly increased, especially in the baseline and N-back + texting conditions. This “adaptable” baseline normalization may enhance the robustness of classifiers so that they could handle the impacts of temporal variation on cognitive distraction detection. Further research is needed to understand the rationale of the adaptable baseline strategy and refine it into practical solutions, for example, a capacity of real-time driver monitoring system to record and calibrate the physiological baseline every hour when appropriate.
The strategy of multistage baseline normalization also improved cognitive workload detection for different individuals. Without normalization, the random forest (tested on eight “new” participants) had lower precision of classifying the baseline and N-back conditions and was less sensitive (lower recall) to the baseline, N-back, and N-back + texting conditions. Using the multistage baseline normalization increased the random forest’s precision of detecting the baseline condition and increased its sensitivity (higher recall) to the baseline and combined conditions. After transforming the multiclass classification into binary classification, we found that the traditional normalization using the early-stage baseline did not improve the classification performance due to temporal variation existing in the dataset. However, the multistage baseline normalization led to higher precision and recall of classifying the baseline and distracting conditions. The overall binary classification accuracy was increased from 65.3% to 75.9%, which was approximate with the ~80% accuracy of classifying cognitive load across individuals solely based on HR features in Solovey et al. (2014). The strategy of adaptable baseline normalization may play a critical role in generalizing cognitive workload classifiers to different drivers in the long duration driving.
However, even with multistage baseline normalization, it was still difficult for random forests to detect cognitive workload among the N-back, texting, and N-back + texting conditions solely based on cardiac measures. This was indicated by low recall and precision among the task conditions, especially in the N-back and texting tasks (Figures 3 and 4). The challenge of distinguishing different cognitive workloads was also found by Yang, Kuo, et al. (2018): the glance metric PRC (percent road center) did not differentiate the cognitive loads in the N-back, conversation, comprehension, and route-planning tasks. However, Matthews et al. (2015) found that the single-versus-dual tasks or low-versus-high workload conditions could be differentiated in the simulated unmanned ground vehicle operation, depending on the metrics applied. This finding suggests classifying various task-dependent cognitive workloads using measures that are more effective than cardiac measures, or combining diverse measures (e.g., cardiac and glance metrics) for high-resolution detection.
Limitations and Future Research
The detection of cognitive workload across the task conditions still had low precision and recall. More diverse metrics (e.g., glance-based, performance-based, and neurological) plus adaptable baseline normalization should be explored in future research to improve the reliability and accuracy of detecting different cognitive workloads in prolonged driving and across individuals. Moreover, the HR and HRV measures were affected by data integrity and noise issues, which need to be addressed in future research taking the ECG signal reliability under investigation. Additionally, further research is needed to prove the validity when the baseline and workload conditions are fully counterbalanced.
Conclusion
The study demonstrated the temporal variation of driver cognitive workload in prolonged driving using ECG-based measures. The random forests trained with HR and HRV in the first hour of driving did not perform well in the second hour of driving due to temporal variation. However, the strategy of multistage baseline normalization significantly improved the classification accuracy of cognitive workload under temporal variation. Moreover, this strategy also enhanced the performance of classification for different individuals. The findings of temporal variation and individual differences in cognitive workload provide realistic expectations for physiological-based cognitive workload monitoring technology in the transportation domain.
Key Points
Driver cognitive workload changes over time in the same task context.
The temporal variation and individual differences in cognitive workload affect the cognitive workload detection based on HR and HRV.
Multistage baseline normalization improves cognitive workload classification.
Footnotes
Acknowledgments
We acknowledge the funding received from the Commonwealth of Australia through the Cooperative Research Centre - Projects scheme and the cash and/or-kind support provided by Seeing Machines, Monash University, Ron Finemore Transport and Volvo Group Australia. We also thank the extended team supporting this project: Fivaz Buys, Shafiul Azam, David Moorhouse, Jeremy da Cruz, Tim Edwards and Rod Stewart at Seeing Machines; Nebojsa Tomasevic, Brendan Lawrence, and Raphaela Schnittker at Monash University.
Author Biographies
Shiyan Yang is a research scientist (Human Factors) at Seeing Machines Ltd. He obtained his PhD from the Department of Industrial and Systems Engineering at Texas A&M University in 2016.
Jonny Kuo is a senior research scientist (Human Factors) at Seeing Machines Ltd. He completed his PhD at the Monash University Accident Research Centre in 2016.
Michael G. Lenné is the senior vice-president (Fleet & Human Factors) at Seeing Machines Ltd and adjunct professor (research) at the Monash University Accident Research Centre. He earned his PhD in experimental psychology at Monash University in 1997.
Michael Fitzharris is the associate director of Regulation and In-depth Crash Investigations at the Accident Research Centre. He completed his PhD in psychology at Monash University in 2006.
Timothy Horberry is an adjunct professor of Human Factors at Monash University Accident Research Centre. He completed his PhD in human factors psychology and traffic safety at Derby University, England, in 1998.
Kyle Blay is a staff project manager at Seeing Machines. He obtained his master’s in engineering science from the University of New South Wales in 2007 and his MBA from the University of Sydney in 2018.
Darren Wood is the general manager at Ron Finemore Transport Service. He obtained his Bachelor of Business from Charles Sturt University in 1996.
Christine Mulvihill is a research fellow in the Human Factors Team at the Monash University Accident Research Centre.
Carine Truche is the Uptime Support Manager at Volvo Group Australia. She obtained her master’s in marketing from ESCP Europe in 2003.
