Abstract
Objective:
We aimed to (a) describe the development and application of an automated approach for processing in-vehicle speech data from a naturalistic driving study (NDS), (b) examine the influence of child passenger presence on driving performance, and (c) model this relationship using in-vehicle speech data.
Background:
Parent drivers frequently engage in child-related secondary behaviors, but the impact on driving performance is unknown. Applying automated speech-processing techniques to NDS audio data would facilitate the analysis of in-vehicle driver–child interactions and their influence on driving performance.
Method:
Speech activity detection and speaker diarization algorithms were applied to audio data from a Melbourne-based NDS involving 42 families. Multilevel models were developed to evaluate the effect of speech activity and the presence of child passengers on driving performance.
Results:
Speech activity was significantly associated with velocity and steering angle variability. Child passenger presence alone was not associated with changes in driving performance. However, speech activity in the presence of two child passengers was associated with the most variability in driving performance.
Conclusion:
The effects of in-vehicle speech on driving performance in the presence of child passengers appear to be heterogeneous, and multiple factors may need to be considered in evaluating their impact. This goal can potentially be achieved within large-scale NDS through the automated processing of observational data, including speech.
Application:
Speech-processing algorithms enable new perspectives on driving performance to be gained from existing NDS data, and variables that were once labor-intensive to process can be readily utilized in future research.
Keywords
Introduction
Passenger-related distractions are one of the most prevalent secondary behaviors engaged in by drivers during the driving task and are a leading cause of all distraction-related crashes (Ghazizadeh & Boyle, 2009; McEvoy, Stevenson, & Woodward, 2007; Sullman, 2012; Young, Rudin-Brown, & Lenne, 2010). Distraction occurs when insufficient attention is directed to the driving task, instead being allocated to behaviors such as conversing with passengers, resulting in impaired driving performance in the form of increased steering and speed variability and an overall increase in the number of driving errors (Horberry, Anderson, Regan, Triggs, & Brown, 2006; Young & Salmon, 2012).
Through the use of vehicles instrumented with video cameras, kinematic sensors and other recording equipment, naturalistic driving studies (NDSs) offer researchers an in-depth and ecologically valid perspective into real-world driving (Eeinink, Barnard, Baumann, Augros, & Utesch, 2014; Hanowski, Perez, & Dingus, 2005; Stutts et al., 2005; Van Schage et al., 2011). Analyzing NDS data for the occurrence of specific in-vehicle behaviors is typically a labor-intensive process involving manual review of video or audio data (Koppel, Charlton, Kopinathan, & Taranto, 2011). With respect to growing data set sizes, manual review is increasingly less feasible (Kuo, Koppel, Charlton, & Rudin-Brown, 2014). The automated processing of NDS video data for observable behaviors, such as eye glances, blinking, yawning, and hands off wheel, has received considerable attention in the research literature, with several validated tools actively in use in transport safety research (Kuo et al., 2014; Medina, Lee, Wierwille, & Hanowski, 2004; Tan, Borgstrom, & Alwan, 2010). In contrast, little attention has been directed toward automation of audio data processing in NDSs, possibly because in some jurisdictions, the data cannot be recorded or analyzed due to privacy issues.
Audio processing, or more specifically, speech processing, comprises multiple research areas, including voice activity detection, speaker diarization (identifying unique speakers in a signal), speaker verification (verifying the identity of speakers), and speech recognition (identifying the contents of speech). Within applications such as telephony or broadcast television transcription, where the signal-to-noise ratio is high, significant performance benchmarks for speech-processing measures have been achieved. However, audio recordings made under naturalistic conditions typically feature intermittent speech activity and high levels of background noise and may involve an unknown number of speakers, conditions that greatly increase the difficulty of accurate processing (Ziaiei, Lakshmish, Sangwan, Hansen, & Oard, 2014). Recently, substantial improvements in voice activity detection and diarization performance have been achieved under naturalistic conditions (Sell & Garcia-Romero, 2014; Ziaiei et al., 2014). The application of these methods to NDS audio would facilitate the use of a greater proportion of collected data and increase the replicability and efficiency with which audio data is analyzed.
In-vehicle audio has been used extensively within human factors research for exploring cognitive load, passenger distraction, and mobile phone use (Drews, Pasupathi, & Strayer, 2008; Reimer & Mehler, 2011; Young, Salmon, & Cornelissen, 2013). In experiments in which the level of conversation intensity has been manipulated either by increasing the cognitive load required in forming a response or by increasing the emotional intensity of the conversation topic, researchers have observed an overall decline in driving performance and safety, with increases in reaction times and critical incident involvement (Chen & Chiuhsiang, 2011; Lansdown & Stephens, 2013).
In contrast, findings on the impact of child-related secondary behaviors on driving performance within naturalistic settings have been reported with less consistency. Vehicle crashes are one of the leading causes of death for children in the developed world, attributable to over 5,000 fatalities and 85,000 incapacitating injuries for children under 8 years of age between 1999 and 2008 in the United States alone (Hanna, 2010; UNICEF, 2001). In one of the first NDSs addressing the issue of child-related distraction, Stutts et al. (2005) did not observe an association between child-related secondary tasks and driving impairment indicators. Similarly, no relationship was found between child passenger presence and crash risk in the 100-Car study (Klauer, Dingus, Neale, Sudweeks, & Ramsey, 2006). Although child-related trips were examined by both groups, they were not the major focus of the studies. Rather, Stutts et al. examined the prevalence of a range of secondary behaviors, and a key outcome measure of the work by Klauer et al. (2006) included odds ratios for crash and near-crash risk. In particular, Klauer et al. (2006) identified a 2-times-greater odds for crashes and near crashes when drivers look away from the forward roadway for more than 2 s.
Although important insights have been gained from these early studies, they offer limited information on child-related secondary behaviors. Additionally, the application of such broad classes of distractors—“child-related distractions” or “child passenger presence”—is unlikely to adequately account for the wide variety of child-related behaviors and their potential effects on driving. In the same way that cell phone conversations can be distinguished by hands free and handheld (Amado & Ulupinar, 2005), and in-vehicle audio may comprise, for example, radio talk shows and children’s stories (Hatfield & Chamberlain, 2008), child-related secondary behaviors encompass a wide range of distinct behaviors.
In a recent observational study examining child occupant behavior and driver distraction (Koppel et al., 2011), drivers were observed to engage in potentially distracting behaviors on 98% of observed trips, with interactions between drivers and rear-seat child passengers representing the second most frequent secondary behavior. In 10% of the child-related secondary behavior epochs, drivers looked away from the forward traffic scene for more than 2 s, a behavioral marker of doubled crash and near-crash risk (Klauer et al., 2006). Notwithstanding the detailed analysis of child passenger behaviors undertaken by Koppel at al. (2011), to date no studies have specifically utilized audio data to examine the impact of these interactions on quantitative measures of driving performance.
Following this body of existing research, the rationale for extending in-vehicle audio analysis to NDS data is clear. The aims of the current study were (a) to describe the development and application of an automated approach for processing in-vehicle speech data from an NDS, (b) to examine the influence of child passenger presence on driving performance, and (c) to model this relationship using in-vehicle speech data.
Method
Data Set
We analyzed the Children in Cars (CIC) data set, which has been previously described by Charlton et al. (2013). The CIC data set comprises 690 hr of naturalistic driving data from 42 participant families residing in Melbourne, Australia. Participant families were selected on the basis of regularly transporting at least one child age between 1 and 8 years who traveled in a forward-facing child restraint system (CRS) or a booster seat. Each participating family drove one of two instrumented study vehicles for a period of 2 weeks. Mean age for all participating drivers was 38.43 years with standard deviation of 4.41 years. Although either spouse was permitted to drive the study vehicle, 66% of all trips were undertaken by female drivers. The study vehicles were luxury-model sedans with automatic transmission and instrumented with eight video cameras, interior microphone (omnidirectional microphone insert embedded in interior roof light panel, 50 Hz to 15 kHz), MobilEye (www.mobileye.com) for measuring headway, and VBOX systems (www.vboxmotorsport.co.uk) for recording Controller Area Network (CAN) bus and GPS data.
Outcome Variables
Vehicle performance measures were recorded at 10 Hz. For the current analysis, standard deviation of steering angle and standard deviation of velocity were used as outcome measures. Although microadjustments to steering angle are representative of alert driving, the overall standard deviation of steering angle has been observed to increase with increased distraction (Chan & Singhal, 2013; Engström, Johansson, & Östlund, 2005). Similarly, speed variability has been shown to increase as drivers engage in concurrent tasks (Horberry et al., 2006). For the purpose of this study, epochs where the vehicle was stopped, was traveling slower than 50 km/h, or was engaged in a turning maneuver were excluded from the analysis. This definition provided a clearly defined, straight-driving context for validation purposes, removing the potentially confounding effects that might be associated with stop-start driving in high-volume traffic and turning, where variability in the two dependent measures of interest (steering angle and velocity) would be expected. Segmentation of the driving data into nonturning, >50-km/h epochs was achieved using GPS and CAN bus data.
Predictor Variables
Audio data were recorded from a microphone embedded within the interior roof light panel. In-vehicle speech activity within the audio data was identified and extracted using the harmonic frequency likelihood ratio test method (Tan et al., 2010). Subsequently, for each vehicle trip, speaker diarization was performed in order to cluster the speech-segmented audio data based on who was speaking and to exclude instances of nonpassenger speech (i.e., speech activity from radio, GPS, DVD, electronic handheld devices). This diarization was achieved by (a) deriving i-vectors for each second of speech audio, (b) performing principal components analysis (PCA) on the derived i-vectors, and (c) performing k-nearest-neighbor classification on the first three PCA factors. An overview of these methods can be found in Sell and Garcia-Romero (2014) and Shum, Dehak, Chuangsuwanich, Reynolds, and Glass (2011). Based on this work flow, a grouping variable of speech/nonspeech was created (with the speech condition including all instances of driver/passenger speech regardless of speaker).
In addition to the speech-based grouping variable, epochs were also grouped by the number of child passengers present (zero, one, two, and three). Child passenger count for each trip was determined by manual review of the video data.
Analysis
Due to the nested repeated measures structure of the data set, multilevel modeling (MLM) was selected for the analysis. MLM is an extension of the general linear model and is commonly used in transport safety research, where the hierarchical nature of data would otherwise violate the assumptions of independence and normality required for a general linear model. A general example of these violations would be crash data, whereby participants are grouped within vehicles, which are in turn grouped within specific road segments where the crashes occurred (Jones & Jorgensen, 2003). In the present study, individual epochs were nested within trips, per family, per study vehicle. Not accounting for potential correlations among these measures would result in a less powerful model, leading to inaccurate estimates of parameter effects.
In the current study, two MLMs were specified to test the effects of in-vehicle speech activity type and number of child passengers on the outcome measures of steering angle and velocity variability. An autoregressive correlation structure was specified for each trip, per family, per vehicle, to account for the hierarchical structure of the data set. This model was implemented in SAS via the MIXED procedure.
Results
Segmentation of Driving Epochs
Epochs when the vehicle was traveling above 50 km/h and not engaged in a turning maneuver were extracted from the data set to minimize potential confounds for the outcome variables. Three participant families were excluded from analysis due to incomplete data. A total of 6,778 epochs comprising 131.6 hr of driving time were extracted from the initial 690-hr CIC data set, representing 19.1% of all collected trips. Mean and standard deviation of epoch duration was 699 s and 731 s, respectively. The extracted epochs totaled 8,661 km of driving, representing 67.6% of all collected trips (12,808 km total). Mean and standard deviation of epoch distance was 1.3 km and 1.6 km, respectively.
Automated Processing of NDS Audio Data
To evaluate speech activity detection performance in the current data set, three epochs were randomly selected from each of 100 randomly selected trips for manual review of incorrect speech detections. Based on this process, a false-positive rate of 10% was achieved. This manually annotated sample (excluding false positives) was subsequently used as training data for the k-nearest-neighbor classifier. Summary statistics for the child passenger and speech activity grouping variables are presented in Table 1. Due to sample size disparity, the three-child condition was excluded from subsequent analyses.
Summary Statistics for Child Passenger and Speech Activity Grouping Variables
Distribution of epochs and speech activity per participating family are presented in Figure 1, sorted in descending order of epochs. The distribution of epochs per family exhibited negative kurtosis (–0.86, SE = 0.74) and positive skew (0.58, SE = 0.38), and correlated significantly with the distribution of speech activity at alpha = .01 (r = .68, p = .000).

Distribution of epochs and speech activity per participant family, sorted by descending order of epochs.
Effect of Child Passenger Presence on Driving Performance
To examine the predictive effect of child passenger presence on driving performance, the number of child passengers per trip was tested as a main effect in both models. Least square means for steering angle standard deviation and velocity standard deviation are presented in Figures 2 and 3, respectively (least square means derive from a given linear model and are adjusted for the hierarchical structure specified in the MLM).

Least square means and standard deviation for steering angle standard deviation per number of child passengers in the vehicle.

Least square means and standard deviation for velocity standard deviation per number of child passengers in the vehicle.
No significant main effect of child passenger presence was observed for steering angle or velocity variability. Individual least square means also did not differ significantly.
Interactions Between Child Passenger Presence and Speech Activity
A speech/nonspeech grouping variable extracted via automated audio processing was included as an interaction effect to model the impact of speech activity on steering angle variability. A significant main effect of speech presence was observed at α = .05, F(1, 91) = 4.29, p = .041. A plot of least square means contrasts is presented in Figure 4.

Least square means and standard deviation for number of child passengers and speech activity interaction effects on steering angle variability (standard deviation).
The two-child/speech condition was associated with the most steering angle variability, statistically significant at α = .05, when compared with all other combinations of passenger presence and speech activity. Full statistical output (fixed-effect solutions and contrasts) for both the steering and velocity models are presented in the appendix.
For velocity variability, a significant main effect was observed for speech presence, F(1, 91) = 7.82, p = .006. A plot of least square means contrasts is presented in Figure 5.

Least square means and standard deviation for number of child passengers and speech presence interaction effects on velocity standard deviation.
Similar to contrasts for steering angle variability, the two-child/speech group was associated with the most velocity variability, t(91) = 2.15, p = .034, when compared with the two-child/nonspeech group. The two-child/speech group was also associated with more velocity variability than the control (nonspeech) condition, t(91) = 2.44, p = .017; one-child/speech versus control, t(91) = 1.97, p = .052.
Discussion
There is a growing need for automation in analyzing increasingly larger NDS data sets. Working within the problem space of driver distraction and passenger interactions, we aimed to (a) describe the development and application of an automated approach for processing in-vehicle speech data from an NDS, (b) examine the influence of child passenger presence on driving performance, and (c) model this relationship using in-vehicle speech data.
Through the application of state-of-the-art speech-processing methods, audio data from an existing NDS data set were segmented to identify epochs that included vehicle occupant (driver and passenger) speech. This was an automated process, achieving a false-positive rate of 10% compared with a manually validated subset of data. A full evaluation of the system, including missed speech and receiver operating characteristic, was outside the scope of the current study—the harmonic frequency likelihood ratio test method and i-vector-based diarization performance on a variety of evaluation sets having been previously reported (Sadjadi & Hansen, 2013; Ziaiei et al., 2014). In previous evaluations on similar data sets, a 10% false-positive rate has corresponded with speech detection rates of greater than 95% (Sadjadi & Hansen, 2013; Ziaiei et al., 2014). The application of these methods to NDS audio would facilitate more effective use of data sets and increase the replicability and efficiency with which audio data are processed. These technologies also have broader practical applications in advanced driver assistance systems. The speech-processing protocol used in the current study, for instance, could potentially be applied to the analysis of in-vehicle infotainment usage, cell phone use, or the monitoring of passenger carriage (e.g., for enforcing passenger carriage restrictions under graduated licensing), facilitating human–machine interaction by taking into consideration driver (or passenger) state.
To examine the predictive effects of speech and child passenger presence on driving performance, multilevel models for steering angle and velocity variability were specified. Consistent with previous literature, child passenger presence alone was not a significant predictor. However, incorporating speech into the model revealed that the presence of passenger speech significantly predicted both performance measures. These findings contribute to the distraction and NDS literature by linking child passenger presence and child passenger behaviors to objective measures of driving performance.
Overall, our results indicate heterogeneity within the effects of passenger behaviors on driving performance, with driving performance being variably affected by a combination of the number of child passengers per trip and the presence of speech activity. Examination of least square means showed that engaging in in-vehicle speech activity was generally associated with increased variability in steering angle and velocity, with the effect most pronounced when driving with two child passengers.
One potential explanation for increased variability when speech activity occurs with two child passengers present may be an increase in distraction exposure as a result of the additional passenger—the presence of additional passengers is likely to increase the opportunities for the driver to engage in passenger interactions. However, it is likely that in addition to heterogeneity within child-related secondary behaviors, surrogate measures of driving performance may also be affected differentially. For instance, in the specific context of the nonturning driving epochs sampled, drivers may prioritize lateral control over velocity variability when engaged in potentially distracting behaviors. There may also be temporal effects associated with the duration of epochs or speech behaviors that are not accounted for in the present models. In essence, it is difficult to conclude that passenger speech behavior alone is attributable to impaired driving performance. Rather, the patterns identified in the current study provide a basis on which subsequent experimental research may be conducted.
Large, nonexperimental data sets characteristic of NDSs present a number of challenges to interpretation. First, this research is limited in the extent to which conclusions about causality can be drawn due to the observational nature of NDSs. This limitation is additionally confounded by small effect sizes. Our results are presented instead in the context of an exploratory analysis for which the ecological validity of an NDS design is highly suited. Second, the absence of vehicle crashes in the current data set limited our models to using measures of driving performance. Changes to steering and velocity variance are a valid measure of driver distraction and are a mechanism through which secondary behaviors affect driving performance, increasing the number of driving errors (Young & Salmon, 2012). However, it is not known whether these variables themselves are directly correlated with crashes, and due to the absence of actual crashes in the data, the validity with which inferences about injury risk could be made was constrained. Last, the current data set was limited to self-selected families in Melbourne. Based on the occurrence of non–socially desirable behavior, the observer effect was not likely to have been a significant factor in the data. The ability to generalize our findings outside of this population, however, is inherently limited.
As the focus of the current study was the exploratory use of algorithm-processed NDS speech data in modeling driving performance, a number of other data sources and variables that may have additionally contributed to predicting driving performance were excluded on the basis of preserving clarity. Semantic content within speech, for instance, has been previously used to explore internal factors, such as sentiment and emotional state, which in turn have been shown to affect driving performance (Briggs, Hole, & Land, 2011; Chan & Singhal, 2013; Grimm & Kroschel, 2005; Lansdown & Stephens, 2013). Prior to semantic analysis, audio data must first be transcribed, the time frame for which was outside the scope of the current study. However, given the efficacy with which speech activity detection was achieved, automated transcription processes may be utilized in subsequent studies.
Driver gaze data were not examined in the current study. Secondary behaviors in practice typically involve multiple attentional processes (e.g., passing an object to a child passenger involves visually searching as well as physically handling the item), and the inclusion of driver gaze data would likely assist in further distinguishing instances of in-vehicle speech activity in which the driver was actively involved versus passively listening. The analysis of driver gaze data is the subject of ongoing research.
Additionally, it was unclear whether the novelty of the study vehicle affected the outcome variables of steering and velocity variability. Although the impact of these effects may be expected to diminish over time, the multilevel models used in the current study do not explicitly take into account the participants’ familiarity (or changes in familiarity over time) with driving the study vehicles.
External to the vehicle, roadway video and headway data were also present in the data set. In the exploratory analysis of trip metadata, there may have been differences in the nature of the trips driven with one child passenger versus the other conditions beyond the factors captured in the metadata, such as different traffic conditions or the types of roads traveled. Changing road conditions have also been postulated as a potential moderating factor in the degree of cognitive load required by drivers who are responding to passenger-initiated conversation tasks (Drews et al., 2008). Whether these effects apply to child passengers remains untested.
In summary, we demonstrated a novel application of state-of-the-art speech-processing algorithms for the automated processing of NDS audio data. To our knowledge, this study is the first application of automated speech processing to the study of in-vehicle speech in an NDS. Using segmented speech data, the predictive effect of child passenger presence on steering angle and velocity variability was modeled. Consistent with previous research, passenger presence alone was not a significant predictor of driving performance. However, significant differences were observed between the number-of-child-passengers and speech-presence grouping variables, supporting the notion that not all child passenger behaviors affect performance equally. In-vehicle audio data are seldom analyzed at scale—through the interdisciplinary application of automated techniques, new perspectives can be gained from existing data, and variables that were once laborious to process can be readily utilized in future research.
Key Points
Automated speech-processing algorithms were applied to audio data from a Melbourne-based naturalistic driving study (NDS). The predictive effect of speech activity and child passenger presence on driving performance was modeled. Child passenger presence alone did not predict performance, but a significant difference between the number of child passengers and speech presence groupings was observed. Multiple factors need to be considered in evaluating the impact of child passenger presence on driving performance. Within a large-scale NDS, this goal can be achieved through the automated processing of observational data.
Footnotes
Appendix A
Acknowledgements
The project is supported by the Australian Research Council Linkage Grant Scheme (LP110200334) and is a multidisciplinary international partnership between Monash University, Autoliv Development AB, Britax Childcare Pty Ltd, Chalmers University of Technology, General Motors-Holden, Pro Quip International, RACV, the Children’s Hospital of Philadelphia Research Institute, Transport Accident Commission (TAC), University of Michigan Transportation Research Institute, and VicRoads.
Jonny Kuo is a PhD student at the Monash University Accident Research Centre, where he is currently completing a project investigating driver distraction in the context of child passenger behaviors using naturalistic methods. He completed a bachelor of psychological science and postgraduate diploma of psychology at Monash University. His current research interest is in the application of data-analytic and computer-vision techniques for solving behavioral science problems.
Judith L. Charlton is an associate director of Behavioural Science for Transport Safety at the Monash University Accident Research Centre. She is a registered psychologist and holds a PhD in movement science. She has published extensively on road safety issues with a focus on the safe mobility of vulnerable road users. Her research team has expertise in driving simulation, instrumented vehicles, and naturalistic methods to study driver and passenger behavior in real-world settings and is recognized as the leading research group in Australia on the safety of older and impaired drivers, child passengers, and pedestrians.
Sjaan Koppel holds a PhD in psychophysiology and is a senior research fellow at the Monash Injury Research Institute’s Accident Research Centre. During her 10 years at the Monash University Accident Research Centre, she has been involved in a range of road safety research projects involving vulnerable road users, such as child road users (e.g., child vehicle occupants, pedestrians, and cyclists), older drivers, and drivers with psychological and/or medical conditions or functional impairments. She has published widely in the area of vulnerable road users and has presented at many national and international conferences.
Christina M. Rudin-Brown is a human factors specialist at Human Factors North Inc., a Canadian human factors consulting firm. She is a Certified Canadian Professional Ergonomist (CCPE) and holds a doctorate in experimental psychology. She has over 15 years’ experience in transport- and vehicle-related human factors research in government and academia, including road user behavior, driver distraction and impairment, occupant protection, in-vehicle intelligent transportation systems, and advanced driver assistance systems. She is also an accident investigator.
Suzanne Cross is currently completing a PhD at the Monash University Accident Research Centre, where she is investigating the role of behavior on child occupant protection using naturalistic methods. She received a bachelor of health science, majoring in psychology, disability studies, and health promotion in 2010 and received a postgraduate diploma of psychology in 2011 at Deakin University, Melbourne, Australia. Her current research interest is in the exploration of how child and parent behavior can affect child occupant protection.
