Abstract
Distracted driving is a dangerous driving behavior that causes numerous accidents on US roads each year. It is critical to identify distracted drivers in order to prevent such accidents. Previous studies attempted to detect distracted driving using heuristics and machine learning; however, none of these methods could capture the problem's spatiotemporal features. As a result, the purpose of this study was to use a 3D convolutional neural network (CNN) that can capture both spatial and temporal information to classify distracted drivers based on facial features and behavioral cues. We used the Database to Enable Facial Analysis for Driving Studies (DEFADS), an open-source dataset containing 77 human subjects performing scripted driving-related activities, to achieve this goal. The PyTorch video library was used to train the model. The 3D CNN achieved an overall recall and precision of 97.6 and 98.1, respectively, indicating its efficacy in detecting distracted drivers in the real world.
1.0 Introduction
According to the National Highway Traffic Safety Administration (NHTSA), distracted driving is responsible for approximately 10% of fatal accidents and 15% of injury accidents in the United States (NHTSA, 2020). In fact, the NHTSA estimates that distracted driving was responsible for 3,142 deaths in 2019 (NHTSA, 2020). Distracted driving, which includes activities such as using a cell phone, eating, or adjusting the radio, is a leading cause of human errors in crashes. Research has shown that driving while distracted can make it harder for a driver to react to sudden events on the roads and can slow down reaction time (Aboah et al, 2023).
Studies indicate that even in cases of highly automated driving, the driver’s behavior, such as interacting with the infotainment system, may hinder the driver’s readiness to regain control of the vehicle (Walch, 2017). Accurate Advanced Driving Assistance Systems (ADAS) could assist in resolving this issue by identifying these distractions and predicting the likelihood of a traffic collision before it occurs (Rommerskirchen, 2014).
The National Highway Traffic Safety Administration (NHTSA) and the Centers for Disease Control and Prevention (CDC) classify driver distractions into three categories, each requiring a different set of mental and motor skills (CDC, 2019): 1)
To address this problem, researchers have explored the use of heuristics and machine learning techniques to classify driver distraction based on data collected from sensors and cameras inside vehicles with the later approach been recently used. The machine learning approaches that have been employed in past studies include, but are not limited to, support vector machines (SVMs), decision trees, random forests, and 2D convolutional neural networks (CNN) (Aboah et al, 2023; Chilukuri et al., 2022; Gyimah et al., 2021; Gyimah et al., 2023; Keshinro, 2022).
Support vector machines have been widely used for driver distraction classification due to their ability to effectively classify high-dimensional data (Liang et al., 2007). However, SVMs are limited in their ability to capture the temporal and spatial features of driver behavior data, which can be important for accurately detecting driver distraction. Decision trees are another commonly used approach that are relatively simple and easy to interpret, but limited in their ability to handle complex data and can suffer from overfitting (Lan et al., 2020). Random forests, an extension of decision trees, can help address some of these limitations, but are not as effective at capturing temporal and spatial features of driver behavior data as other approaches (Zahid et al., 2020).
Convolutional neural networks (2-Dimentional) have shown promise for driver distraction classification due to their ability to effectively capture spatial features of the input data (Keshinro et al., 2022). However, 2D-CNNs are limited in their ability to capture temporal features of driver behavior data, which can be important for accurately detecting driver distraction. Three Dimensional convolutional neural networks have been specifically designed to capture both temporal and spatial features of the input data and have been shown to be particularly effective for classifying other spatio-temporal problems. However, despite the potential benefits of using 3D CNNs, this approach has not been widely explored for driver distraction classification.
Considering the aforementioned limitations observed in previous approaches for classifying human distracted driver behavior, this study aims to develop a more robust driver distracted classification model capable of simultaneously capturing spatial and temporal features. The research makes use of a publicly accessible data set (DEFADS) containing information on 77 drivers engaged in various forms of distracted driving. To improve the classification accuracy of these activities, this study employs a 3D CNN framework that takes the spatio-temporal nature of the problem into account. The one-shot method utilized in this study is a novel approach that has not been widely utilized in previous research. The experimental findings of this study indicate that the proposed method is efficient, accurate, and effective for recognizing and classifying distracted driver activities. These results indicate that the 3D CNN framework may be a viable strategy for addressing the challenges associated with classifying distracted driving behaviors in natural environments.
In summary, this study contributes to the field by demonstrating the potential of 3D CNN to improve the accuracy of distracted driving classification in naturalistic settings. The findings of this study can contribute to the development of more effective driver monitoring systems and, ultimately, to the goal of improving road safety.
The remainder of the paper is structured as follows. The second section is a review of relevant literature. Section three contains the data and methodology used for this study. Section four is a discussion of the model development results. Section five concludes with a summary of the research, conclusions drawn from the findings, and recommendations for future research.
1.1 Objective
The main objective of this work is to classify the distractions of the driver based on their facial characteristics and behavioral cues by using a 3D convolutional neural network built on top of pytorch libraries.
2.0 Related Works
The classification of distracted driving behavior is an important area of research in the field of human factors and transportation safety. Accurate identification of distracted driving activities can help develop effective driver monitoring systems and ultimately improve road safety. In recent years, machine learning and deep learning models have gained significant attention in the field of driver distraction detection. These models are designed to classify different types of driver distraction based on data collected from various sensors and sources.
Many studies have been conducted in recent years to develop driver distraction recognition systems that use machine learning methods such as SVM, decision trees (DT), and random forest (RF) (McDonald et al., 2020). These algorithms work by training a model on a large dataset of annotated data to recognize patterns in the data. Data collection modalities employed in the investigations included facial expressions, eye movements, and physiological markers. The results reveal that the SVM model was able to accurately recognize distraction with a high degree of accuracy.
Similarly, a study by Das et al. (2020) proposed a machine learning-based approach for driver distraction detection using a SVM classifier. The model was trained on a dataset of driver behavior collected from a driving simulator. The results showed that the proposed approach achieved high accuracy in detecting driver distraction and outperformed traditional rule-based methods (Das et al., 2020).
A number of deep learning techniques have been applied to the problem of driver distraction recognition. These include CNNs, recurrent neural networks (RNNs), and long-short-term memory networks (LSTMs), have also been used for driver distraction recognition (Omerustaoglu et al., 2020). These methods are based on artificial neural networks and are designed to automatically extract features from the data and perform pattern recognition tasks. In a study by Wang et al. (2021), a deep learning model was developed using CNNs to recognize driver distractions from facial expressions and eye movements. The model achieved an accuracy of 96.7%, demonstrating the potential of deep learning for driver distraction recognition. Another study by Sukhavasi et al. (2020) used a combination of LSTMs and SVM to analyze physiological signals for distraction recognition. The authors found that the combination of LSTMs and SVM was able to achieve an accuracy of 93.4%, outperforming traditional machine learning methods.
3.0 Method
3.0.1 Dataset and Description
The dataset used in this analysis was collected at Oak Ridge National Laboratory. The dataset consists of 77 human subjects performing scripted driving-related activities as shown in Figure 1 below. The original goal of the DEFADS study was to create a dataset using a Second Strategic Highway Research Project (SHRP2) Data Acquisition System (DAS), with many participants (100 was the goal) and a focus on facial features and behavior cues (Karnowski et al., 2022). Three camera systems were used for the collection, including two high-resolution webcam devices as well as a third system from an actual Naturalistic Driving Studies (NDS) (the Second Strategic Highway Research Project, or SHRP2).

Specific cameras and mounts. 1. FrontWEB camera; 2. DAS camera; 3. RVWeb mounted under DAS; 4: magnetic mount.
The "FrontWeb" camera, which was a Logitech BRIO 4K webcam with a default field of view of 90 degrees, was mounted right in front of the driver on the dashboard. The "RVWeb" camera, which was a PEGATISAN mini webcam with a 110-degree field of view, was attached to the bottom of the DAS unit. The data collection process proceeded by first collecting a portrait-style image of each subject. Then two in-vehicle sequences were conducted for each participant: an “Action” collection and a “HeadPose” collection. In the Action collection, the subject was asked to perform a series of actions that were emulative of driving behavior as shown in Table 1 . In the HeadPose collection, the subject was asked to look at a particular landmark in the vehicle, such as the rearview mirror, console, etc. These positions were marked by tapping a large sign on each position of interest with a letter ranging from A to N to avoid confusion.
Sequence Activities.
3.0.2 Pytorch Video Library
PyTorchVideo is an open-source deep learning library for video understanding tasks, such as classification, detection, and segmentation (Mehta et al., 2022). It is built on top of the PyTorch framework and provides several pre-trained models, including SlowFast and X3D, that achieve state-of-the-art performance on various benchmarks.
Compared to other video analysis frameworks like OpenCV and Caffe, PyTorch Video provides more advanced video processing features like temporal subsampling, data augmentation, and efficient video loading and pre-processing. It also integrates seamlessly with other PyTorch libraries, making it easy to incorporate it into existing deep learning workflows.
3.0.3 X3D Model Architecture
The X3D architecture is designed to efficiently process video data using a combination of 2D and 3D convolutions. The key idea behind X3D is to expand the network in both the spatial and temporal dimensions, allowing for greater expressiveness while maintaining efficiency. The X3D network consists of three components: the entry flow, the middle flow, and the exit flow. The entry flow processes the input video frames and consists of a series of 2D convolutional layers followed by a 3D convolutional layer to incorporate temporal information. The middle flow further expands the network by adding more 3D convolutional layers. Finally, the exit flow reduces the spatial dimensions of the output using a combination of 2D and 3D convolutions and prepares it for classification as shown in Figure 2 below.

X3D model architecture.
X3D networks expand a 2D network incrementally along the following axes: Temporal duration γt, frame rate γτ, spatial resolution γs, width γw, bottleneck width γb, and depth γd.
One of the key innovations in X3D is the use of channel-wise spatiotemporal convolutions, which enable efficient processing of video data. These convolutions take advantage of the fact that neighboring channels in a feature map often contain similar information, allowing for the use of shared weights across channels. This reduces the number of parameters required and improves efficiency. Another important aspect of the X3D architecture is the use of a "factorized" design, where 3D convolutions are decomposed into a series of 2D convolutions. This reduces the computational cost of 3D convolutions while maintaining their effectiveness in capturing temporal information. Table 2 below shows the summary of the model parameters.
Model parameters.
4.0 Results and Discussions
After training the 3D CNN model for driver distraction classification for 50 epochs, the convergence of the validation and training losses was observed as shown in Figure 3 . The training loss started at a high value and decreased rapidly in the initial epochs, indicating that the model was quickly learning the features from the training data. However, after about 30 epochs, the training loss started to converge, and the decrease in the loss was slower. The validation loss, which indicates how well the model generalizes to new, unseen data, decreased rapidly in the initial epochs as well. However, after about 35 epochs, the validation loss started to converge as well. Although the decrease in the validation loss was slower than that of the training loss, it eventually converged as well. The convergence of the training and validation losses is an important indication of the model's performance. It shows that the model has learned to generalize well to the unseen data and has not overfit to the training data.

Graph of training and validation losses.
Our model achieved an overall validation accuracy of 97.9% and an MAE of 0.023. The precision, recall, and F1-score for each class are presented in the Table 3 below:
Summary of results of the 8 classes.
Table 3 provides a breakdown of the precision, recall, and F1-score metrics for different distracted driving behaviors. Overall, the model performed well with high precision and recall scores for most of the classes. The activity with the highest score was TouchPassengerSeat, with a precision of 0.99, recall of 0.95, and F1-score of 0.97. This indicates that the classifier performed very well in identifying this behavior.
On the other hand, the activity with the lowest F1-score was ScaryReaction, with a precision of 0.97, recall of 0.92, and F1-score of 0.94. This suggests that the classifier struggled to identify this behavior, with a higher rate of false negatives or false positives.
There could be several reasons for the differences in accuracy between these behaviors. For example, the TouchPassengerSeat behavior may have distinctive visual features that are easier for the classifier to identify. On the other hand, the ScaryReaction behavior may be more challenging to detect due to its subtle or nuanced visual cues.
Overall, understanding the strengths and limitations of the classifier for different behaviors can help inform further improvements to the model and ultimately contribute to better road safety by reducing distracted driving behaviors.
The Figure 4 below shows sample results from the model.

Classified driver distracted activities.
5.0 Conclusion
In summary, the present study successfully employed a 3D convolutional neural network to classify distracted driving behaviors based on facial characteristics and behavioral cues. Our results demonstrate the efficacy of this approach, achieving overall recall and precision values of 97.6 and 98.1, respectively. These findings offer valuable insights into the detection of driver distraction and highlight the potential of advanced machine learning techniques to improve road safety.
Footnotes
Acknowledgements
We are grateful to the creators of the Database to Enable Facial Analysis for Driving Studies (DEFADS) for making this valuable dataset publicly available. This work has been funded by the National Science Foundation (ERC HAMMER, Award 2133630). This work was also supported by the Department of Energy Minority Serving Institution Partnership Program (MSIPP) managed by the Savannah River National Laboratory under BSRA contract 0000602156. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of these organizations.
