Drivers’ Visual Distraction Detection Using Facial Landmarks and Head Pose

Abstract

Drivers’ distraction has been widely studied in the field of naturalistic driving studies. However, it is difficult to use traditional variables, such as speed, acceleration, and yaw rate to detect drivers’ distraction in real time. Emerging technologies have obtained features from human faces, such as eye gaze, to detect drivers’ visual distraction. However, eye gaze is hard to detect in naturalistic driving situations, because of low-resolution cameras, drivers wearing sunglasses, and so forth. Instead, head pose is easier to detect, and has correlation with eye gaze direction. In this study, city-wide videos are collected using onboard cameras from over 289 drivers representing 423 events. Head pose (pitch, yaw, and roll rates) are derived and fed into a convolutional neural network to detect drivers’ distraction. The experiment results show that the proposed model can achieve recall value of 0.938 and area under the receiver operating characteristic curve value of 0.931, with variables from five time slices (1.25 s) used as input. The study proves that head pose can be used to detect drivers’ distraction. The study offers insights for detecting drivers’ distraction and can be used for the development of advanced driver assistance systems.

Keywords

Drivers’ visual distraction Naturalistic Driving Study (NDS)head pose convolutional neural network (CNN)

Globally, traffic crashes cause more than 1.3 million deaths every year. Among these road crashes, impaired driving activities, such as fatigue or drivers’ distraction, result in around 25% of road crashes ( 1 ). Numerous studies have investigated the impact of drivers’ distraction on crash occurrence, or the detection of drivers’ distraction.

To better investigate drivers’ states, naturalistic driving studies (NDS) are widely conducted. The most widely used NDS data set is the Strategic Highway Research Program 2 (SHRP2) NDS. Wood and Zhang found that drivers in crash events had longer perception-reaction time and lower deceleration rates than drivers in near-crash events ( 2 ). They also found some types of distraction could influence drivers’ perception time, thus influencing driving safety.

Three types of distraction were identified by the National Highway Traffic Safety Administration (NHTSA), which were: visual distraction (the driver looked away from the roadway), manual distraction (the driver took hand off the steering wheel), and cognitive distraction (the driver had mental workload associated with a task other than driving) ( 3 ). It was well acknowledged that involvement in a secondary task while driving could reduce a driver’s performance. The secondary tasks during driving included answering a mobile phone, talking, eating, adjusting mirrors, and so forth ( 1 ). Some studies investigated factors that are correlated with drivers’ distraction to propose performance measures. Wang et al. found that the standard deviations of speed, distance headway, and lane offset in 1,244 phone distraction events were lower than those values in the base cases ( 4 ). Li et al. developed a distraction detection algorithm using kinematic signals and in-vehicle units ( 5 ). It was found that steering entropy, time headway, and speed variation were correlated with drivers’ distraction. For visual distraction, eye gaze was a direct measure; that is, eye-off-road could be regarded as visual distraction. If the driver had eyes-off-road for 2 s, the crash risk doubled ( 6 ). Most existing studies used certain devices to detect eye gaze and further detected drivers’ distraction ( 7 ). Shi et al. used multiple machine learning models to detect eye gaze directions and then further detected drivers’ distraction ( 8 ). Videos were collected from 30 participants in a driving simulator, and the proposed model was tested both in a driving simulator and a natural driving environment. The literature on detecting drivers’ distraction, the related performance measures, and so forth, are summarized in Table 1.

Table 1.

Literature Review on Drivers’ Distraction with Naturalistic Driving Studies (NDS) Data Sets

Study	Topic	Data set	Data type	Model	Output
Seshadri et al. ( 9 )	Phone usage detection	SHRP2 NDS	Images	AdaBoost, support vector machine (SVM), random forest	Phone distraction
Li et al. ( 5 )	Drivers’ distraction detection	Own data set	Images	Nonlinear autoregressive exogenous (NARX) model	Distracted/attentive
Ishak et al. ( 10 )	Performance measures of drivers’ distraction	SHRP2 NDS	Global positioning system (GPS) trajectories	Discriminant analysis, logistic regression	Drivers’ crash risk, distraction (calling/texting/passenger interaction)
Seacrist et al. ( 11 )	Crash/near-crash events in different age groups	SHRP2 NDS	GPS trajectories	Pearson’s chi-square test	Crash/near-crash
Papazikou et al. ( 12 )	Vehicle kinematics in crash/near-crash events	SHRP2 NDS	GPS trajectories	Multilevel mixed effects linear regression model	Crash/near-crash
Eraqi et al. ( 13 )	Drivers’ distraction detection	Own data set	Images	Convolutional neural network (CNN)	No distraction/drinking/texting, etc.
Wang et al. ( 4 )	Phone distraction	SH-NDS	Images	Analysis of variance (ANOVA), classification and regression tree (CART)	Distractions and secondary tasks
Arvin and Khattak ( 14 )	Distraction and involved secondary tasks	SHRP2 NDS	Images	Random parameter logistic regression, etc.	Distractions and secondary tasks
Wood and Zhang ( 2 )	Factors influencing drivers’ perception time	SHRP2 NDS	GPS trajectories	Quantile regression	Drivers’ reaction time, deceleration rate

However, it is hard to detect drivers’ eye gaze with a low-resolution camera. Also, in areas like Florida, with strong sunlight, drivers tend to wear sunglasses during driving. Thus, detecting eye gaze is conducted more in driving simulators than in a natural driving environment. Previous studies found that there were correlations between a human’s head pose and eye gaze direction (15, 16). Ahn et al. trained a multi-task deep neural network to detect head pose ( 17 ). Four data sets were used, including Biwi Kinect Head Pose (BKHP), RCVFace, Annotated Facial Landmarks in the Wild (AFLW), and SHRP2 NDS (18, 19). Paone et al. established a benchmark data set based on SHRP2 NDS, which contained drivers’ faces, heads, and head poses from videos ( 20 ). Three algorithms were used to compare the accuracy of the output, that is, pitch, yaw, and roll rates (in degree) to validate the head poses. Kashevnik et al. and Johnson and Cuijpers detected human head poses from images (15, 21). In summary, there is potential to use head pose to detect drivers’ distraction, especially when eye gaze is hard to detect. Jha and Busso estimated head pose using a commercial headband device ( 16 ). Zhao et al. used head pose to detect drivers’ distraction with two data sets: one was State Farm Distracted Driver Detection (SF3D), which contained 26 participants, and the other was collected from China and contained 90 participants ( 22 ).

Convolutional neural network (CNN) was widely used for image classification, sequential data prediction (such as travel time prediction), and so forth (23 –25). Different from common neural network, CNN had convolutional layers that can better learn complex data structures. A few studies have used CNN for addressing transportation problems. For example, Du et al. and Abdelraouf et al. used CNN to predict traffic speed or travel time on freeways, and Li et al. used CNN to predict crash risk on urban arterials (24, 26, 27).

Based on the above discussion, this study is aimed at detecting drivers’ visual distraction using head pose. Cabin-view videos from around 289 drivers were collected. There were different types of drivers’ distraction, including food/drink distraction, phone distraction, and so forth. Depending on the eye states, the video frames were labeled into two classes, (visual) distraction or no (visual) distraction. A CNN model was used to detect the video frames with drivers’ distraction using head poses (pitch, yaw, and roll rates) derived from videos. Different sliding windows were tested to achieve the best result. After tuning the hyperparameters, the experiment results showed that the model with a sliding window of 1.25 s could achieve the best result, with recall value of 0.938 and area under the receiver operating characteristic (ROC) curve (AUC) value of 0.931.

The remainder of the paper is organized as follows: the data collection is illustrated in the next section. The CNN model, training, and evaluation procedures are illustrated in the section after that. The conclusion, discussion, and limitations of this study are illustrated in the final section. The study has two main highlights:

Videos are collected from 289 drivers (with cameras installed at any random position). The trips cover the metropolitan Orlando, FL, (city-wide) area. This is in contrast with most of the existing studies, which use videos collected from limited participants that drive on several specific routes.

This study derives drivers’ head pose as the input variables to detect drivers’ distraction. This is in contrast with existing studies, which mostly use eye gaze. Compared with existing work, this study does not have high requirements for the camera, standard camera installation position, and illumination level. It can be potentially used for real-time implementation.

Data Collection

Background

For monitoring risky driving behaviors, Lytx^® offers the DriveCam^® device to help with fleet management (28, 29). The device has two camera views: cabin view (driver’s face) and forward-facing view. Other information, such as speed, lateral acceleration (LAT), and forward acceleration (FWD) is also collected. By collecting large-scale data with installed event recorders, Lytx^® determines the threshold values of the two accelerations, LAT and FWD, to be |0.5 g|. Whenever the vehicle exceeds the lateral or forward threshold, which is regarded as an event, the device will save a 20 s video clip (i.e., 10 s before the threshold being met or surpassed and 10 s after). Common reasons for recording videos include hard cornering, hard braking, hard acceleration, crash, rough/uneven road surface, and so forth. The Lytx^® management system also provides diagnostic results. Also, the distraction behaviors (if there are any), which are not appropriate on business trips, are labeled.

Definitions of Head Pose

Open sourced programming packages such as OpenCV and Dlib, and a pre-trained facial landmark detection model are used to detect face and facial landmarks (30 –32). Sixty-eight landmarks on one face can be detected, such as the jawline, mouth, eyebrows, nose, eyes, and so forth, as shown in Figure 1a. The head pose can be denoted by three angles, with three mutually perpendicular axes as shown in Figure 1b. The Euler angles generated between the head and the three axes (X-axis, Y-axis, and Z-axis) are defined as pitch ( $θ_{x}$ ), yaw ( $θ_{y}$ ), and roll ( $θ_{z}$ ).

Figure 1.

Generating head pose: (a) facial landmarks,* and (b) pitch, yaw, and roll.

When a point $(x y z)$ in the 3D coordinate system rotates angle $θ_{x}$ around the X-axis, the coordinate in the 3D space turns into $(x_{X} y_{Y} z_{Z})$ with:

\begin{matrix} (x_{X} y_{Y} z_{Z}) = R_{X} \cdot {(x y z)}^{T}, and \\ R_{X} = [\begin{matrix} 1 & 0 & 0 \\ 0 & \cos θ_{x} & - \sin θ_{x} \\ 0 & \sin θ_{x} & \cos θ_{x} \end{matrix}] \end{matrix}

(1)

Similarly, if the angles around the Y-axis and Z-axis are defined as $θ_{y}$ and $θ_{z},$ the resulting coordinate is:

\begin{matrix} (x_{XYZ} y_{XYZ} z_{XYZ}) = R_{X} R_{Y} R_{Z} \cdot {(x y z)}^{T} = R \cdot {(x y z)}^{T}, \\ R = [\begin{matrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{matrix}] = [\begin{matrix} \cos θ_{z} \cos θ_{y} & \cos θ_{z} \sin θ_{y} \sin θ_{x} - \sin θ_{z} \cos θ_{x} & \cos θ_{z} \sin θ_{y} \cos θ_{x} + \sin θ_{z} \sin θ_{x} \\ \sin θ_{z} \cos θ_{y} & \sin θ_{z} \sin θ_{y} \sin θ_{x} + \cos θ_{z} \cos θ_{x} & \sin θ_{z} \sin θ_{y} \cos θ_{x} - \cos θ_{z} \sin θ_{x} \\ - \sin θ_{y} & \cos θ_{y} \sin θ_{x} & \cos θ_{y} \cos θ_{x} \end{matrix}] \end{matrix}

(2)

The pitch, yaw, and roll rates can be derived as:

{\begin{matrix} θ_{x} = ta n^{- 1} \frac{r_{32}}{r_{33}} \\ θ_{y} = - ta n^{- 1} \frac{r_{31}}{\sqrt{r_{32}^{2} + r_{33}^{2}}} \\ θ_{z} = ta n^{- 1} \frac{r_{21}}{r_{11}} \end{matrix}

(3)

The matrix $R$ can be derived using the perspective-n-transformation method. According to the pinhole camera model, the projection from a 3D coordinate system to a 2D image plane can be put as:

s (\begin{matrix} u_{i} \\ v_{i} \\ 1 \end{matrix}) = K [R t] = [\begin{matrix} f_{x} & γ & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{matrix} \begin{matrix}  \end{matrix} \begin{matrix} t_{1} \\ t_{2} \\ t_{3} \end{matrix}] (\begin{matrix} x_{XYZ} \\ \begin{matrix} y_{XYZ} \\ \begin{matrix} z_{XYZ} \\ 1 \end{matrix} \end{matrix} \end{matrix})

(4)

With the ideal camera system without distortion, $γ = 0$ and $f_{x} = f_{y}$ , and $t$ is the translation matrix. Scale factor $s$ does not influence the $R$ matrix. To derive $R$ , a 3D morphing method with spherical method is used. The ground truth 3D coordinates are obtained from a mean 3D model of the human face ( 33 ). The image coordinates $(u_{i}, v_{i})$ are the facial landmarks detected from videos. Among them, facial landmarks such as eye corners, the nose tip, the mouth corners, and the chin are used. Using Equations 1 to 4, the pitch, yaw, and roll rates are derived.

Drivers’ Distraction

The videos are collected from a fleet, with vehicle types such as van, sedan, and truck. In this study, the videos are collected from 289 drivers representing 423 events happening on 133 different days during the daytime from April 11, 2020, to May 14, 2021. The trips cover the Orlando, FL, area. The locations where the videos are collected are shown in Figure 2. Most drivers are males in middle age. Concerning the roadway types, 39% of the videos are collected from urban arterials, 30% from intersections, 12% from freeways, and so forth. The details of the collected videos are listed in Table 2.

Table 2.

Video Data Description

Roadway type (%)	Driver gender (%)	Driver age (%)
Arterial: 39	Male: 82	Young (20–30): 31
Intersection: 30	Female: 18	Middle age (30–50): 54
Freeway: 12		Old (>50): 15
Rural road: 8
Other: 11

Figure 2.

Video collection locations.

Lytx^® provides labels (drivers’ activities) for videos. Most of the time, for these distraction activities (texting/calling on the phone, having food/drink, etc.), the drivers’ hands are involved, which means at least one hand is off the wheel manipulating something. Currently, in no videos do the drivers have both hands off the wheel. To better label the drivers’ distraction, every frame of the video is further labeled. It should be noted that the video frames are only taken from the situations under which the drivers are driving under normal conditions. The invalid video frames are removed for reasons listed as below:

In the middle of the video, the driver may get alerted from the onboard device that at least one acceleration threshold is violated. In this case video frames after this timestamp are removed.

The driver is not driving on the road. Instead, the vehicle is parked or just starts from a parking lot.

The video frame has poor illumination conditions, or the camera lens is covered.

Basically, there are four frames per second (fps) (video frame rate is 4 fps). For each frame, if the distraction involves eye activity (driver has eyes off road), it is identified as distraction; if not, the frame is identified as no distraction. Besides, the frames from normal driving videos are labeled as no distraction. The labeling process is shown in Figure 3. An example of the labeled frames is shown in Figure 4. All the frames are manually checked to ensure accuracy.

Figure 3.

Labeling the video frames.

Figure 4.

Examples of labeled video frames: (a) frame with (visual) distraction, and (b) frame with no (visual) distraction.

After labeling the video frames, the detailed information of the two classes (distraction/no distraction) is shown in Table 3.

Table 3.

Data Overview

Label	Total frame number	Number of videos	Video label (phone, food, other)	Average number of frames per video	Maximum number of frames per video	Minimum number of frames per video
Distraction	1,091	55	36, 1, 18	20	69	7
No distraction	9,949	368	54, 15, 17^a	28	196	7

Another 282 videos are normal driving videos (drivers’ distraction is not observed).

The descriptive statistics of the collected variables are shown in Table 4. Besides pitch, yaw, and roll angles, the differences of these angles between two consecutive frames are also used as input variables.

Table 4.

Variable Descriptive Statistic^a

Variable	(Minimum, maximum)	Mean	Standard deviation
Pitch	(–21.72, 20.20)	–0.36	4.66
Yaw	(–27.16, 36.36)	–3.84	8.18
Roll	(–23.08, 22.23)	2.02	5.24
Pitch difference	(–32.45, 25.81)	–0.013	2.96
Yaw difference	(–51.28, 46.42)	–0.002	4.37
Roll difference	(–23.46, 24.30)	–0.015	3.42

Unit: degree.

Experiment and Result

CNN Model

The CNN is widely used in the studies related to sequential data. The proposed model in this study contains two convolutional layers. The dropout layers are added after each convolutional layer to avoid overfitting. One max pooling layer and one fully connected (FC) layer are also added. The overall architecture of the used model is shown in Figure 5.

Figure 5.

Model architecture.

The commonly used hyperparameters, such as optimization algorithm, batch size, and learning rate, are tuned. The tuning ranges and selected values are shown in Table 5. The optimization functions are selected between Adam, stochastic gradient descent (SGD), and RMSprop. Finally, the selected optimization function is Adam (with learning rate as 0.005), and the batch size is selected to be 50. The training epoch number is 250.

Table 5.

Hyperparameter Tuning (Ranges and Selected Value)

Parameter/algorithm	Tuning range	Selected value
Optimization	Adam, stochastic gradient descent (SGD), RMSprop	Adam
Batch size	300, 100, 50, 30	50
Learning rate	0.01, 0.005, 0.001	0.005

Sliding Window

The frame rate of collected videos is 4 fps, which means one video frame stands for 0.25 s. For sequential data, the sliding window method is usually used to learn historical information. A sliding window of 0.75 s is first used, as shown in Figure 6. The variables from the last three frames (samples) are used to classify the dependent variable for the current frame.

Figure 6.

Sliding window.

Data Set Splitting, Oversampling, and Experiment

The diagram for metrics calculation is shown in Table 6. True negative (TN) is the number of actual negative samples (no distraction) that are correctly classified. False positive (FP) is the number of actual negative samples (no distraction) that are wrongly classified. False negative (FN) is the number of actual positive samples that are wrongly classified. True positive (TP) is the number of samples in the distraction class that are correctly classified.

Table 6.

Confusion Matrix of Binary Classification Problem

	Classified label
Actual label	No distraction	Distraction
No distraction	True negative (TN)	False positive (FP)
Distraction	False negative (FN)	True positive (TP)

Using these four numbers, the following metrics are calculated: recall, false alarm rate (FAR), accuracy, and AUC.

Recall (or sensitivity): the proportion of correctly classified samples among actual positive samples, as shown in Equation 5.

FAR: the proportion of the falsely classified samples among the actual negative samples, as shown in Equation 6.

Accuracy: the proportion of correctly classified samples among all the samples, as shown in Equation 7.

AUC (area under the ROC curve): the ROC curve is used as a comprehensive metric to evaluate the model’s performance ( 34 ). This curve plots two parameters, recall and FAR, at different classification thresholds. The AUC value, which ranges from 0.5 to 1, is the area under the ROC curve. For imbalanced data sets, the AUC value is more reliable than accuracy.

Recall = \frac{TP}{TP + FN}

(5)

False alarm rate = \frac{FP}{FP + TN}

(6)

Accuracy = \frac{TP + TN}{TP + FP + FN + TN}

(7)

With the experiment, 70% of the data set is used as the training data set and 30% of the data set is used as the test data set. The synthetic minority oversampling technique (SMOTE) is used to increase the samples in the minority class (distraction class) to balance the data samples on the training data set ( 35 ). SMOTE is widely used in the transportation safety field because of the rareness of critical events such as crash or conflict (36, 37). With three time slices (each containing variables from 0.25 s), the proposed CNN model achieves the recall value of 0.811 (it can identify 81.1% of the samples in the distraction class), and an AUC value of 0.804. As the sliding window gets larger, the model’s performance improves. With five time slices, the model achieves a recall value of 0.938, and an AUC value of 0.931. Taking into consideration the computational cost for real-time implementation, the authors do not further increase the size of the sliding window. The evaluation metrics are calculated using the sklean.metrics package. Table 7 shows the above-mentioned metrics, with the best model marked in bold. Figure 7 shows ROC curves and AUC values from the three models.

Table 7.

Experiment Results (Test Data Set)

Sliding window size (slices)	Recall	False alarm rate (FAR)	Accuracy	Area under the receiver operating characteristic (ROC) curve (AUC)
0.75 s (three)	0.811	0.189	0.811	0.804
1 s (four)	0.882	0.118	0.880	0.882
1.25 s (five)	0.938	0.061	0.939	0.931

Figure 7.

Receiver operating characteristic (ROC) curves of convolutional neural network (CNN) models (test data set).

The confusion matrix is usually a good method to show a model’s performance on all classes. Figure 8 shows the confusion matrix on the test data set, using the best model (the model with five time slices). It is found that the model can classify the samples in both classes successfully.

Figure 8.

Confusion matrix of convolutional neural network (CNN) model with five time slices (test data set).

For comparison, the study also uses two machine learning models: support vector machine (SVM) and extreme gradient boosting (XGBT) (38, 39). SVM is a supervised learning algorithm that is widely used. Given a data set $D$ in the form of ${x_{i}, y_{i}}_{i = 1}^{N}$ where $x_{i} ϵ R_{d}$ are the samples, and $y_{i}$ is the label, SVM maps the feature vector $x_{i}$ to an $N$ -dimensional space, with $N$ as the number of features of the samples. For a binary classification problem, SVM finds the hyperplane (decision boundary) to maximize the margin distance between the two classes of samples by solving loss function. XGBT is a decision-tree-based model that utilizes many trees to generate results. To reduce the computational cost, XGBT estimates the distributions of features across all samples in a leaf to reduce the search space for building new trees. The subsequent trees will give extra weights to the samples that are incorrectly classified by the prior tree. Weighted voting is used to generate the final classification results based on all the trees. XGBT is an efficient and effective model. The experiment results from these two models are listed in Table 8. The best models are marked in bold.

Table 8.

Experiment Results (Comparison with Other Models)

Model	Sliding window size (slices)	Recall	False alarm rate (FAR)	Accuracy	Area under the receiver operating characteristic (ROC) curve (AUC)
Support vector machine (SVM)	0.75 s (three)	0.686	0.314	0.685	0.677
	1 s (four)	0.706	0.290	0.707	0.700
	1.25 s (five)	0.727	0.250	0.725	0.725
Extreme gradient boosting (XGBT)	0.75 s (three)	0.683	0.315	0.684	0.684
	1 s (four)	0.721	0.281	0.719	0.716
	1.25 s (five)	0.729	0.270	0.730	0.729

Conclusion and Discussion

This paper uses head pose to detect drivers’ distraction using onboard videos. The head pose is derived from the drivers’ facial landmarks. A 3D morphing human head model is used to obtain the ground truth 3D points. Through the perspective-n-transformation method, the head pose (three angles, pitch, yaw, and roll rates) is generated. Based on these angles, six variables are used to input into a CNN. The experiment result shows that the model can detect 93.8% of drivers’ distraction frames, with AUC value as 0.931, when the sliding window is taken as five time slices (1.25 s). The machine learning models, XGBT and SVM, are used for comparison. It should be noted that the videos in this study are collected city-wide with different vehicles, which results in different camera installation positions. This is in contrast with existing studies which usually collect videos from driving simulators, or from limited participants driving on several specific routes. These kinds of settings, with varying camera resolutions, illuminations, faces, and camera positions, are regarded as “in-the-wild” conditions. Increasingly, research interests are focused on these kinds of head pose data sets, such as Annotated Faces in the Wild (AFW), AFLW, and Labeled Face Parts in the Wild (LFPW) ( 19 ). These studies can be further applied to NDS and implemented in advanced driver assistance systems (ADAS).

For this study, the errors of the generated head pose are mainly from two perspectives: the misdetection of facial landmarks, and the errors from perspective-n-transformation algorithm. The authors eliminate the misdetections by removing video frames that have bad detections. However, for solving the second problem, some studies use more advanced techniques, such as neural networks, to estimate head pose. Future work can be extended to use better models for better head pose detection. Also, in this study, all the videos are collected during the daytime to ensure the accuracy of the facial landmark detection model. With new types of cameras (such as infrared camera) becoming popular on the market, more onboard videos with different illumination levels can be collected to test the performance of the proposed model.

Footnotes

Acknowledgements

The authors would like to acknowledge Lytx^® and Orange County for providing the videos.

Author Contributions

The authors confirm contributions to the paper as follows: study conception and design: S. Zhang, M. Abdel-Aty; data collection: S. Zhang, M. Abdel-Aty; analysis and interpretation of results: S. Zhang, M. Abdel-Aty; draft manuscript preparation: S. Zhang, M. Abdel-Aty. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Shile Zhang

Mohamed Abdel-Aty

All results and opinions are those of the authors.

References

Ranney

T. A.

Garrott

W. R.

Goodman

M. J.

NHTSA Driver Distraction Research: Past, Present, and Future. SAE Technical Paper. Warrendale, PA, 2001.

Wood

J. S.

Zhang

Evaluating Relationships Between Perception-reaction Times, Emergency Deceleration Rates, and Crash Outcomes Using Naturalistic Driving Data. Transportation Research Record: Journal of the Transportation Research Board, 2021. 2675: 213–223.

National Highway Traffic Safety Administration. Overview of the National Highway Traffic Safety Administration’s Driver Distraction Program. Report No. DOT HS 811 299. National Highway Traffic Safety Administration, Washington, D.C., 2010.

Wang

Asmelash

Xing

Lee

Characteristics of Driver cell Phone Use and Their Influence on Driving Performance: A Naturalistic Driving Study. Accident Analysis & Prevention, Vol. 148, 2020, p. 105845.

Bao

Kolmanovsky

I. V.

Yin

Visual-manual Distraction Detection Using Driving Performance Indicators With Naturalistic Driving Data. IEEE Transactions on Intelligent Transportation Systems, Vol. 19, No. 8, 2017, pp. 2528–2535.

Klauer

S. G.

Dingus

T. A.

Neale

V. L.

Sudweeks

J. D.

Ramsey

D. J.

The Impact of Driver Inattention on Near-Crash/Crash Risk: An Analysis Using the 100-Car Naturalistic Driving Study Data. Report No. DOT HS 810 594. National Highway Traffic Safety Administration, Washington, D.C., 2006.

A. S.

Suzuki

Aoki

Evaluating Driver Cognitive Distraction by Eye Tracking: From Simulator to Driving. Transportation Research Interdisciplinary Perspectives, Vol. 4, 2020, p. 100087.

Shi

Chen

Wang

A Nonintrusive and Real-Time Classification Method for Driver’s Gaze Region Using an RGB Camera. Sustainability, Vol. 14, No. 1, 2022, p. 508.

Seshadri

Juefei-Xu

Pal

D. K.

Savvides

Thor

C. P.

Driver Cell Phone Usage Detection on Strategic Highway Research Program (SHRP2) Face View Videos. Proc., IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, 2015, pp. 35–43.

10.

Ishak

S. S.

Osman

O. A.

Codjoe

Jenkins

Karbalaieali

Theriot

Bakhit

Exploring Naturalistic Driving Data for Distracted Driving Measures. Louisiana Transportation Research Center, Baton Rouge, LA, 2017.

11.

Seacrist

Douglas

E. C.

Huang

Megariotis

Prabahar

Kashem

Elzarka

Haber

MacKinney

Loeb

Analysis of Near Crashes Among Teen, Young Adult, and Experienced Adult Drivers Using the SHRP2 Naturalistic Driving Study. Traffic Injury Prevention, Vol. 19, Supplement, 2018, pp. S89–S96.

12.

Papazikou

Quddus

Thomas

Kidd

What Came Before the Crash? An Investigation Through SHRP2 NDS Data. Safety Science, Vol. 119, 2019, pp. 150–161.

13.

Eraqi

H. M.

Abouelnaga

Saad

M. H.

Moustafa

M. N.

Driver Distraction Identification With an Ensemble of Convolutional Neural Networks. Journal of Advanced Transportation, Vol. 2019, 2019, p. 4125865.

14.

Arvin

Khattak

A. J.

Driving Impairments and Duration of Distractions: Assessing Crash Risk by Harnessing Microscopic Naturalistic Driving Data. Accident Analysis & Prevention, Vol. 146, 2020, p. 105733.

15.

Johnson

D. O.

Cuijpers

R. H.

Predicting Gaze Direction from Head Pose Yaw and Pitch. In Proceedings of the IPCV'13 - The 2013 International Conference on Image Processing (H. R. Arabnia, L. Deligiannidis, J. Lu, F. G. Tinetti, and J. You, eds.), Computer Vision, & Pattern Recognition, Las Vegas, NV, 2013, pp. 662-668.

16.

Jha

Busso

Analyzing the Relationship Between Head Pose and Gaze to Model Driver Visual Attention. Proc., 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro, Brazil, 2016, pp. 2157–2162.

17.

Ahn

Choi

D.-G.

Park

Kweon

I. S.

Real-time Head Pose Estimation Using Multi-task Deep Neural Network. Robotics and Autonomous Systems, Vol. 103, 2018, pp. 1–12.

18.

Fanelli

Dantone

Gall

Fossati

Van Gool

Random Forests for Real Time 3D Face Analysis. International Journal of Computer Vision, Vol. 101, No. 3, 2013, pp. 437–458.

19.

Koestinger

Wohlhart

Roth

P. M.

Bischof

Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. Proc., 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, IEEE, Washington, D.C., 2011, pp. 2144–2151.

20.

Paone

Bolme

Ferrell

Aykac

Karnowski

Baseline Face Detection, Head Pose Estimation, and Coarse Direction Detection for Facial Data in the SHRP2 Naturalistic Driving Study. Proc., 2015 IEEE Intelligent Vehicles Symposium (IV), Seoul, South Korea, 2015, pp. 174–179.

21.

Kashevnik

Ali

Lashkov

Zubok

Human Head Angle Detection Based on Image Analysis. Proc., Future Technologies Conference (FTC) 2020, Vol. 1. San Francisco, CA, Springer International Publishing, Cham, 2021, pp. 233–242.

22.

Zhao

Xia

Zhang

Yan

Zhang

Driver Distraction Detection Method Based on Continuous Head Pose Estimation. Computational Intelligence and Neuroscience, Vol. 2020, 2020, p. 9606908.

23.

Sharma

Jain

Mishra

An Analysis of Convolutional Neural Networks For Image Classification. Procedia Computer Science, Vol. 132, 2018, pp. 377–384.

24.

Sun

Kuai

Xie

Sun

Highway Travel Time Prediction of Segments Based on ANPR Data Considering Traffic Diversion. Journal of Advanced Transportation, Vol. 2021, 2021, p. 9512501.

25.

Zhang

Abdel-Aty

Cai

Ugan

Prediction of Pedestrian-vehicle Conflicts at Signalized Intersections Based on Long Short-term Memory Neural Network. Accident Analysis and Prevention, Vol. 148, 2020, p. 105799.

26.

Abdelraouf

Abdel-Aty

Yuan

Utilizing Attention-Based Multi-Encoder-Decoder Neural Networks for Freeway Traffic Speed Prediction. IEEE Transactions on Intelligent Transportation Systems, 2021, pp. 1–10. https://doi.org/10.1109/TITS.2021.3108939.

27.

Abdel-Aty

Yuan

Real-time Crash Risk Prediction on Arterials Based on LSTM-CNN. Accident Analysis and Prevention, Vol. 135, 2020, p. 105371.

28.

Soccolich

S. A.

Hickman

J. S.

Potential Reduction in Large Truck and Bus Traffic Fatalities and Injuries Using Lytx’s Drivecam Program. Virginia Tech Transportation Institute, Blacksburg, VA, 2014.

29.

Lytx: Video Telematics and Fleet Management Solutions . https://www.lytx.com/en-us/. Accessed December 28, 2020.

30.

Bradski

. OpenCV (Open Source Computer Vision Library). https://opencv.org/. Accessed July 15, 2020.

31.

José

. Facial Landmarks Recognition. https://github.com/italojs/facial-landmarks-recognition/blob/master/shape_predictor_68_face_landmarks.dat.

32.

Dlib C++ Library . http://dlib.net/. Accessed July 26, 2021.

33.

Storer

Urschler

Bischof

3D-MAM: 3D Morphable Appearance Model for Efficient Fine Head Pose Estimation From Still Images. Proc., 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Kyoto, Japan, 2009, pp. 192–199.

34.

Zweig

M. H.

Campbell

Receiver-operating Characteristic (ROC) Plots: a Fundamental Evaluation Tool in Clinical Medicine. Clinical Chemistry, Vol. 39, No. 4, 1993, pp. 561–577.

35.

Chawla

N. V.

Bowyer

K. W.

Hall

L. O.

Kegelmeyer

W. P.

SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Vol. 16, No. 1, 2002, pp. 321–357.

36.

Zhang

Abdel-Aty

Zheng

Pedestrian Crossing Intention Prediction at Red-Light Using Pose Estimation. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 3, 2021, pp. 2331–2339.

37.

Zhang

Abdel-Aty

Zheng

Modeling Pedestrians’ Near-accident Events at Signalized Intersections Using Gated Recurrent Unit (GRU). Accident Analysis and Prevention, Vol. 148, 2020, p. 105844.

38.

Boser

B. E.

Guyon

I. M.

Vapnik

V. N.

A Training Algorithm for Optimal Margin Classifiers. Proc., 5th Annual Workshop on Computational Learning Theory, Pittsburgh, PA, 1992, pp. 144–152.

39.

Chen

Guestrin

Xgboost: A Scalable Tree Boosting System. Proc., 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794.