Abstract
This study utilized a recently released crash dataset of Level 3 automated vehicles (AVs) made publicly available by the National Highway Traffic Safety Administration (NHTSA). The primary objective was to investigate various crash types and identify factors that influence crash severity. To achieve this, we employed a lightweight Natural Language Processing (NLP) pipeline to automatically extract relevant information from crash narratives and categorized the crashes into 15 distinct types. By analyzing the dependency triples derived from the crash narrative using the Stanford CoreNLP library, we determined the similarity between each narrative and the predefined categories. Our findings highlight safety-critical crash scenarios based on real-world data encompassing diverse operational design domains (ODDs), revealing a statistically significant impact of lighting conditions on crash severity. These results contribute to a better understanding of AV crashes and provide valuable insights to enhance the safe testing, integration, and development of AVs in real-world environments.
Keywords
Introduction
Automated vehicles (AVs) are equipped with advanced sensors to perceive and respond to the dynamic complexities of road conditions. While there has been significant discussion regarding the benefits of AVs such as reducing traffic congestion and improving fuel-efficiency (Martínez-Díaz and Soriguera, 2018), ensuring safety remains a major challenge. To address this challenge, researchers have started investigating AV crashes by utilizing available AV crash datasets.
Previous studies have analyzed crashes using the California Department of Motor Vehicles (CA DMV) dataset, which focuses on ADS and Advanced Driver Assistance System (ADAS) crashes in California. For instance, Wang and Li (2019) classified crashes into four categories (Rear end, Sideswipe, Angled collision, and Run off the road) and used 5 levels of injuries to measure severity. They utilized a Classification And Regression Tree (CART) model to investigate the impact of environmental factors on crash severity when ADAS was deployed. Their study indicated that crash severity was higher when the vehicle was in autonomous mode and deemed at fault for the crash. Building upon this work, Zhu and Meng (2022) conducted a similar analysis and determined that crash severity was influenced by factors such as manufacturer, movement preceding collision, collision type, light condition, and year. However, the statistical significance of their results was questionable as they were based on probability.
Leilabadi and Schmidt (2019) performed an in-depth analysis of crashes in the CA DMV dataset from 2014 to 2019. They classified crashes based on the type of impact and used the Chisquare Automatic Interaction Detectors (CHAID) algorithm for grouping. Their exploratory data analysis revealed relationships between crash type, lighting, weather, and crash severity. Similarly, Sinha et al. (2021) investigated the CA DMV data from 2014 to 2019 and identified relative speed and vehicle damage as crucial variables affecting crash severity. Favarò et al. (2017) conducted an extensive exploratory data analysis of the CA DMV crash dataset from 2014 to 2017, concluding that rearend collisions were the primary cause of crashes in AVs, and conventional vehicles drove 10 times more miles per crash than AVs.
Boggs et al. (2020) conducted a study focusing on rear-end collisions involving AVs. Their research also explored the impact of land use on crash occurrence by geocoding the crash location and investigating the correlation with nearby building types. To analyze these relationships, they employed wordstat to extract word and phrase frequencies, which were subsequently utilized to develop a Bayesian model for studying correlations. The study revealed a higher likelihood of rear-end collisions when the autonomous mode was engaged, particularly in mixed land-use settings. However, a limitation of their study was the reliance on word frequency to determine crash types. While the authors suggested that 61% of the crashes were rear-end collisions, the increased frequency of the term “rear bumper” did not necessarily indicate a rear-end collision. It is important to note that there were instances where the AV was rear-ended without any fault of its own
The utilization of the CA DMV dataset for crash analysis presents several limitations. First, the limited number of crash reports (less than 150 crashes, all in California) and the lack of proper regulations for reporting crashes during that period hinder a comprehensive understanding of crash categorization and the identification of statistically significant factors influencing crash severity. Second, the CA DMV reporting regulations did not require information about the level of automation to be submitted. This made it impossible to analyze ADAS and ADS equipped vehicles separately.
To address these limitations, our study aims to utilize the recent crash dataset released by the National Highway Traffic Safety Administration (NHTSA) and an enhanced Nature Language Processing (NLP) pipeline. Our goal is to not only complement key findings from previous research but also explore unexamined or previously unattainable aspects. Specifically, we aim to investigate various crash types and the factors influencing them, providing valuable insights for governmental and private stakeholders to facilitate safe testing, integration, development, and education of driving automation technology.
Methodology
Dataset
In June 2021, the NHTSA issued a General Order to assess compliance of manufacturers and operators of vehicles equipped with ADS and ADAS in meeting their obligations. The goal was to ensure vehicle and equipment safety without unreasonable risks. The dataset used contains comprehensive crash information including damage extent, location, weather, lighting, road conditions, vehicle movements, and other factors.
Before the NHTSA General Order, timely crash notifications varied across manufacturers. The California Department of Motor Vehicles (CA DMV) maintained a public crash database for ADS and ADAS vehicles, which was previously used for research. However, the NHTSA dataset has advantages over the CA DMV dataset.
Firstly, the NHTSA dataset segregates crashes based on the level of automation employed in the vehicles at the time of the crash. In this study, we specifically focus on crashes involving vehicles with SAE Automation Level 3 and higher, which are experimental vehicles not yet deployed for commercial or public use. This allows us to gain insights into the safety considerations and challenges associated with the advancement of automated driving technology.
Secondly, the NHTSA dataset covers crashes in diverse operational design domains (ODDs). As of December 2022, data from 23 states, including California, and 46 cities have been reported. Thirdly, the NHTSA General Order requires crashes to be reported only if the ADS was in use within 30 seconds, excluding crashes where the autonomous system was not engaged. In contrast, the CA DMV dataset includes such crashes.
Lastly, the NHTSA dataset is well-formatted in a database structure, enabling easier analysis, unlike the PDF forms in the CA DMV data that require OCR algorithms for usability.
Data preprocessing
The general order requires an entity to submit the crash within a certain time frame. This report is required regardless of whether the reporting entity has all the information or if they agree with or have verified the information. Thus, the initial version of the report may have incomplete or partial information. If a reporting entity receives new or additional information after the initial report is submitted, it is required to submit an updated report. Regulations also require crashes to be reported by both the operating entity and the manufacturer/ developers of the AV. As a result, multiple reporting entities may submit duplicate versions of the same crash report, potentially affecting the results of previous studies.
To address this, we implemented a procedure to eliminate duplicates by grouping all reports related to the same incident and selecting the newest version of the report as the representative entry. This approach ensures that each crash is represented by a single unique report in our analysis, reducing any potential bias or duplication in the dataset.
Furthermore, we encountered missing values in certain reports. To mitigate this issue, we extracted relevant information from the crash narratives to fill in the gaps where data was absent. By doing this we were able to extract valuable information from the textual descriptions provided in the crash narratives, thus minimizing the impact of missing values on our analysis.
Crash scenarios and categorizing
To categorize crash scenarios, we built upon the work of Kibalama et al. (2022), who defined 12 categories of crashes based on highway conditions. However, since our research encompassed crashes across various operating environments, we made certain modifications to their categories. We added an additional 5 categories, some of which absorbed and expanded upon their original categories. As a result, we established a total of 15 crash categories, as outlined in Table 1. The categorization process was conducted through a combination of manual review of crash reports and collaborative brainstorming among researchers. It is important to note that these crash categories were not mutually exclusive, meaning that a single crash could be classified into multiple categories, capturing the diverse nature of crash scenarios encountered in our dataset.
Crash categories and explanations (the bottom five are what we added).
Our pipeline for categorizing crash scenarios involved three main steps. First, we employed the Stanford CoreNLP Natural Language Processing Toolkit (StanfordCoreNLP) (Manning et al., 2014) to extract semantic dependencies from the narrative reports. This toolkit greatly simplified the process of generating dependency trees, which allowed us to extract relevant relation triples specifically pertaining to the crash. Second, we utilized a GloVe model for word embedding to determine the semantic textual similarity between the crash reports. By leveraging term similarity index and Term Frequency-Inverse Document Frequency (tf-idf), we generated a sparse term similarity matrix. Through experimentation with various threshold values, we determined that a threshold of 0.7 achieved the highest classification accuracy. Finally, soft cosine similarity was employed to classify the reports by comparing query representations for each category with the similarity matrix.
To ensure the accuracy of the classification, we performed manual quality control sampling to verify the correct placement ofreportsintotheirrespectivecategories. Thisstepallowedusto validatetheresultsandensurethereliabilityofthecategorization process.
Statistical analysis
We divided the influencing factors into roadway type, weather and lighting conditions and investigated their effects on crash severity, the number of accidents and the types of accidents. We used two indicators to identify the severity of crashes: whether people were injured in the crash and whether the vehicle was towed after the crash. We performed Pearson’s Chi-squared test to examine the relationships between crash severity and environmental factors.
Results
Collision summary
At the time of writing this article, NHTSA database included 210 unique crashes from 12 states and 46 cities. The reports were submitted by 31 unique reporting entities using 22 different vehicle makers. Of the 210 crashes, a commercial operator was present in the driver’s seat in 190 cases (90%) while a customer was in the driver’s seat only in 2 crashes. There was a remote operator in 12 cases and no operator in 6 cases. Among the crashes, 143 occurred within the operational design domains (ODDs), while 4 took place outside the designated ODD. Information regarding the ODD was redacted for the remaining crashes by the operators. The distribution of crashes was relatively even throughout the year and during the day.
Table 1 shows different crash categories and Fig. 1 shows the proportion of all crashes. It is important to note that a crash can be classified into one or two categories. For example, 8 crashes were classified as both rear-end struck collision and Turning. Among the crash categories, rear-end struck collisions accounted for the highest proportion at 31.7%, followed by Other vehicle mistake (22.9%), Cut-in maneuver (8.3%), Turning (5.4%), and Object (5.4%). Further analysis of the pre-crash movement revealed that 41.4% of crashes occurred when the automated vehicle was stopping or slowing down, 39.5% during straight driving, and 16% during a turn.

Crash categories.
Roadway type and lighting conditions
In terms of roadway type, Fig. 1 provides an overview of the number of crashes occurring on different types of roads. Intersections accounted for 47% of crashes, regular streets for 39.5%, and highways for only 7%. It is worth noting that highways typically have better road markings and less variation in vehicle trajectories compared to intersections and regular streets. We observed some interesting results when we examined the relationship between roadway type and crash categories. When examiningtherelationshipbetweenroadwaytypeandcrashcategories, we found that the leading crash category on intersections and streets were rear-end struck collisions (30%) and Other vehicle mistake (26.5%). Intersections naturally had a higher rate of crashes in the Turning category. Vehicles on streets and highways had a significant number of crashes in the Object (9%) and Cut-in maneuver (12%) categories.
Lighting conditions also had an impact on the type of crash. Whilewedidnotknowhowevenlytestingwasdistributedduring day and night, crashes in the dark make roughly 30% of all crashes in each category. The proportion of crashes in the dark was higher in the Object, software issues and Turning category. However, due to the unbalanced nature of the data and with limited data points in each category, it is not possible to perform statistical analysis to understand the significance of roadway type and lighting’s impact on the type of crashes.
As shown inFig. 2, 90% of crashes took place in clear weather conditions. 95% crashes happened on a dry road surface. There- fore, it is difficult to describe the relationships between weather, road surface, and crash categories.

Number of crashes in each weather type.
Factors that influence crash severity
Fig. 3 and 4 show the number of vehicles towed and the number of injuries during daylight and dark. Of the reported crashes, 174 crashes caused no injuries to passengers in either vehicle. The vehicle was unsafe to drive and had to be towed away in 51 cases. We found that lighting conditions significantly influenced whether a vehicle was towed (χ2(1) = 13.882; p =.0004), and marginally significantly influenced whether people were injured in the crash (χ2(1) = 3.42; p =.1). The marginal significance is likely due to a relatively small number of crashes with injury (i.e., low statistical power). The roadway type does not have a statistically significant impact on the severity of crashes (χ2(1) = 1.541; p =.701).

Number of crashes towed in daylight and dark.

Number of crashes with injury in daylight and dark.
The current state of data is very unbalanced when it comes to weather and road surface conditions. This makes it impossible to perform any meaningful analysis of the effects of these parameters on crash severity.
Discussion
In this study, we developed an NLP pipeline to classify AV crashes into 15 categories, allowing for the analysis of crashes in various operational design domains (ODDs). We explored the relationships between crash categories and severity, and environmental factors such as lighting, roadway type, roadway surface, and weather.
Primary crash categories included rear-end struck collisions, Other vehicle mistakes, and Turning. Software issues, Target lane change, and rear-end strikes categories indicated limitations in AV software.
AVs approach intersections with an abundance of caution (Smith, 2012) and are prone to frequent stops to yield to pedestrians and other vehicles. Possibly due to the aggressive driving style of drivers in conventional vehicles who were prone to breaking rules and making mistakes, we see a lot of rear-end struck collisions and Other vehicle mistake type crashes when they were driving in a convoy. This finding was consistent with Deluka Tibljaš et al. (2018) and Papadoulis et al. (2019) who used a traffic simulation software package and predicted an increase in the share of rear-end struck type of collision with the introduction of AV.
The overly cautious behavior of AVs may also contribute to crashes in the rear-end struck and Turning categories. Many crashes in the NHTSA database took place because the AV stopped to yield to pedestrians or other vehicles while making a turn. More research is needed from the manufacturers to calibrate the driving style and assertiveness of AVs and their interaction with conventional drivers.
Intersections (46.6%) and streets (39.5%) were the primary locations where crashes occurred, likely due to the complexity of movements and interactions between vehicles in these areas. This highlights the need for manufacturers to continuously improve their driving algorithms for AVs to navigate these challenging scenarios more effectively.
The results of this study are important to validate the performance of Level 3 ADS vehicles prior to their commercialization on public streets. Our study covered a very diverse set of ODDs, which could be leveraged in simulation tools to other virtual platforms to test ADS algorithms. It could also help manufacturers and policymakers understand the effect of different factors on AV performance and make changes to improve their safety.
Limited data availability was the main limitation of the study. Weather and road surface effects could not be fully analyzed due to a lack of crashes in adverse conditions. As AV technology continues to advance and ODDs expand, we anticipate increasing publicly available data for future studies. We observed that AVs had a greater proportion of crashes in the Objects, Software issues, andTurningcategories. Thiscouldbebecauseofthelimitations of the algorithms or camera hardware, however we can only verify this once more data is available. These additional factorscouldprovidevaluableinsightsintotheirinfluenceonAV safety and contribute to a more comprehensive understanding of the subject.
Conclusion
As more manufacturers increase testing Level 3 ADS on the road, the evaluation of AV crashes and safety needs to be updated. Through the analysis of 210 crashes, this research identified 15 categories in which AV crashes can be classified. Compared to previous studies that used the CA DMV dataset and learning algorithms, we used the NHTSA dataset and the Glove model to cluster AV crashes in a more comprehensive way. These categories can be used by authorities, manufacturers, and researchers to identify and evaluate the safety of AVs.
The findings of this study, which focused on AV crashes, indicated that lighting conditions were associated with whether a vehicle involved in an AV crash required towing and the occurrence of injuries in the crash. However, the type of roadway did not show a statistically significant correlation with the severity of the AV crashes observed.
We strongly recommend that transportation agencies at both the federal and state levels mandate operating entities to collect and share more comprehensive data on AV crashes. This increased availability of detailed data would facilitate more informed research and analysis of safety by researchers, consequently enabling better testing and evaluation of AVs. By promoting a greater understanding of AV crash patterns and contributing to the development of enhanced safety measures, this data-driven approach can further advance the adoption and effectiveness of AV technology.
