Ethical and moral decision-making for self-driving cars based on deep reinforcement learning

Abstract

Self-driving cars are expected to replace human drivers shortly, bringing significant benefits to society. However, they have faced opposition from various organizations that argue it is challenging to respond to instances involving unavoidable personal injury. In situations involving deadly collisions, self-driving cars must make decisions that balance life and death. This paper investigates the ethical and moral decision-making challenges for self-driving cars from an algorithmic perspective. To address this issue, we introduce the accident-prioritized replay mechanism to the Deep Q-Networks (DQN) algorithm based on early humanities research. The mechanism quantifies a reward function that takes priority into account. RGB (red, green, blue) images obtained by the camera installed in front of the self-driving cars are fed into the Xception network for training. To evaluate our approach, we compare it to the conventional DQN algorithm. The simulation results indicate that the Rawlsian DQN algorithm has superior stability and interpretability in decision-making. Furthermore, the majority of respondents to our survey accept the final decision made by our algorithm. Our experiment demonstrates that it is possible to incorporate ethical considerations into self-driving car decision-making, providing a solution for rational decision-making in emergency and dilemma circumstances.

Keywords

Rawlsian maximin principle carla simulator depth-wise separable convolution deep Q-network ethical decision-making

1 Introduction

With the rapid development of automatic control, artificial intelligence (AI), computer vision, wireless communication, and many other technologies, self-driving cars have been feasible from an engineering perspective [1]. The potential benefits of self-driving cars are immense, including improved road efficiency, reduced traffic accidents, increased productivity, and minimized environmental impact. Given these advantages, various studies have estimated that the market demand for self-driving cars is expected to grow at a compound annual rate of 53.6% from 2022 to 2030 [2]. This projected growth is driving companies to expand the production of self-driving cars by utilizing technological improvements, such as adaptive algorithms, sensor processing, high-definition mapping, and enhanced infrastructure.

Current research primarily focuses on advanced decision-making and perception systems technology. Hsu and Chen proposed a dynamic vehicle prediction system that comprises a sensor fusion system and a vehicle recognition system [3]. Goodall presented a three-stage algorithm to enhance the collision algorithm [4]. Meanwhile, Liu et al. suggested a perception system for self-driving cars based on a 3D projection model [5]. Additionally, Lefebvre and Ambellouis introduced a primitive joint obstacle detection and tracking method that uses a mean shift algorithm and a semi-dense disparity map [6].

Self-driving technology has brought about significant progress, yet there are still dilemmas associated with its development. These dilemmas primarily include cognitive and ethical concerns. A survey conducted by Kyriakidis revealed that users express heightened apprehension regarding safety, economy, and legal issues concerning self-driving cars [7]. In the present scenario, only about 40% of users are willing to invest in self-driving cars [8]. Hence, it is essential to resolve conflicts when self-driving cars face different dilemma scenarios. In that case, the moral decision-making problem has to be considered for the further prosperity of the self-driving industry.

One solution to the ethical dilemma in self-driving cars is to use machine learning for intelligent decision-making and driving behavior planning. Machine learning has a wide range of applications in various fields, including image recognition [9], path-tracking [10], talent growth [11], disease treatment[12], drug development [13], etc. The ethical decision-making schemes for self-driving cars must comply with popular perceptions to encourage widespread acceptance of these cars [14]. However, most of the existing studies only propose ethical decision theories and give less consideration to the implementation of decision schemes. Furthermore, many of the decision theories have inherent flaws. In this paper, we have chosen the Rawlsian maximin principle for designing reward functions and have combined it with the empirical preferential replay theory. Moreover, we developed the Rawlsian Deep Q-Network(DQN) algorithm for simulation and compared the simulated decision outcomes with publicly acceptable results to validate the feasibility of the proposed scheme. The main study framework of this paper is shown in Fig. 1.

Fig. 1

The study idea and technical framework of this paper.

The primary contributions of this paper are summarized as follows:

We propose a unique combination of deep reinforcement learning with moral and ethical decision-making. This approach combines the Rawlsian maximin principle with the DQN algorithm to avoid the moral issue of deontology, which tends to prioritize strict adherence to specific rules over the complexity of the situation. Furthermore, it overcomes the problem of unfairness that utilitarianism creates.

The validity of the questionnaire analysis was made and the corresponding results were cross-validated with the outcomes of self-driving decision-making under deep reinforcement learning. This approach confirms the credibility of our novel ethical decision-making solution for self-driving cars.

The paper is organized as follows: Section 2 presents a review of the relevant literature, while Section 3 provides a detailed problem description. Section 4 states the details of our Rawlsian DQN algorithm. And the performance of the proposed algorithm through computational experiments is validated in Section 5. Furthermore, Section 6 details a real-world questionnaire test for cross-validating our algorithm. Finally, a summary of our findings and potential future research directions were made in Section 7.

2 Literature review

With the development of society and self-driving car technology, the safety of self-driving technology has now surpassed that of manual driving. However, it still faces ethical decision problems [15], such as the decision choice in the face of the trolley dilemma. Making self-driving cars the subject of decision-making is more acceptable to the public [16]. The moral decision problem can be solved by using scenario planning, which helps developers build moral decision mechanisms for self-driving cars. The mainstream approach is to integrate philosophy and technology to study the moral decision problem of self-driving cars [17].

Table 1 conducts a comparative analysis of the existing decision-making approaches with machine learning. Here are three main academic theories of ethical decision-making: deontology, utilitarianism, and the Rawlsian maximin principle. From a deontological perspective, self-driving cars follow previously established “specific rules” to confront decision problems under moral dilemmas, which emphasizes the rule of reason. The “specific rules” method is technically easy to implement, but it cannot capture the complexity of real moral dilemmas, which can result in irrational behavior. From a utilitarian perspective, the decision-making of self-driving cars is aligned with the direction of the optimal outcome. Although this theory ensures the maximization of group interests, it may also lead to behaviors that sacrifice fairness [4]. The Rawlsian maximin principle makes decisions by maximizing the worst possible outcome [18], which promotes both parties to reach the Pareto optimal state. From a practical perspective, it considers the complexity of the environment without sacrificing fairness. However, this theory requires evaluating the payoffs under various mechanisms, which makes it more difficult to implement [19].

Table 1
Comparison of the existing methods related to ethical decision-making

1. Theoretical analysis:

Method∖Features Regularity Interest bias Fairness Addressing environmental complexity

(1) Deontology √ √

(2) Utilitarianism √ √

(3) Rawlsian maximin principle √ √

2. Algorithm comparison:

Method∖Features Road condition recognition Lane change decision Global planning Ethical decision making Steering forecast Model training

(1) Deep learning networks based on security constraints √ √

(2) Deep convolutional neural networks combined with event cameras √

(3) Hierarchical Reinforcement Learning Network √ √

(4) Reinforcement learning networks combined with path planning √ √ √

(5) Rawlsian Deep Q-learning Network √ √ √ √

1. Theoretical analysis:
Method∖Features	Regularity	Interest bias	Fairness	Addressing environmental complexity
(1) Deontology	√	√
(2) Utilitarianism	√	√
(3) Rawlsian maximin principle			√	√

2. Algorithm comparison:
(1) Deep learning networks based on security constraints	√	√
(2) Deep convolutional neural networks combined with event cameras	√
(3) Hierarchical Reinforcement Learning Network		√			√
(4) Reinforcement learning networks combined with path planning	√	√	√
(5) Rawlsian Deep Q-learning Network	√	√		√		√

Deontology adopts an ethical knob with two sides representing “egoism” and “altruism”, respectively, allowing the individual to decide on the decision option [20]. Utilitarianism uses dynamics to evaluate the loss of multiple options for the self-driving car in the current situation to generate a decision [21], represented by the “dynamics algorithm”. Nevertheless, the complexity of the decision scenarios, the incompatibility of egoism and altruism, and the differences in ethical concepts make it impossible to obtain a uniform ethical decision solution.

The Rawlsian maximin principle stands out among current ethical decision-making theories due to its approach of maximizing the worst gain, which takes into account the complexity of moral-ethical dilemmas and promotes both survival rate and fairness to reach the Pareto optimal state. However, it is important to clarify how different actions are evaluated when applying this principle [22]. Additionally, Chandak et al. proposed a two-stage algorithm to address ethical and moral issues in self-driving cars, highlighting the potential use of deep reinforcement learning as a technical means to tackle the ethical decision problem [23]. Despite these efforts, the complexity of decision scenarios, the incompatibility of egoism and altruism, and differences in ethical concepts still make it challenging to find a uniform ethical decision solution.

Previously, deep reinforcement learning-based self-driving car research mainly focused on designing reward values based on rational judgments, without integrating ethical decision-making issues with the technology. However, in recent years, there has been a growing interest in the humanities of self-driving cars, and researchers have started to consider the rationality of the technology’s actions during training [17]. On one hand, researchers have been working on newer and faster neural network algorithms to improve the safety and comfort of self-driving cars. For instance, Du et al. proposed a vehicle control algorithm based on safety constraints to make decision-making more stable by applying reasonable constraints during training [24]. Cervera-Uribe used code-decode networks in deep learning to improve in-vehicle systems’ reaction to the driving environment and possible collisions between the driver and obstacles, increasing the speed of vehicle recognition of road conditions [25]. On the other hand, researchers have been exploring better road condition capture technology to enable self-driving cars to analyze road conditions as humans do. Additionally, they have been delving deeper into the connection between manual driving and self-driving. Duan et al. used a hierarchical reinforcement learning approach to study lane change and approach strategies for self-driving cars, demonstrating that deep reinforcement learning can achieve smooth and safe decision-making [26]. Maqueda et al. combined event cameras with deep reinforcement learning to achieve smooth steering prediction of self-driving cars, showing that deep convolutional neural networks can enable the prediction and control of vehicle motion parameters [27]. Kim and Canny introduced attention models into convolutional neural networks to produce a more concise visual interpretation and more accurate exposure of the network’s behavior, providing causal cues between neural networks and human driving for various features [28].

Currently, researchers are exploring models that combine global optimization of the road with local vehicle decision-making. In this regard, Chen et al. proposed a self-driving obstacle avoidance strategy that combines path planning and reinforcement learning [29]. The strategy involves first planning the global optimal path using global information and then combining the globally optimal path and vehicle information as input into a reinforcement learning neural network. The output of the network is a vehicle control signal that follows the optimal path while avoiding obstacles. This approach shows promise in achieving both safe and efficient self-driving.

However, the guidelines for training in these works are often based solely on the rational judgment of the researcher and lack integration with socio-ethical knowledge. As a result, there is a need to develop reasonable criteria for training ethical decision-making in self-driving cars from a social human ethics perspective and to subject the results to testing by the public. This paper proposes the use of the Rawlsian maximin principle as a decision theory for driverless cars. The feasibility of this principle is verified by combining it with the kinetic principle to design the agent’s reward function in deep reinforcement learning. To train the agent, an accident priority replay mechanism is used, and the decision results of self-driving cars in different ethical and moral environments are obtained.

3 Problem description

This paper considers how to make a reasonable decision in the face of a dilemma that a self-driving car may encounter when it is out of control. From an ethical perspective, we explore the crisis decision of whether to crash into a passerby and save the owner or take other actions based on deep reinforcement learning. To study its decisions, we constructed a local environment, as shown in Fig. 2.

Fig. 2

Local environment for self-driving car.

In the local environment previously mentioned, a simulated “junction” setting is used to test self-driving cars’ decision-making abilities. This scenario requires the car to choose between hitting a person or a car while traveling a certain distance. To make an informed decision, self-driving cars must consider both external environmental factors, such as speed and distance, and internal moral factors. Balancing the importance of these factors is crucial to prevent moral dilemmas, such as sacrificing an innocent person to save a group of criminals. The paper aims to determine how self-driving cars can make decisions that promote both fairness and survival rates, achieving the Pareto optimal state. Ultimately, the study aims to determine whether a self-driving car with ethical considerations can produce a “satisfactory solution” accepted by society. The final decision solution is compared with social statistics, and its applicability to other environments is evaluated.

4 Methodology

4.1 Markov decision-making process

Here, we abstract the out-of-control car as an agent and study its decision-making process in the face of ethical dilemmas. The decision of the car can be regarded as a Markov decision process, where the quintuple < S,A,P,R,γ>is used to represent the state space, action space, transition probabilities, reward function, and discount factor, respectively. These elements are explained in detail in Table 2.

Table 2
The elements of the Markov process

Factor Meaning

Set of states S The state of the car is a crucial factor for the vehicle to make the next decision, which can be idealized as the road condition in front of the car.

Set of actions A Car actions can be broadly categorized into two main types: speed control and direction control. Speed control refers to the car’s ability to regulate its speed, while direction control is its ability to steer and navigate through the environment. In an out-of-control state, the car defaults to moving at full throttle, which means it accelerates at maximum speed. However, the car’s direction is still subject to continuous adjustments to prevent collisions or accidents.

Probability of state transition P The state of the out-of-control car is continuously changing as it remains in motion, and the likelihood of maintaining its original state is zero. Given the unpredictability of its movement, it can be presumed that the probability of the car reaching any particular state is equal.

Reward function R The objective of this paper is to incorporate ethical considerations into the decision-making process of autonomous vehicles and devise a reward function that prioritizes them.

Factor of discount γ Given the intricate nature of the vehicle state in this problem and the substantial computational requirements for continuous control, it is important to avoid excessive convergence of the reward function. Therefore, this paper has adopted a higher value of 0.99 to address this issue.

Factor	Meaning
Set of states S	The state of the car is a crucial factor for the vehicle to make the next decision, which can be idealized as the road condition in front of the car.
Set of actions A	Car actions can be broadly categorized into two main types: speed control and direction control. Speed control refers to the car’s ability to regulate its speed, while direction control is its ability to steer and navigate through the environment. In an out-of-control state, the car defaults to moving at full throttle, which means it accelerates at maximum speed. However, the car’s direction is still subject to continuous adjustments to prevent collisions or accidents.
Probability of state transition P	The state of the out-of-control car is continuously changing as it remains in motion, and the likelihood of maintaining its original state is zero. Given the unpredictability of its movement, it can be presumed that the probability of the car reaching any particular state is equal.
Reward function R	The objective of this paper is to incorporate ethical considerations into the decision-making process of autonomous vehicles and devise a reward function that prioritizes them.
Factor of discount γ	Given the intricate nature of the vehicle state in this problem and the substantial computational requirements for continuous control, it is important to avoid excessive convergence of the reward function. Therefore, this paper has adopted a higher value of 0.99 to address this issue.

The Markov decision model employed in this study is depicted in Fig. 3, which outlines the potential state transitions of a self-driving car as it navigates its motion space. Under normal conditions, the car’s state can take on nearly infinite possible values, represented by the blue circle in the Fig. 3. However, in the event of a traffic accident, the model designates it as an absorbing state, requiring the self-driving car to learn from the collision with another object. By utilizing this Markov decision model, the self-driving car can make informed decisions based on its current state and the potential future states it may transition to, enabling it to navigate complex environments and handle unexpected scenarios.

Fig. 3

Markov decision model for self-driving cars.

Unlike conventional deep reinforcement learning methods, the control of self-driving cars involves continuous and real-time decision-making. To achieve this in our study, we make use of the “throttle” and “steer” functions within the Carla environment, ensuring that the model can make decisions in real time and avoiding instances where it defers decision-making, which could negatively impact convergence. To represent the car’s actions, we encode them as 0, 1, or 2, corresponding to the “throttle,” “steer left,” and “steer right” functions, respectively. This approach allows for precise adjustments to the car’s speed and direction, enabling it to navigate complex environments. We represent the Markov decision action space of the out-of-control vehicle using Equation (1).

$action space = {\begin{matrix} 0 throttle = 1.0, steer = - 1 \\ 1 throttle = 1.0, steer = 0 \\ 2 throttle = 1.0, steer = 1 \end{matrix}$ (1)

The parameter “steer” controls the driving direction within the environment, allowing for the continuous control of the self-driving car’s movements in a single training cycle.

4.2 Reinforcement learning-based approach

Reinforcement learning is centered on discovering the optimal solution without receiving explicit feedback on the correctness of each action until the end. The objective of an agent in reinforcement learning is to maximize the cumulative reward in each epoch through continuous exploration. In our study, we implement reinforcement learning for the self-driving car, as depicted in Fig. 4. This enables the car to learn from its actions and modify its behavior accordingly. By maximizing the cumulative reward over time, the car can make informed decisions that strike a balance between safety and efficiency.

Fig. 4

Self-driving car reinforcement learning architecture.

The Q-learning algorithm in reinforcement learning takes into account all possible behavioral pathways. In the self-driving car, the “speed” and “direction” features are used to define the decision-making model for the out-of-control vehicle in critical moments. The speed state is kept constant, and the action can be obtained directly through DQN. The updated form of the DQN algorithm used in our study is single-step Q-learning, which is defined as Equation (2).

$Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r + γ \max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]$ (2) where α is the learning rate; γ is the discount factor measuring the cumulative return of the future state to the current state; and r is the reward value for adopting the current behavior in the current state.

4.3 Xception model

In our experiment, we utilized Xception for detecting traffic conditions ahead of the vehicle, ensuring that the model meets both accuracy and speed requirements. The main concept behind Xception is depth-wise separable convolution: each feature channel of an RGB image is first convolved individually with a 3×3 convolution kernel and then convolved across channels with a 1×1 convolution kernel [30]. It works as shown in Fig. 5 below.

Fig. 5

Xception network structure.

Table 3 shows that Xception, a lightweight neural network structure [31], has the highest prediction accuracy on the ImageNet dataset compared to Resnet-152 [32], VGG-16 [33], and Inception V3 [34]. The Xception architecture has shown significant performance improvement on ImageNet datasets, making it an ideal choice for our experiment. It is capable of meeting the requirements for training accuracy and speed, allowing us to efficiently detect traffic conditions ahead of the self-driving car.

Table 3

Classification performance comparison on ImageNet

	Top-1 accuracy	Top-5 accuracy
VGG-16	0.712	0.901
ResNet-152	0.771	0.932
Inception V3	0.783	0.940
Xception	0.789	0.946

4.4 Rawlsian reward function

The reward function is a critical factor in guiding the decision-making of self-driving cars when faced with ethical dilemmas. This paper uses the Rawlsian maximin principle as the design criterion for the rewards, taking into account the complexity of the decision-making environment and ensuring fairness. The paper focuses on the collision selection problem in car ethical dilemmas and uses the kinetic theory in car collision theory to determine the reward function in reinforcement learning. This approach allows for the measurement of the accident subject and survival rate under different scenarios.

The simulation environment used in this study is based in an urban setting, with the intention of emulating real-world scenarios. Therefore, the self-driving car’s speed is limited to no more than 60km/h (50km/h) when driving on city roads. However, it is also taken into account that, in real-world driving situations, the vehicle’s speed should not be less than 20km/h. It is important to note that driving at excessive speeds can result in greater harm than driving at a slower pace. The reward for the speed aspect is determined using Equation (3). ${reward}_{speed} = {\begin{matrix} - 0.1 v > 50 km / h \\ 0 50 km / h > v > 20 km / h \\ - 0.05 20 km / h > v \end{matrix}$ (3)

In real-life situations, car collisions typically result in inelastic collisions, which are classified as either elastic or fully inelastic collisions. For the purposes of this study, the fully inelastic collision type is considered based on the Rawlsian maximin principle, which results in the highest energy loss. The mass and velocity of the objects involved in the collision can vary in real-world scenarios, but for the purposes of the reward design in this paper, a high level of precision is not required. The specific values of the parameters used in the study are presented in Table 4.

Table 4

Parameter description of the collision object

Object	Quality /kg	Speed /(km/h)
Car	1500	40
Person	70	0
Barrier	40	0

The reward function for collisions is derived from collision dynamics and the Rawlsian maximin principle, and is represented by Equation (4), which can be found in the appendix. ${reward}_{crash} = {\begin{matrix} - 1 car - person \\ - 0.5 car - barrier \\ - 0.2 car - car \end{matrix}$ (4)

It is evident that the speed aspect and collision aspect are relatively independent and have little influence on each other. Thus, the total reward is the sum of the rewards from the speed and collision aspects, following the summation principle, and can be computed using Equation (5). ${reward}_{total} = {reward}_{speed} + {reward}_{crash}$ (5)

Equation (6) presents the obtained reward function for various behaviors in different states. $r (a, v) = {\begin{matrix} - 1.1 o = car - person, v > 50 km / h \\ - 1 o = car - person, 50 km / h > v > 20 km / h \\ - 1.05 o = car - person, 20 km / h > v \\ - 0.6 o = car - barrier, v > 50 km / h \\ - 0.5 o = car - barrier, 50 km / h > v > 20 km / h \\ - 0.55 o = car - barrier, 20 km / h > v \\ - 0.3 o = car - barrier, v > 50 km / h \\ - 0.2 o = car - barrier, 50 km / h > v > 20 km / h \\ - 0.25 o = car - barrier, 20 km / h > v \end{matrix}$ (6) where o is the event and v is the “state” of the vehicle at the time of the event, i.e., the speed of the car. Briefly, the reward function can be expressed as shown in Equation (7). $Reward Function = ax + by + cz + \dots \dots$ (7) where a,b,c are coefficients that are set according to priority and x,y,z are the number of collided objects.

4.5 Rawlsian DQN algorithm

The complexity of self-driving cars’ behavior and the diversity of states they encounter present challenges in adopting Q-learning. The storage space required to store reward values can increase significantly, leading to memory constraints. To address this issue, this paper introduces the convolutional neural network (CNN) as an approximation method for estimating Q-values. The CNN-based Q-function approximator reduces storage requirements and improves the learning efficiency of the Q-learning algorithm. To speed up the training efficiency, a small batch half-gradient descent strategy for Rawlsian DQN is adopted. The decision of the self-driving car is evaluated using MSE [35] (as shown in Equation (8)) as a loss function. $\begin{matrix} Loss (w) = E_{π w} [{(r + γ max_{a^{'}} Q (s^{'}, a^{'}, w) - Q (s, a, w))}^{2}] \\ \nabla Loss (w) = \\ \begin{matrix} \frac{1}{N} \sum_{i}^{N} (r + γ max_{a^{'}} Q (s^{'}, a^{'}, w) - Q (s, a, w)) \nabla Q (s, a, w) \end{matrix} \end{matrix}$ (8) where $r + γ max_{a'} Q (s', a', w)$ represents the targeted Q-value, Q (s, a, w) represents the predicted Q-value, and γ is the discount factor. Besides, the optimizer Adam is utilized in the gradient update, which implements the gradient update process as shown in Equation (9). $w \leftarrow w - α \nabla Loss (w)$ (9)

Algorithm 1 outlines the procedure for implementing the DQN algorithm used to train the self-driving car in our experiment.

Algorithm 1: Rawlsian DQN algorithm for self-driving cars
Initialization:
1. The maximum capacity of the experience pool D is 5000, and every 4 frames are taken as training data.
2. The initialized prediction network weights w = None.
3. Parameters of the target network w’ = w.
4. fore = 1 to Training Step do:
5. Initialization of the state S₁ ={ x₁ }, convert RGB images to grayscale by processing them, get φ₁ = φ (S₁).
6. Input φ₁ in the Xception network to obtain the Q-value function corresponding to each action Q (φ (S₁) , a, w).
7. Repeat
8. Selecting actions in a predictive network using an ɛ_k-greedy strategy
$A_{t} = \arg \max_{a} Q (φ (S_{t}), a, w)$ with a probability of 1 - ɛ_k.
9. Execute action A_t to get the reward R_t+1, and the next frame image sequence x_t+1.
10. Update Status S_t+1 = S_t, A_t, x_t+1 After pre-processing, the input sequence for the next moment is obtained φ_t+1 = φ (S_t+1).
11. Store experience migration samples (φ_t, A_t, R_t+1, φ_t+1, end) in a queue to the experience pool D, where end is a boolean variable used to determine that a collision has occurred and that a training session has ended.
if Collisionoccurred : end = True
12. Selecte n empirical migration samples from them at random:
[(φ₁, A₁, R₂, φ₂, V₁, end₁) . . . . . . (φ_j, A_j, R_j+1, φ_j+1, V_j, end_j)
. . . . . . (φ_n, A_n, R_n+1, φ_n+1, V_n, end_n)].
13. IfR_t ⩽ -1:
14. Put the sample in D₁.
15. Else
16. Put the sample in D₂.
17. End if
18. If the number of samples in D is much more than n, then:
19. forj = 0 ton - 1 do:
20. Use p = 1/(V + 1) to calculate the priority of samples in D₁ and D₂.
21. If random()< ρ then
22. Use the probability distribution $P (i) = p_{j}^{α} / \sum_{i = 0}^{i = size (D_{1})} p_{i}^{α}$ to extract (φ_j, A_j, R_j+1, φ_j+1, V_j, end_j) from D₁.
23. Else
Use the probability distribution $P (i) = p_{j}^{α} / \sum_{i = 0}^{i = size (D_{2})} p_{i}^{α}$ to extract (φ_j, A_j, R_j+1, φ_j+1, V_j, end_j) from D₂.
24. End if
25. Update the access times in the experience pool: V_j = V_j + 1.
26. If end_j = True then
27. y_j = R_j+1.
28. Else
29. $y_{j} = R_{j + 1} + γ max_{a} Q (ϕ_{j + 1}, a, w') .$ .
30. Calculating the loss function: $L o s s (w) = [E_{π w} {(r + γ max_{a'} Q (s', a', w) - Q (s, a, w))}^{2}] .$
31. Adopt the Adam method to correct the prediction network parameters w.
32. untilt reaching the end state
33. Update target network parameters every ten steps.

In contrast to the conventional collision recognition Q-function, collision recognition is incorporated into the training process, and distinct reward values are assigned to different collision objects based on their priority. This paper presents a departure from previous qualitative humanistic research in favor of quantitative calculations, resulting in the development of a reward function that takes priority into account. This approach enables the agent to develop a greater awareness of the incidents it causes during training, which facilitates more effective learning. Additionally, Prioritized Accident Replay is introduced, which differs from the conventional preferential replay mechanism that utilizes high-quality samples to increase training speed. In this study, we propose prioritizing the replay of low-quality samples in order to reduce the number of incidents that occur during subsequent training. This approach seeks to improve the robustness of the self-driving system by emphasizing the importance of unfavorable learning experiences.

5 Simulation experiments

To evaluate the efficacy of the proposed method, we used the PythonAPI to control the Carla Simulation in order to construct an environment for reinforcement learning. This simulation-based evaluation framework permits us to examine the agent’s performance in a variety of complex scenarios and simulate real-world driving situations.

5.1 Hardware and software configuration

Carla is an open-source self-driving simulator that enables the construction of self-driving simulation environments and the algorithmic incorporation of deep reinforcement learning via Python API [36]. This simulation software primarily supports the construction of the learning environment for deep reinforcement learning, the acquisition of image data used for training, agent control methods, the discrimination of the nature of collision events, and diverse dynamical programming for the experiments conducted in this paper.

A relatively even and well-coordinated town (Town02_0pt) was chosen as our experimental scene to facilitate the experimentation process and reduce collision conflicts during environment construction. We generated a training vehicle at a random location within the town and positioned 15 non-player character (NPC) vehicles and 25 NPC pedestrians within the environment. Figure 6 depicts the resulting environment.

Fig. 6

Experimental environment.

The environment and system configuration for the experiments are as follows: the operating system is Windows 10, the CPU is Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz 2.10 GHz, the graphics card configuration is 24 GB memory NVIDIA GeForce RTX 3090. Python version is 3.7, Tensorflow version is 1.13.1, Keras version is 2.2.4, and Carla simulator version is 0.9.13. The data and source code are available at https://github.com/newsigema/Ethical-and-moral-decision-making-for-self-driving-cars-based-on-deep-reinforcement-learning.

The experiments are accelerated via GPU, and the RGB images captured by the front camera of the self-driving car are used as training data. The real-time images are fed into the Xception network model, and tensorboard is applied for curve output. In terms of experimental design, we will compare the performance of both Rawlsian DQN and DQN. The main difference between the two is that Rawlsian DQN uses a reward function based on the dynamic change of vehicle state, taking into account the priority. And an accident priority replay mechanism has been introduced to circumvent the worst solution. The experimental group would adopt Rawlsian DQN while the control group adopts DQN with “experience random replay” and “0-1 reward function”.

5.2 Hyper-parameter tuning

Before beginning the formal experiment, we conducted 100 pre-training iterations to fine-tune the relevant parameters in order to improve the experiment’s outcomes. Given that the RGB images used for training are acquired in real-time, we use the mini-batch technique to train the model using 4, 8, and 12 batches, as shown in Fig. 7.

Fig. 7

Comparison of different batch sizes.

Regarding the learning rate, the initial learning rate is set as 1 to avoid fast convergence. Meanwhile, the pre-training results with the Learning rate decay of 0.975, 0.9975, and 0.99975 are compared (as shown in Fig. 8).

Fig. 8

Comparison of different learning rate decay.

Due to the Carla emulator’s large consumption of computer memory in this experiment, this paper chooses the Adam optimizer with high computational efficiency and low memory requirements. In summary, the related parameters are set in Table 5.

Table 5

Related parameter settings

Parameters	Setting
Network Structure/ optimizer	Xception/Adam
Loss Function	Mean squared error(MSE)
Training Period	100000 steps
Learning rate decay	0.975
Batch size	4

5.3 Markov model setting

The observation input for the self-driving car is the road conditions ahead of it. The training image dataset employed in this study was obtained from the front-facing RGB camera, which captures 640×480-pixel three-channel images. The specific camera structure is depicted in Fig. 9.

Fig. 9

Camera position setting.

This paper also added the observation camera in the rear of the car to make the presentation more intuitive. Here are the images captured by the two cameras. Figure 10 shows the training perspective and observation perspective.

Fig. 10

The picture two cameras captured.

In this study, a training cycle for the self-driving car is defined as the period from its inception until a collision occurs. Within the Python API of the Carla simulator, the collision object’s type (e.g., “vehicle,” “walker,” “barrier”) can be printed using the “isinstance” function in Python.

5.4 Training results analysis

Figure 11 depicts the reward curve for the environment depicted in Fig. 2 which was trained using the DQN algorithm that integrates ethical prioritization. Since the ultimate goal of self-driving car research is to minimize collisions with pedestrians under the most adverse conditions, this study focuses predominantly on the average and minimum rewards received during training.

Fig. 11

Reward curve of Rawlsian DQN and DQN.

Observing the training outcomes of the experimental group, two significant decreases in reward were observed between 15,000 and 20,000 steps (as depicted in Fig. 11 (b)). Further investigation revealed that these decreases were the result of collisions with pedestrians, resulting in a significant penalty factor that negatively impacted network performance (as illustrated in Fig. 12). This finding was reflected in the replication of the script.

Fig. 12

Footage captured by the camera before colliding with passers-by.

The minimum reward’s convergence was more notable than that of the average reward. Priority allowed the minimum reward to converge to approximately -0.65 between 20,000 and 45,000 steps, with only minimal fluctuations attributable to changes in speed. The control group’s training curves indicate that although the vehicle attempted to avoid collisions, it was unable to evaluate the ethical and moral implications of the various individuals involved in the scenario. As a consequence, the direction of the training process was unclear, and the responses exhibited persistent jumps in the curve, making it impossible to draw meaningful conclusions. This had a substantial effect on the convergence of the results.

Due to variations in environmental parameters during training and the number of training sessions, training outcomes may vary depending on the training situation. We conducted 10 training sessions and recorded the data of the DQN algorithm and the Rawlsian DQN algorithm for the minimal reward curve in order to quantify this uncertainty. Calculating the mean value and standard deviation of the minimum reward for 10 repeated experiments yielded the results shown in Table 6.

Table 6

Reward statistics

Approach	Average reward	Std.
Rawlsian DQN	–0.6428	0.0203
DQN	–1.0310	0.0179

The results demonstrated that the standard deviation of the Rawlsian DQN algorithm was greater than that of the DQN algorithm. This is because the Rawlsian DQN algorithm takes into account realistically complex factors such as collision objects and speed. Nevertheless, the standard deviation of both algorithms is extremely low (10^- 2), so the algorithm’s stability is acceptable.

5.5 Uncertainty analysis

Based on the calculated standard deviation, it is evident that the proposed model is stable for the locally determined environmental model. After approximately 50,000 training steps, however, the experimental group’s reward trajectory began to fluctuate once more. Further analysis of the model file revealed that the fluctuations in Fig. 13’s reward curve were the result of a successful departure from the local decision environment and subsequent training in the new environment. The standard deviation results also indicate that when the local environment alters, the optimal decision-making model requires additional training time. Therefore, environmental ambiguity has a significant impact on the algorithm’s performance, which represents its limitation.

Fig. 13

The fluctuation of a curve after leaving the given environment.

6 Questionnaire test

Finally, we performed a questionnaire to determine whether the Rawlsian DQN algorithm’s ethical decision-making in this scenario was acceptable to the general public. The questionnaire was conducted online, and seventy valid responses were collected. It was a two-tiered survey, with the first tier collecting basic information and the second tier concentrating on intentions regarding ethical decision-making. The two-tiered approach helped participants differentiate the focus of the survey sections.

In the basic information survey, there was a comparatively balanced gender ratio, with 51.43 percent of respondents being male and 48.57 percent being female. The ratio of drivers with licenses to those without licenses was close to 1 : 1, at 47.14 percent to 52.8 percent. This allowed for a more objective evaluation of the decision-making intentions of self-driving cars. 71.43% of respondents answered in the affirmative to the question of whether the public is willing to own a self-driving car, indicating a greater willingness to embrace self-driving cars. The second level of questions was also predicated on the assumption that the general public was more willing to own self-driving cars. The questionnaire for the scenarios depicted in Fig. 2 yielded the data shown in Fig. 14.

Fig. 14

Questionnaire results.

Even at the risk of their own lives, more individuals continue to choose to collide with the barrier. The convergence of the reinforcement learning algorithm’s reward curve to -0.65 results in “striking the barrier at a slower speed.” The fact that the data is consistent with the research findings and is even superior to the optimal solution envisioned by the team demonstrates that the Rawlsian DQN algorithm is capable of making rational decisions in the given environment.

Furthermore, This paper also expands the scenario, and separate questionnaires were administered in Table 7.

Table 7

Decision-making scenarios

Scenarios		Research results
1		The self-driving cars lost control at high speed, with two pedestrians on the road: an older man over 80 years old and a young man.

2		The self-driving car loses control at high speed and can only choose to hit one pedestrian on the right front of the car or five pedestrians on the left front.

3		Self-driving cars lose control at high speed and a pedestrian who run red lights on the road. If it swerves, it will hit and kill pedestrians, but not swerving may cause a chain of crashes with unknown results.

According to the training approach adopted in this paper, the deep reinforcement learning training scheme for the above scenario is described in Table 8.

Table 8

Training program changes

Scenarios	Training program changes
1	This requires a choice between “people” as opposed to the original question, so a new ethical decision-making study needs to be introduced to prioritize the different people. The results of the study showed that the subjects were more inclined to protect those with higher social values.
2	A lower limit on the number of collisions during a training session should be set so that a vehicle has multiple collisions before ending a training step, and the amount of experience to be played back needs to be increased.
3	Due to the possibility of a chain of crashes, the crash log should record information on all crashes in the area and calculate the Q-value for the whole area.

7 Conclusion

This paper proposes an ethical decision-making scheme for self-driving cars based on deep reinforcement learning. The Rawlsian maximin principle is utilized to develop the reward function and to incorporate insights from humanities research. In order to prevent accidents and other undesirable outcomes during the training process, a prioritized accident replay mechanism is also implemented. The experiment demonstrates that this approach is more effective for addressing the ethical decision-making challenges posed by self-driving cars. In addition, the paper presents the results of a survey of seventy participants, which corroborate that the subjects’ decision-making tendencies align with the simulation outcomes, thereby validating the enhanced efficacy and rationality of the Rawlsian maximin principle.

Nevertheless, our long-term investigation reveals that despite this algorithm’s success in local decision-making, its performance fluctuates once it leaves the established decision-making environment. Therefore, the algorithm’s present application is relatively limited. We propose two future research directions. Inspired by DDQN [11], we will first investigate the combination of Rawlsian decision theory with more advanced reinforcement learning algorithms and further evaluate its performance. s, e, s’;>can also be used to expand the decision-making scope in future work. Then, DDQN can be used to simulate the autonomous simulation process with greater precision. Second, we intend to investigate how to implement the knowledge gained from local decision-making to automatic driving in the real world. To accomplish this, we should consider how to speed up training in a simulation environment and reduce computational complexity.

Footnotes

Acknowledgments

The research of this paper is supported by the Sichuan Science and Technology Program (No. 2022NSFSC0459) and the Fundamental Research Funds for the Central Universities (No. 202210613020).

Appendix

In elastic collisions, since there is no loss of kinetic energy, so get Equation (1). (1) ${\begin{matrix} m {\vec{v}}_{before} + m_{it} {\vec{v}}_{itbefore} = m {\vec{v}}_{after} + m_{it} {\vec{v}}_{itafter} \\ \frac{1}{2} {mv}_{before}^{2} + \frac{1}{2} m_{it} v_{itbefore}^{2} = \frac{1}{2} {mv}_{after}^{2} + \frac{1}{2} m_{it} v_{itafter}^{2} \end{matrix}$

In a perfectly inelastic collision, the loss of kinetic energy is most significant, i.e., the two objects move together after the collision, so Equation (1) can be changed to Equation (2).

In the actual situation, the collision involved in the car is often very complex, and this article uses an entirely inelastic collision to facilitate the calculation.

Since the velocity is a two-dimensional vector, the velocity can be expressed by Equation (3). (3) $v = (v_{x}, v_{y})$

Thus obtain Equation (4). (4) ${\begin{matrix} m v_{x (before)} + m_{it} v_{itx (before)} = (m + m_{it}) v_{x (after)} \\ m v_{y (before)} + m_{it} v_{ity (before)} = (m + m_{it}) v_{y (after)} \end{matrix}$

The amount of change in momentum is calculated by Equation (5). (5) $Δ mv = m v_{after} - m v_{\begin{matrix} before \end{matrix}}$

However, since the change in momentum in a single direction (x-direction or y-direction) can qualitatively measure the change in momentum on the combined momentum. Thus, when only the momentum in the x-direction is considered, the amount of change in momentum is obtained when the car hits different objects in Table 9.

Thus obtain Equation (6). (6) $Δ mv = {\begin{matrix} 208.33 v - 8333.33 car - car \\ 18.58 v car - person \\ 49.02 v car - barrier \end{matrix}$

A check of the data shows the collision time for the different collision types(as shown in Table 10).

Combine the momentum theorem expressed by Equation (7). (7) $Δ mv = Δ Ft$

Therefore, the force per unit time generated by the collision is shown in Equation (8). (8) $F = {\begin{matrix} \frac{208.33 v - 8333.33}{0.2} car - car \\ \frac{18.58 v}{0.04} car - person \\ \frac{49.02 v}{0.12} car - barrier \end{matrix}$

Considering the buffering capacity of different objects, Equation (9) is then obtained.

(9) $\begin{matrix} bufferingcapacity = \\ {\begin{matrix} \frac{208.33 v - 8333.33}{0.2} \times 0.45 car - car \\ \frac{18.58 v}{0.04} \times 1 car - person \\ \frac{49.02 v}{0.15} \times 0.75 car - barrier \end{matrix} \end{matrix}$

According to The Rawlsian maximin principle, i.e. finding the optimal object in the set of minimum values under different measures.

From Fig. 15, the car-car collision occurs only when v > 40. What’s more, the priority in the driving process is determined according to the magnitude of the values of different curves, so get Equation (10).

(10) $car - person < car - barrier < car - car$

So, according to the worst results under different crash types and according to the Rawlsian maximin principle, the rewards are set as follows in Equation (11). (11) $rewar d_{crash} = {\begin{matrix} - 1 car - person \\ - 0.5 car - barrier \\ - 0.2 car - car \end{matrix}$

References

Zhao

, Liang

and Chen

, The key technology toward the self-driving car, International Journal of Intelligent Unmanned Systems 6(1) (2018), 2–20. doi: 10.1108/IJIUS-08-2017-0008.

Grand View Research, Autonomous Vehicles Market Size & Share Report, https://www.grandviewresearch.com/industry-analysis/autonomous-vehicles-market (accessed Nov. 20, 2022).

Hsu

L.-Y.

and Chen

T.-L.

, Vehicle Dynamic Prediction Systems with On-Line Identification of Vehicle Parameters and Road Conditions, Sensors 12(11) (2012), Art. no. 11, doi: 10.3390/s121115778.

Goodall

N.J.

, Ethical Decision Making during Automated Vehicle Crashes, Transportation Research Record 2424(1) (2014), 58–65. doi: 10.3141/2424-07.

Liu

, Wei

and Li

, Occluded Street Objects Perception Algorithm of Intelligent Vehicles Based on 3D Projection Model, Journal of Advanced Transportation 2018 (2018), e1547276. doi: 10.1155/2018/1547276.

Lefebvre

and Ambellouis

, Vehicle detection and tracking using Mean Shift segmentation on semi-dense disparity maps, in 2012 IEEE Intelligent Vehicles Symposium, Jun. 2012, pp. 855–860. doi: 10.1109/IVS.2012.6232280.

Kyriakidis

, Happee

and de Winter

J.C.F.

, Public opinion on automated driving: Results of an international questionnaire among respondents, Transportation Research Part F: Traffic Psychology and Behaviour 32 (2015), 127–140. doi: 10.1016/j.trf.2015.04.014.

Rezaei

and Caulfield

, Examining public acceptance of autonomous mobility, Travel Behaviour and Society 21 (2020), 235–246. doi: 10.1016/j.tbs.2020.07.002.

Said

and Barr

, Human emotion recognition based on facial expressions via deep learning on high-resolution images, Multimedia Tools and Applications 80(16) (2525), 1–3. doi: 10.1007/s11042-021-10918-9.

10.

Chen

I.-M.

and Chan

C.-Y.

, Deep reinforcement learning based path tracking controller for autonomous vehicle, Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 235(2–3) (2021), 541–551. doi: 10.1177/0954407020954591.

11.

Guarino

, Malandrino

, Marzullo

, Torre

and Zaccagnino

, Adaptive talent journey: Optimization of talents’ growth path within a company via Deep Q-Learning, Expert Systems with Applications 209 (2022), 118302. doi: 10.1016/j.eswa.2022.118302.

12.

Hung

T.N.K.

et al., An AI-based Prediction Model for Drug-drug Interactions in Osteoporosis and Paget’s Diseases from SMILES, Molecular Informatics 41(6) (2100), e264. doi: 10.1002/minf.202100264.

13.

T.H.

, Nguyen

N.T.K.

, Kha

Q.H.

and Le

N.Q.K.

, On the road to explainable AI in drug-drug interactions prediction: A systematic review, Computational and Structural Biotechnology Journal 20 (2022), 112–2123, doi: 10.1016/j.csbj.2022.04.021.

14.

Rahwan

et al., Machine Behaviour, in Machine Learning and the City, John Wiley & Sons, Ltd (2022), pp. 143–166, doi: 10.1002/9781119815075.ch10.

15.

Siegel

and Pappas

, Morals, ethics and the technologycapabilities and limitations of automated and self-driving vehicles, AI & Soc, Sep. 2021. doi: 10.1007/s00146-021-01277-y.

16.

Gill

, Blame It on the Self-Driving Car: How Autonomous Vehicles Can Alter Consumer Morality, Journal of Consumer Research 47(2) (2020), 272–291. doi: 10.1093/jcr/ucaa018.

17.

Ryan

, The Future of Transportation: Ethical, Legal, Social and Economic Impacts of Self-driving Vehicles in the Year, Sci Eng Ethics 26(3) (2020), 1185–1208, doi: 10.1007/s11948-019-00130-2.

18.

Kameda

et al., Rawlsian maximin rule operates as a common cognitive anchor in distributive justice and risky decisions, Proc. Natl. Acad. Sci. U.S.A. 113(42) (2016), 11817–11822. doi: 10.1073/pnas.1602641113.

19.

Leben

, A Rawlsian algorithm for autonomous vehicles, Ethics and Information Technology 2(19) (2017), 107–115. doi: 10.1007/s10676-017-9419-3.

20.

Contissa

, Lagioia

and Sartor

, The Ethical Knob: ethically-customisable automated vehicles and the law, Artif Intell Law 25(3) (2017), 365–378. doi: 10.1007/s10506-017-9211-z.

21.

Davnall

, Solving the Single-Vehicle Self-Driving Car Trolley Problem Using Risk Theory and Vehicle Dynamics, Sci Eng Ethics 26(1) (2020), 431–449. doi: 10.1007/s11948-019-00102-6.

22.

Schäffner

, Between Real World and Thought Experiment: Framing Moral Decision-Making in Self-Driving Car Dilemmas, Humanist Manag J 6(2) (2021), 249–272. doi: 10.1007/s41463-020-00101-x.

23.

Chandak

, Aote

, Menghal

, Negi

, Nemani

and JhaTwo-stage

, approach to solve ethical morality problem in self-drivingcars, AI & Soc, Jun. 2022. doi: 10.1007/s00146-022-01517-9.

24.

, Lin

, Zhang

, Dong

and Zhang

, Safe deep reinforcement learning-based adaptive control for USV interception mission, 77, Ocean Engineering 246 (2022), 110477, doi: 10.1016/j.oceaneng.2021.110477.

25.

Cervera-Uribe

A.A.

, U19-Net: A Deep Learning Approach for Obstacle Detection in Self-Driving Cars, p. 2.

26.

Duan

, Li

, Guan

, Sun

and Cheng

, Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data, IET Intelligent Transport Systems 14 (2020). doi: 10.1049/iet-its.2019.0317.

27.

Maqueda

A.I.

, Loquercio

, Gallego

, García

and Scaramuzza

, Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars, presented at the Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5419–5427 Accessed: Nov. 13, 2022. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/html/

28.

Kim

and Canny

, Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention, presented at the Proceedings of theIEEE International Conference on Computer Vision, 2017, pp. 2942–2950. Accessed: Nov. 13, 2022 [Online]. Available: https://openaccess.thecvf.com/content_iccv_2017/html/Kim_Interpretable_Learning_for_ICCV_2017_paper.html

29.

chen

, Han

, Zhu

, Liu

and Zhao

, Reinforce Learning-Based Collision Avoidance in Network Assisted Automated Driving, In Review, preprint, Jan. 2022. doi: 10.21203/rs.3.rs-1226853/v1.

30.

Chollet

, Xception: Deep Learning With Depthwise SeparableConvolutions, presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258. Accessed: Nov. 13, 2022 [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Chollet_Xception_Deep_Learning_CVPR_2017_paper.html

31.

Bianco

, Cadene

, Celona

and Napoletano

, Benchmark Analysis of Representative Deep Neural Network Architectures, IEEE Access 6 (2018), 64270–64277. doi: 10.1109/ACCESS.2018.2877890.

32.

, Zhang

, Ren

and Sun

, Deep Residual Learning for Image Recognition. arXiv, Dec. 10. 2015. doi: 10.48550/arXiv.1512.03385.

33.

Simonyan

and Zisserman

, Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv, Apr. 10, 2015. doi: 10.48550/arXiv.1409.1556.

34.

Szegedy

, Vanhoucke

, Ioffe

, Shlens

and Wojna

, Rethinking the Inception Architecture for Computer Vision. arXiv, Dec. 11. 2015. doi: 10.48550/arXiv.1512.00567.

35.

Hester

et al., Deep Q-learning From Demonstrations, , Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018), Art. no. 1. doi: 10.1609/aaai.v32i1.11757.

36.

Dosovitskiy

, Ros

, Codevilla

, Lopez

and Koltun

, CARLA: An Open Urban Driving Simulator, in Proceedings of the 1st Annual Conference on Robot Learning, PMLR, Oct. 2017, pp. 1–16. Accessed: Nov. 26, 2022. [Online]. Available: https://proceedings.mlr.press/v78/dosovitskiy17a.html

2. Algorithm comparison:
Method∖Features	Road condition recognition	Lane change decision	Global planning	Ethical decision making	Steering forecast	Model training
(1) Deep learning networks based on security constraints	√	√
(2) Deep convolutional neural networks combined with event cameras	√
(3) Hierarchical Reinforcement Learning Network		√			√
(4) Reinforcement learning networks combined with path planning	√	√	√
(5) Rawlsian Deep Q-learning Network	√	√		√		√