Abstract
Aiming at the problem of mainline congestion and ramp queue spill in urban expressway, an optimized on-ramp control method based on reinforcement learning is put forward. Online reinforcement learning algorithm is used to optimize on-ramp control regulation by taking the metering rate as the action, the length of ramp queue, the throughput and the occupancy rate of the interweaving area as the state, and the volume of the road network as the reward function. By iterating the value function with actual behavior, the proposed method can avoid the establishment of an accurate traffic model and the reliance on prior knowledge. Meanwhile, the real-time update of the value function Q compensates for the defect of control hysteresis. Compared with the classical method in simulation scenarios of Nanjing Kazimen Expressway, the average delay of the proposed method is reduced by 16.83%, the total delay reduced by 15.83%, the average speed enhanced by 6.80%, and the total travel time decreased by 5.22%; the average queue length of the on-ramp decreased by 89%; the average occupancy rate of the weaving area is decreased by 2.42% at rush hours, and the average traffic volume increased by 109veh/h.
Introduction
The traffic demand of urban expressway continues to grow steadily, resulting in frequent traffic congestion and vehicle delays, thus causing traffic safety problems [1]. Compared with highways, the expressway is featured by many nodes, large surface range, many entrance and exit ramps and short ramp distance. When the confluence area is close to the diverging area or when the entrance ramp passes through the auxiliary lane that connect with the off-ramp, a weaving area is formed [2]. The serious traffic weave of the weaving area reduces the traffic capacity of mainline, makes it difficult for vehicles on the mainline to leave the off-ramp, easily leads to upstream movement of queue and causes mainline jamming, and even makes the jamming spread to the on-ramp and form ramp spill-back [3].
There are many ways to solve the congestion of mainline and weaving area of expressway, such as mainline control, variable speed limit control, channel control, ramp control. [4]. The highway control experience shows that ramp control is the most direct and effective way to solve the congestion of highway or expressway [5]. Here the ramp control mainly refers to the on-ramp control. Its basic goal is to control the traffic demand of highway or expressway, that is, to limit the number of vehicles entering the mainline from the on-ramp at rush hours and make the mainline traffic flow run at the optimal state.
In addition to timing control, the existing ramp control methods are mainly divided into local induction control and coordinated induction control [6]. The local induction control uses the real-time detection data of a single on-ramp and its adjacent sections as the basis of control decisions, including Demand-Capacity, OCC, ALINEA, etc. The coordinated induction control is a control method that coordinates multiple on-ramps. Its control strategies mainly include optimal control, state regulator control, heuristic control and intelligent control. The ramp control methods based on the classical local induction control ALINEA and its extension have been applied in the European region, and the heuristic coordinated induction control methods such as Zone [7], Bottleneck [8], Swarm [9] and Hero [10] are also practically applied in some regions. The application results show that the control system based on accurate mathematical model and experience can desirably solve the frequent congestion problems on the expressway mainline, but for sporadic congestion problems, due to the model’s inaccurate prediction of the traffic state of the incident, it is easy to cause control disorder and time lag [11]. The heuristic control is substantially impacted by model parameter settings and an improper model parameter setting easily leads to poor control results, or even over-control [12]. The research shows that the intelligent control method in ramp control can make up for the deficiency of traditional heuristic coordinated induction control to a certain extent. The intelligent control is mainly oriented to systems featured by nonlinearity, complexity, time variability, uncertainty and incompleteness, while highway or expressway is precisely provided with the above characteristics. However, the existing mature intelligent control methods, including fuzzy logic control and neural network, have the problem of too slow learning speed and it is difficult to apply them in practice [13]. In summary, the classical local control method has the problem that the control effect is greatly influenced by the calibration of control parameters and cannot guarantee the control effect of a wide range of road network. The existing coordinated control method cannot achieve system optimization and needs accurate original or measured data, a poor control effect in a cycle may have a negative impact on the control in subsequent cycles. In addition, time lag exists with it for the control over the sporadic congestions. The existing intelligent control methods such as fuzzy logic control and neural network require plenty of data training time and have the problems of local or slow convergence and difficulty with actual application. To deal with the above shortcomings, this paper intends to use reinforcement learning algorithm to carry out intelligent ramp control [14].
Predecessors have conducted studies on the application of reinforcement learning in the ramp control field [15]. [16] advanced the adoption of Q-learning to solve the on-ramp control problem of expressway and achieved the optimal control of local on-ramps in the context of traffic demand changes. [17] took METANET model as the traffic model and based on Q-learning method put forward the local on-ramp control method which took into account the ramp queue length. [18] advanced the adoption of reinforcement learning to solve the single on-ramp control in the context of unknown traffic model and conducted the traffic simulation experiment on the basis of the actual data collected in Toronto. [19] applied Q-learning to solve the on-ramp control problem under congestion conditions at rush hours and realized the simulation experiment by using C# to pass through the Vissim COM interface. [20] conducted pertinent control over the mainline speed and the total travel time of road network while using Q-learning to carry out local on-ramp control. [21] analysed the selection of parameters in the reinforcement learning ramp control and recommended the settings of the parameters.
Q-learning is used in many studies of the existing reinforcement learning local ramp control. Q-learning is essentially an offline algorithm and each learning must obtain the control scenario at various traffic states. However, in the process of actual traffic operation, not all traffic states will emerge. For complex traffic state space, it may be difficult to converge in a short cycle, while online learning can effectively avoid such problems.
Based on the above analysis, this paper intends to construct an ramp control method consistent with the characteristics of the weaving area for the on-ramp control study of expressway by using the online learning algorithm of reinforcement learning. Section 2 of this paper mainly introduces the theory of reinforcement learning and analyses the advantages and disadvantages of different learning methods; Section 3 constructs the local intelligent control model of expressway ramp based on online learning method; Section 4 compares the local expressway control method constructed in this paper and the classical ALINEA method; Section 5 sums up the method put forward in this paper and its control effect.
Modeling method
Reinforcement learning [22] develops from dynamic programming equation, trial-and-error learning, parameter disturbance adaptive control, temporal difference and other theories. It means that the learning system obtains the mapping relation between state and action through interaction with the external environment and the learning objective is to optimize the reward function value. With a self-governing agent that can perceive the environment as its basic unit, reinforcement learning completes the learning process of continuous trials and errors through interaction of the agent with the environment. Its basic principles are as follows: If an action strategy of the reinforcement learning system brings a reward to the environment, the emergence trend of this action strategy will be enhanced [23]. The basic model is shown in Fig. 1.
The specific interaction process of agent in reinforcement learning with the environment can be described as follows [24]:
The agent perceives the current environmental state s;
According to the current state s and reinforcement signal r, the agent chooses an action a to act on the environment;
The environmental state is changed to the new state s’, and a reinforcement signal r is given;
The new state feeds back the reinforcement signal to the agent.

Basic model of reinforcement learning.
The reinforcement learning algorithm can be divided into offline and online learning modes pursuant to the application process. The offline learning uses the algorithm to obtain the optimal action offline corresponding to each state, that is, the optimal state-action pair, and then performs the optimal action in the face of different states. The online learning is to obtain learning experience in the real-time interaction process of the learning algorithm with the environment, and produce optimal action by using value function experience combined with action selection strategies. In the environmental operation process, learning gradually converges and ultimately gets the optimal state-action pair at various states.
Q-learning algorithm is a model-independent reinforcement learning algorithm. It is recognized as a milestone in the development process of reinforcement learning algorithm [20]. Q-value function needs to consider each action of agent in each iteration cycle, so Q-learning does not require a unique exploration strategy in essence. When a certain condition is required, only a simple greedy strategy needs to be adopted to ensure the convergence of the algorithm. Therefore, Q-learning algorithm is one of the most effective model-independent reinforcement learning algorithms.
In Q-learning, action strategy is generally denoted by function π : S × A → (0, 1), where (s, a) is the probability of choosing a at the state s. The Q-value function indicates the good or bad degree of implementing an action under a certain strategy by linking together strategy π, the expectancy reward function r and state-action pair.
According to the optimal Bellman formula, the Q-value function update can be deduced so that the Q-value function is always updated in the optimal direction. In the environment s
t
observed at the given time t, select action a
t
and implement it. At the next time t + 1, the state s
t
is transferred to st+1 and the intelligent system receives the reward rt+1, thus an empirical sample s
t
, a
t
, rt+1, st+1 consisting of the above letters is obtained. Then real-time update is made for the Q-value function of state-action pair (s
t
, a
t
). The rule is as follows:
Where s t is state; a t is action; r t is the reward value; α is the learning rate α ∈ (0, 1), used to control the speed of learning; γ is the discount factor, γ ∈ (0, 1); A is the set of all alternative actions; Qt+1 (s t , a t ) is the Q-function value of an action a t selected by Q-learning model at the next state s t . For ∀s, a, the Q-value function updated at the k th time will converge into the optimal Q-value function.
The learning task of Q-learning algorithm is as follows: find an action strategy to make the Q-value function of all the state-action pairs of this strategy get the maximum value and record this strategy as π*. MDP (Markov Decision Process) generally has at least one π* and if different π* have the same Q-value function, then the Q-value function is the optimal Q-value function Q*:
Usually a table is used to store all the state-action pairs of the optimal Q-value function. The Q-value function that stores all the state-behaviour pairs is expressed into a matrix of |S| × |A| in size. π* is derived through the proofing and back-calculation of the Q value. The expression is as follows:
π* is called deterministic optimal strategy of MDP and optimal strategy for short.
A typical process of single-step Q-learning algorithm is shown in Table 1.
Process of Q-learning algorithm
The optimization of the Q-value update function is divided into maximization and minimization operations. Among them, the maximization operation is used more commonly, that is, the action corresponding to the maximum Q value at the same state of Q-value function is the optimal action.
SARSA learning is an improved Q-learning algorithm [25]. Q-learning is based on the assumption of action and the maximum value of Q-value function is chosen for iteration. It belongs to the category of offline algorithm. By contract, SARSA learning uses the actual Q value for iteration. It updates Q-value matrix pursuant to the experience obtained from the implementation of the actual strategy. It belongs to online algorithm. The value function iteration and action selection strategy iteration in Q-learning function are iterated over each other, while in SARSA algorithm, the actual action is used to conduct value function iteration, indicating that value function iteration is consistent with action selection strategy iteration. Its Q-value update rule is as follows:
The specific steps of SARSA is shown in Table 2.
Process of SARSA learning algorithm
However, the Q value converge rapidly with the increment of the learning rate α, but it easily leads to immature convergence. Therefore, the learning rate normally value between 0.2 and 0.5. If the value of discount factor γ is small, the algorithm depends on instant rewards and on the contrary it depends on the long-term rewards. So, the discount factor is set to an approximate value of 0.9.
As SARSA is based on Q-value iteration, the simple greedy strategy can be used to ensure convergence to the optimal strategy and, but the selection of action selection mechanism plays a pivotal role in the convergence of algorithms.
Offline method always adopts optimal actions in the actual operation process and will not produce actions with unfavorable effects on the environment. Its main disadvantage is that the establishment of an accurate environmental model is required for simulation and prediction, and it obtains the optimal action after the convergence of learning algorithm, while online learning does not need the establishment of an environmental model and it completes convergence just by reading the response of the environment to the actual action. Moreover, it is unnecessary to wait for learning convergence before the actual operation. Learning is synchronized with operation. It is difficult to establish an accurate traffic model due to the complexity, randomness and uncertainty of expressway. The offline learning needs to wait for learning convergence before the next simulation, so each learning must obtain the control scenario at various traffic states of Q-learning matrix. However, in the process of actual traffic operation, not all traffic conditions will emerge. Therefore, the offline learning may have the problem of too slow convergence or even convergence difficulty in the face of a sudden change in the traffic states, and cannot implement ramp control in time. In contrast, the online learning does not need an accurate model, and the value function update is synchronized with the operation. It conducts learning according to the real-time traffic states and updates its corresponding Q value. At the early stage of traffic operation, the online learning possibly cannot obtain an optimal scenario. However, through the appropriate action selection mechanism and the increase of traffic operation time, the Q-value matrix will gradually converge, and the optimal action corresponding to the actual traffic states will be drawn in succession. Therefore, SARSA is chosen as the basis of the study on expressway ramp control.
The road segment is divided into multiple ramp agents. Each agent contains only one on-ramp and several off-ramps. The core of the ramp control method based on reinforcement learning lies in the model construction of ramp agent, including algorithm selection, state space, action space, reward function, Q-value update rule, action selection strategy, and finally the SARSA-based local ramp control method (SARSA Ramp Metering, SRM) is built.
Establishment of action space
The on-ramp metering value is chosen as the value of action space and is controlled by the signal lamp. The green time is generally set to 2s, and the control effect is achieved by adjusting the red time [26]. Ramp metering space, namely, the action space is:
Establishment of state space
The commonly-used method is to discrete the continuous state into finite state points and then form a state-action table with the set of action space. There are many variables that can be used to describe traffic states in the urban expressway entrance ramp system, such as upstream mainline traffic volume, traffic volume of weaving area, on-ramp traffic volume, on-ramp queue length, downstream mainline density and so on. In previous studies, density was mostly used as a measure of the mainline traffic states [27]. However, in the actual project, density cannot be directly obtained and the detector can only detect the occupancy rate, so this paper takes the occupancy rate o of weaving area as a state, and its state space set is denoted by O. The traffic state of the entrance ramp is described by the queue length w of vehicles on the entrance ramp, and its state space set is represented by W. Because of the complicated traffic conditions of the weaving area of expressway, only occupancy rate cannot fully describe the smoothness of the downstream weaving area of the entrance ramp. The traffic volume v is also an important measure, and its state space is denoted by V. Therefore, in this paper, the state S set can eventually be defined as:
Where the value range of downstream mainline occupancy rate of the entrance ramp is 0.1, and it is evenly discrete into 11 points, that is,
The maximum queue length of the entrance ramp is determined by the ramp length of the selected simulation scenario and other geometrical factors. After field investigation, the queue length generally does not exceed 100veh. It is evenly discrete into 11 points and the W set is,
Kazimen Expressway chosen here has two-way six lanes, namely, three lanes one way, plus the auxiliary road in the weaving area, four lanes one way in total. The ideal traffic capacity is 8000veh/h. V set is evenly discrete into 11 points,
The number of states in the entire state space is 1331 (1331 = 11×11×11).
The overall goal of the expressway control system is to minimize the total travel time (TTT) of vehicles in the road network. It is defined as follows:
Where T
c
is the control cycle; K is the number of control cycles; N(t) is the total number of vehicles in the road network in the t
th
control cycle. Suppose there is no vehicle initially in the road network and it is deduced that N(t) is:
Where d(t') and s(t') are the traffic volumes entering and leaving the road network in the control cycle. Substitute Formula (7) into (6) to obtain:
Where
Where d (t) is the sum of the traffic volumes entering the entrance ramp and entering the weaving area from the upstream mainline in the control cycle; s (t) is the sum of the traffic volumes leaving the exit ramp from the mainline and leaving the weaving area from downstream mainline in the control cycle.
If the number of vehicles decreased in the road network is taken as the reward function, the long-term goal of learning is changed into reducing the number of vehicles in the rod network to the largest extent. In the context of uncontrollable traffic demands, the traffic states of mainline weaving area are improved by controlling the number of vehicles entering the mainline from the on-ramp so that more vehicles can leave the off-ramp and weaving area from the mainline.
ɛ - greedy mechanism is the most commonly used selection mechanism. But online learning is to conduct real-time interaction with the actual traffic data. Before the convergence of value function, the experience of value function is used to generate action through the strategy. If after the value function gradually converges, ɛ - greedy mechanism is still used, it is highly probable to choose non-optimal action and easy to cause traffic jam. Therefore, with the gradual increase in the number of iterations, the probability for the agent to choose the optimal action should be gradually increased. The action update mechanism of Pursuit function in ɛ - greedy mechanism can meet the above requirements.
Similar to ɛ - greedy mechanism, Pursuit function always pursues the evaluation of optimal action at the current state, that is,
The probability iteration formula of choosing other actions
Where π
t
(a) is the probability of choosing the action a in t
th
cycle;
Where, the probability of choosing the ramp metering
The Q-value update rule of the new local ramp control method is determined under the condition that the action space, state space and reward function have been determined. As mentioned earlier, the SARSA algorithm has been selected, the Q-value update rule is as follows:
Where s t is the real-time traffic state which is determined via Formula (5) through the measured traffic data; a t is the ramp metering which is chosen via Formula (12); r t represents the number of vehicles reduced in the road network which is obtained via Formula (9); and α are β learning rate and discount factor respectively.
As shown in Formula (13), the Q-value update is iterated by the ramp metering at+1 actually implemented and the corresponding traffic state st+1. Compared with the offline algorithm, it does not need to traverse the traffic states that will not appear, thus improving the learning efficiency.
The reinforcement learning system acquires the traffic volume, queue length and occupancy data actually collected by the road section, obtains the traffic state, calculates the immediate reward and updates the Q-value matrix. The control action is selected through the current state, action selection mechanism and Q-value experience and it is sent to the ramp signal lamp to achieve ramp control. Therefore, the intelligent ramp control process based on reinforcement learning of measured data is as follows:
Step 1. Initialize the Q-value matrix and the action selection matrix;
Step 2. Obtain the current traffic state as the initial states via the detector, which is composed of the occupancy, volume and ramp length;
Step 3. Using the Q-value experience, select a ramp metering a according to the action selection strategy in the set of feasible on-ramp metering actions corresponding to the traffic states;
Step 4. Perform the ramp metering in the expressway on-ramp control system, observe the reward and new traffic state indicated by the traffic volume, and select the metering in the next control cycle;
Step 5. Update Q value: Q
t
(s, a)→ Q
t
(s, a
t
) +
Step 6. Assign the new state s' to s' and the new metering a' to a;
Step 7. Repeat steps 3 through steps6 until the end of control.
For the reasons of that the traffic flow is cyclical, the traffic states in each same time of the working day are similar. With the increase of the training days, the Q-value matrix corresponding to the model gradually converges over the different traffic states. Ultimately, the final ramp metering rates can be determined by the optimal behavior selection strategy.
Simulation experiment
Matlab and Vissim are chosen as the simulation software. The compilation of three control scenarios (no control, ALINEA and SRM) is achieved by using Matlab language. Matlab calls Vissim simulation real-time data through the COM interface and feeds back the control information to Vissim after processing to complete control.
Kazimen Expressway in Nanjing City is taken as the experimental section. It contains a weaving area, an entrance ramp and an exit ramp, as shown in Fig. 2. The experimental section is a one-way three-lane section from west to east. The Kazimen Street entrance ramp of the first weaving area is defined as the entrance ramp. This weaving area is 366m long. According to field research, traffic congestion is easily caused. The entrance ramp is 346m long and its schematic diagram is shown in Fig. 3.

Satellite map of Kazimen Expressway.

Diagram of Kazimen simulation section.
As the actual satellite map, the road network is built at the ratio of 1 : 1, as shown in Fig. 4. The coil detector is set on the upstream and downstream sections, the middle section of the weaving area, the entrance ramp and the exit ramp after the pattern of the actual road network [28].

Simulation road network of Kazimen Expressway.
This paper proposes to use the simulation software to restore the actual traffic state of expressway and its data detection. The total simulation time is 2h and the mainline traffic demand is shown in Fig. 4. Rush hours start at about 8 : 15 am and ends at about 9 : 00 am. The on-ramp traffic demand is shown in Fig. 5. The off-ramp outflow rate is 8%. According to the actual research, it is found that the traffic state of the entrance ramp is gradually congested from 8 : 20 am and at 9 : 10 am, the ramp traffic demand is gradually reduced and traffic congestion will come to an end.

Satellite map of Kazimen Expressway.

On-ramp traffic demand.
The control method of ramp proposed in this paper and its improvements are compared with the classical method ALINEA. So, the methods finally in need of simulation comparison are shown in Table 3.
Table of models for simulation comparison
Table of models for simulation comparison
ALINEA is the most classical ramp control method [6]. It also has gained extensive practical application in the world. No-control experiment is made for the ramp weaving area under different traffic demands to obtain the traffic volume - occupancy rate diagram, as shown in Fig. 7. The critical occupancy rate is calibrated. The critical occupancy rate of the weaving area is 41% in the Fig. 7.

Critical occupancy rate of the weaving area.
There are three parameters to be calibrated in the SRM. They are learning rate α, discount factor γ and parameter β of action selection Pursuit function mechanism. In accordance with documents [29], the values of the three learning parameters in this paper are shown in Table 4.
Determination of the learning parameters
Determination of the learning parameters
The learning rate α and model parameter β can achieve a balance that it focuses on exploration at the early stage and application at the late stage, which ensure that the ramp local control method can eventually achieve optimal control. The discount factor γ makes every step of ramp metering not limit to short-term control effect and more intently improve the operating efficiency of expressway.
As SRM is an online learning method, the concrete simulation process is realized by Vissim-Matlab simulation platform. The optimization and simulation of the signal control scenario are carried out synchronously, so the Q-value matrix stored in Matlab is only updated once in each simulation cycle. A complete learning process can be described as follows:
Step 1. Start the Vissim-Matlab simulation platform, initialize the Q-value matrix in Matlab, and define the action selection strategy (ramp metering);
Step 2. Select the initial action, first run a cycle of Vissim and get the traffic state;
Step 3. Choose the current ramp metering action based on the state, Q-value matrix and action selection strategy;
Step 4. Send back to Vissim the signal lamp control scenario corresponding to action, extract the traffic state next_state after a cycle of a single-step running and choose the ramp metering in the next cycle according to the action selection strategy;
Step 5. Matlab gets the reward next_reward after data processing;
Step 6. Update the Q-value matrix by using the Q-value update rule, state = next_state, action = next_action;
Step 7. Determine whether the termination condition is satisfied, if not, go to step 4, otherwise it will be terminated.
The learning period in simulation is 2 hours, while in practical engineering is one day. The corresponding optimal ramp metering at different traffic states is obtained through repeated learning. Moreover, as the traffic demand changes, the model can also update the Q-value matrix in a real-time manner to meet the control requirements of the actual projects.
Based on the expressway simulation research experience, this paper analyzes the overall performance indexes of road network in the three scenarios, then evaluates the control effects of no-control method, ALINEA and SRM respectively from the perspective of on-ramp queue length, weaving area occupancy rate and weaving area traffic volume.
Evaluation of the performance of road network
There are various road network performance indexes in VISSIM. In this paper, the average delay (Ave.delay), total delay (Total.delay), average speed (Ave.speed) and total travel time are chosen as the evaluation indexes. The simulation results of the scenarios are shown in Table 5. The average delay (Ave.delay), the average vehicle delay in the road network in Scenario 1 (no-control) is the biggest, and it is declined successively in Scenario 2 (ALINEA) and Scenario 3 (SRM), down by 32.33% and 43.73% respectively compared with the Ave.delay under no-control; and the Ave.delay in Scenario 3 is reduced by 16.83% compared with that in Scenario 2.
Overall simulation evaluation indexes of the road network
Overall simulation evaluation indexes of the road network
With regard to the total delay (Total.delay), the total vehicle delay in the road network in Scenario 1 is the biggest, and it is declined successively in Scenario 2 and Scenario 3, down by 31.42% and 42.28% respectively compared with the Total.delay under no-control; and the Total.delay in Scenario 3 is reduced by 15.83% compared with that in Scenario 2.
With regard to the average speed (Ave.speed), the average speed in the road network in Scenario 1 is the lowest, and it is increased gradually in Scenario 2 and Scenario 3, up by 18.01% and 26.03% respectively; and the Ave.speed in Scenario 3 is increased by 6.80% compared with that in Scenario 2.
The total travel time is a measure of the effectiveness of the control method. The total travel time in Scenario 1 is the largest, and it is reduced successively in Scenario 2 and Scenario 3, down by 14.06% and 18.55% respectively. The total travel time of Scenario 3 is 5.22% lower than that of Scenario 2.
In summary, as to the road network performance, in the order of no-control scenario, ALINEA scenario and SRM scenario, the Ave.delay, Total.delay and total travel time are basically reduced in sequence, while the Ave.speed is increased gradually. Therefore, compared with the ALINEA method, the SRM scenario has a relatively good control effect on the overall performance of road network.
The on-ramp queue lengths in different control scenarios are compared, as shown in Fig. 8.

Comparison chart of on-ramp queue lengths in different scenarios.
As the on-ramp vehicles are not controlled, they all enter the mainline weaving area, resulting that no vehicle queues on the entrance ramp. In Scenario 2, the existence of excessive control traps many vehicles on the entrance ramp, resulting in a large on-ramp queue length; compared with Scenario 2, Scenario 3 increases the on-ramp metering at peak hours, resulting that the queue length in Scenario 3 is obviously smaller than that in Scenario 2 at peak hours. The statistical data of average on-ramp queue lengths in different control scenarios at 60 110 min of rush hours is shown in Table 6.
The average on-ramp queue lengths in different control scenarios at rush hours
As shown in Table 6, during peak hours, the average on-ramp queue length in Scenario 1 is 0. The average queue length in Scenario 3 was 89 percent lower than in Scenario 2. In Scenario 3 the average on-ramp queue length during peak hours is reduced greatly and the fluency of the flow is improved under the condition that the overall performance of the network in Scenario 3 better than that in Scenario 2.
Figure 9 shows the occupancy rates in the three scenarios at the beginning of rush hours are basically the same, that the weaving area occupancy rate in Scenario 1 is the highest in the later period of rush hours and that the occupancy rate in Scenario 3 is reduced compared with that in Scenario 2.

Comparison chart of weaving area occupancy rates in different scenarios.
The statistical data of average weaving area occupancy rate in different control scenarios at 60 110 min of rush hours is shown in Table 7.
It can be known from Table 7 that the average weaving area occupancy rate in Scenario 3 is reduced by 5.72% and 2.42% respectively compared with that in Scenario 1 and Scenario 2.
The average weaving area occupancy rate in different control scenarios
The weaving area traffic volumes in different scenarios are compared, as shown in Fig. 10.

Comparison of ramp 1 weaving area traffic volumes in different scenarios.
Figure 10 shows that the traffic volume of ramp 2 in Scenario 3 is basically the same at the beginning of rush hours. In the follow-up Scenario 1, the traffic volume is low due to congestion; in Scenario 2 the traffic volume is low due to the excessive control and trapping of many vehicles; in Scenario 3, the control effort is adjusted so that the total weaving area traffic volume is higher than that in Scenario 2.
The statistical data of average weaving area traffic volumes in different control scenarios at 60 110min of rush hours is shown in Table 8.
The average weaving area traffic volumes in different control scenarios at rush hours
As indicated by Table 8, compared with Scenario 1 and Scenario 2, the average traffic volume of the weaving area in Scenario 3 is increased by 214veh/h and 109veh/h respectively.
To sum up, through the comparison of on-ramp queue lengths, weaving area occupancy rates and traffic volumes, it is known that compared with ALINEA, the control scenario SRM proposed in this paper can greatly reduce the on-ramp queue length, decrease the weaving area occupancy rate and enhance the weaving area traffic volume at rush hours. Thus, it has a better control effect.
Aiming at the congestion problem of the weaving area mainline of urban expressway and its entrance ramp, the local ramp agent model is established by applying the online SARSA reinforcement learning method to form the ramp control method (SRM).
Compared with no-control and classical ALINEA, the proposed method has intelligent learning and optimized function and improved the control effect on road network performance, ramp queue length, weaving area occupancy rate and traffic volume.
Based on the simulation experiment of the local ramp control of Kazimen Expressway in Nanjing, compared with the classical ALINEA, the average delay of the new method (SRM) is reduced by 16.83%, the total delay reduced by 15.83%, the average speed enhanced by 6.80%, and the total travel time decreased by 5.22%. Furthermore, through the control effort adjustment, the queue length of the entrance ramp is also greatly reduced; the average occupancy rate of the weaving area of SRM is 2.42% lower than that of ALINEA at rush hours; the average traffic volume of weaving area is 109 veh/h higher than that of ALINEA at rush hours.
Based on reinforcement learning, the study on urban expressway entrance ramp control is carried out and the intelligent ramp control is realized, which provides a new idea for the study on ramp control. The study will be extended to the coordinated ramp control level at a later period to achieve the coordinated control of urban expressway multi-entrance ramp and enhance the overall smoothness of road network. At present, we are considering how to design algorithms for regional traffic problems. The main challenge is that too much data leads to too much storage of Q-value tables, which is difficult to calculate. Therefore, this kind of problem is being studied in combination with deep learning method.
Footnotes
Acknowledgments
This work is funded by the National Natural Science Foundation of China (Grant 61573106).
