Abstract
Reinforcement learning is an efficient, widely used machine learning technique that performs well when the state and action spaces have a reasonable size. This is rarely the case regarding control-related problems, as for instance controlling traffic signals. Here, the state space can be very large. In order to deal with the curse of dimensionality, a rough discretization of such space can be employed. However, this is effective just up to a certain point. A way to mitigate this is to use techniques that generalize the state space such as function approximation. In this paper, a linear function approximation is used. Specifically,
Keywords
Introduction
Traffic signal control is a challenging real-world problem. Current solutions to this problem, such as adaptive systems like SCOOT [18], are often centralized or at least partially centralized if each controller is in charge of a portion of the urban network. Alternatives are manual interventions from traffic operators or the use of fixed-time signal plans. However, in the era of big data and advanced computing power, other paradigms are becoming more and more prominent. Among these, we find those derived from machine learning in general and reinforcement learning (RL) in particular. The reader is referred to some surveys in the area (see Section 2). In RL, traffic signal controllers located at intersections can be seen as autonomous agents that learn while interacting with the environment.
The use of RL is associated with challenging issues: the environment is dynamic (and thus agents must be highly adaptive), agents must react to changes in the environment at an individual level while also causing an unpredictable collective pattern, as they act in a coupled environment. Therefore, traffic signal control poses many challenges for standard techniques of multiagent learning.
To understand these challenges, let us first discuss the single agent case, where one agent performs an action once in a given state, and learns by getting a signal (reward) from the environment. To put it simply, RL techniques are based on estimates of values for state-action pairs (the so-called Q-values). These values may be represented as a table with one entry for each state-action pair. This works well in single agent problems and/or when the number of states and actions is small. However, in [28] Sutton and Barto discuss two drawbacks of this approach: First, a lot of memory is necessary to keep large tables when the number of state-action pairs is huge, which tends to be the case in real-world applications. Second, a long exploration time is required to fill such tables accurately. Those authors then suggest that generalization techniques may help in addressing this so-called curse of dimensionality.
An efficient representation of the states is a key factor that may limit the use of the standard RL algorithms in problems that involve several agents. Moreover, in scenarios in which the states are represented as continuous values, estimation of the state value by means of tabular Q-values may not be feasible. To deal with this problem, in this paper a true online
The RL-based adaptive signal control algorithm was implemented in the open-source agent-based transport simulation MATSim [17]. In MATSim, it is possible to investigate the impact of the RL-based adaptive signal control algorithm and compare it to other fixed-time or adaptive signal control methods. For comparison, we run our approach against a rule-based adaptive signal control algorithm based on Lämmer and Helbing [22], which was implemented in MATSim in a previous study [20,30]. The results show that the RL-based approach is able to outperform these approaches in a single intersection scenario. This is especially notable, as these approaches were designed specifically for dealing with the control of signals, whereas the RL-based approach needs no domain knowledge. To the authors’ best knowledge, virtually no other work in the literature (especially those stemming from the RL area) includes such kind of comparison. More often than not, comparison of RL approaches is made only to a fixed-time scheme.
The remaining of this paper is organized as follows. The next section discusses background and related work; this includes the rule-based adaptive signal control algorithm that is used as comparison in this study. The RL-based approach developed in this study is described in Section 3. Experiments and results are presented in Section 4, whereas Section 5 contains a discussion of the results and future work.
Background and related work
This section introduces some concepts on traffic signal control (Section 2.1) and gives more details about one method in particular, which is used as comparison (Section 2.2); then we discuss related work that is based on RL; the last subsection presents the simulation environment MATSim.
Traffic signal control
In contrast to fixed-time signals that cyclically repeat a given signal plan, traffic-responsive signals react to current traffic by adjusting signal states based on sensor-data (e.g., from upstream inducting loops or cameras). They can, therefore, react to changes in demand and reduce emissions and waiting times more efficiently.
A variety of traffic-responsive signal control algorithms have been developed. An overview is given, e.g., by Friedrich [8]. Different levels of adjustment are distinguished: actuated signals use a fixed-time base plan and adjust parameters like green split, cycle time or offset. (Fully) adaptive signals decide about the signal states on the fly. They can modify phase orders or even combine signals into different phases over time. With this, the flexibility of the signal optimization is augmented, which increases the possible improvement, but makes the optimization problem more complex. In order to reduce complexity and communication effort between sensors and a central computation unit (which controls signal states system-wide), decentralized (also called self-controlled) methods decide locally about signal states without complete knowledge of the system. Usually, every signalized intersection has its own processing unit that accounts for upstream (and sometimes downstream) sensor data of all approaches. A challenge of decentralized systems is to still ensure system-wide stability, especially when dealing with oversaturated conditions. A number of methods were developed that tackle these challenges.
Examples of traffic-responsive approaches from various generations and technological basis are: SCOOT [18] SCATS [23]; Prodyn [16]; OPAC [9]; UTOPIA [6]; TUC (Traffic-responsive Urban Traffic Control) [7]; and TUC combined with predictive control [5]. Some can be considered as rule-based as for example Lämmer and Helbing [22]), while others use techniques from RL and model signals as learning agents (see Section 2.3).
Lämmer’s rule-based adaptive traffic signal control algorithm
The idea of the self-controlled signals proposed by Lämmer and Helbing [22] is to minimize waiting times and queue lengths at decentralized intersections while also granting stability through minimal service intervals. The algorithm combines two strategies. The
An enclosing
An
Two
Reinforcement learning
In RL, an agent’s goal is to learn an optimal control policy
Since the agent’s objective is to maximize the cumulative reward, if it learned the optimal Q-values
For converge guarantees, in the case of QL, please see [35].
RL methods can be divided into two categories: Model-based methods assume that the transition function T and the reward function R are available, or instead try to learn them. Model-free methods, on the other hand, do not require that the agents have access to information about how the environment works.
There are many studies that use RL to improve traffic signal performance. For details, we refer the reader to some survey papers, which cover different aspects and perspectives: [4,24,37,38].
Using RL for traffic signal control is especially promising, as one does not need a lot of domain knowledge (as opposed to, e.g., rule-based approaches); rather, the controller learns a policy by itself. However, issues may arise with the aforementioned curse of dimensionality. In fact, depending on the specific formulation (e.g., how states and action spaces are defined), the search space can be very high. For instance, consider an intersection with four incoming approaches with three lanes per approach. If we define the state as the queue length on each lane discretized in 10 levels, it results in
In [24 ,26 ,27], RL is used by traffic signals in order to learn a policy that maps states (normally queues at junctions) to actions (normally keeping/changing the current split of green times among the lights of each phase). In [27] the approach is centralized (a single entity holds the MDP for all traffic signals); a central authority receives information about the length of the queues and elapsed time from various lanes to make a decision about timings at each signal. On the other hand, the approaches in [24] and [26] are decentralized. Each junction learns independently (normally using QL).
Since most of these works use QL, and thus approximate the Q-function as a table, they may fall prey to the curse of dimensionality. This arises when one deals with realistic scenarios, as, e.g., those beyond 2-phase intersections that are common in the literature.
In order to address this, a few works used function approximation. For instance, [1] uses tile coding in function approximation. However, the definition of states only considers queue length.
Recently, many studies have achieved impressive results using deep neural networks to approximate the Q-function (e.g., DQN [25,32,39]). However, linear function approximation has guaranteed convergence and error bounds, whereas non-linear function approximation is known to diverge in multiple cases [3,29]. Moreover, linear function approximation is less computation-intensive, as it relies on a significantly fewer number of parameters. Thus, if the Q-function can be linearly approximated with sufficient precision, linear function approximation methods are preferable.
As deployment, operations, and maintenance costs of traffic-responsive signals in general are high, transport simulation tools provide a perfect environment to systematically test and evaluate new signal control methods before applying them in the field.
The agent-based transport simulation MATSim [17], which is used in this study, is especially suitable in this regard, as it is able to run large-scale real-world simulations in reasonable time as. Simulations can be build based on open data (see, e.g., the open Berlin scenario [40]) such that the impact of new signal control approaches can be easily analyzed for arbitrary scenarios2
An example on how to start a MATSim simulation using the RL signal control presented in this paper can be found at
In MATSim traffic is modeled by agents (i.e., persons) that follow a daily plan of activities and trips. Traffic flow is modeled mesoscopically by spatial first-in-first-out (FIFO) queues. Vehicles at the head of a queue can leave a link when the following criteria are fulfilled: (1) The link’s free-flow travel time has passed, (2) the flow capacity of the link is not exceeded in the given time step, and (3) there is enough space on the next link. Despite this simplistic modeling approach, congestion, as well as spillback, can be modeled.
The traffic signal control module was developed by Grether as an extension to MATSim [13]. If a signal exists on a link, leaving the link is not possible while it shows red. First studies focused on fixed-time signals, but also approaches for traffic-responsive signal control have been implemented [11,20,30]. Kühnel et al. [20] and Thunig et al. [30] present the implementation and application of the rule-based signals from Section 2.2 in MATSim. This implementation is also used in the present study as comparison for the RL signal control.

Links with multiple lanes in MATSim. Each lane is represented by its own FIFO queue. Traffic signal control for different turning moves is captured. Vehicles on different lanes can pass each other, unless the queue spills over. Source: [12].
Separated waiting queues for different turning directions at intersections can be modeled in MATSim by lanes, which are a substructure of links (see Fig. 1). They are especially useful to model protected left turns at signalized intersections. Also, the spatial interaction of different waiting queues on a link can be captured correctly by lanes, as Fig. 1(b) depicts. Each lane can be signalized separately. Signals and lanes in MATSim are more extensively described by Grether and Thunig [12].
Events of vehicles entering or leaving links and lanes are thrown on a second-by-second time resolution in the simulation. Sensors on links or lanes that detect single vehicles can be easily modeled by listening to these events. As in reality, the maximum forecast period of such sensors is limited – vehicles can only be detected when they have entered the link. If a link is short, forecasts might not be accurate. In the simulation, responsive signals use these sensor data to react dynamically to approaching vehicles. For every signalized intersection, the control unit is called every second to decide about current signal states. With that, also RL-based signal control approaches can be easily installed into the simulation framework.
In general, MATSim can model user reaction as route, mode or departure time changes. But for this paper, only the traffic flow simulation of MATSim is used. Readers interested in the evolutionary part of MATSim – i.e., how agents adapt their plans and how long-term effects can be analyzed – are referred to [17].
In this section, we first discuss the method used for function approximation, then give details about the formulation of state and action space, as well as rewards, for the specific domain of signal control.
Fourier basis linear function approximation with the true online
The proposed RL traffic signal controller implements the true online
When linear approximation is used, the Q-values
The Fourier series is one of the most commonly used continuous function approximation methods, presenting solid theoretical foundations. In [19], it was empirically shown that Fourier basis outperforms other commonly used approximations methods such as polynomial and radial basis functions in continuous RL domains.
When applying Fourier series to the RL setting, it is possible to drop the sin terms of the series.3
For detailed explanation, please see [19].
The set of basis functions
After the execution of action
This update rule objective is to minimize the temporal difference error δ, which denotes the error in the current estimates of the Q-values. We refer the reader to [33] for details on its derivation.
The eligibility traces vector
Given the base learning rate α, each weight
In order to address the exploration–exploitation dilemma, the ε-greedy exploration strategy is used to choose actions: the action with the highest Q-value is selected with a probability of
Next, we give the formulations that are specific to the domain of signal control.
In RL problems, the definition of state space strongly influences the agents’ behavior and performance. In traffic signal control, for instance, information related to the level of congestion in the approaching lanes is fundamental in order to appropriately choose the next active signal phase.
In the present setting, the agent observes a vector
As common in the literature, the proposed RL signal control is only called every three seconds. This means, that one time step for the traffic signal agent corresponds to three seconds of simulation. This reduces the complexity and the size of the state space, without significantly reducing the performance.
Action space
At each time step t (every three seconds), the traffic signal controller chooses a discrete action
Reward
After taking action
In its turn, the cumulative vehicle delay D, for any time t, is computed as in Eq. (10), where
Experiments and results
Scenario
This study focuses on a single intersection scenario with four different set-ups. The set-ups vary in demand and/or number of lanes that are usable. The RL control is compared to a fixed-time signal control and rule-based traffic-responsive signal control based on [22] (as introduced in Section 2.2).
Nevertheless, the proposed RL method for traffic signal control is also applicable to real-world scenarios. To do so, every signalized intersection can be modeled as an individual learning agent, only working with local sensor information. This way, green waves are not specifically tackled or pre-defined. We note however that, they may be considered if a different reward function is defined, which is designed to reward offsets that are inline with the emergence of a green wave.
Further, to address more complex scenarios, a setting that considers a network of signals is being investigated, were we show that the RL signal control proposed here is able to keep up with – and in some situations is even able to outperform – Lämmer’s algorithm in a real-world scenario with multiple signalized intersections (see [2]).

Single intersection scenario.
The single intersection featured here (see Fig. 2) has four incoming approaches. In the horizontal direction, there is a dedicate left turning lane in each traffic approach, as well as three lanes for straight traffic. In the vertical direction, there are two lanes for straight traffic.
Traffic signals are grouped into three non-conflicting signal phases: Straight traffic in horizontal direction; left turning traffic in horizontal direction; vertical direction. While switching between two signal phases, there is an all red period of one second. The minimum green time for a signal phase is five seconds.
The fixed-time control that is used for comparison purposes is optimized by Webster’s method [36]. It has a cycle time of 40 seconds and distributes green times according to average flow rates. The traffic-responsive signal approaches do not have a fixed cycle time: For Lämmer’s control algorithm, a desired and a maximal cycle time can be defined (for this scenario 40 and 60 seconds are used, respectively). For the RL control a maximal green time of 30 seconds per signal phase is used. As mentioned in Section 3.1, the RL control is only called every three seconds to decide about new signal states.
All these parameter settings (such as all red time, minimum green time, update time etc.) can of course be adjusted when applying the RL signal control to other scenarios.
Demand
Four different demand set-ups are modeled. In all set-ups, arrival rates are stochastic: vehicles are inserted as platoons, with a platoon size that is exponentially distributed around an expected value of five. Also the time gap between vehicle platoons is exponentially distributed: its expected value is the platoon size divided by the average flow value. To average out the fluctuations in the results depending on the specific platoon structure of approaching vehicles, each set-up was simulated with 20 different random seeds.
Constant demand. In a first set-up, there is traffic going straight in the horizontal direction, with 1800 vehicles approaching on average per hour, in each of the two approaches. In the vertical direction, there are 600 vehicles on average per hour from each side – all going straight. Additionally, there are 180 vehicles on average per hour from both sides in horizontal direction that want to turn left at the intersection. A period of 86,400 seconds (i.e., one day) is simulated.
Peaks with doubled demand. In a second set-up, the demand is doubled during five time periods over the day of 2,000 seconds length each, in order to analyze the effect of fluctuating demand on the performance of the RL controller. To be more precise, in the time intervals [0, 2,000), [20,000, 22,000), [40,000, 42,000), [60,000, 62,000), and [80,000, 82,000) the average flow rates in horizontal direction are 3600 vehicles per hour going straight and 360 vehicles per hour going left per approach, whereas in vertical direction the average flow rate per approach is 1200 vehicles per hour. During the rest of the simulation, the average flow rates are the same as for the first scenario set-up.
With this demand set-up, it can be analyzed how the control algorithms behave with short periods of overload. Because these periods periodically repeat, the RL control is able to learn from peak to peak while the other controllers behave similarly in all peaks.
Asymmetric periodic demand. In this third set-up, an artificial morning and evening peak are simulated around a daily demand level that corresponds to the first set-up. To model the peaks, a sinus curve modifies the average flow values. In the morning peak, this sinus curve has its maximum at 8 am with twice the daily demand level for the horizontal direction and 1.5 times the daily demand for the vertical direction. In the evening, this factors are swapped (1.5 for horizontal direction; 2 for vertical direction) with a maximum at 6 pm. During the day (between 10 am and 4 pm), a constant demand level similar to the first demand set-up is used; during the night one half of this constant demand is used. Figure 3 shows the number of departures per second per direction resulting from this set-up. The x-axis is trimmed to the interesting part of the day. The shadowed area depicts the standard deviation. The lines are smoothed with a moving average window of 300 seconds (i.e., 5 minutes) for better clarity.

Average number of departures per second per direction for the third demand set-up with asymmetric periodic demand.
Having this kind of asymmetric periodic demand makes the situation more difficult for the RL control because a wider spectrum of the state space (i.e., of different vehicle pattern) has to be observed and explored. On the other hand, for Lämmer’s algorithm, the evening peak in this set-up is especially challenging, as the main traffic approaches from the secondary road. This is due to the way the algorithm prioritizes between approaches with different flow capacities.
Constant demand with lane closure. For the fourth set-up, a constant demand is used, with 1100 vehicles on average per hour in each of the two approaches of the horizontal direction and, in each case, additionally 110 vehicles on average per hour that want to turn left. Vertical traffic corresponds to the first demand set-up (600 vehicles per hour).
Between 6 am and 6 pm a lane closure (e.g., due to a road work) is simulated eastbound in horizontal direction which results in a reduction of flow capacity to one third (two lanes are closed). This interesting to look at because Lämmer’s adaptive algorithm is not capable of dealing with such a spontaneous capacity change and still assumes the old flow capacity values, while RL is able to learn from the new situation without any domain knowledge.
The proposed method of the true online
Due to the stochastic arrival rates, results presented here are averaged over 20 runs with different random seeds, whereby the random seed influences the platoon structure of approaching vehicles (the average flow rate stays the same).
The shadowed area in the plots depicts the standard deviation regarding average delay or total queue length, accordingly. The lines are smoothed with a moving average window of 300 seconds (i.e., 5 minutes) for better clarity.
Comparison with other RL-based signal control methods
Here we compare the proposed method with the traditional tabular
Tabular vs. linear
In order to allow a fair comparison, the same discount factor, value of λ and exploration rate were used for both methods. The discount factor was set to
As the state space in this case is large, and the number of Fourier basis functions grows exponentially on the number of dimensions of the state space, we placed constraints on the coefficient vectors
In Fig. 4 the average delay per vehicle at each second of the simulation is depicted for true online

Average delay for tabular and linear function approximation RL implementations.
With 10 bins, the learning is very slow, as the number of discretization bins exponentially increases the size of the state space. Reducing the number of bins to 8 significantly speeds up learning and reduces the delay. However, by reducing the number of bins, different states (in which different actions are optimal) are perceived as the same, thus leading to a sub-optimal performance in the long run.
The usage of function approximation not only avoids the curse of dimensionality, but introduces generalization, i.e., when updating the Q-function after taking an action in a given state, similar states are also affected and have their Q-values changed. With that, the true online
Order of the Fourier basis approximation.

Impact of different values for the order n of the Fourier basis approximation.
Figure 5 shows the impact that the value of the Fourier approximation order n has on the agent’s performance. As expected, the higher the value of n, the more accurate is the approximation of the Q-function. Changing the order from
State definition. Although the q (flow) features provide the traffic signal control agent with queue information on each lane, the Δ (density) features are also important, as they inform how many vehicles (that may be queued in the following seconds) there are on each link. Figure 6 shows that, by removing the Δ features from the state definition, the average delay increases. This difference might be even higher in scenarios with very high demand, where a high number of vehicles are moving and approaching the queues.
Reward definition. The definition of the reward function has a high impact on the performance of the RL-based controller [15]: In Fig. 7 the reward function defined in Section 3.1 is compared to another reward function found in the literature [24], defined as the change in total queue length between successive actions. The traffic signal controller using change in cumulative delay as reward not only converges to better performance, but produces a learning curve that decreases orders of magnitude faster. This result shows that the choice of which reward function to use is one of the most critical implementation decisions when designing a reinforcement learning controller.

Impact of state definition.

Impact of reward definition.
In this section, the true online
First set-up – constant demand. Figure 8 shows the performance regarding average delay and total queue length for the first set-up (constant average flow rates). It can be seen that for this, somewhat homogeneous setup, both the RL-based and Lämmer approaches perform much better than the Webster fixed-time control in terms of average delay and queue length. Also, they produce less variation in these measures, demonstrating robustness against traffic fluctuations. Note that for constant average flow rates, the fixed-time control used here (optimized by Webster’s method) is already quite good. RL is able to outperform the fixed scheme because it seems to be more stable regarding platoon variations. This can be seen in both plots in Fig. 8, with the standard deviation (shown as the shadowed area in the plots) being lower for the RL-based control.

Single intersection scenario with constant average flow rates (first set up of Section 4.1.2).
Second set-up – peaks with doubled demand. Figure 9 depicts how the different signal controllers are able to handle short phases of overload. Flow rates are doubled during five time periods over the day (see description in Section 4.1.2). For this, less homogeneous demand, both adaptive approaches clearly outperform the fixed-time control which is even not able to resolve the queues of one peak before the next peak begins. The RL controller improves its performance from the second peak onwards, as in the first peak it was experiencing an overload situation for the first time. In this more difficult set-up, the difference in performance between RL and Lämmer becomes less visible, with both presenting the same length of queues when there is low demand. Interestingly, RL decreases the queue lengths faster than Lämmer after the peaks, which indicates that RL better adapts to changes in flow pattern.

Single intersection scenario with periodically repeating time periods where the average flow rates are doubled (second set up of Section 4.1.2).
Third set-up – asymmetric periodic demand. This set-up presents the effects of more heterogeneous demand, where in the morning more traffic is approaching on horizontal direction while in the evening more traffic is approaching on the minor vertical road. With this, a wider spectrum of the state space needs to be explored by the RL-based controller because a lot of different vehicle patterns occur. This is why RL is able to improve further when it is run for multiple iterations (i.e. days) in the simulation – in contrast to the first demand set-up. A comparison of average delays and total queue length between the first and the fifths iteration is given in Fig. 10. The x-axis is trimmed to only show the relevant part of the day. Especially in the crowded morning peak, learning over the days helps to narrow and flatten the curves.

First vs. fourth iteration (i.e., days) of RL in the single intersection scenario with asymmetric morning and evening peak (third set up of Section 4.1.2).
Compared to Lämmer’s rule-based control, it can be seen that RL behaves very well in the evening peak, when the main traffic is approaching on the minor road, see Fig. 11. This is probably due to the priority calculation of Lämmer’s algorithm, where approaches with lower flow capacity values result in lower priorities for the same demand pressure. Because also a lot of traffic is approaching on the major road with its high flow capacity, the minor road does not get the main priority. During the morning peak Lämmer and RL behave similarly, with Lämmer resulting in lower maximal queue length, but RL resolving the peak faster. As in the first demand set-up, Lämmer is slightly better with low constant demand values during the day.

Single intersection scenario with asymmetric morning and evening peak (third set up of Section 4.1.2).
Fourth set-up – constant demand with lane closure. Recall that, contrarily to rule-based approaches, the RL-based control does not require domain knowledge. With this, it has advantages when unexpected events occur that change the underlying situation (e.g. a change in flow capacities, storage capacities, free speed etc.). To verify this, the fourth set-up simulates a lane closure event, where two of the three lanes in horizontal direction eastbound are closed for some time (see description in Section 4.1.2). This results in a flow capacity drop by two thirds. As Lämmer’s rule-based control still calculates priorities based on the original flow capacity values, it results in quite high delays and queue length, as seen in Fig. 12. To better see the difference between RL and Lämmer, total delay and queue length for the fixed-time control are not shown in that figure, as they are even higher. The x-axis is again trimmed to the relevant part of the day.
Interestingly, RL is worse than Lämmer at the beginning of the lane closure, where it is still learning to handle this new situation. However, it quickly overtakes Lämmer and stays more or less stable, while Lämmer’s delay and queue length values increase further. When the lane closure ends, queue lengths and delays drop faster with RL, whereas Lämmer’s control quickly takes over (similar to the first demand set-up) as it’s domain knowledge is again beneficial.

Single intersection scenario with constant demand where two lanes are closed eastbound on horizontal direction between 6 am and 6 pm (fourth set up of Section 4.1.2).
As discussed in the first two sections of the present paper, traffic signal control is a challenging domain for RL techniques, being the subject of an increasing body of research. In the present paper, it was shown that specific techniques from RL can help to improve the performance of traffic signal control, and even outperform state-of-the-art rule-based adaptive signal control methods, especially in scenarios with varying traffic demands or unexpected events. It was argued that tabular RL methods may not be feasible due to the curse of dimensionality. When it is possible to employ them, it is often the case that they need long learning times before convergence in the case of realistic intersections with more than two signal phases and when a more complex definition of state is used. Recall that the results presented here show that including more features (i.e., not only queue but also density) played a significant role in the performance.
To address the curse of dimensionality, we used Fourier basis linear function approximation alongside the true online
As a next step, the signal control based on true online
As future avenues for research, we envision the following. First, it remains to be investigated whether the RL signal control can be further improved by designing the learning task using other space and action formulations. Additionally, since the issue of which reward scheme to use seems to be a key issue, a possible extension of this work could consider using the methods proposed in [14,15] for designing a reward function that fits this domain best.
A further study will analyze the effect of self-controlled signals by RL on the long-term decisions of travelers, e.g. regarding route or mode choice. With this, the problem becomes bi-level: Signal agents react to sensor data and traveler agents react to experienced travel times that are, in turn, affected by the signal control. As a consequence, delays and queue length might increase again, as intersections that are efficiently controlled attract more traffic. For rule-based adaptive traffic signals this effect was already verified in the simulation [31].
Footnotes
Acknowledgements
The authors thank Prof. Kai Nagel for his support and supervision during the internship of Lucas N. Alegre at TU Berlin. Thanks also to Dr. Ricardo Grunitzki for a previous version of the tabular Q-learning code. The authors are grateful to the Deutsche Forschungsgemeinschaft (DFG) for funding the project Optimization and network wide analysis of traffic signal control, as well as to the Brazilian Research Council (grant no. 307215/2017-2 for Ana Bazzan).
