DECISION-MAKING OF AN AUTONOMOUS VEHICLE IN THE PRESENCE OF
EMERGENCY VEHICLE USING DEEP REINFORCEMENT LEARNING
by
Hamid Shoaraee
M.Sc. Industrial Engineering, University of Tehran, 2019

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
November 2021
© Hamid Shoaraee, 2021

Abstract
Autonomous Vehicles are the future of road transportation where they can increase safety,
efficiency, and productivity. In this thesis, we address a new edge case in autonomous driving
when one autonomous vehicle is approached by an emergency vehicle and needs to make the best
decision. To achieve the desired behavior and learn the sequence decision process, we trained our
autonomous vehicle with the help of Deep Reinforcement Learning algorithms and compared the
results with rule-based algorithms. The driving environment for this study was developed by using
Simulation Urban Mobility as an open-source traffic simulator. The proposed solution based on
Deep Reinforcement Learning has a better performance compared to the rule-based solution as a
baseline both in normal driving situations and when an emergency vehicle is approaching.

Table of Contents
Abstract ........................................................................................................................................... II
List of Figures ................................................................................................................................ V
List of Tables ...............................................................................................................................VII
List of Abbreviations…………………………………………………………………………..VIII
Publication From This Research…………………………………………………………………IX
Acknowledgment…………………………………………………………………………..……..X
Chapter 1: Introduction ................................................................................................................... 1
1.1. Automobile Automation .................................................................................................. 1
1.2 Artificial Intelligence, Machine Learning, Deep Learning............................................... 3
1.3 Autonomous Driving ........................................................................................................ 5
1.4 Research Objectives, Contributions, and Structure of This Research .............................. 8
Chapter 2: Background ................................................................................................................. 11
2.1 Reinforcement Learning ................................................................................................. 11
2.2 Artificial Neural Network and Deep Learning ............................................................... 18
2.3 Decision-Making Task for Autonomous Vehicles ......................................................... 20
2.3.1 Rule-based Controllers............................................................................................. 21
2.3.1 RL-based Controllers ............................................................................................... 23
Chapter 3: Methodology ................................................................................................................27
3.1 Problem Description ....................................................................................................... 27
3.2 RL Framework ................................................................................................................ 29
3.2.1 Agent ........................................................................................................................ 29
III

3.2.2 Environment............................................................................................................. 30
3.2.3 Actions ..................................................................................................................... 34
3.2.4 Reward ..................................................................................................................... 35
3.3 Deep Reinforcement Learning for Autonomous Driving ............................................... 36
3.3.1 Neural Networks Architecture ................................................................................. 40
3.3.1 Dueling Deep Q-Network ........................................................................................ 43
Chapter 4: Computational Result...................................................................................................45
4.1 Implementation ............................................................................................................... 45
4.1.1 Parameters and Hyperparameter .............................................................................. 49
4.2 Computational Result...................................................................................................... 51
4.2.1 Rule-Based V.S DRL-Based........................................................................................ 52
Chapter 5: Conclusion and Future Research..................................................................................66
5.1 Limitation of This Research............................................................................................ 68
5.2 Future Research .............................................................................................................. 69
References..................................................................................................................................... 71

IV

List of Figures
Figure 1. 1 Level of Automation..................................................................................................... 2
Figure 1. 2 AI, ML, DL................................................................................................................... 5
Figure 1. 3 AD Components ........................................................................................................... 7
Figure 2. 1 Stakeholders in Markov Decision Process. (Richard S. Sutton, 2018) ...................... 12
Figure 2. 2 Exploration V.S Exploitation ..................................................................................... 16
Figure 2. 3 Perceptron the Smallest Component of ANNs ........................................................... 18
Figure 2. 4 Deep Neural Network................................................................................................. 19
Figure 2. 5 Autonomous Vehicle (ego) and Leader Follow IDM................................................. 21
Figure 3.1 High-Level Workflow ................................................................................................. 28
Figure 3.2 Input Data Representation ........................................................................................... 29
Figure 3.3 Highway Environment................................................................................................. 32
Figure 3.4 DRL and SUMO Connection Through Traci .............................................................. 33
Figure 3.5 Workflow in A Single Step of Simulation and Training ............................................. 37
Figure 3.6 ReLU Activation Function .......................................................................................... 41
Figure 3.7 DQN Structure............................................................................................................. 43
Figure 3.8 DDQN Structure.......................................................................................................... 43
Figure 4. 1 Upper-Level View of Implementation ...................................................................... 46
Figure 4.2 Number of Accidents................................................................................................... 53
Figure 4.3 Number of Out of Road ............................................................................................... 55
Figure 4.4 Number of Times ego has a Same Lane with Emergsency Vehicle............................ 56
Figure 4. 5 Comparison Between IDM and DDQN on Same Lane with Emergency Vehicle ..... 58
Figure 4. 6 Number of Speed Violation........................................................................................ 59
V

Figure 4. 7 Speed Violation Comparison...................................................................................... 60
Figure 4.8 Emergency Vehicle Reaches to The End of Route Before ego................................... 61
Figure 4.9 Penalty Comparison Between DDQN and IDM+ MOBIL ......................................... 62
Figure 4.10 Penalty Comparison of Three DRL Based Solutions ................................................ 63

VI

List of Tables
Table 2.1 IDM and MOBIL Parameters ....................................................................................... 22
Table 2.2 Summary of Related Work ........................................................................................... 25
Table 3.1 Vehicle Type…………………………………………………………………………..31
Table 3.2 Termination Conditions ................................................................................................ 34
Table 3.3 Actions .......................................................................................................................... 34
Table 3.4 Penalties ........................................................................................................................ 35
Table 3.5 DQNFNN Structure......................................................................................................... 41
Table 3.6 DQNCNN Structure ........................................................................................................ 42
Table 4.1 APIs and Python Libraries………………………………………………………..…...45
Table 4. 2 Simulation Parameters ................................................................................................. 50
Table 4. 3 RL Framework Parameters .......................................................................................... 50
Table 4. 4 Mean and Standard Division Comparison ................................................................... 57

VII

List of Abbreviations
AV, ego

Autonomous Vehicle

emg

Emergency Vehicle

IDM

Intelligent Driver Model

MOBIL

Minimize Overall Braking Induced by Lane Changes

RL

Reinforcement Learning

DRL

Deep Reinforcement Learning

SUMO

Simulation Urban Mobility

DQN

Deep Q-Network

DDQN

Dueling Deep Q-Network

VIII

Publication From This Research
Hamid Shoaraee, Liang Chen, Fan Terry Jiang. "Decision-Making of an Autonomous Vehicle
when Approached by an Emergency Vehicle using Deep Reinforcement Learning". The 19th IEEE
International Conference on Pervasive Intelligence and Computing (PICom 2021).

IX

Acknowledgment
First and foremost, I would like to express my gratitude to my supervisor, Dr. Liang Chen, who
has an invaluable role in this research. Your informative knowledge, passion, and support
encourage me to improve my thoughts and raise the quality of this work.
I want to thank my committee members Dr. Jueyi Sui and Dr. Fan Jiang for their constructive
comments and insights that brought the quality of this work to a higher level.
My biggest gratitude to my family for all their love, support, and kindness. My parents, who always
providing me opportunities to be a better person and inspired me to take risks in life and pursue
all my dreams, LOVE you Baba and Maman.
I have to thank my brother, Saeed, my hero, and his wife and son, Setareh and Ryan, as best friends.
Finally, thanks to all my friends in Prince George especially Reza and Matthias for their help and
support throughout my education.

X

Chapter 1: Introduction
Autonomous driving (AD) or driverless vehicles have gained a lot of attention both in
industry and academia. Researchers, engineers, scientists, and automobile manufacturers are
working to reach an effective, efficient, and safe solution for fully autonomous driving.
There have been great achievements regarding sensing, perception, planning, and control tasks in
autonomous driving. However, some challenges related to decision-making tasks still exist.
In this chapter, an introduction of levels of automation and how the automobile industry has thrived
from level 0 of automation to level 4 as a fully autonomous vehicle will be present. After that, we
will provide some information about Artificial Intelligence, Machine Learning, and Deep Learning
and how they can play a great role as a solution for complicated problems. Next, we will provide
information about modular and end-to-end pipelines and the different components of autonomous
driving.

1.1. Automobile Automation
Without any doubt, building the first automobile is one of the greatest human
achievements. Automobiles serve as a showcase of human accomplishment in fields such as
electronic, mechanical engineering, design, manufacturing, art, and many other fields over the past
century. Automobiles give people lots of accessibility, freedom, leisure, and they changed the way
people thought about transportation. From Karl Benz who made the first automobile in 1885 many
inventors and engineers follow his lead to make automobiles a better product.
Automation in the automobile industry is categorized by US National Highway Traffic Safety
Administration [1] in 5 levels as shown in Figure 1.1.

1

The first level of automation is level 0 where the driver has an exclusive control on all controllers
(steering, acceleration, and brake). The automation starts from level 1 where the car is equipped
with Advanced Driver-Assistance Systems (ADAS). The driver has overall control of the vehicle,
but he or she can choose to relinquish limited control over a primary task. Adaptive cruise control,
emergency brake, and lane assistance are named as some features of ADAS in this level. When
two primary controls are handled by the vehicle instead of a human driver, a vehicle reaches level
2 of automation. For example, a combination of adaptive cruise control and lane centering are
considered as the second level of automation. Level 3 is the beginning of the era of self-driving
cars: this level call semi-autonomous, where the driver is not required to monitor the road
continuously. However, in some situations, the driver is expected to control the vehicle when it is
difficult for the self-driving car to control itself.
Level of Automation

Level 4

Level 3

Level 2
Cruise Control
Seat Belts
ABS

Level 1

Level 0

Adaptive
Cruise Control,
ADAS

Observe
environment
and do lots of
tasks but always
hands on the
wheel Human!

I Am the
Master!
(No steering
wheel, pedals,
or any
controller)

Ford Model T

1885 - 1950

1950 - 2000

2000 - 2010

2010 - 2020

Figure 1. 1 Level of Automation

2

Time
2020 - 2025

For example, in complex situations like unprotected intersections, unmarked roads, and
construction zones with a narrow pass, the autonomous driving will be deactivated, and the vehicle
gives enough time to transition the control to the human driver.
In level 4, all driving tasks will be observed and controlled by the vehicle without any help from
the human driver. At this level, there is no controller such as the steering wheel or gas and brake
pedals. Based on our knowledge, none of the car manufacturers and research teams have reached
level 4 as fully self-driving at this time. However, based on the trend and development in this field
we can expect the ability to fully self-drive in the future. We need to consider based on different
categorizations on automation levels some people consider level 5 as fully self-driving automation.
There are some studies (e.g. [2][3][4]) that describe the effects of AVs in cities, and how they can
lower the cost of transportation, accessibility, and safety. These studies describe the challenges of
reaching level 4 as fully autonomous driving.

1.2 Artificial Intelligence, Machine Learning, Deep Learning
Before introducing all procedures that need to consider for Autonomous Driving (AD), we
want to provide some information about Artificial Intelligence (AI) and Machine Learning (ML)
and explain why classic and heuristic programming does not work for having an AD.
Marvin Minsky in [5] describes the inefficiency of programming in all possible situations as “A
computer can do, in a sense, only what it is told to do. But even when we do not know exactly how
to solve a certain problem, we may program a machine to Search through some large space of
solution attempts. Unfortunately, when we write a straightforward program for such a search, we
usually find the resulting process to be enormously inefficient.” Letting machine learning based
on searching through a large space of solution is the same thing humans do for learning. Therefore,
3

in AI we want to somehow simulate human intelligence in machines to learn like a human brain.
The science and art of gaining knowledge from raw data is called Machine Learning and defined
by Tim Mitchell as follows: “A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.”
In general, there are mainly three types of Machine Learning, known as Supervised learning,
Unsupervised learning, and reinforcement learning. In this research, we will describe all the details
about reinforcement learning as we want to have a solution for the decision-making task of AD
based on reinforcement learning.
In supervised learning, we already know both data (features) and labels and we want to find the
relationship between input and output. Supervised learning most of the time is categorized into
two problems of regression and classification.
In unsupervised learning, the training data is unlabeled, and the machine tries to find patterns from
input data without a supervisor. Also, there is Semi-supervised learning when we have a
combination of unlabeled and labeled data and we want to learn from partially labeled data.
For a comprehensive study about machine learning and deep learning, we refer readers to two great
books by Ian Goodfellow [6] and Aurelien Geron [7].
The relationship between Artificial Intelligence, Machine Learning, and Deep Learning is
presented in Figure 1.2. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton defined deep learning
in the paper they published in [8] as: “Deep learning discovers intricate structure in large data sets
by using the backpropagation algorithm to indicate how a machine should change its internal
parameters that are used to compute the representation in each layer from the representation in the
previous layer.
4

Artificial Intelligence

Machine Learning

Deep Learning

Figure 1. 5 AI, ML, DL

Deep convolutional nets have brought about breakthroughs in processing images, video, speech,
and audio whereas recurrent nets have shone a light on sequential data such as text and speech.”
Learning features that are not human-designed and engineered by the deep neural networks is the
power of deep learning which can pave the way for solving complicated problems in Artificial
Intelligence like autonomous driving.
We will describe how proposed deep neural networks (Deep Learning) can help to train the
autonomous vehicle to handle decision-making tasks with the help of Deep Reinforcement
Learning (DRL) without explicit programming (Machine learning), and it enables the AD to be
treated like a human driver (Artificial Intelligence).

1.3 Autonomous Driving
Big names like Mercedes-Benz, Toyota, Tesla, Waymo, etc. have tried to reach fully
autonomous driving with various technologies from radio control approach to autopilot
technologies for many years.
Using Graphics Processing Unit and Tensor Processing Unit with more computational power,
developing more complicated deep neural networks that can be trained for different tasks, and
5

collecting and labeling data in an efficient way opens more doors for reaching fully autonomous
driving.
There are two common pipelines for solving the autonomous driving problems: 1) modular
pipeline 2) end-to-end pipeline.
The input data of both pipelines are the same which can come from different sensors, cameras,
Light Detection and Ranging (LiDAR), Radio Detection and Ranging (RADAR), and Ultrasonics
data. In the modular pipeline, each of the components works independently, and after the
integration between these components, the vehicle can experience autonomous driving.
The second pipeline is End-to-end, where entire driving tasks are considered as one Machine
Learning problem, and a neural network handles all the work. Both pipelines have their own pros
and cons, for example, the modular pipeline is easier to track which component caused a failure.
However, engineering of input and output data between different modules is a very complicated
job and is prone to errors, and the result of the modules may not be an optimum decision based on
different driving scenarios.
In the end-to-end pipeline, there is no feature engineering and the DNN should learn optimum
representation. However, comprehension and tracking the failure reasons are not as easy as
compared with the modular pipeline.
The same level of complexity of driving task with other challenges like betting professional game
players in Atari 2600 games [9], astonishing performance on complicated games like Go [10] and
StarCraft [11], finding protein structure as a significant discovery in genomics [12], and improving
computer vision [8] show the possibility of solving end-to-end decision-making of autonomous
vehicle with the help of Machine Learning.

6

Figure 1.3. Shows the main components of the AD system. The sensing component as a first One
provides the input data for the AD system. Vision-based AD systems use a simple dash-cam or
many cameras around the vehicle. For example, Tesla’s autopilot https://www.tesla.com/autopilot
uses cameras for 360-degree visibility around the vehicle up to 250 meters in range. Studies like
[13][14][15] consider a vision-based system as input data in their models for sensing surrounding
objects. The second well-known source of input data is LiDAR [16] as a laser technology that
estimates the time of reflected light to the source. Other technologies like RADAR and Ultrasonics
are also used in AD systems for collecting input data.
In this research, we consider vehicle state as a high-level measurement technology for collecting
input data. All the essential data is extracted in each step from the simulation. We will provide
comprehensive information about this method and input data in Chapter 3 when we will introduce
the environment and the Simulation Urban Mobility (SUMO) as a simulator for this research.

1. Sense

5. Control

2. Perception

4. Planning

3. Scene
Representation
Figure 1. 7 AD Components

7

After collecting the input data, the next component is perception. Creating an intermediate
representation from input data is the goal of the perception task. Object detection, bird-eye view
map, lane detection, semantic representation, and high-definition maps are the technologies that
give enough perception and localization to the autonomous vehicle.
Scene representation is a higher level of the environment understanding. At this level, information
of all the sources needs to combine to create a higher level of knowledge. This combination will
happen during the different levels of fusions and in the end, AV knows comprehensive knowledge
about the world around the vehicle.
The planning component helps AV to know about the routing and trajectory planning based on the
input information such as Global Positioning System or High-Definition Maps. By this level, the
AV has all the information for autonomous driving. The last task in the AD pipeline is controlling
the vehicle. This task is the output of all the previous tasks and based on the all-stacked information
the AV make decision and control steering, acceleration, and deceleration. We refer readers to [17]
as a comprehensive study about end-to-end autonomous driving.

1.4 Research Objectives, Contributions, and Structure of This Research
The main objective of this research is to improve the decision-making of autonomous
driving systems. We know that there are still some challenges to reach fully autonomous driving.
One of these challenges is how the autonomous vehicle can respond to edge cases.
In this research, we want to solve one of these edge cases when an autonomous vehicle is
approached by an emergency vehicle. Therefore, the main goal of this thesis is to improve the
decision-making of AVs by considering the presence of an emergency vehicle.

8

Based on our knowledge this research would be the very first research that wants to consider this
edge case. As a second contribution of this research, we will develop both rule-based and DRLbased solutions for solving the decision-making of AVs. The rule-based solution will be created
by a combination of the Intelligent Driver Model (IDM) and Minimize Overall Braking Induced
by Lane Changes (MOBIL). We will design the three DRL-based solutions with two different
neural network architectures, and for the DRL-based solution, we will create a highway
environment and all the stakeholders of the simulation with help of Simulation Urban Mobility
(SUMO) as an open-source simulator. Finally, this research shows that the performance of the
proposed solutions is robust by doing different experiments and considering different criteria.
We will do a comprehensive analysis to find which of the proposed solutions has a better
performance.
The rest of this thesis will be organized as described: In Chapter 2, we will review the literature
and recent work on Reinforcement Learning, Artificial Neural Networks, and Decision-making
tasks for autonomous driving. In chapter 3, we will start with a problem description and after that,
we will describe both proposed DRL-based and rule-based solutions. We will describe how to
develop the environment, action, agent, reward, and all stakeholders that will need for a DRLbased solution. In chapter 4, we will take different experiments and show our proposed solution is
robust based on different criteria. Also, computational results and comparisons between different
methods will be provided in this chapter. Finally, in Chapter 5, we will conclude this thesis by
summarizing our work through this research and we will propose some potential future research
at the end.

9

10

Chapter 2: Background
This chapter will focus on the previous studies that related to this research. The chapter
starts with a review of the literature on Reinforcement Learning and how value-based algorithms
can help us to solve the decision-making of autonomous vehicles. Then the background knowledge
of Artificial Neural Networks and how they evolved from the simple unit of Perceptron will be
described, and decision-making tasks and edge-cases of autonomous driving that were considered
by previous researchers will be explored as well.

2.1 Reinforcement Learning
The third type of Machine Learning is Reinforcement Learning (RL) when an agent trains
based on a trial-and-error process to learn the optimum policy.
“Reinforcement learning is learning what to do—how to map situations to actions—so as to
maximize a numerical reward signal. The learner is not told which actions to take, but instead must
discover which actions yield the most reward by trying them” (Richard S. Sutton, 2018) [18].
In other words, an agent will learn what to do (take actions) in different situations (states) based
on the reward or penalties gained from the environment after taking actions. Interactions of the
agent with the environment and learning the best policy
to take the best action (

∈ ) in each state (

reward is the goal of reinforcement learning.

∗

( | ) that works as a map for an agent

∈ ) to maximize the cumulative discounted

There are two types of RL algorithms, value-based and policy-based algorithms. This research will
use value-based RL algorithms to solve the decision-making of the AD problem.
Definition of some terminologies can help us for understanding the RL concept. The terms will be
introduced through a story that many people experienced in their daily life.
11

Imagine you have a dog; we need a good name for her. We call her Ava!. You take Ava out for a
walk and want to teach her each time you throw a ball she follows the ball and brings it back.
Each time Ava brings the ball back you will give her a treat for her good job and if she loses the
ball or does not bring the ball back means no more treat! The concept of RL is the same, there is
an agent (Ava), and the agent can perform actions. In Ava’s case, following the ball or not, and
each action creates a new situation (state) for an agent. Based on the action that the agent (Ava)
chooses there is a reward or penalty (treat) and by doing a considerable number of experiments we
can suppose Ava learns to follow the ball and bring it back. All these experiments happen in an
environment, in Ava’s example open areas like parks. In RL concept everything outside of the
agent is considered as the environment.
Figure 2.1 shows all the stakeholders of RL in a sequential decision process where the action of
the agent not only has an immediate reward but also changes the long-term reward of the agent.
This sequential decision process follows the finite Markov Decision Process, and it has Markov
property when each state has all the relevant information on the history. In other words, in a
Markov Decision Process (MDP), the next state is just related to the current state and action, and
not the history of the previous states [19]. Therefore, RL based on the MDP can be defined as a
tuple (S, A, T, R, ), where an agent in state S take action A with state transition probability of T,
receive the reward of R, which is discounted with

for balancing between a long time and short-

time effect of rewards.

Figure 2. 3 Stakeholders in Markov Decision Process. (Richard S. Sutton, 2018)

12

The sequence of the agent’s experience based on MDP consider as follow:
(2.1)

S0, A0, R1, S1, A1, R2, S2, A 2, R3, ….

It means an agent in each timestep with knowledge Si from the environment takes an action Ai and
in the next step, it will receive the consequence of the action as the reward of Ri+1.
The policy can define as

when an agent chooses one of the actions and caused the transition to

the next state and reward. The sequence of state, action, and reward are built by the following
policy as:
:

→ (

(2.2)

∈ | ).

The transition probability equation (2.3) shows the probability of an agent in
and will have reach to
(

, take action

and reward of R.

, | , )=

{

=

, |

= ,

= }

(2.3)

Based on probability theory the sum of probabilities of taking all actions in each state is equal to
one as (2.4):
∑ ∈ ∑

(

, | , ) =1

(2.4)

The agent’s goal is to maximize the cumulative long-term reward that the agent can receive from
the environment. The sum of all the rewards that received after time step t is calculated as (2.5)
where T is the final time step.
=

+

+

+⋯+

(2.5)

For the reward calculation, we need to consider the trade-off between future rewards and short
time rewards. This concept is achieved by discount the reward with

as a discounted parameter

in (2.6)
= ∑

13

(2.6)

The discount factor 0 < <1 where

= 1 is like a short memory and

Mostly the discount factor selects on the range 0.95 <

= 0 is a long-time memory.

< 0.98 for having a combination of both

short and long-time effect of reward.
The goodness of each state is based on a policy

calculated by the state-value function as

described in (2.7)
( ) = [∑

(2.7)

]

The expected value of each state is based on a policy of

in a stochastic case when the agent can

reach different states ( ) by taking action (A) calculated by the Bellman equation in (2.8)

If the policy of

( )= ∑

( , )∑ ,

( , | , )[ +

(2.8)

( )]

can find the optimum value of each state,

∗

( ), then this policy can have

optimum value for the state-action function. The value of the state-action function is named Qvalue and we want to find the
( ,

∗

as an optimum value in (2.9).

)← ( ,

)+ [

+

(

, )− ( ,

)]

(2.9)

This equation is known as Q-learning and defined by (Watkins 1992) [20] as a value-based RL
algorithm. Monte-Carlo and Temporal Difference are two approaches that can improve the
modeling of model-free RL algorithms. In the Monte-Carlo update, we consider the difference
between the value of each state and the expected return as (2.10)
V( ) ← ( ) + [

(2.10)

− ( )]

The other method that is mostly used to update the model-free RL algorithm is the Temporal
Difference (TD) update. In the TD update, we use the bootstrapping method based on the Bellman
equation and by using the value of the successor state, we can estimate the value of the current
state as (2.11)
( )← ( )+ [

+

(

14

) − ( )]

(2.11)

The difference between the estimated value of the
TD error and denoted by
=

and the real value of the state is known as

as in (2.12)
+

(

(2.12)

)− ( )

Due to the fact that we are working with a model-free RL algorithm and we do not know about
state transition probability, we are interested to work with the value of each (state, action) pair,
Q(

,

) instead of the value of each state ( ).

There are two main model-free algorithms: SARSA (state, action, reward, state, action) as an onpolicy algorithm and Q-learning as the off-policy algorithm. In SARSA, the algorithm chooses the
action based on the policy derived from Q. However, Q-learning is an off-policy algorithm. The
following algorithm shows how Q-learning as an off-policy, value-based algorithm works.
We will use Q-learning in the next chapters to solve the decision-making problem in autonomous
driving. We need to mention all the previous information and math were obtained from
Reinforcement Learning, An introduction by Richard S.Sutton and Andrew G.Barto book [18].
Algorithm1. Q-learning

Preconditions: : discount factor, : Learning-rate, : decay factor
Initialization: Q(s,a) = 0
for each episode of all iterations do:
While S is not termination do:
Choose action a A based on policy

(e.g., based on -greedy algorithm)

Execute the action a and get the next state
( , )← ( , )+ [ +

and reward r

( , ) − ( , )]

S←

end while
end for
15

As we mentioned there is a dilemma between exploration, and exploitation. The trade-off between
exploration and exploitation is answered by -greedy algorithm. Before we introduce the -greedy,
we need to define the concepts of exploration and exploitation. Imagine you are at the top of the
mountain as Figure 2.2 and want to reach the lowest part of the mountain.

A

B

A*

C

B*
C*

Figure 2. 4 Exploration V.S Exploitation

As we can see in Figure 2.2, we can start from any point of the mountain and the goal is reaching
the minimum point. For simplicity, we consider three candidate points as (A, B, C) and by starting
from A and B we reach the local minimum (A* and B*) that are not the optimum points. However,
if we start moving downhill from point C, we will reach the optimum point (C*) as the global
optimum.
The task of choosing the starting point in solution space is called exploration and moving small
steps from the initial point to the optimum point is called exploitation. The agent needs to first do
some exploration and after that exploit the solution space to reach to the optimum solution.
The trade-off between exploration and exploitation is handled by the epsilon greedy algorithm
[21]. Generally, in -greedy we generate a random number, and if the random number is greater
than the

agent will take exploration. Otherwise, if the random number is less than

the agent

will do exploitation, and it means the agent chooses the best action on that state for exploitation.
16

At this point we review the essential knowledge about the reinforcement learning and Q-learning
as an off-policy algorithm. Before moving to the next section and describing artificial neural
networks and deep neural networks, we want to review on-policy algorithms as another type of
RL algorithm.
On-policy algorithms directly optimize the policy performance and these algorithms do not use the
old data. However, off-policy algorithms like Q-learning reuse all the previous data based on the
Bellman equation to learn optimum policy.
The idea of the on-policy based algorithm like the Vanilla Policy Gradient (VPG) is that it
optimizes the policy in the way the agent takes the actions that leads to have more reward with
greater probabilities and take the actions with fewer rewards with lower probabilities.
The policy update with standard gradient descent, stochastic gradient descent, or Adam as an
optimizer. For more information about on-policy algorithms we refer readers to VPG (Sutton et
al. 2000) [21], Proximal Policy Optimization (PPG) in (Schulman et al. 2017) [22], deterministic
policy gradients (DPG) [23], and Soft-Actor Critic (SAC) by (Haarnoja et al, 2018) [24].
The problem with Q-learning and using Q-table to store all the values is that by increasing the
number of the states and actions, the computational cost will increase dramatically.
Most of the time in the real world we have lots of combinations of state-action. Therefore, storing
and quarrying all the information from a tabular format is impossible. The second problem with
the tabular version of Q-learning is that the relationship of different states that have similarities
cannot be comprehended.
Gaining the knowledge and information between different states can occur when we use the Deep
Neural Network as a great function approximator.

17

The great success of the Deep Neural Network (DNN) in various tasks like computer vision,
natural language processing, and speech recognition brings this idea to combine these two concepts
to build Deep Reinforcement Learning (DRL) as a combination of the Deep Neural Network
(DNN) and Reinforcement Learning (RL).

2.2 Artificial Neural Network and Deep Learning
Aiming to create a solution based on brain activities and transferring data into knowledge
with help of brain cells (neurons) caused the creation of an Artificial Neural Network (ANN).
The simple mathematical representation of the neurons can be found in early AI works like [25].
By this time, researchers try to have a different combination of the neurons and different ways they
can stack neurons to make ANNs more capable to solve complicated problems like a human brain.
What makes ANNs so special compared to the other solving approaches is the capability of
considering non-linearity as most of the phenomenon in this world have some degree of nonlinearity in their behaviors.
The smallest particle in ANNs structure is called perceptron as shown in Figure 2.3. The perceptron
(neuron) has input, activation function, and output as subcomponents. The artificial neural network
with the different chain-like structure of neurons has a great capability of approximate the nonb
X1
X2
X3

W1
W2
W3
W4

Activation
Function

^

Xn

s
Figure 2. 6 Perceptron the Smallest Component of ANNs
18

linearity behaviors. Finding the best combination of all the weights and biases of all neurons in the
neural network is called learning.
There are different configurations of neural networks, and the simplest neural network is called
multilayer perceptron (MLP) [26]. In the MLPs the first layer is called the input layer, where all
the preprocessed data from outside of the network feed into the network. After the input layer,
neural networks have one or more hidden layers that all the processing is performed in these layers.
Finally, the network will be terminated with the output layer with several output neurons that based
on the different tasks can predict, describe, discover relationships from input data.
Figure 2.4 shows a MLP network with more than one hidden layer that mostly considers as Deep
Neural Network or Deep Learning. As we can see all the neurons are connected to neurons in the
next layer and for this reason, these types of networks consider as fully connected networks.

Figure 2. 7 Deep Neural Network

19

We will propose a deep reinforcement learning solution with help of both a Fully connected Neural
Network (FNN) and a Convolutional Neural Network (CNN) as they can learn the complex
relationship between input data.

2.3 Decision-Making Task for Autonomous Vehicles
The main two controllers in Autonomous Vehicles (AVs) include longitudinal and lateral
control. Longitudinal control mainly means control the speed of the vehicle.
The adequate speed brings a safe distance to the following vehicle, applying the brake in
emergency situations, and staying with the constant speed with control of brake and throttle pedals
are part of the longitudinal task.
lateral control means how the autonomous vehicle steering in different driving situations. Change
the lane safely without making trouble for the neighbor vehicles in the target lane, keep in the
center of the lane, and turning are part of the lateral control.
For a full review of car-following and lane-changing models, we refer readers to studies
[27][28][29] as comprehensive studies about this topic. We consider two famous models that were
mostly used in previous studies of the decision-making of AVs.
These models consider as rule-based solutions for the decision-making of AVs. In other words,
they are based on the mathematical formulation, and they have a perfect result in certain situations
but not in all situations. However, by creating a solution based on Machine Learning generalization
problem of rule-based solutions can solve.

20

2.3.1 Rule-based Controllers
One of the rule-based models that can control longitudinal trajectories perfectly is the
Intelligent Driver Model (IDM) [29]. The IDM as an accident-free model sets the speed of the
Autonomous Vehicle by considering the minimum gap and different speeds between two
vehicles as the equations of (2.13) and (2.14).

∗

Where

=

(1 −

( ,∆ ) =

+

is the desired velocity,

−(

+

∆

∗(

,∆ )

(2.13)

)

(2.14)

√

acceleration exponent, the gap between two vehicles denoted

by , speed difference (approach rate) denoted by ∆ , the minimum gap distance
acceleration , desired deceleration , and safe time headway T.

, maximum

Figure 2.5 shows AV and the leader vehicle that we need to consider in IDM. We will use the
terms autonomous vehicle (AV) and ego from now to the end of this thesis interchangeably.

d
ego

Leader
Headway(T)

Figure 2. 10 Autonomous Vehicle (ego) and Leader Vehicle Follow IDM

The IDM can play a role as a longitudinal controller when the ego needs to set and control the
speed of the vehicle and avoid rear-end collision.
The second model helps the ego to change the lane as a lateral controller. The lateral model called
Minimize Overall Braking Induced by Lane Changes (MOBIL) was introduced in [30]. This model
minimizing the effect of lane changing on the positive and negative acceleration of the other
neighbor vehicles.
21

Considering neighbor vehicles before changing the lane in the MOBIL model is provided by (2.14)
as follows.
+ [(

−

−

)+(

−

(2.14)

)] >

Safety constraints in this model will check the possibility of changing the lane by considering the
traffic effect on the target lane. So, the deceleration of the successor vehicle after changing a lane
in a target lane follows
target lane and
,

,

constraint where

≥ −

is the acceleration of the vehicle in the

is a safety acceleration parameter.

are the acceleration of the ego, new follower, and old follower, respectively.

are the accelerations of the vehicles if ego changes lanes, respectively. Also,

,

,

determines the

politeness factor, and the lane will change if the sum of weighted acceleration is greater than the
changing parameter

on the right-hand side of equation 2.14.

Table 2.1 IDM and MOBIL Parameters

Index

T

P

Definition

Value

Acceleration exponent

4

Minimum gap between two vehicles

2 (m)

Safe time headway

1.6 (S)

Desired deceleration
Maximum acceleration

1.8 (m/ )

safety acceleration parameter

4 (m/ )

Politeness factor

1

Changing Threshold

0.1 (m/ )

22

2 (m/ )

All the parameters of both IDM and MOBIL will be presented in Table 2.1. we will use a
combination of the IDM and MOBIL as a rule-based solution.

2.3.1 RL-based controllers
The reason researchers start considering RL-based solutions instead of the rule-based
solution is that rule-based solutions can have a good solution for certain situations.
For solving the generalization problem of rule-based algorithms, researchers in [31][32] tried to
optimize the performance of rule-based algorithms with help of the genetic algorithm as a
metaheuristic algorithm. However, rule-based approaches are vulnerable to unforeseen
circumstances and need to modify their features based on new situations. Therefore, recent studies
proposed a new approach, Deep Reinforcement Learning (DRL), for decision-making of AVs as
a combination of the Reinforcement Learning (RL) and Deep Neural Network (DNN).
For example, studies like [33] trained ego for control steering, throttle, and brake in simulation
world that they create in CARLA [34] as one of the open-source simulators.
Traffic fluidity is considered by [35] when the ego needs to control the speed and avoid collision
with other vehicles. The recent studies with help of DRL want to solve complicated edge cases of
autonomous driving. For example, researchers in [36] proposed a DRL model to train the ego to
perform the on-ramp merge task involving accelerating, decelerating, and steering. In this research,
they use a Long Short-Term Memory (LSTM) architecture to learn the relationship between ego
and surrounding vehicles and internal state from LSTM cell feed into DRL architecture.
Wang and Chen in [37] proposed a new model based on DRL for continuous state and action
space for merging tasks in the highway environment. The reward function in this study is designed

23

for doing a highway merge like a human driver where smoothness, safeness, and promptness are
key attributes in reward function.
More studies want to solve merging tasks with DRL solutions like [38] where they introduced
Deep merging as a DRL-based solution with pictures input data and [39] proposed a DRL-based
model for on-ramp merging with traffic light input data.
Another edge case that many researchers try to have a DRL-based solution for is overtaking task.
In overtaking task, the ego wants to pass the next vehicle safely. For overtaking ego needs to
control both speed and lane change decisions. For example, researchers in [40] introduced to DRLbased solutions and find that if the agent control both lateral and longitudinal task with the help of
DRL solution it can provide better result compared to lateral control by DRL and longitudinal
control by IDM. Researchers in [41] proposed two DRL-based models to control speed and lane
change decisions for overtaking in a highway driving environment. Researchers in [42] consider
overtaking tasks with two different highway traffic scenarios with and without oncoming traffic
and they proposed DRL-based solutions for proposed problems.
Researchers at the University of California Berkeley developed FLOW [43][44]
https://github.com/flow-project/flow as four benchmarks for reinforcement learning in mixedautonomy traffic. These benchmarks included different driving scenarios: Circle, Figure-eight,
Grid, Merge, and Bottleneck. They want to develop a solution for a mixed-autonomy traffic
system, where autonomous vehicles control by RL, interact with a human driver. Flow as a python
library interface with other python RL libraries like RLib [45], rllab [46], and Simulation Urban
Mobility (SUMO) [47] as an open-source traffic simulator.
There are some studies that consider uncertainty, these uncertainties can come from noisy sensor
data or interaction with other users. For example, researchers in [48] used the combination of
24

Monte-Carlo tree search and deep reinforcement learning for continuous highway driving and
highway exit cases. Researchers in [49] proposed two solutions based on game theory for the
decision-making of AVs and compare the result with human-like driver decisions. They test the
proposed solutions on lane changing, merging, and overtaking scenarios. For more clarification,
related previous work on the decision-making task of AVs is summarized in Table 2.2.
Table 2.2 Summary of Related Work

[29]

DRL/
Rulebased
Rule

[30]

Rule

[31]

Rule

[32]

Rule

[36]

DRL

Work

Solving method

Task

Simulator

IDM

Longitudinal control

N/A

MOBIL

Lateral control

N/A

(IDM + MOBIL) optimized by

Lane change and

Evolutionary algorithm

Speed control

(IDM + MOBIL) optimized by
GA
LSTM

N/A

Lane assignment

N/A

On-ramp merge

N/A

Merging with
[37]

DRL

Q-function approximation

continuous

N/A

action/state
[38]

DRL

Deep merging/ picture input data

On-ramp Merging

N/A

[39]

DRL

DQN + graph CNN

On-ramp Merging

SUMO

[41]

DRL

DQN

Speed and lane control

[43]

Both

IDM, MOBIL, and DQN

Speed and lane control

[33]

DRL

Actor-Critic

Normal Driving

Open AI/
Python
FLOW/SU
MO
CARLA

Normal Driving,
[35]

DRL

Proximal Policy Optimization

Interaction with
human driver

25

SUMO

For a full review of DRL-based solutions for autonomous driving, we will refer readers to [50] and
[51] as comprehensive studies related to intelligent transportation systems and autonomous
driving. For different edge cases and tasks, previous researchers proposed DRL-based or rulebased solutions in the previous studies. Based on our knowledge, there is no study that answered
the decision-making of the AVs when approached by an emergency vehicle. For reaching fully
autonomous driving and level 4 autonomous vehicles need to answer to all the driving edge-cases.
Also, making optimum decisions by AVs in emergency situations can minimize the emergency
time response. Therefore, in this study, we want to develop and propose a new edge-case for the
decision-making of AVs when the ego is approached by an emergency vehicle.

26

Chapter 3: Methodology
In this chapter, we will start with a problem description. After that, we will describe all the
components that we need for the RL concept and introduce our solution based on Deep Q-Network
(DQN). We will propose two types of neural networks and at the end, we will propose our solution
based on Dueling Deep Q-Network (DDQN) as an improvement of DQN.

3.1 Problem Description
The primary purpose of this research is to create a solution for the autonomous vehicle to
drive like a human driver when an emergency vehicle approaches. When a human driver notices
the presence of an emergency vehicle, they reduce their speed and pull over to the side of the road.
In the presence of emergency vehicles, the ego should use the same techniques as a human driver.
To that end, the ego must first learn how to drive normally without any collision with other vehicles
and travel within the road boundaries. Following that the ego will learn how to respond to the
emergency vehicle appropriately.
The objective function that needs to be minimized is shown in (3.1), where
number of times that the ego changes lanes and travels out of the road,
times the ego crashes with other vehicles,
emg in the observation range,

_

emg and they have the same lane, and

_

_

is the

_

is the number of

denotes the exact time that ego recognizes the

=

denotes the number of times that ego recognizes the
is the number of times that ego recognizes the

_

emg and the speed of ego is greater than the speed of the emg.

∑

_

_

+

+ (∑

27

_

_

+

_

)

(3.1)

This equation shows all the objectives of this research as ego learns how to handle the normal
driving task that is no collision with others and travel out of road boundaries and the second part
of the equation wants to assure a decent response after the ego sense the emergency vehicle.
For minimizing the above equation, the ego needs to control both longitudinal and lateral
trajectories. Before that, we need to build an environment to help the ego to experience all the
different driving scenarios especially when the ego is approached by emg.
For creating this environment there are some open-source libraries that can help researchers create
different traffic simulations. Some research trains their model based on real-world data. For
example, researchers in [52] proposed virtual image synthesis and transformation for autonomy
(VISTA) to train ego based on real-world data.
Training based on capturing real-world data is susceptible to unforeseen driving situations and in
some cases can cause trouble. To that end, other researchers use simulation to generate a world
around the ego and extract data from that environment for training. We will create a simulation
environment with help of SUMO and during different episodes, we will train the ego for both
normal driving and responding to the emergency vehicle. The following graph in Figure 3.1 shows
the high-level workflow for solving this problem.

Figure 3.1 High-Level Workflow

28

3.2 RL Framework
The Reinforcement Learning framework has five main components as an agent,
environment, state, action, and reward that we will introduce as follow. We defined these
components based on the proposed problem.

3.2.1 Agent
In the RL framework, an agent needs to sense the state in the environment and take action
to change that state to the new one. In this research, the agent is the autonomous vehicle (AV) let’s
call it ego wants to learn the best longitudinal and lateral decisions when approached by an
emergency vehicle in a highway environment.
When the simulation starts, we consider a box with length (obs_range) around the ego with the
number of vehicles observation (nb_obsrv) limitation.
The state array for the ego consists of positions ( ,

) and speed ( ) of all the vehicles in the

(obs_range). The agent (ego) senses the environment at each step of the simulation and based on
this data takes the best action and transit to the new state.
The ego can observe the environment with the limitation of the 10 vehicles. This observation
includes the position ( ,

) and the speed ( ) of the surrounding vehicles. So, we consider the

following array with 31 of elements that create the state of ego in each step. We consider 20
elements for positions and 10 elements for the speed of surrounding vehicles and a binary variable
that shows the presence of the emergency vehicle as shown in Figure 3.2.

1/0

X1

Y1

V1

….

X10

Y10

Figure 3.2 Input Data Representation

29

V10

3.2.2 Environment
As we mentioned in Chapter 2, the environment is one of the most important components
of the RL framework. The agent can experience different states and take different actions in an
environment. One of the famous toolkits for creating different RL environments is OpenAi Gym
https://gym.openai.com/.
For autonomous driving, we need to build a traffic simulation environment, and there are some
candidates like Simulation Urban Mobility (SUMO), CARLA, and Udacity Self-Driving Car
Simulator https://github.com/udacity/self-driving-car-sim.
The benefit of using open-source libraries like SUMO is that we can trust the accuracy of the
information extracted from these traffic simulations. The negative point of using traffic simulators
is that because of the generalization problem of these simulators, it could be difficult to build an
environment and all the stakeholders on top of them.
Based on all our needs, in this research, we prefer to use SUMO as an accurate and compatible
traffic simulator. SUMO as a 2D simulator with a combination of Traffic Control Interface (Traci)
[53] can handle different traffic scenarios. It is worth mentioning that both CARLA and Udacity
Self-Driving Car Simulator can work as 3D simulators. However, they are associated with
significantly higher computation costs as they rely on Graphics Processing Units.
For building the environment, we started with creating three .xml files. The first .xml file is
simulation.node.xml which is defined all the nodes of simulation. We consider two nodes A and B
as the beginning and ending of the route with a 2000-meter length.

30

The first node A has (x = 0, y = 0) coordinate and the second node B has (x = 2000, y=0)
coordinate. The second .xml file simulation.edg.xml determines the edge between two nodes. The
third .xml file is simulation.rou.xml which determines the type of vehicles that travel in the route.
We consider three types of vehicles in this study, the first type of vehicle is the emergency vehicle
(emg), the second type of vehicle is the passenger (veh), and the third type is the autonomous
vehicle (ego).
The speed factor determines the desired speed which multiplies with the road speed limit. An
emergency vehicle with a speed factor of 1.5 means this vehicle can travel 50 percent above the
road speed limit. Table 3.1 shows all the parameters related to these three types of vehicles.

Table 3.1 Vehicle Types

Type

Length(m) Width(m) maxSpeed(m/s) minSpeed(m/s) speedFactor

emg

16.0

2.55

50

20

1.5

ego

4.8

1.8

30

15

1

veh

4.8

1.8

30

20

1

The last essential .xml file is simulation.net.xml which has node file and edge file as an input and
for the output, it generates the network file. The simulation.net.xml creates all the lanes and
connects all the nodes with edges.
For more information about how to define different nodes, edges, routes, and networks we refer
readers to SUMO documentation in https://sumo.dlr.de/docs/Tutorials/Hello_World.html.

31

The simulation configuration file will be created by combining all the previous .xml flies, and we
will have a highway with a length of (highway_lentgh) and three lanes ( , , ) from rightmost
to the left where both left and right lane changing is allowed.

At the beginning of each experiment, vehicles are spawn into the environment with different
parameters which some deterministic and some random.
The emergency vehicle (emg) always starts travel from (

= 0) in the emergency lane ( )

and has a random departure speed of [20, 50] (m/s). The ego vehicle will be departure with a
random position in the first 200 meters of the highway in front of the emg vehicle in one of three

lanes. The ego can travel in all the lanes ( , , ). Other vehicles with the type (veh) are

responsible to make the longitudinal and lateral control harder for the ego, they spawn in ( , )
with random speeds of [15, 30] (m/s) with IDM as a longitudinal control.

Highway environment that was built with the help of SUMO shown in Figure 3.3. The green
vehicle is ego, blue vehicles are passenger-type vehicles, and the vehicle with the emergency sign
is an emergency vehicle.

Figure 3.3 Highway Environment

All the configurations were implemented with the help of SUMO which is working as a server.

32

We need a command-response exchange between the client and SUMO. This connection between
our python code which is responsible for training the agent and deployment of the actions with
SUMO as a server handle by Traci.
Figure 3.4 shows the connection between the Python module (DRL-code) and SUMO as an
environment.

Figure 3.4 DRL and SUMO Connection Through Traci

With help of Traci, it is possible to retrieve the values of all the objects in simulation and change
their behavior online. For more information about Traci, we refer readers to the Traci
documentation on https://sumo.dlr.de/docs/TraCI.html.
Some termination conditions caused one episode to finish, and the environment reset to the new
one. Table 3.2 shows all the termination conditions.

33

Table 3.2 Termination Conditions

Reason
Out of road

Definition
The ego travel in the right lane
the agent is in the left lane

and chooses action change right lane or if

and chooses action a1.

Accident

An accident of the ego with other vehicles.

Out of

Emergency vehicle passes the ego becoming out of observation range after

sense

recognition.

Endpoint

ego or emergency vehicle reach the endpoint.

3.2.3 Actions
Based on different states the ego needs to take different actions. We consider 5 actions
that the ego can take for controlling the lateral and longitudinal movements. Table 3.3 shows all
the actions that we consider in this research.

Table 3.3 Actions

Action

Definition

a0

Stay in the current lane, keep speed

a1

Change to the left lane

a2

Change to the right lane

a3

Stay in the current lane, increase speed 3 m/s

a4

Stay in the current lane, decrease speed 5 m/s

34

These are actions that the ego can take in each step of the simulation. Agent (ego) by taking one
of the above actions change the state of to the new state which can bring a reward for the agent.

3.2.4 Reward
An agent will train to take the best actions based on the reward that gain from the
environment. This reward in the Ava example was a small biscuit after bringing the ball back. In
this research, we consider negative rewards as a penalty. Therefore, if the ego makes a bad
decision, it will cause a penalty for the ego and the goal is to minimize the penalty in each step of
the simulation.

Table 3.4 shows all the penalties and the definitions for each penalty.

Table 3.4 Penalties

Definition

Penalty

The ego travels out of road boundaries

_

The ego has a collision with other vehicles

_

= -200

_
_

Each time ego changes a lane
Each step that ego and emg have the same lane
Each step that the agent recognizes and has a speed greater than
emg

35

_

_

_

_
_

= -150

= -1

= -5
= -1

If the ego travels out of the road or if it has an accident with other vehicles the episode will be
terminated. However, for other penalties, ego can continue the episode. The reason we consider a
small penalty for changing the lane is that we don’t want to ego change the lanes continuously.
Also, we consider a penalty for speed violation that check ego learns to decrease the speed after
recognizing the emg. For each step of one episode, the summation of all the previous penalties is
consider as a penalty that the agent will get from the environment.

3.3 Deep Reinforcement Learning for Autonomous Driving
Google DeepMind researchers develop the deep Q-network (DQN) [9], and they showed
that the DQN as a combination of the Neural Network and Reinforcement Learning could have an
appropriate result on the Atari 2600 games and gain a comparable outcome to professional human
gamers.
Figure 3.5 shows the workflow in a single step of one episode based on the DRL. We want to
approximate Q*(s, a) in each step to take the best action. Approximating the Q(s, a) with Neural
Network can be a challenging task, due to the need of lots of data and more importantly high
correlation between consecutive states, and actions can cause oscillation on weights of the network
and divergence.

36

Figure 3.5 Workflow in A Single Step of Simulation and Training

To solve the divergence problem, researchers use two concepts experience replay memory and the
second network as a target network. Each experience with form of tuple ( , , , ) store into the
memory with a capacity of max_cap when the memory filled with max_cap it will remove the
oldest experiences. This memory is called experience replay memory. Therefore, instead of
feeding the network with one state, the input value will be in the size of minibatch_size.
This random minibatch can reduce the correlation between input data.
DQN algorithm includes two networks named policy network and a target network. The loss will
be calculated based on the Q-value from the action of the experience tuple and the target value of
that action. At the first, we don’t know the value of Q-value in the target network, and we want to
approximate it. To calculate the target value, we need the second pass with the next state
next state will not be a termination state.
37

if the

Then, we can obtain the maximum Q-value among the possible actions that can be taken for that
next state and use this value in the Bellman equation to calculate the target Q-value of the action.
For reaching the optimum network the following loss function in (3.2) needs to be minimized, and
after some steps, the target network

will be updated from the policy network.

The target network will be updated after target_update_counter times.
( )=

Where

, , ,

( ,

;

[

( ,

+

;

)− ( , ; ) ]

(3.2)

) is optimum value of Q-value from the target network and ( , ; ) is

Q-value from the policy network. This loss function will be updated with help of gradient descent
in (3.3).
∇

( )= [ +

( ,

;

) − ( , ; )) ∇

( , ; )]

(3.3)

The reason we consider two networks and not using the policy network for the second pass is that
when our policy network update to be closer to the target Q-values also the target Q-value is also
moving in the same direction because we use the same network architecture and same weights to
calculate these values. To solve this problem, we consider the second network as the target network
and update the weights of this network after some steps which is denoted by
target_update_counter as a hyperparameter.
To summarize all the previous procedures the following algorithm is presented as our solution
based on DQN for solving the decision-making problem of autonomous driving.

38

Algorithm 2. proposed solution based on DQN

Initialization:

Initialize replay memory with maximum capacity max_cap.
Initialize the policy network with random weight.
Clone the policy network as a target network.
For i in range number_of_simulation:
start simulation with Traci
reset the environment
add the vehicle and set all the controllers
extract state from one step of the simulation
preprocess the input data
Based on -greedy algorithm (exploration, exploitation)
Select an action
Execute the action in the simulation
Calculate the new state and reward
Store the new experience in experience replay memory
Sample random batch from the experience replay memory
Preprocess states from the batch
Pass the processed batch to the policy network
Calculate loss between target Q-values and output Q-values as
( )=

, , ,

[

+

( ,

;

)− ( , ; ) ]

Gradient descent to update the weights of the policy network to minimize the loos.
∇

( )= [ +

( ,

;

) − ( , ; )) ∇

( , ; )]

After reaching to target_update_counter update the weight of the target network with
weight of the policy network.

With this algorithm after a considerable number of training steps, we will end up with a trained
neural network that can choose the best actions based on the optimum Q-values.
39

Before the description of the neural networks in more detail, we need to normalize the input values
that are extracted from the simulation in each step. To that end, we consider following
normalization in 3.4, 3.5, and 3.6 for positions and speed normalization in each step of the
simulation in (obs_range).

_
_

=

=
=

_

_

(3.4)

_

(3.5)

_

_

_

_

_

_

_

(3.6)

3.3.1 Neural Networks Architecture
In this research, we consider two types of neural networks structures. The first network is
a fully connected feed-forward network and helps us to have DQNFNN solution. This network is a
multi-layer perceptron with 31 input neurons as we elaborated in the agent section, in each time
step of the simulation agent need to consider a maximum of 31 input variables. The output layer
consists of five neurons as we introduced five actions in Table 3.3. This neural network will have
3 dense layers with 128 neurons and all the neurons connected to the next level to create a fully
connected network.
We consider Rectified Linear Unit (ReLU) [54] as an activation function. ReLU as an activation
function, output the values greater than zero exactly as they are, otherwise, output the other values
equal to zero. ReLU is a popular activation function that allows gradient backpropagation without
vanishing or exploding. Figure 3.6 shows how ReLU as an activation function works.
40

12

10
8
6

4
2
0

-10

8

-6

-4

-2

0

2

4

6

8

10

Figure 3.6 ReLU Activation Function

The second network was designed with convolutional layers and create DQNCNN solution.
Convolutional layers can learn complex and useful patterns from the input data.
The first layer includes 32 filters, a kernel size of 3, and ReLU as an activation function. The
second Conv layer has 64 filters, kernel size of 1, and ReLU as an activation function. The third
layer includes 128 filters, kernel size of 1, and ReLU as an activation function. There are some
max-pooling layers as aggregation layers after each Conv layer.
The summary of both DQNFNN and DQNCCN are presented in table 3.5 and table 3.6.

Table 3.5 DQNFNN Structure

Layer (type)
Output Shape
Param #
==========================================================
input_1 (InputLayer)
[(None, 31)]
0
_________________________________________________________________
dense_1 (Dense)
(None, 128)
4096
_________________________________________________________________
dense_2 (Dense)
(None, 128)
16512
_________________________________________________________________
dense_3 (Dense)
(None, 128)
16512
_________________________________________________________________
dense_4 (Dense)
(None, 5)
645

=========================================================

41

Table 3.6 DQNCNN Structure

Layer (type)
Output Shape
Param #
=========================================================
all_input (InputLayer)
[(None, 31, 1)]
0
_________________________________________________________________
conv1d (Conv1D)
(None, 31, 32)
128
_________________________________________________________________
max_pooling1d
(None, 8, 32)
0
_________________________________________________________________
conv1d_1 (Conv1D)
(None, 8, 64)
2112
_________________________________________________________________
max_pooling1d_1
(None, 6, 64)
0
_________________________________________________________________
conv1d_2 (Conv1D)
(None, 6, 128)
8320
_________________________________________________________________
max_pooling1d_2
(None, 1, 128)
0
_________________________________________________________________
dense (Dense)
(None, 1, 128)
16512
_________________________________________________________________
dense_1 (Dense)
(None, 1, 128)
16512
_________________________________________________________________
dense_2 (Dense)
(None, 1, 128)
16512
_________________________________________________________________
dense_3 (Dense)
(None, 1, 5)
645
_________________________________________________________________
reshape (Reshape)
(None, 5, 1)
0
For implementation and all the computation behind neural network structures, we used
TensorFlow https://www.tensorflow.org/ as a machine learning platform and on the top of the
TensorFlow we used Keras https://keras.io/ as a deep learning API that is written in python.

42

3.3.1 Dueling Deep Q-Network
Researchers in [55] improved the performance of the DQN and introduced the dueling
version of this algorithm. The dueling version is the optimized DQN that has a modification on
the output layer.
They introduced the Dueling Deep Q network (DDQN) with two streams. Figure 3.7 shows the
DQN structure and Fig 3.8 shows DDQN structures.

Figure 3.7 DQN Structure

Figure 3.8 DDQN Structure

43

The dueling network has two steams: one is state-value (the expected reward of state S) as a scaler
and the second stream determines the advantage of each action. By a combination of these two
streams, we will have Q-value as normal DQN. The critical point behind DDQN that makes it
better compared to DQN is that it is unnecessary to calculate the value of each action in each state
Q (s, a) for every step due to the fact that, not all the states have the same learning capability.
For example, knowing how to change the lane or speed is very important in a situation that ego
recognizes both (veh) and (emg) vehicles in (obs_range) compared to the case that there is no
vehicle around the ego in the observation range. The Q-value in the DDQN structure is calculated
by (3.7).
( , )=

( )+

(3.7)

( , )

The above equation used the concatenate the value of each state and action advantages.
Researchers in [55] proposed the Q-value of the DDQN as (3.8).
( , ; , , )= ( ; , )+[ ( , ; , )−
∑ ( ,
| |

The convolutional layers parameter denoted by
connected layers denoted by

; , )]

(3.8)

, and parameters of two streams of fully

and . All the other components of both DQN and DDQN structure

are the same instead of the last layer when the last dense layer makes two streams and after that
concatenate to create the Q-values. This strategy in dueling architecture brings better convergence
and results for the DDQN compared to the DQN structure.
In this chapter, we focused on the problem definition and solution based on Deep Reinforcement
Learning. We introduced DQNFNN, DQN CNN, and DDQN as solutions for the proposed problem.
In the next section, we will describe the implementation process and computational results in more
detail.
44

Chapter 4: Computational Result
In this chapter, we will review the implementation process and the result of both rule-based
and RL-based solutions. After that, we will compare the performance of different solutions that
are presented in Chapter 3 by considering various criteria.

4.1 Implementation
For the implementation part of this research, we use Python as a programming language. The
following Python APIs and libraries also are used as shown in Table 4.1.

Table 4.1 APIs and Python Libraries

API/ library Name

Task Description

Sumolib

Help to generate the simulation components

Traci

Real time control of the simulation

NumPy

Mathematical functions, matrix operation, random number
generator

Matplotlib
Math
TensorFlow
tensorflow.keras

Visualization
Mathematical functions
The main platform of ML and DL
Keras on the top of TensorFlow as a DL API

tensorflow.keras.optimizers Adam as an optimizer
tensorflow.keras.models

Create a sequential model

tensorflow.keras.layers

Create different type of neural network layers

45

Figure 4.1 shows the main classes, methods, and python files that were determined for coding
purposes.

Figure 4. 1 Upper-Level View of Implementation

46

The first python file is simulation_params.py this file includes all the essential parameters. All
the parameters related to the simulation, agent, and training process.
The second python file is simulation.py and it has five methods for generating nodes, edges,
route, setting, and configurations.
The next class is vehicles include removing, adding, control, and subscribing methods. At the
beginning of the simulation, all the vehicles remove first from the simulation and make sure the
simulation has no vehicle from the previous episode. Then new vehicles add to the simulation with
add() method. The control() method sets all controllers that vehicles need during simulation. This
method makes sure which vehicles use the rule-based controller and makes sure ego has no default
controller. As we mentioned before we need to connect to the SUMO as a server and subscribe.
Therefore we will subscribe for all the useful information with subs() method. This subscription
gets all the position, speed, and type of all vehicles that are in the observation range of the ego.
The last class of the simulation.py do a major part of the simulation.
The observation() method returns the state of the ego by concatenation of the vehicle type,
position, and speed of the vehicles.
The next method is action_func() that executes the action, this action can be a random action or
action predicted from the neural network.
After execution of the action, the next method is step() which determines what is the consequence
of the actions that the agent took. This method includes some queries about the collision of the ego
with other vehicles, travel out of the road boundaries, reaching to the end of the route, termination
conditions, relative speed and position of the ego to the emg, changing lane, out of access
condition, number of times ego and emg have the same lane, and speed violation.
The step() method is responsible for all the queries and calculations in each step of one episode.
47

The reward_cal() method helps to calculate the reward of the agent in each step of episodes based
on the parameters presented in Table 3.4.
The last method of simulation.py is the run() method which is responsible to start the simulation
and initialize all the previous methods for the number of the episodes.
The DQNAgent.py is responsible for creating all the agent needs. The first class is Memory(object)
that creates the experience replay memory. The push() method, fill the experience replay memory
with new experiences. These experiences are collected based on the tuple (s, a,

, r).

The sample() method extracts a batch with minibatch_size from the experience replay memory,
and can_provide_sample() makes sure there are enough experiences in the memory.
The second class is EpsilonGreedy(object) that is responsible for
exploration_rate() determine the value of the

−

algorithm.

in each iteration. This value starts from one and

decays in each iteration.
The last class of the DQNAgent.py is responsible for creating the neural network, prediction, and
training.
The create_model() is responsible for creating the fully connected layers in DQN FNN and
convolutional layers in DQN CNN and return the model. Predict the Q-value with single transition
calculated by get_qs() method. The Q-value prediction based on the transition with batch_size is
calculated by get_batch_qs() method.
The last method of DQNAgent.py is the train() which is the main part of the agent, and it needs
termination_state and minibatch as input of the method.
The main part of the train() presents as the following pseudocode which has the primary roles for
training the agent.

48

For a sample (current_state, rand_action, new_current_state, reward, done) in minibatch:
If done is False:
Calculate the maximum q_value predict by target network
new_q_value is equal = reward + discounted maximum q-value
Else:
new_ q_value = reward
create the list of the future states as input data.
create the list of the new_ q_value for each action as target data.
fit the model with input data and target data.
All the previous Python files import to the last Python file as Train.py which is connect all the
essential modules which is defined in the previous files together.
With this configuration, the agent starts training and because of the nature of the exploration,
exploitation the agent makes bad decisions and gain penalties during the first episodes. By having
more and more episodes and experience different situations the ego starts learning how to make
the best decisions in different situations.

4.1.1 Parameters and Hyperparameter
One of the essential python files is Simulation_params.py which is provide all the
parameters and hyperparameters.
Table 4.2 shows all parameters relates to the simulation and Table 4.3 shows parameters related to
Agent and RL framework. Also, there are some other parameters related to the rule-based
algorithm and vehicle types that were introduced in Table 2.1 and Table 3.1.

49

Table 4.2 Simulation Parameters

Parameters / Simulation

Value

Number of episodes, episode_num

5000

Number of vehicles, N

20

Minimum road speed, Vmin

15 (m/s)

Maximum road speed, Vmax

60 (m/s)

Highway length, dhighway

2000 (m)

Width of lane, w

3.2 (m)

Number of lanes, n

3

Table 4.3 RL Framework Parameters

Parameters / RL-framework

Value

Observation range, obs_range

100 (m)

Maximum vehicle sense, Scap

10

Initial epsilon, εstart

1

Decay epsilon, εdecay

0.0001

Learning rate, η

0.001

Discount factor,

0.99

Maximum memory size, M maxsize

500

Update target network,

5

Mini batch size, Mminibatch

32

Minimum memory size, Mminsize

100

50

4.2 Computational Result
The goal of this research is having an autonomous vehicle can address both normal highway
driving and when the autonomous vehicle is approached by an emergency vehicle.
To solve this problem, two solutions based on a rule-based solution as a baseline with a
combination of the IDM and MOBIL and the second approach as a DRL-based approach were
proposed. we proposed DQNFNN, DQNCNN, and DDQNCNN as DRL-based solutions.
For evaluating the performance of the ego after some episodes we consider five criteria each of
which is defined as follow:
1. Number of accidents
Calculate the number of times that ego caused an accident. If the ego caused the accident the
episode will be terminated.
2. Number of out of road
Calculate the number of times that ego travels out of the road boundaries. If the agent is in the
right lane

and chooses action a2 or if the ego is in the left lane

and chooses action a1 can

cause out of the road and the episode will be terminated.
3. Number of times ego has the same lane with emergency vehicle
Calculate the number of times that both the vehicles have the same lane. The number of the
times that ego has the same lane with emg counted for each step of one episode. having the
same lane with emg in one step is not a reason for the termination of the episode.
4. Speed violation
Calculate the number of times that ego observes the emergency vehicle, and it has a speed
greater than the speed of the emg.

51

5. Number of times emergency reach to the end before ego
Calculate the number of times an emergency vehicle reaches the end of the route before ego.
As the departure position is

= 0 and ego departure is a random position in the first 200

meters of the route. If the ego has a good performance on both longitudinal and lateral actions

by changing the lane and decreasing the speed appropriately then emg reaches the end of the
route before ego which is proof of the good performance.
6. Reward function
In the RL framework, the reward that the agent gets from the environment based on the
actions it takes in a different state is a sign for the performance evaluation.
Mostly in the RL framework, we want to maximize the reward that the agent can get from
the environment. However, in this research, as all the consequences of the actions are defined
based on penalties, the agent wants to minimize the penalty that gets from the environment.

4.2.1 Rule-Based V.S DRL-Based
In this section, DRL-based and rule-based solutions performance will be compared based
on all the six criteria introduced in the previous section.
The ego experience different situations in 5000 simulation episodes, each of which had a different
number of steps base on the performance of the ego and termination conditions.
Some parameters need to be generated randomly in these episodes. However, due to the fact, these
random numbers are generated as a random seed, the result of different solutions can compare with
each other.

52

Figure 4.2 Number of Accidents

Figure 4.2 shows the number of accidents in 5000 episodes for four solutions.
Accident IDM representative of rule-based solution as a combination of IDM and MOBIL. As we
can see there is no accident in 5000 episodes with IDM and the reason behind this is that MOBIL
performs conservatively on changing the lane and IDM sets the speed in a way that caused no
accidents.
The other three solutions are representative of DRL-based solutions and as we can see DDQN has
the best performance with fewer numbers of accidents.

53

All three approaches have more accidents in the first 1000 episodes and that’s due to the bigger
exploration rate when agents start with more exploration and decay the exploration rate to do more
exploitation. The performance of DQN CNN is better than DQNFNN.
All the accidents in 5000 episodes, for DQNFNN is 71, DQNCNN is 46, and DDQN has 35 accidents.
The second important criteria that we introduced is the number of times that the ego travels out of
the road as shown in Figure 4.3 which determines that the ego can learn about the road boundaries
and change lanes appropriately when traveling in the right lane or left lane.
As we can see the ego equipped by IDM and MOBIL has not out of road boundaries due to fact
that the ego with rule-based solution knows the boundaries of the route and when it is in the right
lane or left lane prevent taking the wrong lateral actions.
The performance of DQNFNN and DQNCCN are close to the out of roads times and DDQN has a
better performance compared to the other two. As we can see DDQN agent learns after 500
episodes, but DQN FNN and DQNCCN need about 1000 episodes to learn about road boundaries.
In 5000 episodes, DQN FNN has 815 times of out of the road, DQNCNN has 776 times of out of the
road, and DDQN has 537 times of out of the road in the episodes which agent try to explore more
compared to exploit.

54

Figure 4.3 Number of Out of Road

The third evaluation is the number of times ego has the same lane as the emergency vehicle in the
observation range. We defined the emergency vehicle which starts travel from (

= 0) in

the emergency lane ( ). Each time that the ego observes the emergency vehicle in the observation
range, and they have the same lane (

) counted and shown in Figure 4.4.

In these plots, having fewer number times that ego has the same lane with the emg means better
performance.

55

Figure 4.4 Number of Times Autonomous Vehicle (ego) has a Same Lane with Emergency Vehicle

Three DRL-based proposed solutions (red, green, and blue) have some similarities in their
performance. For example, in the first 1000 episodes, the number of the same lanes with emg is
mostly bound between 0 to 10 because in the first 1000 episodes the ego explores, and after few
steps, the ego is terminated by an accident or out of the road.
After 1000 episodes the agent start learning about the number of times that has the same lane with
emg. To have a fair comparison between all the four solutions we consider the mean and standard
of deviation as shown in Table 4.4.
56

Table 4.4 shows that the DDQN has a better mean and standard deviation on the number of times
ego has the same lane with the emergency vehicle in 3000 episodes compared to the DQNCNN and
DQNFNN.

Table 4. 4 Mean and Standard Division Comparison

Method

Mean

Std

DDQN

6.859

7.65

DQNFNN

12.63

12.63

DQNCNN

7.97

8.28

DQN IDM

5.18

8.75

Both DDQN and DQNIDM have similar performance and it is hard to decide which one has a better
performance on the number of times that ego has the same lane with emg.
The mean number is almost the same and DDQN has a better standard deviation compared to
DQNIDM. Therefore, we determine another measure for finding which one has a better performance
on the number of same lanes with emg. Again, we consider 3000 episodes between episode 2000
to 5000. The reason not considering the first 2000 is that as we mentioned in those episodes DDQN
has an accident or out of the road due to the exploration and as a result fewer number of times that
has the same lane with emg.
For completing the comparison between ego DDQN and ego IDM, Figure 4.5 can help us.
This Figure works like a zero game when one of the players A or B is the winner of the game.
From episode 2000 to episode 5000 the method which has a fewer number of same lanes with the
emg is the winner of the game.
57

Figure 4. 5 Comparison Between DDQN and IDM on Same Lane with Emergency Vehicle

Figure 4.5 shows the outperformance of DDQN compared to the IDM on the number of times that
ego has the same lane from episode 2000 to 5000. The ego DDQN has fewer times that has the
same lane with emg in the same episode compared to egoIDM.
The next evaluation is speed violation, as the number of times that ego sense emg in the observation
range and it has a greater speed compared to the emg. Figure 4.6 shows the number of speed
violations for DQNFNN, DQNCNN, DDQN, and IDM as a rule-based solution.
As we mentioned before to have a more accurate comparison, we don’t consider the first 2000
episodes, where the ego mostly has an accident or is out of the road even before recognizing the
presence of the emg in the observation range.

58

Figure 4.6 shows that DDQN with a mean and standard deviation of (10.48, 9.02) has a better
performance compare DQNFNN (15.28, 6.96), DQNCNN (14.89, 6.98), and IDM (16.87, 17.44) as a
rule-based solution. The reason the combination of IDM and MOBIL has a poor performance on
speed violation is that with IDM the ego mostly wants to reach a certain speed and it doesn’t
change that speed by observing the emergency vehicle.

Figure 4. 6 Number of Speed Violation

59

Figure 4.7 shows that DDQN (red) and IDM (blue) at first has a similar number of speed violations
between [0,5] for almost half of episodes between (2000, 5000).

Figure 4. 7 Speed Violation Comparison

The DDQN has a better performance as we can see the other episodes has a speed violation fewer
than 25 times. Otherwise, IDM for other episodes has 25 or more speed violations. For DDQN CNN
for most episodes, the number of speed violations is between (10, 15) times.

60

The last evaluation feature is the number of the time emergency vehicle reach to the end of the
route before the ego as an evaluation for both changing the lanes and decreasing the speed of the
ego. The ego starts the travel in front of the emg so by changing the lanes and decreasing the speed
appropriately the emg should reach to the end point before ego.
Figure 4.8 shows the number of times that emg reach the end of the road before the ego.

Figure 4.8 Emergency Vehicle Reaches to The End of Route Before ego

61

Figure 4.8 shows that all the DRL-based solutions have a better performance compared to rulebased solutions.
The reason for the bad performance of the rule-based solution is that IDM never decreases the
speed because of the existence of the emergency vehicle and the only reason that IDM changes the
speed is to avoid accidents.
For 3000 episodes between [2000, 5000] the number of times that emg end the road before ego
with DQNFNN, DQN CNN , and DDQN is 979, 1001, and 1477 respectively. Therefore, we can
conclude DDQN has a better performance compared to other solutions about the number of times
that emg reach the end before ego.

Figure 4.9 Penalty Comparison Between DDQN and IDM+ MOBIL

62

Figure 4.9 shows the penalties gained by ego equipped by DDQN and compares with penalties
that agent gets by using a combination of IDM and MOBIL. Figure 4.9 shows the outperformance
of the egoDDQN compared to the egoIDM as we can see the fewer penalties for egoDDQN (red dots)
compared to the egoIDM (blue dots).
As we expected the DDQN has a better performance compared to the DQN FNN and DQNCNN on
the normalized average reward as shown in Figure 4.10 and the mean normalized penalty of the
10 consecutive episodes.

Figure 4.10 Penalty Comparison of Three DRL Based Solutions

63

Based on the previous experiments we notice for some criteria like the number of accidents or
number of times that ego travel out of the road, the rule-based solution with a combination of IDM
and MOBIL has a better performance. However, the proposed DRL-based solutions after some
training steps reach the same level of performance as a rule-based solution with no accident and
out of the road.
Another important feature is the number of same lanes with the emergency vehicle, and we found
that our proposed DRL-based solution has a better performance compared to the rule-based
solution.
The ego needs also to control the speed and has a speed less than the emergency vehicle after
observing it. To that end, we compare the performance of the DRL-based solutions with the rulebased solutions on the number of speed violations and found that the proposed DDQN has a better
performance on the speed violation compared to the combination of the IDM and MOBIL.
The last feature that we consider is the number of times the emergency vehicle reaches the end of
the road before ego. This feature shows us the performance of the ego on both minimize the speed
and change the lane.
We designed the simulation on the way that ego starts the travel in front of the emergency vehicle
and if the ego can minimize the speed or in some cases open the emergency lane for the emergency
vehicle means the emergency vehicle can end before the ego. The performance of the proposed
DRL-based solutions is better compared to the rule-based solution on the number of times that the
emergency vehicle reaches the end of the route before ego.

64

By considering all the previous criteria we can conclude the agent trained with the proposed DRLbased algorithm has a better performance compare to the rule-based solution as a baseline.
The proposed DRL-based solutions are capable to solve the generalization problem of the rulebased solutions both on normal highway driving and the emergency vehicle approaching situation.
In this chapter, we started with a description of the implementation process. After that, essential
parameters and hyperparameters are presented. At the end of this section, we investigated the
performance of the proposed solution, and the comprehensive computational results are presented.

65

Chapter 5: Conclusion and Future Research
Automobiles as the main component of road transportation have been evolved for years.
These days we can think about autonomous vehicles which can drive almost like a human driver.
Autonomous vehicles which can operate without a human driver can play a major role in
improving transportation systems, reducing traffic congestion, reducing accident rates, increasing
safety, and generally increasing travelers’ satisfaction. However, for having fully autonomous
driving, some challenges need to be solved. For example, when the autonomous vehicle needs to
make decisions in difficult situations or scenarios which are not happen quite often, but they are
essential for reaching fully autonomous driving.
This research aimed to answer one of those complicated incidents as a new edge case in
autonomous driving when an autonomous vehicle is approached by an emergency vehicle.
Therefore, the autonomous vehicle needs to control both longitudinal and lateral trajectories.
As the result of this research, we want to have an autonomous vehicle that can make the best
decisions when approached by the emergency vehicle as the same as the human drive.
The way that all types of vehicles respond to the presence of an emergency vehicle can bring life
or death to one person. Therefore, we consider and defined this problem as a new edge case for
autonomous driving.
In this research, we proposed two solutions for the decision-making of autonomous vehicles.
The first solution as a rule-based solution is the combination of two models, Intelligent Driver
Model (IDM) and Minimize Overall Braking Induced by Lane Changes (MOBIL).
The reason these approaches are called rule-based is that they are created based on the
mathematical formulation as an exact algorithm.

66

The problem with the rule-based solution as a combination of the IDM and MOBIL is that they
don’t have an accurate solution for all the driving solutions or in the other words they are not
generalized for all the driving scenarios.
This lack of generalization brings this idea to use the neural networks as great function
approximators. All the power of the neural networks is that they can play as complex approximator
functions which has a better performance compared to the engineered functions like a combination
of the IDM and MOBIL as a rule-based solution.
We used this power of the neural network with the concept of reinforcement learning to train an
agent, the autonomous vehicle, to take the best actions in different states. By a combination of
neural networks and reinforcement learning, DRL-based solutions are created. We started with a
simple fully connected neural network at first, DQNFNN, and make the network more complicated
by adding some convolutional layers to the network and create DQN CNN . To have a better result,
we try the dueling version of the DQN as DDQN where we considered a difference between the
state-value and advantage of each action.
We compared the result of the proposed solutions based on designed different criteria as we
thought they are essential for autonomous driving and answering the presence of the emergency
vehicle. We did a comprehensive quantitative analysis and to find that which of the proposed
solutions has a better performance. Based on these experiments we found that DRL-based
solutions, DDQN and DQNCNN can have a better performance compared to the rule-based solution.
For all these experiments and creating the environment for training the agent, we built a simulation
with help of Simulation Urban Mobility (SUMO) as an open-source traffic simulator and Traffic
Control Interface (Traci) for making a connection between different components of simulation and
the autonomous vehicle.
67

5.1 Limitation of This Research
We consider some assumptions in this research, these assumptions mostly bring some
limitations to the result and the performance of this research especially when we want to transfer
the result of this research to the real world.
The data generated on this research in the simulation environment is not from the real-world traffic
data. However, we try to consider all the parameters and assumptions like what happened in realworld highways.
The reason that we use the simulation is that collecting a considerable amount of traffic data in a
highway environment especially when an emergency vehicle wants to travel and approach the
autonomous vehicle is almost impossible.
These limitations of the simulation environment need to be considered if we want to implement
the result of this research in the real world. For example, we consider all vehicles always are travel
in the middle of the lanes, which is not always true in the real world. We consider 3.2 meters as
the width of lanes, and vehicles are always located at the center of one lane. Vehicles change the
lane from the center of one lane to the center of the adjacent lane.
Another limitation of this research that can address in future research and make this research more
and more compatible with real life is the number of actions that we considered.
We considered 5 actions as presented in Table 3.3, these are the actions that the agent can take in
each step of one episode. In the real world, a human driver most of the time has a combination of
two actions presented in Table 3.3 in one step. For example, we consider action 2, ego change to
the right lane, and action 3, ego increase the speed of the vehicle. In the real life at one step a
human driver sometimes changes both the lane and speed at the same time.

68

By considering the combination of the actions we will have 9 actions instead of 5 and that can be
more realistic. However, considering 9 actions instead of 5 means calculating 9 Q-values and 9
output neurons and it will bring complexity which needs more computational resources.
All the actions in this research have discrete values. For example, the agent with action 3 increases
the speed of the ego by 3 m/s for each step. This value can consider as a continuous value and be
more compatible with the real world. By considering the continuous actions policy-based solutions
might be more helpful compared to the solutions based on Q-values.
In the next section, we will present some future research that can improve the quality of decisionmaking on autonomous driving and potentially consider it as potential future research.

5.2 Future Research
There are several open challenges and extensions that can be considered as future studies of this
research. The first potential work that we can consider as future research from this thesis is that
how we can transfer the result and the agent trained in the simulated environment like SUMO or
CARLA to the real world. Generally, integrity is the problem with autonomous driving when the
knowledge from simulation needs to transfer to the real world.
The main goal of this research is that the ego learns how to make decisions when approached by
an emergency vehicle. In this research, we use the numeric data extracted from the simulation. In
the real world the shape, siren, and special color of emergency vehicles make them distinctive
from the other vehicles. If there is accessibility to the various type of data for example rearview
cameras of the autonomous vehicle, it can be useful to solve longitudinal and lateral decisions with
help of imagery data for future research.

69

The other potential future research which can be considered is that the ability of autonomous
vehicles shares one lane with other vehicles to open the way for an emergency vehicle in the case
that there is traffic congestion.
In this chapter, we started with the summary of this research and the goal of this research
elaborated. We also determine the limits of this thesis and potential future research for further
studies.

70

References
[1] National Highway Traffic Safety Administration. "Preliminary statement of policy concerning
automated vehicles." Washington, DC 1 (2013): 14.
[2] Daily, Mike, et al. "Self-driving cars." Computer 50.12 (2017): 18-23.
[3] Rao, Qing, and Jelena Frtunikj. "Deep learning for self-driving cars: Chances and
challenges." Proceedings of the 1st International Workshop on Software Engineering for AI in
Autonomous Systems. (2018).
[4] Zakharenko, Roman. "Self-driving cars will change cities." Regional science and urban
economics 61 (2016): 26-37.
[5] Minsky, Marvin. "Steps toward artificial intelligence." Proceedings of the IRE 49.1 (1961).
[6] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, (2016).
[7] Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, (2019).
[8] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015):
436-444.
[9] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." nature
518.7540 (2015): 529-533.
[10] Silver, David, et al. "Mastering the game of go without human knowledge." nature 550.7676
(2017): 354-359.
[11] Vinyals, Oriol, et al. "Grandmaster level in StarCraft II using multi-agent reinforcement
learning." Nature 575.7782 (2019): 350-354.
[12] Senior, Andrew W., et al. "Improved protein structure prediction using potentials from deep
learning." Nature 577.7792 (2020): 706-710.
71

[13] Chowdhuri, Sauhaarda, Tushar Pankaj, and Karl Zipser. "Multinet: Multi-modal multi-task
learning for autonomous driving." 2019 IEEE Winter Conference on Applications of Computer
Vision (WACV). IEEE, (2019).
[14] Mehta, Ashish, Adithya Subramanian, and Anbumani Subramanian. "Learning end-to-end
autonomous driving using guided auxiliary supervision." Proceedings of the 11th Indian
Conference on Computer Vision, Graphics and Image Processing. (2018).
[15] Kendall, Alex, et al. "Learning to drive in a day." 2019 International Conference on Robotics
and Automation (ICRA). IEEE, (2019).
[16] Li, You, and Javier Ibanez-Guzman. "Lidar for autonomous driving: The principles,
challenges, and trends for automotive lidar and perception systems." IEEE Signal Processing
Magazine 37.4 (2020): 50-61.
[17] Tampuu, Ardi, et al. "A survey of end-to-end driving: Architectures and training methods."
IEEE Transactions on Neural Networks and Learning Systems (2020).
[18] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT
press, (2018).
[19] Even-Dar, Eyal, Sham M. Kakade, and Yishay Mansour. "Experts in a Markov decision
process." Advances in neural information processing systems 17 (2005): 401-408.
[20] Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992):
279-292.
[21] Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function
approximation." Advances in neural information processing systems. (2000).
[22] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint
arXiv:1707.06347 (2017).
72

[23] Silver, David, et al. "Deterministic policy gradient algorithms." International conference on
machine learning. PMLR, (2014).
[24] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement
learning with a stochastic actor." International conference on machine learning. PMLR, (2018).
[25] McCulloch, Warren S., and Walter Pitts. "A logical calculus of the ideas immanent in nervous
activity." The bulletin of mathematical biophysics 5.4 (1943): 115-133.
[26] Gardner, Matt W., and S. R. Dorling. "Artificial neural networks (the multilayer
perceptron)—a review of applications in the atmospheric sciences." Atmospheric environment
32.14-15 (1998): 2627-2636.
[27] Khodayari, Alireza, et al. "A historical review on lateral and longitudinal control of
autonomous vehicle motions." 2010 International Conference on Mechanical and Electrical
Technology. IEEE, (2010).
[28] Dixit, Shilp, et al. "Trajectory planning and tracking for autonomous overtaking: State-of-theart and future prospects." Annual Reviews in Control 45 (2018): 76-86.
[29] Treiber, Martin, Ansgar Hennecke, and Dirk Helbing. "Congested traffic states in empirical
observations and microscopic simulations." Physical review E 62.2 (2000): 1805.
[30] Kesting, Arne, Martin Treiber, and Dirk Helbing. "General lane-changing model MOBIL for
car-following models." Transportation Research Record 1999.1 (2007): 86-94.
[31] Hoel, Carl-Johan, Mattias Wahde, and Krister Wolff. "An evolutionary approach to generalpurpose automated speed and lane change behavior." 2017 16th IEEE International Conference on
Machine Learning and Applications (ICMLA). IEEE, (2017).

73

[32] Kim, K., J. V. Medanić, and D-I. Cho. "Lane assignment problem using a genetic algorithm
in the Automated Highway Systems." International Journal of Automotive Technology 9.3,
(2008).
[33] Jaafra, Yesmina, et al. "Robust reinforcement learning for autonomous driving." (2019).
[34] Dosovitskiy, Alexey, et al. "CARLA: An open urban driving simulator." Conference on robot
learning. PMLR, (2017).
[35] Wei, Haoran, et al. "Mixed-Autonomy Traffic Control with Proximal Policy Optimization."
2019 IEEE Vehicular Networking Conference (VNC). IEEE, (2019).
[36] Wang, Pin, and Ching-Yao Chan. "Formulation of deep reinforcement learning architecture
toward autonomous driving for on-ramp merge." 2017 IEEE 20th International Conference on
Intelligent Transportation Systems (ITSC). IEEE, (2017).
[37] Wang, Pin, and Ching-Yao Chan. "Autonomous ramp merge maneuver based on
reinforcement learning with continuous action space." arXiv preprint arXiv:1803.09203 (2018).
[38] Nishitani, Ippei, et al. "Deep merging: Vehicle merging controller based on deep
reinforcement learning with embedding network." 2020 IEEE International Conference on
Robotics and Automation (ICRA). IEEE, (2020).
[39] Wu, Yuankai, et al. "ES-CTC: A Deep Neuroevolution Model for Cooperative Intelligent
Freeway Traffic Control." arXiv preprint arXiv:1905.04083 (2019).
[40] Hoel, Carl-Johan, Krister Wolff, and Leo Laine. "Automated speed and lane change decision
making using deep reinforcement learning." 2018 21st International Conference on Intelligent
Transportation Systems (ITSC). IEEE, (2018).

74

[41] Liao, Jiangdong, et al. "Decision-making Strategy on Highway for Autonomous Vehicles
using Deep Reinforcement Learning." IEEE Access 8, (2020).
[42] Ronecker, Max Peter, and Yuan Zhu. "Deep Q-network based decision making for
autonomous driving." 2019 3rd International Conference on Robotics and Automation Sciences
(ICRAS). IEEE, (2019).
[43] Vinitsky, Eugene, et al. "Benchmarks for reinforcement learning in mixed-autonomy traffic."
Conference on robot learning. PMLR, (2018).
[44] Kheterpal, Nishant, et al. "Flow: Deep reinforcement learning for control in sumo." EPiC
Series in Engineering 2 (2018): 134-151.
[45] Liang, Eric, et al. "Ray rllib: A composable and scalable reinforcement learning library."
arXiv preprint arXiv:1712.09381 (2017): 85.
[46] Duan, Yan, et al. "Benchmarking deep reinforcement learning for continuous control."
International conference on machine learning. PMLR, (2016).
[47] Behrisch, Michael, et al. "SUMO–simulation of urban mobility: an overview." Proceedings
of SIMUL 2011, The Third International Conference on Advances in System Simulation.
ThinkMind, (2011).
[48] Hoel, Carl-Johan, et al. "Combining planning and deep reinforcement learning in tactical
decision making for autonomous driving." IEEE transactions on intelligent vehicles 5.2, (2019).
[49] Garzón, Mario, and Anne Spalanzani. "Game theoretic decision making for autonomous
vehicles’ merge manoeuvre in high traffic scenarios." 2019 IEEE Intelligent Transportation
Systems Conference (ITSC). IEEE, (2019).
[50] Tampuu, Ardi, et al. "A survey of end-to-end driving: Architectures and training methods."
IEEE Transactions on Neural Networks and Learning Systems, (2020).
75

[51] Kiran, B. Ravi, et al. "Deep reinforcement learning for autonomous driving: A survey." IEEE
Transactions on Intelligent Transportation Systems, (2021).
[52] Amini, Alexander, et al. "Learning robust control policies for end-to-end autonomous driving
from data-driven simulation." IEEE Robotics and Automation Letters 5.2 (2020): 1143-1150.
[53] Wegener, Axel, et al. "TraCI: an interface for coupling road traffic and network simulators."
Proceedings of the 11th communications and networking simulation symposium. (2008).
[54] Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear units improve restricted boltzmann
machines." Icml. (2010).
[55] Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning."
International conference on machine learning. PMLR, (2016).

76