Utilizing Machine Learning to Forecast the Charging Patterns of Electric Vehicles
by
Saeedeh Goodarzvand Chegini
M.Sc., Islamic Azad University, 2019
B.Sc., Islamic Azad University, 2016

PROJECT SUBMITTED IN PARTIAL FULFILMENT OF
THE REQUIREMENT FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
October 2024
©Saeedeh Goodarzvand Chegini,2024

Abstract
This project involves the application of advanced machine learning techniques to
forecast the charging behaviors of electric vehicles (EVs), addressing the growing demand for
a robust and efficient charging infrastructure as EV adoption accelerates. Utilizing historical
data from NL Hydro’s public EV charging network, this research aims to develop predictive
models that can optimize charging schedules, reduce peak demand on the power grid, and
enhance overall charging efficiency. This study applies a variety of machine learning
algorithms, including Isolation Forest for anomaly detection, Support Vector Regression for
precise regression tasks, Random Forest for robust predictive modeling, XGBoost for highefficiency gradient boosting, and ensemble methods such as Stacking Regressor to improve
predictive accuracy by combining multiple models.
These algorithms help analyze key factors such as the starting state of charge (SOC),
energy consumption during charging sessions, and the duration of charging events. The
models are designed to predict charging behavior patterns, providing insights into how EV
users interact with charging infrastructure. The findings reveal that EV users mainly engage
in short, frequent charging sessions, typically beginning when the SOC is at a medium level
and concluding when it reaches a high level. This pattern suggests a strategic approach to
optimizing driving range while reducing concerns about running out of battery.
The project contributes to the advancement of intelligent transportation systems by
offering data-driven insights that can guide policymakers, utility companies, and the car
industry. By optimizing EV charging infrastructure, the study supports the broader goal of
sustainable mobility, facilitating the transition to electric transportation while achieving longterm environmental and economic benefits.

i

Contents
Abstract ................................................................................................................................ i
List of Tables ...................................................................................................................... vi
List of Figures.................................................................................................................... vii
Acknowledgement and Dedication .................................................................................. viii
Chapter 1: Introduction ...................................................................................................... 1
1.1

The Rise of Electric Vehicles ................................................................................. 1

1.2

Environmental and Economic Implications ............................................................ 2

1.3

Challenges in EV Charging Infrastructure .............................................................. 2

1.4

Importance of Coordinated Charging ..................................................................... 3

1.5

Technological Solutions ......................................................................................... 3

1.6

Research and Development .................................................................................... 4

1.7

Challenges of the Research .................................................................................... 5

1.8

Research Objective ................................................................................................ 7

Chapter 2: Background....................................................................................................... 8
2.1

Forecasting EV Charging Loads for Sustainable Transportation ............................. 8

2.2

Optimizing EV Charging with Predictive Algorithms for Grid Stability ............... 10

2.3

Smart Home Energy Management with Predictive EV Charging .......................... 13

2.4

Analyzing Rapid Charging Patterns of BEVs for Improved Infrastructure ............ 14

2.5

Predicting EV Charging Loads with Enhanced Random Forest Algorithm ........... 17

2.6
Predicting EV Arrival and Departure Times Using Support Vector Machines (SVM)
for Grid Management ...................................................................................................... 19
2.7

Intelligent Charging and Load Management for EV Integration ........................... 21

2.8

California Blackouts ............................................................................................ 23

2.8.1

California to Ban the Sale of New Gasoline Cars ......................................... 24

2.8.2

Flex Alert: .................................................................................................... 25

Chapter 3: Methodology ................................................................................................... 27
3.1

Data Description .................................................................................................. 28

3.1.1

Inputs and Output......................................................................................... 28

3.2

Purposes and Background .................................................................................... 29

3.3

Data preprocessing .............................................................................................. 30
ii

3.3.1

Conversion of Time...................................................................................... 30

3.3.2

Selection of features ..................................................................................... 30

3.4

Exploratory Data Analysis ................................................................................... 31

3.4.1

Descriptive Statistics .................................................................................... 31

3.4.2

Data Visualization ........................................................................................ 31

3.4.2.2

Histograms ...................................................................................................... 32

3.5

Modeling ............................................................................................................. 33

3.5.1

What does Machine Learning mean? ............................................................ 33

3.5.2

What does Data Mining mean?..................................................................... 34

3.5.3

Isolation Forest for Anomaly Detection ........................................................ 35

3.6

Neural Network ................................................................................................... 36

3.7

Random Forest Description.................................................................................. 37

3.7.1

Basic Principles of Random Forest ............................................................... 38

3.7.2

Construction of Random Forest .................................................................... 38

3.7.3

Handling of Feature Importance ................................................................... 38

3.7.4

Majority Voting Mechanism ......................................................................... 39

3.8

Support Vector Regression (SVR) ........................................................................ 39

3.8.1

Fundamental Concept................................................................................... 39

3.8.2

Loss Function .............................................................................................. 40

3.8.3

Optimization Problem .................................................................................. 40

3.8.4

Kernel Trick ................................................................................................. 40

3.8.5

Advantages and Challenges .......................................................................... 40

3.8.6

Practical Implementation.............................................................................. 41

3.9

XGBoost.............................................................................................................. 41

3.9.1

Dealing with Missing Values ........................................................................ 42

3.9.2

Advantages of XGBoost............................................................................... 43

3.10

Stacking Regressor .............................................................................................. 43

3.10.1

Structure of the model .................................................................................. 44

3.10.2

Process of Training ...................................................................................... 44

3.10.3

Primary models ............................................................................................ 44

3.10.4

Meta-Regressor ............................................................................................ 44

3.10.5

Benefits........................................................................................................ 45

3.11

Voting Regressor.................................................................................................. 45
iii

3.12

Structure of the model.......................................................................................... 46

3.13

Categories of Averaging ....................................................................................... 46

3.14

Process of Training .............................................................................................. 46

3.15

Benefits ............................................................................................................... 46

3.16

Model Evaluation ................................................................................................ 47

3.16.1

Mean Absolute Error (MAE) ........................................................................ 47

3.16.2

Root Mean Squared Error (RMSE)............................................................... 47

3.16.3

Symmetric Mean Absolute Percentage Error (SMAPE) ................................ 48

3.16.4

Challenges and Implementation Strategies in Model Development: .............. 48

Chapter 4: Experiment Design ......................................................................................... 50
4.1

Data Pre-Processing ............................................................................................. 50

4.1.1

Handling Missing Data ................................................................................ 50

4.1.2

Data Normalization ...................................................................................... 51

4.1.3

Time Data Conversion ................................................................................. 51

4.1.4

Feature and Target Variable Preparation ....................................................... 51

4.2

Energy Consumption ........................................................................................... 52

4.3

Start SOC (State of Charge) ................................................................................. 52

4.4

End SOC ............................................................................................................. 53

4.5

Charging Duration ............................................................................................... 53

4.6

Insights ................................................................................................................ 54

4.7

Box Plot............................................................................................................... 55

4.8

Energy Consumption Patterns .............................................................................. 55

4.9

Starting State of Charge ....................................................................................... 56

4.10

Ending State of Charge ........................................................................................ 56

4.11

Charging Duration ............................................................................................... 57

4.12

Insights ................................................................................................................ 57

4.13

Isolation Forest Outlier Detection ........................................................................ 58

4.13.1

Isolation Forest Analysis of Outlier Detection .............................................. 58

4.13.2

Comprehension of the Plot ........................................................................... 58

4.13.3

Major Observations ...................................................................................... 59

4.13.4

Insights: ....................................................................................................... 60

4.13.5

Conclusion: .................................................................................................. 60

4.14

Analysis of Neural Network Model...................................................................... 61
iv

4.15

Analysis of the model's framework: ..................................................................... 61

4.15.1

Metrics for evaluating performance .............................................................. 62

4.15.2

Analysis and explanation of findings ............................................................ 63

4.16

Analysis of Random Forest Model ....................................................................... 63

4.17

Analysis of Support Vector Regression Model ..................................................... 65

4.18

Analysis of XGBoost Model ................................................................................ 66

4.19

Analysis of Voting Regressor Model .................................................................... 67

4.19.1

Outcome: ..................................................................................................... 67

4.19.2

Example Predictions: ................................................................................... 68

4.20

Analysis of Stacking Regressor Model ................................................................. 69

4.21

Result Comparison: ............................................................................................. 70

4.21.1

Explanation of Performance Metrics: ........................................................... 70

4.21.2

Detailed Comparison:................................................................................... 71

Chapter 5: Conclusion ...................................................................................................... 75
5.1

The Journey of Discovery .................................................................................... 75

5.2

Achievements in Predictive Modeling.................................................................. 76

5.3

Real-World Applications and Implications ........................................................... 77

5.4

Future Directions and Opportunities .................................................................... 77

5.5

Broader Impact and Societal Contributions .......................................................... 79

5.6

Final Reflections and the Road Ahead.................................................................. 80

References .......................................................................................................................... 81

v

List of Tables
Table 4.1: Real and predicted values by Voting Regressor ................................................... 68
Table 4.2: Evaluation Metrics ............................................................................................. 71
Table 4.3: Stacking Regressor and Voting Regressor Comparison ....................................... 74

vi

List of Figures
Figure 4.1: Energy and Duration Histogram ....................................................................... 55
Figure 4.2: Energy and Duration Boxplots .......................................................................... 58
Figure 4.3: Isolation forest scatter plot................................................................................ 61

vii

Acknowledgement and Dedication
I would like to express my deepest gratitude to my supervisor, Dr. Fan Jiang, for his
invaluable guidance, support, and encouragement throughout the course of this project. His
expertise and patience have been instrumental in shaping both my research and academic
journey. I am truly fortunate to have had the opportunity to learn from him.
I would also like to extend my heartfelt thanks to my parents for their unwavering love and
support. Their belief in me and constant encouragement have been my greatest source of
strength. I am forever grateful for the sacrifices they have made and for always standing by
my side.

viii

Chapter 1: Introduction
With the increasing popularity of EVs, the transportation sector is undergoing
significant transformation. Reducing greenhouse gas emissions and reducing transportation's
environmental impact depend on this transformation. Electric vehicles have potential as a
climate solution due to their lower emissions compared to traditional internal combustion
engine (ICE) cars. Studies suggest that EVs can reduce carbon emissions by as much as 45%
compared to internal combustion engine (ICE) vehicles [1]. However, the rapid adoption of
electric vehicles brings significant challenges, particularly in developing a reliable and
efficient charging infrastructure, which is essential for widespread use [2].
1.1

The Rise of Electric Vehicles
For the last 10 years, the rise of the electric vehicle market has been notable because

of better battery technology, more concern for the environment, and assistance from
governments. Initially, limited driving range and battery reliability were major obstacles to
electric vehicle adoption, but technological advancements have largely addressed these
issues, making EVs more appealing to a broader range of consumers [3].This shift has led to
a notable increase in the market share of electric vehicles, making it possible for them to
compete with internal combustion engine (ICE) based vehicles’ domination [3].
In as much as progress has been made, there remain important challenges in respect to
EVs such as grid infrastructure. High electric vehicle uptake therefore overloads power grids
current distribution networks given that EV charging has high power requirements. To
prevent grid meltdown or other system failures, it is essential for charging stations to be
managed in an efficient manner leading to more stable electricity provision [4].

1

1.2

Environmental and Economic Implications
The migration from ICE engines to electric power is far beyond technology

considering it as an environmental issue too. It is important to note that emissions from
transportation are largely due to vehicles running on roads across the world. We should not
forget about traditional ICE (Internal combustion engines) vehicles, which cause global
environmental problems e.g., air pollution and an increase in average world temperatures.
This would then mean that electric vehicles run on electric power and not fossil fuels like oil
or coal, so we don’t have to worry about CO2 emissions caused by their combustion [5].
Additionally, there are substantial economic consequences inherent in adopting EVs.
Promotion of electric mobility can stimulate new businesses as well as provide employment
opportunities in fields such as battery production, renewable energy and smart grid
technologies. However, these economic benefits will only be realized if we have reliable
support systems put in place, particularly charging stations. Such issues include availability
of charging stations, speed at which they charge and finally how renewable energies are
implemented into electricity grids [5].
1.3

Challenges in EV Charging Infrastructure
Achieving an effective charging infrastructure for electric vehicles remains a

considerable challenge. The underdeveloped nature of the current charging network has made
it impossible for it to keep pace with increasing needs, in some cases resulting in long queues
at public charging stations and uneven distribution of charging facilities. This is not only
inconvenient for the users of electric vehicles but also poses a potential danger to the power
distribution system’s stability [5].

2

A key solution to these problems is in improving scheduling algorithms for public EV
charging stations. The proper way of doing this would be to make schedules that are efficient
enough to use all available charging resources while reducing waiting times for EV users as
well as decreasing the burden on the electricity grids. On the other hand, uncoordinated
charging patterns can cause peak demands affecting grid stability and causing occasional
blackouts. Such a system could help in attaining an balanced distribution of charging load
over time thus guaranteeing stable power generation and delivery [2].
1.4

Importance of Coordinated Charging
The importance of coordinated charging behavior is great. When multiple EV's are

charged at the same time it can cause overloading of the grid leading to high operational costs
because there will be need to reinforce grid; hence multiple EV's must charge at different
times. This implies that the management of charging stations schedule is beyond mere
convenience; it plays a key role in ensuring that the power system remains efficient and
reliable. The use of advanced scheduling algorithms supported by predictive models on
charging behaviors has potential to greatly enhance operation of charging networks [6].
1.5

Technological Solutions
Among the several technical options under research to optimize the management of

EV charging infrastructure, machine learning algorithms are being developed to forecast EV
charging behavior and hence maximize charging schedules. Deep reinforcement learning
(DRL) has been applied, for example, to regulate multiple EV chargers at the distribution
level, therefore demonstrating the potential of learning and adaptation to very unstable
surroundings [7]. Furthermore, machine learning methods as random forests and ensemble

3

learning have been used to forecast home or community charging needs, hence improving the
effectiveness of such systems [8].
Furthermore, development of intelligent charging systems considering several
charging profiles for different electric vehicles helps to maximize the available charging
resources. Therefore, these data-driven models forecasting specific charge profiles by
omitting any unknown internal characteristics can produce more precise scheduling while
enhancing the total charging network performance. By including these sophisticated
predictive models into the operation of public EV charging stations, this may thus
considerably enhance user experience and power system stability [9].
1.6

Research and Development
In this project, the main aim is to use prediction models in developing that will aid in

predicting closely the charging behavior of electric vehicles thus, contributing towards
advancement of intelligent transportation systems. Through analyzing historical data got from
charging stations which are NL Hydro-owned plus use of more advanced machine learning
methods; the major focus of the project is in achieving optimized schedules for public EV
charging networks. Primarily, we are keen on anticipating single session charging time,
calculating consumed energy during those periods and improving algorithms facilitating
efficient utilization and friendliness of the charging infrastructure.
A comprehensive methodology that includes data preprocessing, feature engineering,
model selection and evaluation will be employed in this project. With aim to provide practical
insights based on historical data as well as current machine leaning technologies towards
improving on EV charging management issues; a generalization of the results is done leading
to increased use these vehicles by many people across the globe. Ultimately though its
various research objectives it should be able to act as an agent in helping shift towards
4

environmentally friendly cheap means of transportation therefore affecting global climate
change campaign.
The electric vehicle market has experienced rapid growth in recent years which
provides many opportunities as well as several challenges ahead too. On one hand electric
cars (EVS) provide hope for reducing carbon emissions through reduced tailpipe pollution;
however, developing reliable fast charging network is greatly needed for its wide acceptance.
By solving the non-coordination problem among chargers, optimizing their location (e.g.,
shared parking lots with offices) and capacity utilization of public stations (to avoid long
queues) and providing integrated technical solutions or regulation support, it is possible to
make them competitive solutions in comparison with internal combustion engine cars and
promote the switch to sustainable transport modes. This research aspires to provide support
for the development of intelligent transportation systems and broader sustainability by
designing models that would predict electric vehicle charging accurately.
1.7

Challenges of the Research
At various points during drafting this project, I came across a variety of challenges

which sort of posed a test to my technical and complexity managing skills. First was working
with numerous machine learning algorithms that were hard because they each required
different tuning and optimization procedures. It was not a small task to cross over diverse
datasets for these models to do well; striking this balance took a lot of trial and error.
Another big issue I had to deal with is making sure the models are ready for the
future. This means I had to make sure the EV model I made still works; regardless of the
changes in battery technology and user behavior that may occur tomorrow. This was difficult
since I had to consider how predictions would be affected by developments in energy storage
technology and customer behavior.
5

Managing time was always a problem area. Just by itself, data preprocessing involves
cleaning, organizing and preparing raw data for analysis, and takes a lot of time. In addition,
there was implementing different algorithms within limited timelines low because they
needed to be implemented and tested alongside a super-organized mind alongside high levels
of discipline.
Interpreting and integrating the results from the models into a coherent framework
that could be used by stakeholders like policymakers and utility companies was another
complex task. This was difficult given that it was also important to act on them and
understand them wherefore they were not only accurate but also easy to comprehend, making
an added layer of complexity in the study.
There was another challenge in managing a large amount of data. The computational
requirements for advanced machine learning models on large datasets are very high and
ensuring that processing is efficient and effective remains an ongoing challenge.
It was a complex issue understanding these models and uniting them into an
understandable format for policy makers at large or even industries where electricity is
widely consumed. This project called for not only exact discharge but also useful
information, which made it even more challenging.
The problem of just how overwhelming large amounts of information can be is yet
another stumbling block. Advanced machine learning models require significant
computational power for their running when applied in big dataset domains such as this
study; a situation that created difficulties in achieving efficacy.

6

It also wasn’t easy staying current with the latest developments in machine learning,
as well as EV technology. The field is always moving forward, so I kept myself updated with
new techniques and trends throughout my study.
Though, these challenges were important in my growth as a professional who is both
technically advanced and soundly grounded. More so, they also were a big lesson for me on
how important it is to be flexible, persistent and precise whenever necessary but still consider
practicality.
1.8

Research Objective
The primary objective of this research is to develop and implement advanced machine

learning models to predict the charging behaviors of EVs based on historical data. By using
data from NL Hydro’s public EV charging network, this study aims to create predictive
models that optimize charging schedules, reduce peak demand on the power grid, and
enhance the overall efficiency of EV charging infrastructure. The research focuses on
addressing the variability in charging patterns across different users and locations, ensuring
that the models developed are both accurate and adaptable to future changes in EV
technology and user behavior. Ultimately, the goal is to provide actionable insights that can
support the development of a sustainable and efficient EV charging network, benefiting
policymakers, utility companies, and stakeholders in the automotive industry.

7

Chapter 2: Background
2.1

Forecasting EV Charging Loads for Sustainable Transportation
Electric vehicle and driving towards sustainability: Comparison between EV, HEV,

PHEV, and ICE vehicles to achieve net zero emissions by 2050 from EV [10].
The rise of electric vehicles in modern transport systems has created a need for new
forecasting methods for their power consumption in general, and electricity charging load in
particular. According to this research, random forest algorithm is used to develop an EV
charging load forecasting model. The increase in number of EVs resulted in greater demand
for charging each time, making it necessary to investigate how this increase will lead to
overloading of existing power grids especially during peak periods. An accurate and reliable
model that can forecast individual station charges is essential for load balancing and capacity
planning purposes. In this regard, the study applies the Classification and Regression Tree
(CART) algorithm in order to make short-term predictions for single stations [1]. They also
designed an algorithm which predicts daily charge capacity for different sized or located
stations. Using both regression and classification models over a large historical data set of
charging data from which it learns effectively since it can perform additional tasks other than
just classification alone. It divided the data into details that allow building solid models and
predict variations in charging current [10].
An analysis of charging stations data for Shenzhen was carried out with an emphasis
on time and place. The temporal one is concerned with the pattern tails of energy
consumption during the year and within each day pattern revealed by this analysis indicates
that demand increases greatly from winter to summer season while on holidays there are even
other differences in between. The second analysis locationally describes different stations

8

charging loads taking into account such factors as where each station is located and what
features are around it [10].
The suggested Random Forest model is a mix of several machine learning models that
improve prediction accuracy. It looks at various factors like time of the day, day of the week
among many others within which electric power demand varies as well as specific station
information [10].
[11] is primarily concerned with designing data-driven strategies for smartly charging
diverse electric vehicle fleets It notes that EVs charge in nonlinear profiles like constantcurrent, constant-voltage (CCCV) mode, an approach that leads to less efficient use of the
charging infrastructure if not well-managed [11]. Consequently, smart charging seeks to
optimize the use of such infrastructure by creating effective charging schedules for each EV
based on charging patterns [11].
In [11] the authors suggest using machine learning to forecast power consumption
during charging of EVs. The study relies on a dataset obtained from 2016 to 2018 consisting
of 10,595 charging events from 1,001 EVs of 18 different models. The preprocessed dataset
contains 1.2 million data points. Several machine learning models such as linear regression,
neural networks and XGBoost were trained in order to predict charging profiles. XGBoost
demonstrated the highest performance as it managed to achieve an MAE of 126W and a
relative MAE of .06 [11].
According to simulations, integrating the XGBoost model into a smart charging
algorithm makes the charging infrastructure more efficient for up to 21% of energy charged.
It goes on to illustrate that it is crucial for the developers to pay attention to the actual
behavior of batteries during charging when designing algorithms for smart charging. If this is
done, then smart charging becomes more efficient using data-driven predictions thereby
9

ensuring that the charging resources are distributed fairly within this field. Working place
charging stands to benefit from this particular approach where several EVs have to be
charged at once. The overall performance of smart charging algorithms is improved through
the incorporation of machine-learning models leading to better battery utilization as well as
higher SOC for EVs [11].
The study examines real-world data from public AC charging stations in the
Netherlands to dissect how different types of EVs charge. The wide adoption of EVs could
strain the power system, particularly during peak hours as emphasized by the research. It also
highlights the importance of delayed charging so that the system could be optimized hence
relieving the pressure put on the grid by electricity usage. Unlike many other studies which
treat EV charging load as a constant entity, this research instead observes actual behaviors
during each session aiming at making those sessions more convenient and possibly improving
the efficiency of smart charging algorithms [11].
2.2

Optimizing EV Charging with Predictive Algorithms for Grid Stability
Numerous academic researchers have started studying how to ensure the smooth

integration between the grid and the charging of battery electric vehicles following their wide
adoption. Unlike conventional vehicles, EVs offer an environmentally friendly mean of
transport hence contribute to decrease in the rate at which greenhouse gases are emitted into
the atmosphere. However, the rise in EVs usage escalates electricity needs which challenges
the power grid dynamics with regard to stability and efficiency. Effective management of EV
charging is essential to reduce these issues, ensuring that the benefits of EV adoption are
maximized while minimizing negative impacts on the power system [6].
User behavior prediction is essential in EV charging management. This study
employs various machine learning algorithms to forecast EV users' stay duration and energy
10

consumption based on historical charging data. The algorithms discussed in this report are
Multiple Linear Regression, Support Vector Regression, Decision Tree (DT) regression,
Random Forest (RF) Regression, and K-Nearest Neighbor (KNN) Regression. These
algorithms do have their strong and weak points that depend on specific nature of data and
user behavior patterns [6].
Incorporating right behavioral predictions into EV charging scheduling considerably
improves the efficiency of power distribution. By predicting when and how long the EVs are
planned to be charged, it helps grid operators share resources more effectively in order to
balance load, as well as prevent such peaks that may destabilize power grids. This approach
also reduces the waiting time experienced by EV users and makes the process of charging
more user friendly. After Define the desired goal, there are several steps that need to be taken
before an optimal predictive model can be developed. First off, it is important to clean the
data and eliminate discrepancies. This also involved aligning timestamps with weather data
from an external source. Other attributes we engineered were essential factors for the
charging patterns in aspects like time or context where they occur among many others for
example time-related characteristics about different charging sessions could be used as
features while at the same time the location information relating to the charging environment
could also be utilized among others. At this juncture, we examine the preliminary results on
the applicability of the proposed algorithm in making better EV charging predictions. The
study evaluates the performances of various machine learning techniques that are most
effective according to different types of charging patterns. This is done with the help of
various algorithms to improve the accuracy of the predictions and make them consistent [6].
The results also show that the proposed algorithm can predict EV users’ behavior
accurately, leading to better scheduling and distribution of resources. Nevertheless, these
predictions may vary with the type of data and algorithms employed. The study also
11

highlights the importance of continuous data collection and model refinement to adapt to
changing patterns in EV usage. This research recommends that accurate EV user behavior
prediction is needed if charging scheduling is to be optimized and load on power grid is
managed. An integration of machine learning models gives a possibility to make these
estimations more robust and enhance efficiency and reliability of specific electric vehicle
infrastructures. Hence this study provides insights into how grid bodies as well as
policymakers can facilitate transition towards sustainable and efficient transportation systems
[6].
Researchers are studying electric vehicle charging behavior to optimize EV charging
schedules using machine learning algorithms [12]. They studied actual charging data for 252
users and discovered patterns in electric vehicle charging behavior such as stay duration and
power consumption. Prediction accuracy of these behaviors is influenced by entropy and
sparsity of the data, hence the need for a ratio (R) that combines entropy and sparsity thus
identifying a primary indicator of algorithm selection [12]
Three main discussed algorithms in [12] are:
Support Vector Regression: It has high precision in predicting stay duration when the
entropy/sparsity ratio (R) is low [12].
Random Forest Regression: Random Forest Regression: More accurate for predicting
energy consumption under low R conditions [12].
Diffusion based Kernel Density Estimator (DKDE): It performs best among all other
methods during high values of R concerning stay time and electrical power prediction [12].
The authors of this study invented the Ensemble Predicting Algorithm (EPA) by
integrating the above algorithms that greatly enhance forecasting consistency. In this study,

12

EPA demonstrated a reduction of 11% in prediction errors for stay duration and 22% in
energy consumption [12].
To apply EPA at any scale of charging station, it is assumed that records of electric
vehicles are kept. For optimal scheduling across a distribution grid, integration with Open
Charge Point Protocol (OCPP) and the use of real-time data for better predictions is proposed
[12].
Additionally, the study highlights the practical implications of this work for both
power suppliers and EV users, suggesting that better load predictions can lead to more
efficient energy management and cost savings. The results validate the effectiveness of the
EPA in managing EV charging loads and optimizing the overall performance of the charging
infrastructure [12].
2.3

Smart Home Energy Management with Predictive EV Charging
The opportunity for developing micro-grids and smart communities within energy

internet offered by the global rise of electric vehicles usage is presented in [13]. those microgrids and smart communities can serve local concentrated energy demands. Electric Vehicle
charging management through smart devices is a promising solution, but it heavily depends
on predicting the actual charging demand precisely [13]. Despite its significance, householdbased EV charging demand predictions have not been extensively investigated. Most of the
current research concentrates on charging station solutions, and general models ignore
individual charging behavior [13].
[13] fills the gap by using various well-known machine learning algorithms to
forecast when the next day’s household EV charges will be made as well as whether they will
occur at all. Some of these algorithms involve Random Forest, Gradient Boosting, Adaptive

13

Boosting, Naive Bayes, K-Nearest Neighbors, and Artificial Neural Networks. They
developed a two-layer hybrid stacking ensemble method that combines different types of
algorithms with the aim of achieving better prediction results. Furthermore, this work
suggests that accurate forecasting of household vehicle-to-grid (V2G) reception is an
essential component for efficient Home Energy Management Systems (HEMS) design
because HEMS should be able to schedule energy use efficiently. When individual algorithms
are combined in an ensemble model, they outperform when used independently based on
their complementary nature. To validate the model, empirical data were collected from a
house to which a typical private charger had been attached; the results revealed that the model
could predict both time of occurrence and likelihood of absence during no-charge events with
increased accuracy. They need to implement EV charging prediction as part of larger smart
home solutions so that it improves the flexibility and efficiency of household energy
management. proposes a method whereby ensemble learning is used to better forecast EV
charging demand, which is vital for creating intelligent EV charging systems that support
subsequent energy internet infrastructures [13].
2.4

Analyzing Rapid Charging Patterns of BEVs for Improved Infrastructure
[14] investigates the rapid charging patterns of privately-owned battery electric cars

(BEVs) and analyzes the several elements that impact these patterns. Drivers frequently
encounter limited cruising ranges and extended charging periods as a result of the restricted
performance of BEV power cells. Furthermore, the unequal allocation of charging
infrastructure leads to partial congestion and waiting at charging stations, further
complicating the charging procedure. In order to tackle these obstacles, the research examines
the patterns of rapid charging by utilizing data from 130 privately-owned Battery Electric
Vehicles in Beijing, which were gathered over a period of seven months. The dataset consists

14

of 15,752 trajectories, out of which 2,161 involve rapid charging Some crucial factors that
affect the overcharge behavior are discussed in These are recharge initiation SOC, departure
time, trip duration, distance travelled, speed, weather conditions e.g., rain or snowstorms or
fog that leads to reduced visibility on roads and has negative effects on road turn-taking
behavior, and past charging records. According to it, the lower is the SOC at the beginning of
the trip, the more chances for overcharging it. Besides the trips of longer durations as well as
the increased speed require fast charges more frequent than any other thing does. Weather
conditions significantly impact fast charging behavior. Both high and low temperatures
increase the demand for fast charging as additional energy is used for heating or cooling the
vehicle. Wind power is another factor, with higher wind speeds leading to increased energy
consumption due to higher resistance, thereby necessitating more frequent fast charging [14].
There are findings on day of week effects too: charging behavior barely changes
throughout weekdays versus weekends that suggest similar weekly travel habits and charging
demands. A binary logistic regression model is employed to forecast if overcharging will
happen by means of these factors. The model includes parameters for start-SOC, time-origin,
travel duration, driving distance, driving speed, wind power, temperature, and the last fastcharging event. As can be revealed from regression analysis results lower starting SOC as
well as lower speed contribute negatively towards the increased likelihood of overcharging
while travel duration distance and extreme temperatures increase chances of fast charging
[14].
The model’s predictive accuracy is validated using three-tenths of the data with a
prediction rate of 89.36%. This high prediction rate shows how effective the model can be in
projecting outcomes because the dependent variable can assume only two values.
Furthermore, the study compares the logistic regression model with other models, such as
univariate linear regression (ULR) and multivariate linear regression, using receiver
15

operating characteristics (ROC) curves and areas under the curve (AUC). The logistic
regression model outperforms the other models, indicating its superior predictive
performance [14].
Finally, if major overcharge determinants are known, the utilization of quick
recharging stations can be improved, and this information can be useful for BEV users
seeking to optimize their charging decisions. It is possible to reduce queuing times at
charging stations and idle time through proper positioning and operation of fast charging
stations thus enhancing user experience and helping spread the adoption of BEVs. Policy
makers, researchers as well as industry players are highly likely to benefit from the outcomes
of this study as this will guide in making innovations aimed at optimizing BEV infrastructure
[14].
Data was collected in [15] is from the Municipality of Amsterdam and energy
providers EVNET and NUON. The dataset is composed of detailed charging session records
and specific meter values collected every 15 min, thus making it a comprehensive database
for analysis. Several primary variables contribute to alteration of charging profiles. They
include temperature depending environmental factors like peak times, point of charging like
whether both sockets of an outlet are in use and other EV-specific parameters including
battery degradation or voltage levels [15].
Among main results are:
1. Environmental Effects: In contrast to expectations, charging at peak hours gets
quicker (17:00–21:00), implying that the power grid is robust and exhibits a large capacity
[15]. However, it takes longer for daytime charging due to high power loss and voltage
fluctuations. Room temperature is positively related with charging rate thereby meaning
warmer conditions promote faster charging particularly among 230V EVs [15].
16

2.Charge Point Characteristics: Charging rate is influenced by another EV at the same
charging point. For example, a 230V EV will be charged faster if there is a 400V EV at the
other plug socket. However, if both sockets are plugged in, speed of charging decreases
because of increased power loss and voltage drop [15].
3. EV Characteristics: Speed of charging is significantly affected by voltage system –
roughly three times faster for 400V EVs than 230V EVs. Another parameter that reduces the
speed of such batteries over time is battery degradation with a clear association observed after
multiple charging cycles [15].
[15] employs multiple linear regression models to determine how these variables
affect charging rate. However, this model tells us that fast dynamic charging profiles should
be considered rather than static loads so as to efficiently optimize EV charging infrastructure.
Such understanding draws a line between various factors that affect EV charging behavior
and therefore contributes to development of smarter charging systems hence this will result in
reduced stress on power grid while improving overall user experience. These results could
also be beneficial for politicians as well as experts involved with power systems who are
interested in supporting environmentally friendly fuel-cell vehicles [15].
2.5

Predicting EV Charging Loads with Enhanced Random Forest Algorithm
An improved Random Forest algorithm for predicting EV charging load is explored in

[16]. As the adoption of EVs grows, the power system faces an uphill task of balancing loads,
planning capacities and fostering power quality. Therefore, forecasting EV charging load
accurately is critical to efficient management and future planning [16]. User behavior based
traditional methods have been outperformed by randomness inherent in EV charging loads.
This paper seeks to enhance accuracy and reliability of EV charging load predictions using
machine learning techniques, specifically one known as Random Forest. RF algorithm with a
17

high efficacy and multi tree arrangements for dataset correlations reduction was chosen. The
study employs data from various charging stations in Shenzhen combining single station
predictions with station group predictions [16]. The data contained charging records from
2016-2018 thus addressing time-relatedas well as spatial distribution characteristics [16].
Essentially the methodology includes:
Temporal and Spatial Analysis: Temporal distribution of charges is analyzed from
which we note that the load is higher in summer compared to winter and varies during
holidays. As shown through spatial analysis economically developed areas- which include
Nanshan, Futian, Long gang and Baoan districts- are marked by high electricity loads for
charging facilities [16].
RF Algorithm Application: RF algorithm has been applied in the prediction of shortterm charging load for both single station and group of stations. Characteristic data such as
date, time, location and previous charging amounts act as inputs. The algorithm was assessed
using metrics like Mean Absolute Percentage Error (MAPE) and Root Mean Square Error
(RMSE) [16].
Model Training and Validation: Here we trained the RF model by using 90 percent of
data and tested it on the remaining 10%. The model competes against Support Vector
Regression and decision tree algorithms. The RF model demonstrates superior accuracy,
particularly for predicting charging load at individual stations and station groups [16].
Feature Importance: Among the predictors of charging load, the most crucial are the
previous day’s charge, an activity indicator, and time indicators. This information helps in
tuning the model and thereby improving its accuracy when making predictions [16].
Results have shown that the RF algorithm is highly accurate in predicting charging
load. For single stations, the RF model has an average MAPE of 9.76% and RMSE of 2.27
18

while for station groups it performs well with an MAPE of 10.83% and RMSE of 39.59. The
study concludes by noting that the RF-based prediction method is practical and reliable as it
can help manage EV charging infrastructure more effectively, which gives power suppliers
and policy makers some valuable insights [16].
2.6

Predicting EV Arrival and Departure Times Using Support Vector Machines (SVM) for
Grid Management
The main focus of [17] is on predicting the times of electric cars entering and leaving

the University of California at San Diego campus with a focus on using Support Vector
Machines. Proper electric load forecasting is essential for effective grid management and
power distribution especially with the increasing number of electric vehicles particularly in
California. To better allocate electric power on the grid, the study intends to predict the
arrival and departure times of EVs by utilizing historical data obtained between 2012 and
2014 [17].
With support vector machine (SVM), a machine learning algorithm that has been
recognized for its performance when it comes to classification and regression tasks, this
research will use data on EVs showing connection and disconnection times as well as
cumulative energy consumed. Overnight parking events and very short or negligible charging
sessions were eliminated during preprocessing of the dataset while considering only those
conducted between 6:00 am- 9:00 pm to concentrate on day-to-day travel patterns among
people who own electric vehicles [17].
The data was segmented by weeks and hours were used for categorizing arrival and
departure times for every hour within this given range while each year such as 2012, 2013, …
has its own training set in addition to validation and testing ones when it comes to selecting
features used by the model that help SVM to learn such patterns accurately regarding future
19

arrival or leaving times. Specifically, 45 weeks of data are used for training, whereas 5 weeks
each are set aside as testing and validation sets. The attributes that are essential in the SVM
model include week number, day of the week, hour, arrival time, previous arrival time,
departure time and previous departure time used for teaching the program important trends
about when to expect a car on the area [17].
To assess how well the model could make predictions, it used Mean Absolute
Percentage Error (MAPE), Root Mean Square Error along with Mean Bias Error (MBE). The
variables mentioned contribute to determining how accurate the given forecast is likely going
to be on arrival or departure times of EVs which move within a specified area. It has been
established that the distribution patterns have changed significantly over the years because
there were many more electric vehicles coming into or leaving the UCSD campus between
2012 and 2014. In doing so, the SVM model is trained based on data collected for 50 weeks
and tested for 5 weeks while models are previously executed for separate weeks representing
various seasons in order to attract broad data coverage [17].
Results from the study show that the SVM model has relatively high prediction
accuracy with low error rates. The research compares the performance of the SVM model
against a reference forecast based on persistence where the latter uses data from the previous
week to forecast the following week. In comparison with others, the SVM model exhibits
fewer errors, which means it is more precise and reliable. The findings also suggest that
increasing the size of the training dataset leads to better forecasts given the declining values
of MAPE and RMSE as seen over time [17].
It is thus seen that accurate short-term load prediction of EVs is vital in achieving
optimal power distribution within the grid. There is also an increased efficiency in managing
the risk of grid instability due to overloaded EVs through prediction on entrance as well as

20

exit time by SVM model. The method applied herein can be used in future for any university
campus or city where there is a lot of power usage by many electric vehicles [17].
2.7

Intelligent Charging and Load Management for EV Integration

[18] discusses how the integration of electric vehicles impacts power systems, as well as the
role played by intelligent charging schemes. This paper recognizes the advantages of EVs in
terms of environmental conservation; however, they put extra load on electrical grids
especially during peak hours. The solution to this therefore is the use of delayed charging
options which optimize the charging system and relieve pressure on power systems [18].
Using the actual data from charging sessions performed in the Netherlands for
individual EVs, this study examines factors influencing charging profiles uptake. It argues
that mainstream approaches view the charging profile as a static load neglecting dynamic
behavior during the process of charging. In a bid to address this issue, this study examines
how different external factors affect the progression from an empty to a fully charged battery.
This includes assessing the influence of other parameters on charging time as well as the
overall charging profile [18].
The methodology requires collection and analysis of charging information containing
variables like time of day when charging started, duration of each session and initial State of
Charge. The main objective of this examination is to help improve EV charging timing by
either enabling cost saving opportunities through reduction in consumer electricity bills or
peak demand shaving among other benefits during night-time peak off period [18].
According to some major findings, it is necessary to have smart charging management
since it would help reduce the negative impacts of EVs on the power system from time to
time. As a result, the study suggests that one can come up with more effective electric vehicle

21

charging schemes which are likely to improve charging infrastructure efficiency while
keeping power grids stable. The above research ultimately aims at providing viable methods
for managing the electrical load increase occasioned by the growth in EVs [18].
[19] focuses on the effect of the widespread adoption of electric vehicles on power systems
and shows the relevance of intelligent load management for charging panels. The more EVs
there are, the more ecologically friendly it gets but they overload the power grid during peak
hours. Therefore, there is a need for late charging strategies that make the most of the
charging systems as well as relieve pressure on grids [19].
In order to investigate this, real-world data from public AC charging points in the
Netherlands were used by the researchers [19]. This study was meant to show how charging
proceeds from an empty battery to a full one and what makes an individual EV’s charging
profile differ over time based on various external factors. It would allow for optimization
analysis of EV smart charging schemes [19].
This included collecting detailed session-level records including timestamps of start
and end as well as electricity supplied, or specific meter readings taken every quarter-hourly
period during each session [19]. Particularly in daytime hours, the time of day was an
important factor next to property type of the charge point while there were other variables
such as battery degradation and voltage levels distinctive of EVs that were considered
significant [19].
According to the research results: Environmental Effects: At peak hours (17:00–
21:00), charging speed increases as this illustrates a strong power grid with high capacities
whereas during daytime hours charging rates tend to be slower due to increased power losses
and deviations on voltage levels. Also, during hot seasons, charging is faster especially for
230V EVs [19].
22

Charge Point Characteristics: Charging speed is impacted by the presence of another
electric vehicle at the same charging point. For instance, if a 400V EV is plugged into one
socket and a 230V into the other, then the latter charges faster than when there are no other
cars at the point. However, simultaneous charging on both sockets reduces speeds because
more power is lost and there are voltage drops [19].
Concerning these variables’ effects on charging rate, multiple linear regression model
was employed in the study. The model indicated that rather than using static loads,
consideration should be given to dynamic charging profiles in order to optimize EV charging
infrastructures effectively. They noted that this disclosed how different factors-controlled
EVs’ charging behavior thereby contributing to the creation of better smart load systems for
electric vehicles so that power system constraints are reduced [19].
In a nutshell, the research affirms why intelligent use of energy in EV charging is
necessary for protecting power systems from harm. By understanding when and how much
energy a given electric car would require you can develop strategies that will improve
charging infrastructure efficiencies besides ensuring electricity grids remain stable always,
amid growing loads from such cars. Therefore, such findings are essential for supporting
sustainable growth of electric cars uptake and offering workable solutions to power system
load management challenges [19]. [20]
2.8

California Blackouts
During the West-wide excessive heat wave on August 14 and 15, 2020, the California

Independent System Operator Corporation (CAISO) had to implement rotating electrical
outages in California. After the emergency occurrences, Governor Gavin Newsom asked the
CAISO, CPUC, and CEC to investigate and report on the underlying reasons for the August
outages, once they had taken steps to prevent any future outages. The Final Root Cause
23

Analysis (Final Analysis) includes supplementary data analyses that were previously
unavailable during the publication of the Preliminary Analysis. However, it does not
significantly alter previous findings and affirms that the three primary causal factors behind
the August outages were extreme weather conditions, resource adequacy and planning
processes, and market practices [20].
To summarize, the factors were as follows:
1. The significant heat wave caused by climate change in the western United States
led to a higher demand for power than what was available and planned for.
2. The resource planning objectives have not kept up with the need for dependable,
clean, and inexpensive resources that can match the demand during the early evening hours.
This caused the difficulty of matching demand and supply within the intense heat wave.
3. Certain behaviors in the day-ahead energy market worsened the supply difficulties
during extremely strained circumstances [20].
2.8.1 California to Ban the Sale of New Gasoline Cars
California authorities have approved a comprehensive plan to impose restrictions and
eventually prohibit the sale of automobiles fueled by gasoline, according to state officials.
The governor of California has characterized this decision as the initial step towards phasing
out the internal combustion engine [21].
The California Air Resources Board has approved a regulation mandating that all new
automobiles sold in the state by 2035 must be devoid of greenhouse gas emissions, such as
carbon dioxide. The regulation also establishes intermediate benchmarks, mandating that by
2026, 35 percent of newly marketed passenger vehicles must be capable of emitting zero
pollutants. The requirement increases to 68 percent by 2030 [21].
According to state officials, the new regulation in California would reduce greenhouse gas
24

emissions from passenger vehicles by almost 50 percent in 2040 compared to the projected
levels without the program. Liane Randolph, head of the California Air Resources Board,
stated that this would result in the elimination of 395 million metric tons of greenhouse gas
emissions, which is comparable to burning 915 million barrels of oil [21].
2.8.2 Flex Alert:
A Flex Alert is a request for customers to willingly reduce their electricity usage when
there is an expected shortfall of energy supply, particularly if the California Independent
Operator (ISO) has tap into reserves to ensure the stability of the power system. Californians
may avert more severe emergency measures, such as rotating power outages, by reducing
electricity usage during a Flex Alert [22]
Actions should be taken in response to a Flex Alert: Minimize electricity consumption
During the period from 4 p.m. to 9 p.m. [22].
This period corresponds to the peak of energy consumption, during which the availability of
renewable energy sources such as solar power is relatively low [22].
Adjust the thermostats to a temperature of 78 degrees Fahrenheit or above.
Raising the thermostat temperature decreases the burden on air conditioning systems, which
are a substantial contributor to energy usage during periods of extreme heat [22].
Refrain from utilizing large household appliances:
Delay the utilization of household appliances such as dishwashers, washing machines, and
dryers until after 9 p.m. in order to reduce the strain on the electricity grid during periods of
high demand [22].
Minimize the use of superfluous lighting:
Diminishing illumination aids in reducing the total consumption of power, particularly during
periods of high demand [22].
25

Restrict the charging of electric vehicles during periods of high demand:
Electric vehicle owners are requested to charge their vehicles outside the time frame of 4 p.m.
to 9 p.m. in order to prevent additional strain on the power system during these crucial hours
[22].

26

Chapter 3: Methodology
The objective of this work is to comprehensively analyze and predict the charging
behaviors of electric vehicles using modern machine learning techniques. This research
focuses on converting charging times into durations and normalizing the dataset. It involves
utilizing machine learning models such as Isolation Forest for anomaly detection, Neural
Networks, Support Vector Regression, Random Forest, and Stacking Regressor ensemble
methods. Through the analysis of key factors such as the starting state of charge (SOC),
energy consumption during the journey, and charging duration, any differences in charging
will be found. This will enable accurate forecasts of charging time and evaluation of the
performance of various predictive models.
This study is significant in enhancing comprehension of how individuals charge their
electric vehicles, which is essential for optimizing the usage of electric vehicle infrastructure.
Hence, by identifying patterns and anomalies in the charging data, one may consistently
discover more efficient methods for utilizing charging stations, resulting in shorter waiting
times and less congestion, all while avoiding consumer displeasure. Furthermore, accurate
time forecasts might potentially improve the scheduling of resource distribution in EV
networks, hence boosting their overall sustainability.
Finally, this project assesses several machine learning algorithms that are practical in
real-world applications by examining their strengths and drawbacks in a human-centered
manner. The results are anticipated to provide valuable guidance to policymakers and other
stakeholders in the automotive sector for the creation of data-driven plans to expand the EV
infrastructure. Ultimately, this will enhance the acceptance of electric vehicles by effectively
tackling significant issues related to consumer satisfaction and the overall charging
infrastructure.

27

3.1

Data Description

The dataset used in this project was provided by NL Hydro and contains detailed records
from the public electric vehicle charging network for the entire year of 2022. The dataset
includes charging session information from various public charging stations, all equipped
with both 62.5 kW direct current fast chargers (DCFC) and 7 kW Level 2 chargers. These
charging stations support both CCS and CHAdeMO connector types, ensuring compatibility
with a broad range of EV models. The chargers are integrated into the ChargePoint network,
allowing for precise station location tracking via ChargePoint’s driver map and open-source
platforms like PlugShare. The data offers a valuable snapshot of charging behavior across
multiple stations, capturing patterns over time, across locations, and across different user
types.
Dataset Overview
Source: NL Hydro
Year of Data: 2022
Types of Chargers:
62.5 kW DCFC Chargers (with CCS and CHAdeMO connections): Primarily used for fast
charging, these stations offer high power output to quickly replenish EV batteries.
7 kW Level 2 Chargers: Slower, more common chargers typically used for longer charging
sessions or when fast charging is not available.
3.1.1 Inputs and Output
Start SOC (State of Charge): Input - This feature indicates the initial charge level (as a
percentage) of the EV battery when the charging session begins. It provides insights into

28

when users typically plug in their vehicles for a recharge, highlighting their driving habits or
charging strategies.
End SOC: Input - This feature records the battery charge level (as a percentage) at the
conclusion of the charging session, indicating how much charge the user prefers to
accumulate before unplugging.
Charging Time: Input - The total duration of the charging session, recorded in hours, minutes,
and seconds. This feature is vital for understanding how long users typically spend at
charging stations and helps in identifying fast versus slow charging patterns.
Energy (kWh): Output - This records the total amount of energy transferred to the vehicle
during the charging session. Energy consumption is one of the most critical features for
understanding demand on the charging network, as well as for analyzing the load EVs place
on the power grid.
3.2

Purposes and Background
NL Hydro's objective is to effectively handle the heightened demand on the electrical

grid resulting from electrification initiatives, such as electric vehicle charging, household hot
water heating, and space heating. Efficiently handling these additional demands is essential
for utilities and customers to prevent wasteful investments in the electrical infrastructure,
potentially leading to increased power costs for consumers. NL Hydro is getting ready to
initiate a Residential EV smart charging trial in order to assess the possibility of moving EV
charging demand away from peak hours. This pilot project aims to investigate two distinct
control methods: direct connection with smart chargers and Telematics, which entails direct
contact with the electric vehicle's internal charging management logic.

29

The dataset is crucial for this research as it offers factual data essential for analyzing
and forecasting EV charging behaviors. The research intends to utilize this data in order to
enhance the efficiency of EV charging infrastructure and assist NL Hydro's goals of effective
grid management and cost-efficient electrification.
3.3

Data preprocessing
Data preparation is an essential step to guarantee the quality and uniformity of the

dataset prior to doing any analysis. The EV charging data was preprocessed using the given
code. The following procedures were followed:
3.3.1 Conversion of Time
The Charging Time (hh:mm:ss) column is transformed from the format of hours,
minutes, and seconds into a cumulative duration measured in seconds. This conversion
standardized the time format, making it more convenient for manipulation and analysis. The
function divides the time string into hours, minutes, and seconds, and subsequently
transforms these elements into a cumulative count of seconds.
3.3.2 Selection of features
Significant characteristics were chosen and retrieved for examination. The selected
main characteristics consisted of Start SOC, Energy (kWh), End SOC, and the recently
generated duration (converted from Charging Time (hh:mm:ss)). The selection of these traits
was based on their direct relation to the research aims, particularly their influence on
comprehending and forecasting EV charging trends.
Normalization refers to the process of organizing data in a database to eliminate
redundancy and improve data integrity.

30

The data was normalized by scaling numerical features to a standard range, typically
between 0 and 1. This was done by subtracting the minimum value of each feature and
dividing it by the range (maximum value minus minimum value). Normalization ensures that
all features contribute equally to the analysis and is crucial for models sensitive to the scale of
input data. Dealing with Missing Data:
The dataset was examined for any missing values, especially in crucial attributes such
as Start SOC, Energy (kWh), End SOC, and Charging Time (hh:mm:ss). Records containing
missing values in these crucial aspects were examined and eliminated if they were considered
incomplete for the purpose of analysis. This measure guaranteed that the dataset utilized for
analysis was comprehensive and dependable, hence reducing the possibility of distorted
outcomes.
3.4

Exploratory Data Analysis
Exploratory Data Analysis involves summarizing the main characteristics of the

dataset and visualizing data distributions and relationships. The following steps were
undertaken to perform EDA on the EV charging data:
3.4.1 Descriptive Statistics
Key statistics of the dataset were summarized to provide an overview of the data. This
included measures such as mean, median, standard deviation, and range for numerical
features like Start SOC, Energy (kWh), End SOC, and duration. These statistics helped in
understanding the central tendency, dispersion, and overall distribution of the data.
3.4.2 Data Visualization
Visualization libraries such as Seaborn and Matplotlib were used to create various
plots that depict the distribution and relationships of the data.
31

3.4.2.1 Box Plots
Box plots were created to visualize the distribution of the duration and Energy (kWh)
features [23]. Box plots help in identifying the presence of outliers and understanding the
spread and skewness of the data [24].
A box plot, also known as a box-and-whisker plot, is used in displaying the
distribution of quantitative data so that comparisons can be made among variables as well as
across levels within a categorical variable [23] [24].The whiskers extend to show the rest of
the distribution except for any points that fall outside “outliers” which are identified using
some function involving inter-quartile range [23] [24]. For this plot, the box contains the
quartiles of the dataset while whiskers expand to display other parts of the distribution [24].
3.4.2.2 Histograms
In order to visualize the distributions of each numerical characteristic, histograms
were drawn for each of the components. It is possible to recognize patterns such as normal
distribution, skewness, or bimodal distributions with the use of histograms, which are
graphics that depict the frequency distribution of data. They organize the data points into
continuous ranges, often known as bins, and each bar in the histogram shows the frequency
of the data points that are contained inside each different bin. The form of the data
distribution may be seen, and outliers can be identified with the assistance of this [25].
3.4.2.3 Scatter Plots
The purpose of the scatter plots was to provide a visual representation of the link that
exists between the various numerical characteristics. A scatter plot is a type of graph that uses
dots to represent values for two distinct numeric variables. The position of each dot on the
horizontal and vertical axes indicates the values for a single data point. When it comes to

32

determining the links, correlations, and possible outliers that exist between variables, this sort
of figure is quite helpful [26].
For example, plotting duration against Energy (kWh) helps to observe any potential
correlations or trends between the amount of energy consumed and the duration of the
charging session.
The utilization of these visualization tools enhanced the comprehension of the dataset,
uncovering patterns, trends, and possible anomalies that influenced further analysis and
modeling. The integration of descriptive statistics and visualizations offered a thorough and
all-encompassing depiction of the data, establishing a foundation for further in-depth research
and predictive modeling.
3.5

Modeling
Machine learning and data mining are changing many parts of our lives by helping us

find useful information in the huge amounts of data that are created every day. These tools are
very important for making sense of data and using it to solve problems and make smart
choices [27].
3.5.1 What does Machine Learning mean?
A part of data mining called machine learning tries to make things work better and
predict what will happen by looking at past data [27]. It means showing computers how to
learn from cases and decide what to do or guess without being told directly. There are two
types of machine learning: supervised learning and unsupervised learning. Supervised
learning uses labeled data to help the system learn, while unsupervised learning lets the
system find patterns on its own [27].

33

3.5.2 What does Data Mining mean?
The goal of data mining is to find patterns and useful information in big sets of data.
For instance, companies use data mining to figure out how their customers act, guess what
trends will happen in the future, and make important business decisions. It involves going
through huge amounts of data to find trends that make sense and can help you make decisions
[27].
Data should be gathered from various sources like client’s past purchases, or sensor
data from instruments [27].
Pattern Detection: Data mining techniques uncover patterns and relationships in the
information [27]. It can be found out, for example, that people who usually purchase bread
also buy margarine [27].
Learning from Data: Machine learning enables decision making based on trends. If
somebody makes a purchase of bread, the computer system may suggest buying some butter
during the next visit [27].
These technologies are significant because they allow us to comprehend and exploit
large volumes of data created daily. It transforms raw data into meaningful information that
guides firms in making better decisions as well as helps doctors detect impending disease
attacks earlier than they occur and enable engineers to come up with improved products [27].
Machine learning on the other hand uses feature recognition, supervised learning and
unsupervised learning in the same way human beings learn, machine learning algorithms
concentrate on those factors that are considered important when coming up with conclusions
[28].

34

3.5.3 Isolation Forest for Anomaly Detection
The Isolation Forest (iForest) is an anomaly detection system that relies on ensemble
learning [29]. The approach builds a collection of isolation trees (iTrees) by repeatedly
dividing the data space using randomly chosen features and split values [30].
3.5.3.1 Isolation Mechanism:
iForest generates iTrees by iteratively picking a feature and a split value at random,
repeating this process until each data point is individually separated [29]. Distinct anomalies
are segregated with fewer partitions, leading to shorter routes in the trees [29] [30].
Score indicating the presence of an anomaly:
Anomaly scores are calculated by averaging the path length from the root to the leaf
nodes. Smaller average route lengths suggest a greater probability of abnormalities [29].
Anomalies are promptly identified and addressed due to their distinct and exceptional
attributes [30].
3.5.3.2 Enhancements:
The system uses k-means clustering to automatically establish anomaly thresholds,
eliminating the requirement for human threshold configuration and enhancing the objectivity
and consistency of anomaly detection [30].
Top-K Anomaly Detection refers to the process of identifying the K most significant
anomalies in a given dataset [30].
The k-nearest neighbor distance is utilized to compute anomaly scores, guaranteeing
consistent ranking of anomalies throughout repeated tests [29] [30]. This aids in accurately
recognizing the top-K anomalies, which is essential for the interpretation of hydrological data
[30].
35

3.6

Neural Network
Artificial neural networks (ANN) are a category of information processing systems

that draw inspiration from the structure and functioning of the human brain. Nonlinear
systems simulation and control are utilized across many domains including medical, biology,
mathematics, physics, philosophy, computer science, and information science [28] [29].
Artificial neural networks are composed of a multitude of interconnected processing
units known as neurons. Every individual neuron receives input signals and generates an
output depending on these received inputs. The synapses between neurons possess weights
that may be modified throughout the process of learning in order to enhance the performance
of the neural network [29].
An Artificial Neural Network is generally composed of an input layer, one or more
hidden layers, and an output layer. Neurons at the input layer receive signals from the
external world, which are further processed through the hidden layers before being sent to the
output layer [28]. The presence of hidden layers in the network enables it to effectively
capture complex patterns and correlations present in the data [29].
The learning process in artificial neural networks entails modifying the connection
weights between neurons by considering the discrepancy between the network's expected
output and the true output [28]. This technique is commonly accomplished through the
utilization of algorithms such as backpropagation [28] [29]. Backpropagation analyzes the
gradient of the error in relation to each weight and subsequently adjusts the weights to
minimize the error [28] [29].
Artificial Neural Networks are renowned for their capacity to accurately represent
complex nonlinear connections and acquire knowledge from input, rendering them wellsuited for tasks such as identifying patterns, categorizing information, and controlling
36

systems [29]. Artificial intelligence systems have the capability to effectively manage and
analyze substantial volumes of data, enabling them to identify significant patterns. This
ability is particularly important for applications like image recognition, voice processing, and
autonomous systems [28].
To summarize, Artificial Neural Networks are very effective instruments that replicate
the architecture and capabilities of the human brain to analyze data, acquire knowledge, and
generate informed choices [29]. Artificial neural networks, often referred to as ANNs, are
extensively used across different fields to address complex issues by utilizing their adaptive
learning skills and capacity to represent nonlinear interactions [28] [29].
We used this model and followed these steps for conducting [31]. First, the input
features, including Start SOC, End SOC, and Charging Time, were selected to provide
relevant data about the charging behavior of electric vehicles. The output feature, Energy
(kWh), was chosen as the target variable for predicting the total energy consumption during
each session. This approach allowed us to model and analyze the relationships between these
inputs and energy consumption, providing valuable insights into charging habits and patterns.
We employed various machine learning techniques, trained the models, and evaluated their
performance using established metrics to ensure the effectiveness of our predictions.
3.7

Random Forest Description
Random Forest (RF) is an ensemble learning method used for classification and

regression tasks. The algorithm creates multiple decision trees during training and outputs the
class that is the mode of the classes (classification) or the mean prediction (regression) of the
individual trees [32].

37

For the Random Forest model, we selected Start SOC, End SOC, and Charging Time
as the input features, with Energy (kWh) as the output variable to predict. In [33], Random
Forest’s ensemble approach, which builds multiple decision trees, was particularly useful for
capturing the diverse interactions between the input variables. By averaging the predictions
of multiple trees, Random Forest improved the model’s robustness and reduced overfitting,
leading to more reliable predictions of energy consumption during EV charging sessions. Its
ability to handle feature importance and noisy data further strengthened the accuracy of the
results.
3.7.1 Basic Principles of Random Forest
The Random Forest algorithm uses the bootstrap resampling method to establish a
decision tree model for each sample set. During modeling, each decision tree randomly
selects features to split the attributes of internal nodes, thus forming a random forest. The
final output is derived from the comprehensive decision trees. For regression tasks, the final
prediction result is the average of the outputs from all decision trees [32].
3.7.2 Construction of Random Forest
Random Forest constructs different training sets to increase the variation between
classification models, enhancing the extrapolation prediction ability of combined
classification models. Through k rounds of training, a classification model sequence is
obtained and used to form a multi-classification model system. The final classification result
of the system is determined using a simple majority voting method [34].
3.7.3 Handling of Feature Importance
In Random Forest prediction, noise can be added to a feature to judge its importance.
The importance is determined based on whether the prediction accuracy decreases
significantly when the feature's data is perturbed. This capability allows the algorithm to
38

calculate the importance of characteristic variables while maintaining accuracy even with
outliers and noise [32].
3.7.4 Majority Voting Mechanism
The final classification decision in Random Forest is made using a majority voting
mechanism. Each tree in the forest gives a classification, and the class with the most votes
becomes the model's prediction. This mechanism helps in reducing overfitting and improves
the robustness of the model [34].
3.8

Support Vector Regression (SVR)
Support Vector Regression is an extension of the Support Vector Machine tailored for

regression tasks. The primary goal of SVR is to predict continuous output values based on
input features [35].
For the Support Vector Regression (SVR) model, we applied this method by using
Start SOC, End SOC, and Charging Time as input features, with Energy (kWh) as the output
to predict. In [33], SVR was chosen for its capacity to handle complex relationships by
finding the optimal hyperplane that minimizes prediction errors. This model enabled us to
predict energy consumption accurately, especially in cases with non-linear relationships
between inputs and the target variable. The use of the kernel trick further allowed us to model
these non-linearities effectively, making SVR a valuable tool in understanding charging
behavior patterns.
3.8.1 Fundamental Concept
SVR constructs a hyperplane in a high-dimensional space to predict continuous
values. The method is an adaptation of the Support Vector Machine (SVM), which is
typically used for classification tasks [35].
39

3.8.2 Loss Function
SVR employs an ε-insensitive loss function [36]. This loss function helps measure the
quality of the estimation by ignoring errors within a certain margin (ε) from the true value.
Deviations beyond this margin are penalized. The ε-insensitive loss function allows SVR to
find a balance between the complexity of the model and the precision of predictions [35]
[36].
3.8.3 Optimization Problem
The objective of SVR is to minimize a function that balances model complexity
(measured by the norm of the weights vector) and the sum of the errors exceeding the ε
margin. This balance is achieved by introducing slack variables to handle deviations outside
the ε-sensitive zone. The optimization problem is formulated to minimize the norm of the
weight vector while also considering the slack variables that account for errors [35].
3.8.4 Kernel Trick
Similar to SVM, SVR can handle non-linear relationships by employing the kernel
trick [18]. This technique involves mapping the input features into a high-dimensional space
using a kernel function [18]. Common kernel functions include the Radial Basis Function
(RBF) and polynomial kernels. The kernel trick allows SVR to model complex, non-linear
relationships effectively [35].
3.8.5 Advantages and Challenges
SVR is capable of producing robust regression models that can handle highdimensional data and complex relationships. However, the performance of SVR depends
heavily on the choice of kernel and the tuning of hyperparameters such as the regularization
parameter (C) and the ε margin. Proper tuning of these parameters is crucial for the
effectiveness of the SVR model [35].
40

3.8.6 Practical Implementation
In practical applications, such as facial expression recognition, SVR is used to predict
continuous values representing different expressions. The effectiveness of SVR in these
applications demonstrates its versatility and robustness in handling regression tasks with
high-dimensional and complex data. SVR's ability to model complex relationships makes it
suitable for various real-world applications where continuous output prediction is required
[35].
3.9

XGBoost
XGBoost, also known as Extreme Gradient Boosting, is a highly potent machine

learning method utilized for problems involving supervised learning [37].
XGBoost is a sophisticated version of the gradient boosting framework. The system is
optimized for rapid execution and high efficiency, enabling it to effectively process extensive
volumes of data [37].
The model structure of XGBoost consists of a tree ensemble, where many decision
trees are utilized to create predictions. The ultimate forecast is determined by aggregating the
results of each individual tree in the ensemble. This methodology facilitates the identification
and analysis of complex patterns within the data [37] [38].
The goal function in XGBoost comprises a loss function and a regularization term
[37]. The loss function evaluates the degree of correspondence between the model's
predictions and the true target values, while the regularization term manages the model's
complexity to avoid overfitting [37] [38].
Greedy method: XGBoost employs a greedy method to identify the optimal split
points for the decision trees. This method systematically assesses every potential division for
41

each feature and chooses the one that reduces the loss function to the greatest extent. In order
to enhance efficiency, XGBoost pre-sorts the data and accesses it in a sequential manner
during the process of searching for splits [37].
Regularization is an important aspect of XGBoost, as it incorporates regularization
within the goal function [38]. Regularization mitigates overfitting by imposing a penalty on
the complexity of the model [37]. This is accomplished by including phrases that pertain to
the count of leaf nodes and the amount of the weights applied to each individual leaf node
[37] [38].
The model can be represented by:
ŷᵢ = f₁(xᵢ) + f₂(xᵢ) + ... + fₖ(xᵢ)

(3.1)

where fₖ represents the k-th decision tree [37].
For the XGBoost model, we followed a systematic approach by using Start SOC, End
SOC, and Charging Time as input features, while Energy (kWh) was the output target. In [18]
XGBoost's ability to handle large datasets and capture complex relationships made it an ideal
choice for predicting energy consumption. We employed this gradient boosting technique to
iteratively improve model accuracy by minimizing prediction errors at each step. XGBoost’s
regularization and efficiency in handling missing values further enhanced our ability to model
the intricacies of EV charging behavior and deliver precise energy consumption predictions.
3.9.1 Dealing with Missing Values
An advantage of XGBoost is its capability to effectively manage missing values.
During the training process, XGBoost algorithm acquires knowledge on how to effectively
manage missing values, enabling it to provide accurate predictions even when certain data
points are lacking [38].

42

Parallel Processing: XGBoost is specifically engineered to exploit the benefits of
parallel processing. XGBoost achieves a notable reduction in training time compared to
classic gradient boosting algorithms by dividing data and executing calculations
simultaneously [39].
3.9.2 Advantages of XGBoost
Efficiency: XGBoost has exceptional efficiency, demonstrating the ability to
effortlessly manage enormous datasets and high-dimensional data [40].
Accuracy: XGBoost frequently produces great predicted accuracy because to its
utilization of sophisticated techniques such as second-order Taylor expansion and
regularization [39] [40].
Flexibility: The algorithm is designed to accommodate a diverse set of
hyperparameters, enabling thorough customization and precise adjustment to enhance
performance for particular jobs [38].
3.10 Stacking Regressor
The Stacking Regressor is an ensemble learning method that enhances forecast
accuracy by amalgamating numerous regression models [41].
The stacking regressor is a conceptual framework used in machine learning [42]. A
stacking regressor is a machine learning ensemble strategy that combines numerous base
regressors to improve the accuracy of predictions [41].
For the Stacking Regressor model, we followed a similar process by using Start SOC,
End SOC, and Charging Time as inputs and Energy (kWh) as the target output. The Stacking
Regressor combines the predictions of multiple base models by using a meta-regressor to
make the final prediction. In [31], this ensemble approach was applied to leverage the diverse
43

strengths of different regression models, enhancing the predictive accuracy of energy
consumption. This method allowed us to gain deeper insights by effectively combining
information from various models and minimizing errors in forecasting.
3.10.1 Structure of the model
The stacking regressor paradigm consists of many base regressors and a metaregressor [42]. The primary regressors are trained autonomously, and their forecasts are
employed as input characteristics for the meta-regressor [41].
The stacking regressor is a technique used in forest height estimation to enhance the
accuracy of forecasts by combining the predictions of many base models [42].
3.10.2 Process of Training
A stacking regressor undergoes a training procedure that consists of two distinct steps
[24]. During the initial phase, every base regressor is trained using the original training data.
During the second step, the meta-regressor is trained by utilizing the predictions made by the
basic regressors as input characteristics [41].
3.10.3 Primary models
The base regressors can encompass a range of regression techniques, including linear
regression, random forest, adaptive boosting, support vector regression, and ridge regression
[41] [42]. Each of these models captures distinct facets of the data, hence enhancing the
overall prediction's resilience [41].
3.10.4 Meta-Regressor
The meta-regressor is often a straightforward model that acquires the ability to
amalgamate the predictions made by the base regressors. Ridge regression is commonly

44

employed as the meta-regressor because of its capacity to address multicollinearity among the
input characteristics [41].
3.10.5 Benefits
Stacking regressors can enhance prediction accuracy by capitalizing on the
capabilities of numerous regression algorithms, surpassing the performance of individual
base models. The utilization of this ensemble strategy mitigates the likelihood of overfitting
and improves the model's capacity for generalization [41] [42].
3.11 Voting Regressor
The Voting Regressor is a type of ensemble learning method that is specifically
designed for regression challenges [43].
The concept of a Voting Regressor refers to a machine learning algorithm that
combines the predictions of many regression models to make a final prediction [43].
The Voting Regressor is an approach that enhances overall performance by
aggregating the predictions of numerous independent regression models. The system operates
by utilizing the collective knowledge of a group of individuals, either by calculating the
average or assigning weights to the predictions made by its component models [44].
For the Voting Regressor model, we utilized this approach by selecting Start SOC,
End SOC, and Charging Time as input features and Energy (kWh) as the output. By
combining predictions from multiple models, the Voting Regressor aggregates the strengths
of each individual model to improve the accuracy of energy consumption forecasts.
Throughout [31], we applied this technique by training several base models and averaging
their predictions, allowing us to capture different aspects of the charging behavior and
ultimately deliver more reliable and robust results.
45

3.12 Structure of the model
The Voting Regressor is composed of many basic regressors. Every individual base
model is trained separately using identical information, and their predictions are combined to
get the ultimate prediction. The process of aggregating can be accomplished by either a basic
calculation of the average or by giving distinct weights to each model's forecast, taking into
account their anticipated performance [44].
3.13 Categories of Averaging
Simple Averaging: In this method, each model is given the same weight, and the final
forecast is calculated by taking the average of the predictions from all models [43].
Weighted averaging involves assigning different weights to each model's forecast
based on their respective performance [43].
3.14 Process of Training
Every each base regressor is trained separately using the training data. During the
prediction phase, each model generates its own forecast. These predictions are then merged
using either simple or weighted averaging, according to the selected aggregation technique
[44].
3.15 Benefits
The Voting Regressor can enhance accuracy by amalgamating the predictions of
numerous models, resulting in superior performance compared to individual models [43].

46

Reduced Overfitting: The process of averaging predictions from different models aids
in diminishing overfitting by ensuring that the ensemble is less prone to capturing irrelevant
details from the training data [44].
The Voting Regressor offers versatility by being compatible with any regression
model, enabling a broad spectrum of combinations and freedom in the choosing of models
[44].
3.16 Model Evaluation
3.16.1 Mean Absolute Error (MAE)
MAE measures the average magnitude of errors in a set of predictions, without
considering their direction. It is calculated as the mean of the absolute differences between
predicted and actual values, making it easily interpretable as it is on the same scale as the data
being predicted. MAE is particularly useful because it gives a straightforward measure of
prediction accuracy [45].
3.16.2 Root Mean Squared Error (RMSE)
RMSE is a quadratic scoring rule that also measures the average magnitude of the
error. It is the square root of the average of squared differences between predicted and actual
values. RMSE gives a higher weight to larger errors, making it more sensitive to outliers
compared to MAE [46].
R-squared (R²)
R-squared is a statistical measure that represents the proportion of the variance for a
dependent variable that is explained by an independent variable or variables in a regression

47

model. It indicates the goodness of fit of the model. Higher values indicate better model
performance [47].
3.16.3 Symmetric Mean Absolute Percentage Error (SMAPE)
Description: SMAPE is an accuracy measure based on percentage errors. It is
calculated as the mean of the absolute percentage errors between predicted and actual values,
making it useful for comparing model performance across different datasets [48].
3.16.4 Challenges and Implementation Strategies in Model Development:
During the implementation of the models, several obstacles emerged that required careful
navigation to ensure the accuracy and reliability of the results.
The first issue encountered was the presence of missing data in crucial columns like Start
SOC, End SOC, and Energy (kWh). These gaps in the data could have significantly
undermined the models' ability to learn effectively. To maintain the quality of the dataset, it
was decided to remove rows with missing values in these key columns. While this approach
reduced the overall size of the dataset, it allowed the models to be trained on complete and
consistent data, ultimately enhancing their predictive performance.
Another aspect that demanded attention was the handling of the Charging Time (hh:mm:ss)
feature. Initially, this data was in a time format, which posed a challenge for direct use in
most machine learning models. The solution was to convert this time data into a numerical
format, specifically seconds, which allowed it to be used as a continuous variable in the
models. This transformation enabled the models to better interpret and utilize the time-related
information.
Overfitting was a concern throughout the process, particularly given the smaller dataset after
the removal of incomplete records. Overfitting occurs when a model performs exceptionally
48

well on training data but fails to generalize to unseen data. To address this, various strategies
were employed. In the case of the Random Forest and XGBoost models, careful tuning of
hyperparameters, such as the number of estimators and the depth of trees, was implemented
to prevent the models from becoming too complex. The Random Forest model, in particular,
was monitored using the Out-of-Bag (OOB) score, which provided an additional measure of
performance on unseen data during training.
To further reduce the risk of overfitting, ensemble techniques such as stacking and voting
regressors were employed. By combining multiple models, these techniques used the
strengths of each individual model, resulting in more robust predictions and reducing the
likelihood that any single model would dominate and potentially overfit the data.
Computational resources also played a significant role in the model implementation. Training
models like neural networks, Random Forest, and XGBoost can be demanding, especially
when fine-tuning hyperparameters or using ensemble methods. Given these constraints,
training was conducted with careful attention to the balance between computational
efficiency and model performance. This included optimizing the number of estimators and
epochs, and incorporating more computationally efficient models, like Support Vector
Regression, into the ensemble to lighten the overall load.
Finally, the choice of evaluation metrics was crucial in ensuring a comprehensive assessment
of model performance. Multiple metrics were used, including Mean Absolute Error, Rsquared (R2) score, Symmetric Mean Absolute Percentage Error, and Root Mean Squared
Error. This multi-faceted approach provided a well-rounded evaluation of the models,
capturing not only their accuracy but also their potential biases and variances.

49

Chapter 4: Experiment Design
The dataset of EV charging sessions provides a rich narrative about user behaviors
and charging patterns by delving into the histograms of key numerical features—Energy
(kWh), Start SOC, End SOC, and Duration—we can uncover important trends and insights.
4.1

Data Pre-Processing

4.1.1 Handling Missing Data
Handling missing data is a vital part of data preprocessing to ensure the integrity and
accuracy of the model's predictions. The following approach was taken:
•

Column Selection: Initially, columns that were not essential for the analysis were
removed from the dataset. This was done to focus on the key variables: Start SOC,
End SOC, Charging Time (hh:mm:ss), and Energy (kWh). Removing irrelevant
columns reduces the dimensionality of the data and simplifies the subsequent analysis.
50

•

Row Filtering: The dataset was then processed to identify and handle missing values.
Specifically, rows that contained missing values (NaN) in any of the critical columns
(Start SOC, End SOC, Energy (kWh)) were excluded from the final dataset. This
approach, known as listwise deletion, was chosen to ensure that only complete cases
were used for model training, thereby avoiding the potential biases introduced by
imputation techniques.
By removing rows with missing data, the dataset was cleaned and prepared for

accurate modeling, with the understanding that sufficient data remained to maintain the
robustness of the analysis.
4.1.2 Data Normalization
Normalization is a standard preprocessing step, especially when dealing with features
of varying scales. In this analysis, a normalization function was implemented to scale data
values to a range between 0 and 1. Although this function was not utilized in the final model,
its inclusion highlights the importance of normalization in machine learning, as it ensures that
no single feature disproportionately influences the model due to its scale.
4.1.3 Time Data Conversion
The dataset included a time-related feature, Charging Time (hh:mm:ss), which needed
to be converted into a numerical format suitable for modeling. This was achieved by
transforming the time into seconds, allowing the model to process and learn from the time
data effectively.
4.1.4 Feature and Target Variable Preparation
The preprocessed dataset was then divided into features (X) and the target variable
(y). The features included Start SOC, End SOC, and the normalized Charging Time, while the

51

target variable was Energy (kWh). This clear separation ensured that the model could be
trained specifically to predict the target variable based on the provided features.
In the experimental setup, the dataset was first split into training and testing sets using
the train_test_split function from the sklearn.model_selection module. Specifically, 33% of
the data was allocated to the testing set, while the remaining 67% was used for training. The
split was made with a fixed random_state to ensure reproducibility.
No cross-validation techniques were directly applied in this setup. However, the
consistency provided by the random_state during the train-test split helped in achieving
reliable and repeatable results across different models. If cross-validation had been used, it
would involve dividing the training data into multiple folds to validate the model's
performance across various subsets, providing a more thorough evaluation.
4.2

Energy Consumption
Examining the distribution of energy consumption reveals a pronounced right skew,

with a skewness value of 0.78. This indicates that most charging sessions are characterized by
lower energy usage, with a smaller number of sessions requiring higher energy inputs. The
histogram depicts a large concentration of sessions at the lower end of the energy spectrum,
gradually tapering off towards higher values. This pattern suggests that users often engage in
short charging sessions, potentially opting for partial charges rather than full ones. Such
behavior might be influenced by the widespread availability of fast-charging stations,
allowing drivers to quickly top up their batteries as needed.
4.3

Start SOC (State of Charge)
The Start SOC histogram displays a somewhat uniform distribution with a slight

increase towards the mid-range, followed by a decline. With a skewness of 0.43, this pattern
52

indicates that many charging sessions begin with the battery at a mid-level state of charge. It
appears that drivers typically start their charging sessions when their battery is neither fully
depleted nor overly charged, hovering around a comfortable mid-range. This could be a
deliberate strategy to maintain battery health, as frequent deep discharges and charges can
reduce battery longevity. By starting charges at a mid-range SOC, users might be aiming to
prolong their battery's lifespan while ensuring they have enough charge for their next journey.
4.4

End SOC
In contrast to the Start SOC, the End SOC histogram is left-skewed, with a skewness

of -1.22. This indicates that most charging sessions conclude with a high SOC, often nearing
full charge. The histogram shows a significant concentration of sessions at the upper end of
the SOC spectrum, suggesting that users prefer to charge their vehicles to a high level before
concluding the session. This behavior is likely driven by the desire to maximize driving range
and reduce the frequency of charging stops. By ending their sessions with a high SOC,
drivers ensure they have sufficient battery capacity for their upcoming trips, reducing range
anxiety and enhancing convenience.
4.5

Charging Duration
The distribution of charging durations is strikingly right-skewed, with a skewness of

7.07. The histogram reveals a large number of sessions with very short durations,
accompanied by a long tail extending towards longer charging times. This extreme skewness
suggests that the majority of charging events are quick top-ups, likely facilitated by the
presence of fast-charging infrastructure. Users seem to prefer short, frequent charging
sessions to maintain their SOC, capitalizing on the speed and convenience of modern

53

charging stations. However, the long tail also indicates occasional longer sessions, which
might be necessary for deeper charges or when slower charging speeds are utilized.
4.6

Insights
The patterns observed in the dataset weave a compelling story about the charging

habits of EV users. The tendency towards lower energy consumption in most sessions
highlights a preference for partial charges, likely driven by the convenience of fast-charging
stations. Starting charges at a mid-range SOC and ending them with a high SOC reflects a
strategic approach to battery management, aimed at balancing longevity with readiness for
the next trip. The predominance of short charging durations further underscores the impact of
fast-charging technology, allowing users to quickly replenish their batteries and get back on
the road.
Together, these insights paint a vivid picture of the evolving landscape of EV
charging. Users are increasingly leveraging advanced charging infrastructure to optimize their
charging routines, maintaining flexibility and minimizing downtime. As the adoption of EVs
continues to grow, understanding these behaviors will be crucial for enhancing charging
infrastructure, improving user experience, and supporting the broader transition to sustainable
transportation.

54

Figure 4.1: Energy and Duration Histogram
4.7

Box Plot
The box plots generated from the dataset provide an overview on how users interact

with EV charging infrastructure, revealing patterns in energy consumption, state of charge
(SOC), and charging duration.
4.8

Energy Consumption Patterns
The box plot for Energy (kWh) paints a clear picture of how much energy users

typically consume during their charging sessions. The interquartile range (IQR), represented
by the green box, captures the middle 50% of the data, showing that most charging sessions
consume between approximately 5 kWh and 30 kWh. The median energy consumption,
which is around 16.58 kWh, indicates a common tendency towards moderate energy use.

55

However, the presence of outliers—data points beyond the whiskers—tells another
part of the story. These outliers, extending up to 100 kWh, suggest that while most sessions
are moderate, there are instances of significantly higher energy consumption. This could be
attributed to longer charging sessions or charging sessions for larger battery capacities,
highlighting the diverse needs of EV users.
4.9

Starting State of Charge
The box plot for Start SOC reveals how charged the vehicles are when they begin a

session. Most sessions start with the SOC in the range of 20% to 50%, as indicated by the
green box's position. The median start SOC is around 35%, suggesting that drivers often
begin charging when their battery is somewhat depleted, likely to avoid running too low on
charge.
Interestingly, the whiskers and outliers show a few sessions starting with a very low
SOC, close to 0%, implying that some users charge only when their battery is nearly empty.
This behavior might be influenced by the availability of charging stations or individual
charging habits. The upper whisker extends to around 90%, indicating that some users start
charging even when their battery is relatively full, perhaps to maintain a high state of
readiness.
4.10 Ending State of Charge
The End SOC box plot shifts the focus to how charged the vehicles are by the end of
the session. Here, the green box lies predominantly in the higher range of SOC, between 55%
and 85%, with the median close to a full charge at around 75%. This indicates a strong
preference among users to charge their vehicles to a high level, ensuring maximum driving
range for subsequent journeys.
56

The few outliers on the lower end suggest that occasionally, charging sessions are cut
short, possibly due to time constraints or the immediate need for the vehicle. Nevertheless,
the general trend towards a high end SOC reflects a cautious approach, aiming to reduce
range anxiety. The lower whisker reaches down to 13%, showing that even the lowest typical
ending SOC is relatively high.
4.11 Charging Duration
Finally, the duration box plot, now represented in seconds for granularity, reveals the
length of typical charging sessions. The IQR shows that most sessions last between
approximately 667 seconds (0.185 hours) and 2698 seconds (0.75 hours). The median
duration, at around 1584 seconds (0.44 hours), confirms that quick top-ups are common
practice.
However, the long tail of outliers extending far beyond the whiskers indicates the
presence of longer charging sessions, some exceeding 45,000 seconds (about 12.5 hours).
These could be due to slower charging rates or situations where a full charge is necessary.
This variety in session lengths underscores the flexible nature of EV charging,
accommodating both quick stops and longer charging needs.
4.12 Insights
Users generally prefer moderate energy consumption, starting charges when their
SOC is between 20% and 50%, and aiming for an end SOC of around 75%, all while
capitalizing on the convenience of short charging durations with a median of approximately
26 minutes. The outliers in each plot remind us of the diversity in charging needs and
behaviors, painting a comprehensive picture of the dynamic world of EV charging.

57

Understanding these patterns not only helps in optimizing charging infrastructure but
also in designing policies and strategies that align with user behavior, ultimately supporting
the broader adoption of electric vehicles.

Figure 4.2: Energy and Duration Boxplots

4.13 Isolation Forest Outlier Detection
4.13.1 Isolation Forest Analysis of Outlier Detection
The following scatter plot demonstrates an analysis of outlier detection using an
Isolation Forest algorithm with respect to EV charging data. Let’s explore on this plot’s
information about data and behavior of the charging sessions.
4.13.2 Comprehension of the Plot
This plot presents normalized values for two main variables:

58

Normalized Energy (kWh): This approximates the amount of energy consumed by
charge sessions which is scaled to a range between 0 and 1 to allow for comparison hence
appearing on x-axis.
Normalized Duration (hours): The y-axis shows how long the charging session took
when also scaled between 0 and 1.
The data points in the plot are color-coded based on the predictions of the Isolation
Forest model:
Blue points: Indicate normal charging sessions.
Red points: Indicate anomalous charging sessions (outliers).
4.13.3 Major Observations
4.13.3.1 Cluster of Normal Points:
There is a significant group of blue points located at the bottom-left part of the
diagram. This indicates that most charging sessions consume relatively low amounts of
energy and have short durations; By the Isolation Forest model, these sessions are seen as
normal. Thus, it’s clear from the concentration that most users will do their charging sessions
very quickly and they will consume very little power at the same time.
Scattered anomalies:
Red dots denoting anomalies are spread throughout the plot but tend to be more
frequent as both energy consumption and duration increase.
Several red points are spread across higher normalized energy and duration values,
suggesting that unusually long or high-energy charging sessions are flagged as outliers.

59

4.13.3.2 Behavior Patterns:
Most of users usually have shorter low energy charges although there is some time
users have long term charges with more energy.
These outlier sessions might be the result of slower charging rates, a complete charge
needed, or other user-specific activities.
4.13.4 Insights:
With its obvious difference between typical and abnormal sessions, the scatter plot
reveals dataset outliers and charging patterns. Usually driven by convenience and efficiency,
the normal user behavior consists in brief, low-energy charges. The anomalies, however,
highlight rare situations wherein consumers participate in longer or more energy-consuming
sessions. This might be the result of long trips or other particular demands requiring a
complete charge.
4.13.5 Conclusion:
Isolation Forest effectively separates the typical charging behavior and anomalies. A
better understanding of these trends can help optimize the charging infrastructure as well as
customize services for customers’ needs such as locating charging stations based on session
lengths since most are short ones or making sure there is access to moderate chargers
occasionally For instance, knowing that most sessions are short can inform the placement and
type of charging stations, while recognizing the need for occasional long charges can ensure
that facilities are available for those situations.

60

Figure 4.3: Isolation forest scatter plot

4.14 Analysis of Neural Network Model
We used a neural network model constructed using TensorFlow and Keras to forecast
the energy consumption during these charging sessions. The neural network is composed of
three hidden layers that utilize ReLU (Rectified Linear Unit) activation functions, along with
an output layer containing just one neuron. The model was constructed via the Adam
optimizer and trained with the objective of minimizing the mean squared error (MSE).
4.15 Analysis of the model's framework:
The input layer receives three features: Start SOC, End SOC, and Charging Time.
The neural network consists of three hidden layers, each containing 64, 32, and 16
neurons, respectively. All of these neurons utilize the Rectified Linear Unit (ReLU) activation
function.

61

The output layer consists of a single neuron that generates the predicted energy
consumption.
The training process consisted of 15 epochs.
4.15.1 Metrics for evaluating performance
In order to assess the efficacy of the model, we employed multiple metrics:
The Mean Absolute Error is 5.87. The model's forecasts have an average deviation of
5.87 kWh. This provides us with a direct quantification of the average magnitude of the error.
If the actual energy consumption was 40 kWh, the model's average prediction would likely
fall within the range of approximately 34.13 to 45.87 kWh.
R-squared Accuracy: 0.76. The R-squared score quantifies the proportion of the
variability in energy usage that the model is able to account for. A score of 0.76 indicates that
the model accounts for 76% of the variation in the data. This suggests that the model
successfully captures the majority of the significant patterns and trends present in the data.
The Symmetric Mean Absolute Percentage Error is 26.66. SMAPE quantifies the
accuracy of our predictions by assessing the average absolute differences between the
projected and actual values. A SMAPE of 26.66% indicates that, on average, the prediction
error is approximately 26.66% of the true energy use. This indicates that although the model
has satisfactory performance, there is still potential for better results.
The Root Mean Squared Error is 60.93. RMSE is a metric that places greater
importance on larger errors. It calculates the square root of the average of the squared
discrepancies between anticipated and actual values. A root mean square error of 60.93 kWh
indicates that the model's predictions deviate significantly from the actual values, suggesting
the occurrence of occasional substantial errors.

62

4.15.2 Analysis and explanation of findings
The neural network model has exhibited a commendable capacity to forecast energy
consumption during electric vehicle charging sessions. The MAE of 5.87 kWh suggests that
the average forecast error is rather minimal, which is a positive outcome. The R-squared
score of 0.76 indicates that the model is able to account for a significant portion of the
variation in the data, demonstrating its ability to accurately capture the fundamental patterns.
Nevertheless, the SMAPE of 26.66% suggests that the prediction errors are still
perceptible, with an average deviation of approximately one-fourth of the actual values. This
indicates that the model's predictions are reasonably precise, however there is room for
improvement. The root mean square error of 60.93 kWh indicates the existence of substantial
flaws in the model's predictions, implying that although the model is often precise, it
occasionally produces notable inaccuracies.
To summarize, the neural network model offers a reliable forecast of energy use
during electric vehicle charging sessions. It exhibits strong explanatory power for the
variability in the data and maintains a relatively low average error. Nevertheless, there is
always potential for enhancement, particularly in minimizing significant errors and enhancing
overall precision. The findings indicate that by improving and maybe incorporating
supplementary characteristics, the model has the potential to enhance its precision and
dependability.
4.16 Analysis of Random Forest Model
The Random Forest algorithm is a machine learning technique that constructs several
decision trees and combines them to achieve a more precise and robust prediction.

63

For this study, we applied the Random Forest algorithm with 10 trees to forecast the
energy consumption during electric vehicle charging sessions. The results indicate an out-ofbag score of 0.82. It demonstrates the anticipated performance of the model on unfamiliar
data. An out-of-bag (OOB) score of 0.82 indicates that the model is very dependable and
demonstrates strong performance on data that it has not been trained on, accounting for 82%
of the variation in energy usage.
The Mean Absolute Error is 3.74. MAE measures the average size of the errors in the
predictions, without considering whether the predictions are too high or too low. Here, an
MAE of 3.74 means that, on average, the model’s predictions are off by 3.74 kWh. This gives
us a straightforward understanding of the prediction accuracy.
The R-squared score tells us how well the model explains the variation in the data. An
R2 score of 0.89 means the model can explain 89% of the variability in energy usage. This is
a high score, indicating that the model fits the data very well.
The Symmetric Mean Absolute Percentage Error is 17.56. SMAPE measures the
accuracy of the predictions relative to the actual values. A SMAPE of 17.56% means that the
average error in the model's predictions is 17.56% of the actual energy values. This
percentage helps us understand the error size in a way that is relative to the size of the values
being predicted.
RMSE, or Root Mean Square Error, is a metric that is similar to MAE but places
greater emphasis on larger errors. The metric calculates the square root of the mean squared
deviation between the expected and actual values. A root mean square error of 5.44 kWh
indicates that the average deviation between the predicted and actual values is 5.44 kWh.
This measure is valuable when greater errors are especially unwanted.

64

In summary, the Random Forest model offers a robust and precise forecast of energy
consumption during electric vehicle charging sessions, exhibiting minimal average errors and
a high capacity to elucidate the fluctuations in the data.
4.17 Analysis of Support Vector Regression Model
Support Vector Regression is a type of Support Vector Machine (SVM) used for
regression problems.
The Mean Absolute Error is calculated to be 4.71. MAE is a metric that quantifies the
mean value of errors in the predictions, regardless of their direction. An MAE of 4.71
indicates that, on average, the model's forecasts deviate by 4.71 kWh. This offers a clear and
precise assessment of how accurately predictions are made.
The R-squared score is 0.81. The R-squared score quantifies the degree to which the
model accounts for the variability in the data. Achieving an R2 value of 0.81 indicates that
the model accounts for 81% of the variance in energy usage, demonstrating a robust
performance and capturing the majority of significant patterns in the data.
The Symmetric Mean Absolute Percentage Error is calculated to be 22.24. SMAPE
quantifies the precision of the forecasts in relation to the real values, taking into account both
the size and direction of mistakes. A SMAPE of 22.24% indicates that the average inaccuracy
in predictions is approximately 22.24% of the true energy values. This suggests a moderate
degree of precision, indicating that although the model performs satisfactorily, there is still
potential for enhancement.
The Root Mean Squared Error is 7.08. RMSE prioritizes greater errors by calculating
the square root of the average of the squared discrepancies between anticipated and actual
values. A root mean square error value of 7.08 kWh indicates that the average deviation
65

between the predicted and actual values is 7.08 kWh. This statistic is valuable for
comprehending the average magnitude of the errors, particularly when larger errors have
greater significance.
In summary, the Support Vector Regression model offers a robust and precise forecast
of energy consumption during electric vehicle charging sessions, striking an ideal balance
between the mean number of errors and the ability to account for variations in the data. The
model demonstrates satisfactory performance, but it can still be enhanced to minimize
prediction errors and enhance overall accuracy.
4.18 Analysis of XGBoost Model
XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm that
is used for regression and classification tasks.
The Mean Absolute Error is 3.75. MAE is the average magnitude of the prediction
errors, regardless of their direction. A mean absolute error of 3.75 indicates that, on average,
the model's predictions differ from the actual values by 3.75 kilowatt-hours (kWh). This
metric provides a concise and precise measure of the accuracy of predictions.
The R-squared score is 0.89. A score of 0.89 indicates that the model accounts for
89% of the variability in energy usage, suggesting a robust match.
The Symmetric Mean Absolute Percentage Error is calculated to be 18.81. It indicates
that the average error in predictions is roughly 18.81% of the true energy levels.
The Root Mean Squared Error is 5.49. RMSE, or Root Mean Square Error, is a metric
that quantifies the average squared discrepancies between anticipated and actual values.
The XGBoost model, in general, is able to reliably predict the amount of energy that
will be consumed during charging sessions for electric vehicles. It achieves this by finding an
66

acceptable balance between the average error magnitude and the ability to explain variability.
Despite the fact that the model displays satisfactory performance, there is a need for more
improvement in order to reduce errors while increasing overall precision.
4.19 Analysis of Voting Regressor Model
We predicted the amount of energy utilized during electric vehicle charging sessions
using a Voting Regressor model. To increase overall accuracy, the Voting Regressor integrates
predictions from multiple models. In particular, we used Random Forest, XGBoost, Support
Vector Regressor, and Linear Regression. Our goal in employing this ensemble of models
was to produce more dependable forecasts than could be produced by a single model.
4.19.1 Outcome:
After completing the training of the model, we proceeded to assess its performance
using the test data. The following are the essential measurements:
The Mean Absolute Error is 3.87. This provides a concise understanding of the
standard deviation of prediction errors.
The R-squared score is 0.89 and shows that the model is capable of accounting for
89% of the fluctuations in energy consumption. The high score suggests that the model is a
good fit for the data.
A SMAPE of 19.15% indicates that our estimates, on average, differ by 19.15% from
the actual data. This indicates that the model possesses a satisfactory level of accuracy.
The Root Mean Squared Error is 5.38. This aids in comprehending the overall
precision of the model, particularly in instances where there are significant discrepancies.

67

4.19.2 Example Predictions:
Below are many instances illustrating the extent to which the model's forecasts align
with the real values:

Actual Value

Predicted Value

39.975

43.27

30.154

26.82

1.755

1.55

15.632

12.90

65.439

55.20

Table 4.1: Real and predicted values by Voting Regressor

These examples demonstrate that the model typically produces predictions that are
near the actual values, while there are occasional variances.
The findings suggest that our Voting Regressor model is highly proficient at
forecasting energy consumption during electric vehicle charging sessions. In conclusion,
The Voting Regressor model offers precise and dependable prediction of energy use
during EV charging sessions. By harnessing the capabilities of numerous models, we attained
superior overall performance. This methodology showcases the efficacy of employing
ensemble methods to augment the dependability and precision of predictions in practical
scenarios.

68

4.20 Analysis of Stacking Regressor Model
The Stacking Regressor is an ensemble learning technique that combines multiple
regression models to improve prediction accuracy.
The technique trains many regression models, including Linear Regression, Support
Vector Regressor, XGBoost, and Random Forest.
A meta-model, also known as a final model, is trained using the predictions generated
by the basic models. This meta-model utilizes a learning process to assign weights to the
predictions generated by the basic models, resulting in a final prediction that is more
accurate.
The Stacking Regressor model showed excellent performance, demonstrating high
accuracy and dependability in forecasting energy use during electric vehicle charging
sessions. Through the integration of diverse models, we successfully captured a broad
spectrum of patterns present in the data.
The R2 value of 0.90 suggests that the model is capable of explaining a significant
portion of the variability observed in the data.
The Mean Absolute Error of 3.63 kWh indicates that the average discrepancy between
our estimates and the actual values is rather small.
The SMAPE (18.20%) suggests that our forecasts are rather accurate in relation to the
actual values.
The root mean square error of 5.14 kWh indicates that although the majority of
mistakes are tiny, there are a few significant discrepancies.
The Stacking Regressor model, which integrates Linear Regression, Support Vector
Regression, XGBoost, and Random Forest, yielded precise forecasts of energy use during
69

electric vehicle charging sessions. By employing this strategy, we can take advantage of the
positive aspects of several models, which results in enhanced overall performance.
4.21 Result Comparison:
The significant focus that has been given to the efficient operation and optimization of
electric vehicle charging infrastructure is a direct result of the rapid increase in the use of
electric vehicles. To ensure a charging network that is both dependable and effective, it is
essential for utilities, legislators, and owners of electric vehicles to measure the amount of
energy that is consumed during charging sessions. During the course of this thesis, we
investigated a number of different machine learning algorithms with the goal of precisely
predicting the amount of energy that will be consumed. Among the models that were
examined were the following: Linear Regression, Support Vector Regression, XGBoost,
Random Forest, Voting Regressor, and Stacking Regressor.
4.21.1 Explanation of Performance Metrics:
Mean Absolute Error: Lower values indicate more accurate predictions. The Stacking
Regressor has the lowest MAE, indicating it provides the most accurate predictions on
average.
R-squared Score: Higher values indicate better model fit. The Stacking Regressor has
the highest R2 score, suggesting it explains the most variance in the data.
Symmetric Mean Absolute Percentage Error: Lower values indicate better relative
prediction accuracy. The Stacking Regressor has the lowest SMAPE, indicating better relative
accuracy.
Root Mean Squared Error : Lower values indicate fewer and smaller errors overall.
The Stacking Regressor has the lowest RMSE, indicating it has the fewest large errors.
70

4.21.2 Detailed Comparison:

Model

MAE

R2

SMAPE

RMSE

Neural Network

5.87

0.76

26.66%

60.93

Random Forest

3.74

0.89

17.56%

5.44

Support Vector Regression

4.71

0.81

22.24%

7.08

XGBoost

3.75

0.89

18.81%

5.49

Voting Regressor

3.87

0.89

19.15%

5.38

Stacking Regressor

3.63

0.9

18.20%

5.14

Table 4.2: Evaluation Metrics

Mean Absolute Error:
Measures the average magnitude of the errors in a set of predictions, without
considering their direction. It’s a straightforward measure of accuracy.
Lowest MAE: Stacking Regressor (3.63) – indicates the most accurate model on
average.
Highest MAE: Neural Network (5.87) – indicates the least accurate model on average.
R-squared:

71

Indicates how well the model explains the variance in the dependent variable. Values
closer to 1 signify better explanatory power.
Highest R2: Stacking Regressor (0.90) – explains 90% of the variance, indicating a
strong model.
Lowest R2: Neural Network (0.76) – explains only 76% of the variance, indicating
weaker performance.
Symmetric Mean Absolute Percentage Error:
Measures the accuracy based on relative errors. It is especially useful for comparing
the performance of models on different scales.
Lowest SMAPE: Random Forest (17.56%) – indicates the smallest relative error.
Highest SMAPE: Neural Network (26.66%) – indicates the largest relative error.
Root Mean Squared Error:
Measures the square root of the average squared differences between predicted and
actual values. It gives higher weight to larger errors, thus highlighting models with larger
discrepancies.
Lowest RMSE: Stacking Regressor (5.14) – indicates the least overall error
magnitude.
Highest RMSE: Neural Network (60.93) – indicates the largest overall error
magnitude.
Best Overall Model: Stacking Regressor

72

Strengths: Lowest MAE and RMSE, highest R2, and relatively low SMAPE. This
model balances accuracy and explanatory power, making it the best performer across multiple
metrics.
Second Best Model: Random Forest
Strengths: Second lowest MAE and RMSE, high R2, and the lowest SMAPE. This
model is highly accurate and reliable.
Model to Improve: Neural Network
Weaknesses: Highest MAE and RMSE, lowest R2, and highest SMAPE. This model
has significant room for improvement in all metrics.
The Stacking Regressor emerged as the most effective model for predicting energy
usage during EV charging sessions. It consistently provided the best performance metrics,
demonstrating its ability to leverage the strengths of multiple models and reduce their
weaknesses. The success of the Stacking Regressor highlights the importance of ensemble
methods in achieving higher accuracy and reliability in predictive modeling tasks.

73

Stacking
Metric

Voting Regressor

Better Model

Regressor
MAE

3.63

3.87

Stacking Regressor

R2

0.90

0.89

Stacking Regressor

SMAPE

18.20%

19.15%

Stacking Regressor

RMSE

5.14

5.38

Stacking Regressor

Table 4.3: Stacking Regressor and Voting Regressor Comparison

74

Chapter 5: Conclusion
This research represents a significant step forward in the field of electric vehicle
infrastructure management, offering a comprehensive analysis of EV charging behaviors and
the application of advanced machine learning techniques to predict and optimize these
behaviors. By analyzing the complex patterns of how and when people charge their EVs, we
have gained valuable insights that not only enhance our understanding of current usage trends
but also pave the way for future improvements in EV infrastructure and energy management.
5.1

The Journey of Discovery
The journey of this research began with a deep dive into data preprocessing, where

careful attention was given to transforming raw data into a form that could be effectively
analyzed. This involved converting charging times into durations, normalizing the data to
ensure consistency across different variables, and carefully handling missing values to
maintain the quality of the dataset. These steps were crucial, as they laid the groundwork for
the subsequent analysis and modeling efforts, ensuring that the models built upon this data
were as accurate and reliable as possible.
As we moved into the analysis phase, the data began to reveal fascinating stories
about how people interact with EV charging stations. We discovered that many users prefer
short, frequent charging sessions—likely a reflection of the convenience offered by the
increasing availability of fast-charging stations. This behavior suggests a shift in how drivers
approach charging, opting for quick top-ups rather than waiting for a full charge, which
aligns with the fast-paced nature of modern life.
Furthermore, the data showed that users tend to start charging when their battery's
state of charge (SOC) is at a mid-level, avoiding both deep discharges and starting with a

75

nearly full battery. This finding indicates a strategic approach to battery management, where
drivers are mindful of maintaining battery health while ensuring they have enough charge for
their next journey. By ending their sessions with a high SOC, drivers also minimize range
anxiety, ensuring they have ample battery capacity for upcoming trips.
5.2

Achievements in Predictive Modeling
One of the most significant contributions of this research lies in the development and

evaluation of various machine learning models designed to predict energy consumption
during EV charging sessions. The use of models such as Isolation Forest, Neural Networks,
Support Vector Regression, Random Forest, and ensemble methods like Stacking and Voting
Regressors allowed us to explore different approaches to predictive modeling, each offering
unique strengths and insights.
The Stacking Regressor emerged as the most effective model, consistently
outperforming others across multiple metrics, including Mean Absolute Error, R-squared (R²)
score, Symmetric Mean Absolute Percentage Error (SMAPE), and Root Mean Squared Error.
This model’s success highlights the power of ensemble learning, where combining the
strengths of multiple models leads to more accurate and robust predictions. The Stacking
Regressor’s ability to leverage diverse inputs and methodologies makes it particularly wellsuited for the complex task of predicting energy usage in EV charging scenarios.
The predictive power of these models offers benefits for both EV drivers and the
broader infrastructure that supports them. For drivers, accurate predictions mean less time
spent waiting at charging stations, as well as greater confidence in the availability of charging
options when needed. For utilities and infrastructure managers, these models provide a
valuable tool for optimizing the distribution of energy resources, helping to prevent grid
overloads and ensure that charging stations are used efficiently.
76

5.3

Real-World Applications and Implications
The practical implications of this research are far-reaching. As electric vehicles

continue to gain popularity, the demand for efficient and reliable charging infrastructure will
only grow. This research equips stakeholders with the tools they need to manage this growing
demand effectively. By understanding charging behaviors and predicting energy
consumption, utilities can better manage the load on the power grid, ensuring that energy is
available where and when it is needed most.
For policymakers, the insights gained from this research can inform the development
of data-driven strategies to expand EV infrastructure in a way that aligns with user behavior.
For instance, the preference for short, frequent charging sessions suggests that more fastcharging stations should be placed in high-traffic areas, while also ensuring that slower, Level
2 chargers are available for those who need longer, deeper charges. These strategies can help
ease the transition to electric vehicles, making it smoother and more appealing for the general
public.
Moreover, the ability to detect anomalies in charging behavior using models like
Isolation Forest can be used to identify potential issues with charging infrastructure or to
tailor services to meet the specific needs of different user groups. For example, detecting
unusual patterns in charging behavior could indicate the need for maintenance or the addition
of new charging stations in underserved areas.
5.4

Future Directions and Opportunities
While this research has made significant strides, it also opens up several exciting

avenues for future work. One promising direction is the incorporation of additional features
into the predictive models. Factors such as weather conditions, time of day, geographical

77

location, and individual driving habits could be integrated to enhance the accuracy and
relevance of the predictions. These additional variables could help refine the models, making
them more sensitive to the details of real-world charging behavior.
Another critical area for future exploration is the development of models capable of
real-time predictions. As the adoption of EVs continues to rise, the ability to provide
instantaneous feedback based on live data from charging stations could prove invaluable.
Real-time models could be integrated with smart grids and other advanced infrastructure to
dynamically manage charging loads, prevent grid overloads, and optimize the distribution of
resources in real time.
The personalization of predictions also represents an exciting frontier. By leveraging
user-specific data, such as individual charging habits and preferences, future models could
offer personalized recommendations and forecasts. This approach would not only improve
the accuracy of predictions but also enhance the user experience by providing tailored
insights that align with each driver's unique needs and circumstances.
The integration of these models with smart technology, such as IoT devices and smart
meters, is yet another promising area for exploration. Such integration would allow for
seamless communication between vehicles, chargers, and the power grid, facilitating more
efficient energy management and enabling the development of more sophisticated charging
strategies. For instance, smart meters could provide real-time data on household energy
usage, which could be factored into the timing and intensity of EV charging, thereby reducing
strain on the grid during peak hours.
Long-term testing and validation of these models are also critical. While the models
developed in this study have shown strong performance, their reliability over extended
periods and across different contexts needs to be tested. Long term studies that track the
78

performance of these models over time would provide valuable insights into their robustness
and adaptability. Such studies could also help identify any potential degradation in model
performance and guide the development of strategies to reduce these effects.
5.5

Broader Impact and Societal Contributions
The broader impact of this research extends beyond the immediate improvements in

EV charging infrastructure. As electric vehicles become increasingly central to global efforts
to reduce carbon emissions and combat climate change, the ability to efficiently manage
charging infrastructure will be critical. The insights and models developed in this study can
contribute to the broader goal of creating a more sustainable and resilient energy system.
Policymakers can leverage the findings of this research to craft regulations that
support the expansion of EV infrastructure in a way that maximizes efficiency and user
satisfaction. For example, policies that encourage the development of fast-charging networks
in strategic locations, or that encourage the integration of smart grid technologies, could be
informed by the charging behaviors and patterns identified in this study. Additionally, the
ability to predict and manage charging demand could help stabilize energy prices and reduce
the need for expensive upgrades to the power grid, ultimately benefiting consumers.
Furthermore, as the market for electric vehicles continues to grow, the demand for
accurate and reliable predictive models will only increase. This research positions itself at the
forefront of this emerging field, providing a foundation upon which future innovations can be
built. The development of models that can adapt to new technologies, such as wireless
charging or vehicle-to-grid systems, will be essential as the EV landscape evolves.
In addition, this research has the potential to influence broader societal shifts towards
more sustainable transportation solutions. As governments and industries worldwide seek to

79

reduce their carbon footprints, the findings of this study can inform policies and practices that
support the transition to electric vehicles. By improving the efficiency and accessibility of EV
charging infrastructure, we can encourage more people to make the switch to electric
vehicles, contributing to global efforts to reduce climate change.
5.6

Final Reflections and the Road Ahead
In reflecting on the journey of this research, it is clear that we have made significant

progress in understanding and predicting EV charging behavior. The models and insights
developed here offer practical tools for improving the efficiency of EV infrastructure,
enhancing user experience, and supporting the broader transition to electric mobility.
However, the road ahead is still full of opportunities for further exploration and innovation.
As we look to the future, the potential for continued advancements in this field is
great. By continuing to refine our models, exploring new technologies, and fostering
collaboration across industries and disciplines, we can create a more sustainable, efficient,
and user-friendly EV infrastructure. This isn't just about making things work better—it's
about creating a future where clean, electric transportation is accessible and convenient for
everyone.
Ultimately, this research is about more than just technology; it's about making a
positive impact on the world. By improving the way we manage and use energy, we can help
build a more sustainable future for generations to come. The insights and tools developed in
this study are a step in that direction, and with continued effort and innovation, we can
continue to make progress toward a cleaner, greener world.

80

References

[1]

International Energy Agency, "Key World Energy Statistics 2020," IEA, Paris,France, 2020.

[2]

CEDAMIA (Climate Emergency Declaration and Mobilisation in Action), "Climate Emergency
Declaration and Mobilisation in Action," CEDAMIA , [Online]. Available:
https://www.cedamia.org/global/. [Accessed 23 08 2024].

[3]

United Nations Department of Economic and Social Affairs (UN DESA), "68% of the World
Population Projected to Live in Urban Areas by 2050, says UN," 2018 May 2018. [Online].
Available: https://www.un.org/development/desa/en/news/population/2018-revision-of-worldurbanization-prospects.html. [Accessed 23 August 2024].

[4]

X.Gong, X.Zhang, F.Gao and Z.Wang, "Comparison of Climate Change Impact Between Power
System of Electric Vehicles and Internal Combustion Engine Vehicles," in Advances in Energy
and Environmental Materials, 2018, pp. 739-747.

[5]

International Energy Agency (IEA), "Global EV Outlook 2019: Scaling-up the transition to
electric mobility," OECD Publishing, Paris, France, 2019.

[6]

K. Yeongmin, S. Sanghoon and K. Jang, "User satisfaction with battery electric vehicles in South
Korea,," Transportation Research Part D: Transport and Environment, vol. 82, 2020.

[7]

M. Qasem and J.Jung, "A Comprehensive State-of-the-Art Review of Wired/Wireless Charging
Technologies for Battery Electric Vehicles: Classification/Common Topologies/Future Research
Issues," IEEE Access, vol. 9, pp. 19572-19585, 2021.

[8]

J.Wamburu, S.Lee, P.Shenoy and D.Irwin, "Analyzing Distribution Transformers at City Scale and
the Impact of EVs and Storage," in Association for Computing Machinery, New York, USA, 2018.

[9]

J. García-Álvarez, M. Á. González and C. Vela, "Metaheuristics for solving a real-world electric
vehicle charging scheduling problem}," Elsevier Science Publishers B. V., vol. 65, p. 292–306,
2018.

[10] I. Veza, M. Z. Asy'ari, M. Idris, Vorathin.Epin, I. R. Fattah and M. Spraggon, "Electric vehicle
(EV) and driving towards sustainability: Comparison between EV, HEV, PHEV, and ICE vehicles
to achieve net zero emissions by 2050 from EV," Elsevier, vol. 82, pp. 459-467, 2023.
[11] O. Frendo, J. Graf, N. Gaertner and H. Stuckenschmidt, "Data-driven smart charging for
heterogeneous electric vehicle fleets," Energy and AI, vol. 1, 2020.
[12] D.Ronanki, A.Kelkar and S.S.Williamson, "Extreme Fast Charging Technology—Prospects to
Enhance Sustainable Electric Transportation," Energies, vol. 12, no. 19, 2019.

81

[13] S. Ai, A. Chakravorty and R. Chunming, "Household Power Demand Prediction Using
Evolutionary Ensemble Neural Network Pool with Multiple Network Structures," Sensors (Basel),
vol. 19, no. 3, 2019.
[14] Y.Yang, Z.Tan and Y.Ren, "Research on Factors That Influence the Fast Charging Behavior of
Private Battery Electric Vehicles," Sustainability, vol. 12, no. 8, 2020.
[15] J.Mies, J.Helmus and R. d. Hoed, "Estimating the Charging Profile of Individual Charge Sessions
of Electric Vehicles in The Netherlands," World Electric Vehicle , vol. 9, no. 2, 2018.
[16] Y.Lu, Y.Li, D.Xie, E.Wei, H.Bao, H.Chen and X.Zhong, "The Application of Improved Random
Forest Algorithm on the Prediction of Electric Vehicle Charging Load," Energies , vol. 11, no. 11,
2018.
[17] Z. Xu, "Forecasting Electric Vehicle Arrival &amp; Departure Time On UCSD Campus using
Support," San Diego, 2014.
[18] O. Frendo, N. Gaertner and H. Stuckenschmidt, "Improving Smart Charging Prioritization by
Predicting Electric Vehicle Departure Time," IEEE Transactions on Intelligent Transportation
Systems, vol. 22, no. 10, p. 6646–6653, 2021.
[19] M. Shariatzadeh, C. H. Antunes and M. A. Lopes, "Charging scheduling in a workplace parking
lot: Bi-objective optimization approaches through predictive analytics of electric vehicle users'
charging behavior," Sustainable Energy, Grids and Networks, vol. 39, 2024.
[20] L. Gill, M. Kootstra, E. Huber, C. McLean and B. Fooks, "Midterm Reliability Analysis,"
California Energy Commission, 2021.
[21] C. Davenport, L. Davenport and B. Plumer, "California to Ban the Sale of New Gasoline Cars,"
The New York Times, 24 Aug 2022.
[22] California Independent System Operator, "What is a Flex Alert?," California ISO, 2024. [Online].
Available: https://flexalert.org/what-is-flex-alert. [Accessed July 2024].
[23] M.Waskom, "seaborn.boxplot," Seaborn, 2012 - 2024. [Online]. Available:
https://seaborn.pydata.org/generated/seaborn.boxplot.html. [Accessed July 2024].
[24] M.Yi, "A complete guide to box plots," Atlassian, 2024. [Online]. Available:
https://www.atlassian.com/data/charts/box-plot-completeguide#:~:text=A%20box%20plot%20(aka%20box,line%20marking%20the%20median%20value..
[Accessed July 2024].
[25] J.Frot, "Using Histograms to Understand Your Data," [Online]. Available:
https://statisticsbyjim.com/basics/histograms/. [Accessed July 2024].
[26] M.Yi, "A complete guide to scatter plots," Atlassian, 2024. [Online]. Available:
https://www.atlassian.com/data/charts/what-is-a-scatter-plot. [Accessed July 2024].
82

[27] I. Witten and E. Frank, Data Mining Practical Machine Learning Tools and Techniques, San
Francisco, California: Diane Cerra, 2005.
[28] A. Faul, A Concise Introduction to Machine Learning, CRC Press, 2019.
[29] D. ElMenshawy, W. Helmy and N. El-Tazi, "A Novel Approach for Collective Anomaly Detection
in Internet of Things," in Association for Computing Machinery, 2020.
[30] Y. Qin and Y. Lou, "Hydrological Time Series Anomaly Pattern Detection based on Isolation
Forest," in Information Technology, Networking, Electronic and Automation Control Conference
(ITNEC), Chengdu, China, 2019.
[31] S. Sakib, AL-Ali, A. Osman, S. Dhou and M. Nijim, "Prediction of EV Charging Behavior Using
Machine Learning," IEEE Access, vol. 9, 2021.
[32] L.Wei, "Genetic Algorithm Optimization of Concrete Frame Structure Based on Improved
Random Forest," in 2023 International Conference on Electronics and Devices, Computational
Science (ICEDCS), 2023.
[33] Y.-W. Chung, B. Khaki and C.-C. Chu, "Ensemble machine learning-based algorithm for electric
vehicle user behavior prediction," Applied Energy, vol. 254, no. 2, 2019.
[34] D. Yuan, J. Huang, X. Yang and J. Cui, "Improved random forest classification approach based on
hybrid clustering selection," Chinese Automation Congress (CAC), pp. 1559-1563, 2020.
[35] G. Gupta and N. Rathee, "Performance comparison of Support Vector Regression and Relevance
Vector Regression for facial expression recognition," International Conference on Soft Computing
Techniques and Implementations (ICSCTI), pp. 1-6, 2015.
[36] Kavitha S, Varuna S and Ramya R, "A comparative analysis on linear regression and support
vector regression," Online International Conference on Green Engineering and Technologies (ICGET), pp. 1-5, 2016.
[37] Y. Zhou, X. Song and M. Zhou, "Supply Chain Fraud Prediction Based On XGBoost Method," in
IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things
Engineering (ICBAIE), 2021.
[38] C. Sheng and H. Yu, "An optimized prediction algorithm based on XGBoost," in International
Conference on Networking and Network Applications (NaNA), 2022.
[39] A. H. Syed and T. Khan, "A Supervised Multi-tree XGBoost Model for an Earlier COVID-19
Diagnosis Based on Clinical Symptoms," in 7th International Conference on Data Science and
Machine Learning Applications (CDMA), 2022.
[40] H. Chen, H. Ai, Z. Yang, W. Yang, Z. Ye and D. Dong, "An Improved XGBoost Model Based on
Spark for Credit Card Fraud Prediction," in h IEEE International Symposium on Smart and

83

Wireless Systems within the International Conferences on Intelligent Data Acquisition and
Advanced Computing Systems, Dortmund, Germany, 2020.
[41] Y. Li, Z. He, Y. Zhang, W. Zhang, L. Guo and C. Du, "Downlink Channel Parameter Prediction
Based on Stacking Regressor in FDD Massive MIMO Systems," in 7th International Conference
on Computer and Communication Systems (ICCCS), 2022.
[42] J. Pereira-pires, J. Silva, A. Mora and J. Fonseca, "Using Sentinel-2 and Stacking Regressors for
Forest Height Estimation," in IEEE International Geoscience and Remote Sensing Symposium,
2023.
[43] R. Herbrich and T. Graepel, Ensemble Methods Foundations and Algorithms, Cambridge:
Microsoft Research Ltd, 2012.
[44] GeeksforGeeks, "Voting Regressor," Sanchhaya Education Private Limited, 25 Oct 2023.
[Online]. Available: https://www.geeksforgeeks.org/voting-regressor/. [Accessed July 2024].
[45] K. Matsuura and C. J. Willmott, "Advantages of the mean absolute error (MAE) over," Climate
Research, vol. 30, pp. 79-82, 2005.
[46] R. J.Hyndman and A.B.Koehler, "Another Look at Measures of Forecast Accuracy," International
Journal of Forecasting, vol. 22, pp. 679-688, 2006.
[47] J. L. Devore, Probability and Statistics for Engineering and the Sciences, 8th ed., M.Julet, Ed.,
Boston: Cengage Learning, 2011.
[48] B. E. Flores, "A Progmatic View of Accuracy Measurement inForecasting," Omega, vol. 14, no. 2,
pp. 93-98, 1986.

84