DATA VISUALIZATION AND PREDICTIVE MODELING FOR IDENTIFYING
COMORBIDITIES IN DIABETIC PATIENTS

by

Giridhar Krishnan
Bachelor of Engineering (B.E.), MNM Jain Engineering College
Anna University, Chennai, 2009

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
November 2020

© Giridhar Krishnan, 2020

i

Abstract
Diabetes is one of the most common chronic diseases in the world. Diabetic patients are also
more susceptible to develop additional comorbidities over time even causing death. This makes
it essential to identify the risk of developing comorbidities as early as possible for effective
diabetes management and to reduce the burden on healthcare system. Large volumes of clinical
data which has been collected over the years has potential to be translated into meaningful
information to enable healthcare professionals gain insights into diabetic patient comorbidities.
This research has two key contributions. First, an interactive diabetes dashboard is developed
in which the data is integrated and shown in the form of visually appealing charts, graphs and
tables. The dashboard displays aggregated results with drilldown capabilities to allow
navigation at finer granularities of various metrics. Second, predictive models are built to
forecast the likelihood of one of the three common comorbidities for diabetic patients – Benign
Hypertension, Congestive Heart Failure, and Acute Renal Failure. The models use advanced
data mining algorithms such as Logistic Regression, Neural Network, CHAID, Bayesian
Network, Random Forest and Ensemble. Results from these models are also incorporated into
an interactive assessment tool that has the ability to take user input and predict the likelihood
of one of these comorbidities. Northern Health (NH) dataset consisting exclusively of diabetic
patients is used for this research.

ii

Acknowledgement

There were a lot of people who have supported me in this journey at University of
Northern British Columbia and I would like to express my appreciation and sincere
gratitude to all of them.

Firstly, I would like to acknowledge the traditional and unceded territory of Lheidli
Tenneh on which UNBC stands. I am thankful to have had the opportunity to pursue
my research here.

I would like to express my sincere gratitude to my supervisor Dr. Waqar Haque as
without his guidance, effort, motivation and unconditional support, this research would
not have been possible. Also, I would like to thank my committee members Dr. Fan
Jiang and Dr. Pranesh Kumar who have been very kind with their support and
guidance for my research. I would also like to thank every faculty member who has
helped me in my time as a graduate student.

Then I would like to thank my family and friends in Canada and India for supporting
me through the time as a graduate student. I would like to give a special mention to
every team member of BIRG lab I have worked with during this time and I am thankful
for their motivation and support which meant a lot to me. Also, a special mention to
all my family and friends in Prince George, Vancouver, Ontario and India who have all
been very kind and supportive during my time as a graduate student.

iii

Additionally, I would also like to thank Northern Health for providing me access to the
data for my research.

Giridhar Krishnan

iv

Table of Contents
Abstract....................................................................................................................................... ii
Acknowledgement...................................................................................................................... iii
List of Figures............................................................................................................................. vii
List of Tables..............................................................................................................................viii
Chapter 1 ............................................................................................................................................ 1
Introduction.................................................................................................................................... 1
1.1

Knowledge Discovery in Databases (KDD) and Diabetes ................................................ 6

1.2

Data Visualization............................................................................................................ 9

1.3

Current State & Motivation .......................................................................................... 11

1.4

Problem Statement.......................................................................................................13

1.4.1How to enhance diabetes management using intuitive visualization techniques? .....13
1.4.2

What are the vital risk factors for diabetes comorbidities? .................................14

1.4.3

What is the likelihood of a patient to be diagnosed with other comorbidities? ..15

1.4.4

Methodology.........................................................................................................16

1.5

Contributions ................................................................................................................17

Chapter 2 ..........................................................................................................................................19
Related Work................................................................................................................................ 19
2.1

Diabetes and Data mining............................................................................................. 20

2.2

Diabetes Calculator .......................................................................................................28

2.3

Data Visualization and Diabetes ...................................................................................32

2.4

Summary .......................................................................................................................35

Chapter 3 ..........................................................................................................................................38
Methodology ................................................................................................................................ 38
3.1

Proposed Model............................................................................................................38

3.2

Data Source ...................................................................................................................41

3.3

Data Preprocessing .......................................................................................................42

3.4

Inclusion and exclusion .................................................................................................46

3.5

Predictive Modeling Inclusions/Exclusions:..................................................................46

3.6

Predictive Modeling ......................................................................................................52

3.7

IBM SPSS Modeler.........................................................................................................59

3.8

Challenges .....................................................................................................................60

3.9

Data Visualization..........................................................................................................65

3.9.1

Dashboard.............................................................................................................65

3.10

SSRS............................................................................................................................... 66

3.11

Summary .......................................................................................................................67
v

Chapter 4 ..........................................................................................................................................69
Experiments and Results..............................................................................................................69
4.1

Diabetes Dashboard......................................................................................................70

4.1.1

Diabetes Types and Comorbidities .......................................................................78

4.1.2

HSDA Comparison .................................................................................................86

4.1.3

Summary ...............................................................................................................89

4.2

Predictive Modeling ......................................................................................................90

4.2.1

Training Models ....................................................................................................90

4.2.2

Testing Models......................................................................................................96

4.2.3

Ensemble.............................................................................................................100

4.2.4

Analysis of Results............................................................................................... 101

4.2.5

Analysis of Variables ........................................................................................... 106

4.2.6

Diabetes Comorbidities Assessment Tool........................................................... 109

4.2.7

Summary .............................................................................................................112

Chapter 5 ........................................................................................................................................114
Conclusion and Future Work .....................................................................................................114
5.1

Future Work ................................................................................................................119

References .......................................................................................................................................... 121

vi

List of Figures
Figure 1 Impact of Diabetes on Human Body [2] ............................................................................ 3
Figure 2 CCHS 2017 Diabetes Chart [2] .......................................................................................... 4
Figure 3 Canadian Diabetes Association Infographic [7] ............................................................... 5
Figure 4 KDD Steps [9] ....................................................................................................................... 7
Figure 5 Data Mining Process [11] .................................................................................................... 8
Figure 6 Screening of T1D patients [14].........................................................................................10
Figure 7 CDSS Diabetic Patients Complications ..........................................................................34
Figure 8 Components for Predictive Modeling and Data Visualization......................................39
Figure 9 Tasks in Data Preprocessing [33] ....................................................................................43
Figure 10 Feature Selection Model .................................................................................................47
Figure 11 FS Model Results .............................................................................................................48
Figure 12 Neural Network Mapping.................................................................................................54
Figure 13 Logistic Regression Predictor Importance....................................................................56
Figure 14 Data Mining Process for Entire Sample Data [36].......................................................58
Figure 15 Data Mining Process for Partitioned Sample Data [36] ..............................................58
Figure 16 Data Inconsistency Example .........................................................................................61
Figure 17 Data Inconsistency I100 (hypertension) .......................................................................62
Figure 18 Diabetic Patient with Kidney Disease............................................................................64
Figure 19 COVID Prevalence in the World [29].............................................................................67
Figure 20 Diabetes Dashboard ........................................................................................................70
Figure 21 Diabetes Dashboard Overall Statistics .........................................................................71
Figure 22 Diabetes Dashboard - Patients/Admissions Drilldown ...............................................72
Figure 23 Diabetes Dashboard - Patients/Admissions by Year..................................................73
Figure 24 Diabetes Dashboard – Patients by Diabetes Type (Yearly) ......................................74
Figure 25 Patients with Comorbidities ............................................................................................ 75
Figure 26 Diabetes Dashboard – Prominent LHAs with Diabetic Patients ............................... 76
Figure 27 Diabetes Dashboard - Prevalence of Diabetes by LHAs ...........................................77
Figure 28 Diabetes Types/Comorbidities Dashboard...................................................................79
Figure 29 Diabetes Types/Comorbidities Dashboard Statistics..................................................79
Figure 30 Diabetes Types/Comorbidities Dashboard - Diagnosis Codes/ Diabetes Types ...80
Figure 31 Diabetes Types/Comorbidities Dashboard - Diagnosis Codes/ Diabetes Types ...81
Figure 32 Diabetes Comorbidities Dashboard- T2D Comorbidities ...........................................82
Figure 33 Diabetes Comorbidities Dashboard- T1D/Other Diabetes Comorbidities ...............83
Figure 34 Comorbidities Dashboard- Diabetes Specific Diagnosis Codes ............................... 84
Figure 35 Diabetes Types/Comorbidities Dashboard- Diabetes Diagnosis Codes Drilldown 85
Figure 36 Diabetes HSDA Dashboard ............................................................................................ 86
Figure 37 HSDA Dashboard - Patients/Visits Drilldown............................................................... 88
Figure 38 Predictive Modeling Training .......................................................................................... 92
Figure 39 Predictive Modeling Training - Type Node ...................................................................94
Figure 40 Predictive Modeling Training - Analysis Node ............................................................. 95
Figure 41 Predictive Modeling Testing............................................................................................ 96
Figure 42 Predictive Modeling Testing - Type Node ....................................................................98
Figure 43 Predictive Modeling Ensemble Training/Testing .........................................................99
Figure 44 Predictive Modeling - I100 Results ..............................................................................102
Figure 45 Predictive Modeling - I500 Results ..............................................................................103
vii

Figure 46 Predictive Modeling - N179 Results ............................................................................104
Figure 47 Predictive Modeling Accuracy for Patients with N179 ..............................................105
Figure 48 Predictive Modeling using Feature Selection (I100, I500, N179) ........................... 106
Figure 49 Feature Selection Results (I100, I500, N179)............................................................ 107
Figure 50 Diabetes Comorbidities Tool - User Input ..................................................................110
Figure 51 Diabetes Comorbidities Tool - Output for I100 .......................................................... 111
Figure 52 Comorbidities for Hospitalized Diabetic Patients in Canada [52] ........................... 115

List of Tables
Table 1 Estimated prevalence and cost of Diabetes [8]................................................................. 6
Table 2 Comparison of Data Mining Models..................................................................................24
Table 3 Diabetes Risk Calculator Results for the United States [28].........................................31
Table 4 Top Twenty Diagnostic Codes by Count..........................................................................44
Table 5 Diagnosis/Patient Distribution ............................................................................................ 60
Table 6 Training/Testing Datasets...................................................................................................91
Table 7 Top Seven Diagnosis Codes............................................................................................ 109

viii

Chapter 1
Introduction
Diabetes, or Diabetes Mellitus, is a chronic disease in which the body cannot either
produce or utilize insulin. Insulin is a hormone which controls the amount of glucose
(sugar) in blood. Elevated blood sugar levels may lead to damage of vital organs and can
be fatal. There are three main types of Diabetes Mellitus [1]:
1. Type 1 diabetes (T1D) - occurs when body does not produce enough insulin (the
cause is unknown)
2. Type 2 diabetes (T2D) - starts with insulin resistance and can progress to a lack
of insulin (primary causes are lack of physical activity and obesity)
3. Gestational diabetes – occurs in pregnant women with no history of diabetes
According to the Public Health Agency of Canada (PHAC), 5 to 10% of diabetes patients
have T1D and the remainder have T2D. Four percent of all pregnant women are affected
by gestational diabetes which puts both the baby and mother at risk [2]. The cause for
T1D and gestational diabetes has not yet been discovered by scientists. However, the list
of risk factors for T2D are known to include [3]:
·

Being overweight or obese

·

Prediabetes (a condition that may occur before developing T2D)

·

Advanced age

·

Physical inactivity

1

·

Having high blood pressure and/or high cholesterol

·

Having a family history of diabetes

·

Belonging to certain high-risk ethnic populations (e.g. Aboriginal, African, Hispanic,
Asian)

·

Having a history of gestational diabetes

·

Having other conditions which may include vascular disease, polycystic ovary
syndrome, and schizophrenia

In prediabetes, the blood sugar levels are higher than normal but lower than the threshold
which defines T2D. Prediabetes and T2D can be prevented by maintaining a healthy
lifestyle, eating a balanced diet, and ensuring regular physical activity [4]. Undiagnosed
T2D in Canadian adults was found to be 1.13% contributing to 20% of total T2D patients
[5]. Diabetes also leads to other comorbidities and puts a great burden on patients as well
as the healthcare system. This disease can impact the entire human body from head to
toe, causing blindness, stroke, heart attack, kidney failure and even non-traumatic
amputations (Figure 1). Early detection of prediabetes can help prevent diabetes, and
early diagnosis of T2D can help physicians recommend guidelines to ensure a healthy
post-diabetes lifestyle to lessen the chances of developing related comorbidities.

2

Figure 1 Impact of Diabetes on Human Body [2]

Diabetes is a disease that needs to be monitored constantly. Even the slightest changes
in the health of diabetic patients can have adverse effects on their wellbeing and in some
cases even lead to death. Diabetes is often considered a modern society disease which
can lead to other complications listed earlier. The Canadian Community Health Survey
(CCHS) has been collecting information related to health status, healthcare utilization and
health determinants for the Canadian population (Figure 2). It produces an annual micro
data file which can be used to extract information related to diabetes as well as other
health related data [6].

3

Figure 2 CCHS 2017 Diabetes Chart [2]

According to Diabetes Canada, 29% of Canadians are affected by diabetes. One million
Canadians have diabetes but are yet to be diagnosed, and 3.9 million Canadians have
been diagnosed with diabetes. Statistics for prediabetes are also a great concern with
an alarming number of 5.7 million Canadians. Cumulatively, out of 37 million, more than
10 million people have diabetes or prediabetes (Figure 3). This number is expected to
reach 33% by 2025 [7].

4

Figure 3 Canadian Diabetes Association Infographic [7]

While lack of physical activity, overweight and obesity makes one more vulnerable to
diabetes, an additional observation was that clinically depressed people have 40%-60%
increased risk of being diagnosed with T2D. The Diabetes Canada Backgrounder
published in February 2020 has observed that 45.4% adults and 44.5% youth are
physically inactive, 23.7% youth are either overweight or obese and 26.8% adults are
living with obesity. The estimated mortality rate for Canadians with diabetes was twice in
comparison with those without diabetes. Diabetes also has a significant cost impact with
majority of patients in Canada paying more than 3% of their income for the treatments
(Table 1). This cost is estimated to grow to $4.9 billion in 2030 [8].

5

Table 1 Estimated prevalence and cost of Diabetes [8]

1.1

Knowledge Discovery in Databases (KDD) and Diabetes

Healthcare is one of the fields where foreseeing future outcomes and possibilities can be
utilized effectively. Diabetes is one such disease, where early detection and management
is vital to address the related health concerns. Foreseeing the possibility of a patient
having diabetes and related comorbidities would be highly beneficial and this can be
accomplished using predictive modeling which analyzes patterns and correlations in
historical data. The entire process, methods, theories and techniques involved to make
sense of available data is called Knowledge Discovery in Databases (KDD). Figure 4
illustrates the basic steps involved in KDD [9].

6

Figure 4 KDD Steps [9]

Data mining is a vital step of the knowledge discovery process and involves cleansing,
integrating,

mining,

selecting,

modeling,

pattern

recognition

and

knowledge

representation of massive amounts of data (Figure 5). This process discovers unknown
patterns that provide useful results and plays a valuable role in healthcare research to
improve quality of life of patients diagnosed with health conditions [10]. Data mining can
also be interfaced with statistics, machine learning, neural networks and inductive logic
programming to play an important and decisive role in diabetes research [11]. Machine
Learning is the process in which machines learn and adapt from experience by repeating
a task for n number of times which in turn improves the performance. It is imperative to
note that machine learning and data mining are two terms that are closely related with the
latter being more generic. Thus, in literature, machine learning methods are also
sometimes referred to as data-mining methods [9].
In healthcare, interfacing data mining with data warehousing and using Online Analytical
Processing (OLAP) can enable efficient decision making. Data warehouses contain
consolidated data which facilitates complex analyses and visualization through OLAP

7

[12]. OLAP has the capability to perform various operations such as rollup, drill-down,
slice and dice, and pivot on the data warehouse. Rollup and drilldown increases and
decreases the level of aggregation, respectively; slice and dice is used to select specific
dimensions, and pivot re-orients the multidimensional view of the data warehouse.
Machine learning and data mining can be utilized to extract knowledge from huge
volumes of diabetes–related data. Data mining algorithms are used to identify correlations
between different variables in the data source and build predictive models. These models
have insightful information of diabetic patients, comorbidities and other demographics for
the purposes of clinical administration, diagnosis as well as management of diabetes.

Figure 5 Data Mining Process [11]

8

1.2

Data Visualization

Interpreting data can be a complicated and tedious task. Complex statistics and equations
may not be understood by all, but when visualized it can be made more relevant to the
end user. Interactive data visualizations can help users to quickly identify patterns and
trends which can enable effective decisions. An effective way to represent data is through
dashboards which can translate key performance of organizations into visual displays.
Dashboards allow visualization of huge amounts of data in an intuitive manner using
charts, graphs, gauges and more. Interactive dashboards with color-coded visualizations
are more appealing to the end user and enhances their experience.
In healthcare, time is vital. Professionals have to make decisions rapidly to ensure optimal
care for the well-being of patients and manage resources efficiently. Research shows that
dashboards significantly reduce time when compared with the conventional approach of
using electronic health records (EHR) for analysis and management [13]. For instance, a
study compared the time for ten physicians to access ten common variables for two
diabetic patients with similar volumes of clinical data using conventional EHR and a
diabetes dashboard [13]. The mean time taken to access the ten variables for two diabetic
patients was 1.9 minutes using the dashboard and 6.3 minutes with the conventional
approach, showing that dashboards can significantly help reduce time spent by
physicians and help optimize patient management. The research further established that
usability analysis tools like dashboards can be an insightful asset for health care
information technology [13].
In 2017, an electronic diabetes dashboard, iScreen, designed by the Canadian Diabetes
Association was introduced [14]. For this research, T1D patients between 14-18 years
9

were assessed for other comorbidities. Fifty charts were used for review, 25 using iScreen
and 25 without iScreen. The results showed an increase in appropriate initial screening
and decrease in under- as well as over-screening of patients for nephropathy and
retinopathy after using iScreen electronic diabetes dashboard (Figure 6). This study
concluded that dashboards have potential to impact clinical outcomes and healthcare
costs.

Figure 6 Screening of T1D patients [14]

10

In sum, an interactive diabetes dashboard listing the risk factors and comorbidities
alongside different demographics has a potential to save time as well as enable effective
decision making to optimize clinical outcomes and costs of treatment.

1.3

Current State & Motivation

Diabetes is one of the major chronic diseases worldwide and efforts are being made on
a global scale to deal with it in the most efficient manner. Since early detection of diabetes
is one of the major factors to prevent it, there has been considerable research done on
developing diabetes risk calculators. For instance, in 2009, a simple tool [15] for detecting
undiagnosed diabetes and prediabetes was proposed and data was used from the Third
National Health and Nutrition Examination Survey [NHANES]. The models were built
using two methods – classification tree analysis and logistic regression. In 2011, the
Public Health Agency of Canada (PHAC) developed a non-laboratory based screening
questionnaire to identify diabetes and prediabetes among middle-aged adults. The
questionnaire was built on the basis of Finnish Diabetes Risk Score (FINDRISC) which
led to the development of The Canadian Diabetes Risk Questionnaire (CANRISK) [15]
which is a diabetes screening tool for Canadians aged over 40. There has also been
significant research done on using data mining algorithms in this area [16] [17]. In 2013,
three data mining models, namely, artificial neural networks (ANN), C5.0 decision tree
and logistic regression were compared to predict diabetes and prediabetes [16]. The C5.0
decision tree model demonstrated the highest accuracy for the dataset used in this
research; this dataset was based on information collected from a questionnaire [16].

11

Machine learning algorithms for detecting undiagnosed diabetic patients have also been
compared and used to create best models using ANN and logistic regression [9]. A more
comprehensive literature review is provided in Chapter 2.

Existing studies have focused more on building diabetes calculators (FINDRISC,
CANRISK) but there is a lack of research when it comes to tools developed for identifying
diabetes comorbidities [15] [18] [19]. Another limitation prevalent in existing work [16] [17]
is that the risk factors identified for predictive modeling in diabetes uses survey data and
interviews. While such databases can be useful to get an overall picture of the disease,
the authenticity of the underlying data is highly questionable. Specifically, self-reported
data has a high possibility of containing unreliable information [20]. Predictive modeling
uses historical data as its base to build models and if this data is inaccurate, the model’s
accuracy becomes questionable. One way to eliminate this issue is to use clinical data
recorded by healthcare professionals. Clinical data is authentic as patient diagnosis has
been confirmed by qualified physicians. However, obtaining clinical data for research can
be quite challenging as it is seldom available in the public domain due to privacy concerns.
Considering these factors, this research proposes building predictive models for
diagnosed diabetic patients using a clinical dataset obtained from Northern Health. These
models will enable users to identify hidden patterns in historical data and predict the
likelihood of comorbidities resulting in effective diabetes management. The models have
been integrated with an interactive diabetes dashboard for visual analytics.

12

1.4

Problem Statement

Predictive models that accurately forecast the likelihood of various diabetes comorbidities
could be an efficient tool for healthcare providers as well as patients. This could facilitate
early diagnosis and interventions with the possibility to prevent other comorbidities as well
as effective management of diabetic patients reducing the cost quotient on the healthcare
system. The results from these predictive models should be user friendly and beneficial
for end users; this leads to the first research question.
1.4.1 How to enhance diabetes management using intuitive visualization
techniques?

Chronic conditions such as diabetes require frequent follow-ups and monitoring of
patients for effective management. This can be a tedious task considering the large
number of patients to track, and the required tests for related comorbidities. Summarized
information of individual patients can enable healthcare professionals to take useful and
timely decisions. Aggregated data visualization of multiple patient records can give an
overview of the entire dataset while drill-downs can let users navigate to finer granularities
focusing on individual patients. This can be accomplished with an interactive diabetes
dashboard which identifies patients with their associated clinical visits and treatment
plans.
A previous study described earlier has shown that conventional approach using EHR took
6.3 minutes to identify all the associated variables for diabetic patients compared to 1.9

13

minutes using a diabetes dashboard. The mean number of mouse clicks were 60 for EHR
and significantly reduced to 3 using the dashboard [13].
Taking these factors into account, an interactive diabetes dashboard is built with the
following features:
·

Color coded visualization of existing and predicted data in the form of charts and
graphs with available demographics (i.e. patient Local Health Area (LHA), patient
community, age and comorbidities)

·

Results of models represented in comparative charts

·

Aggregated data with drill down capability to view information at finer granularity

This interactive diabetes dashboard will help decision makers to identify patterns and
understand the relationship of different variables specific to comorbidities and patients
which is complicated and time consuming using EHRs. These relationships also help
identify associated risk factors using historical and predicted data. This leads to the next
research question:
1.4.2

What are the vital risk factors for diabetes comorbidities?

Diabetes is associated with a number of comorbidities affecting the entire human body.
Analyzing and applying data mining algorithms to existing clinical patient data can help
identify the risk factors specific to comorbidities. This research focuses on three common
comorbidities which are acknowledged by Diabetes Canada - hypertension, congestive
heart failure and renal failure [3]. The risk factors associated with each of these
comorbidities and the prominent common risk factors are presented using a diabetes

14

dashboard. Another interesting observation would be to associate the comorbidities
themselves with the help of a predictive model; this leads to the final research question:
1.4.3 What is the likelihood of a patient to be diagnosed with other
comorbidities?

Diabetes Canada has observed that diabetic patients lifespan can reduce by five to 15
years; further, diabetes has been attributed as the reason for death of one in every ten
Canadian adults in 2008-2009 [8]. These patients are also more likely to be hospitalized
with cardiovascular disease and twelve times more susceptible to end-stage renal
disease compared to the general population. These observations emphasize the
importance of being able to predict the likelihood of comorbidities so that early detection
and efficient diabetes management can be achieved. Taking these factors into account,
predictive models to find the likelihood of diabetic patients with the following three
comorbidities are proposed:
·

Hypertension

·

Congestive Heart Failure

·

Acute Renal Failure

The results from these models can act as a useful guideline to identify patients vulnerable
to specific comorbidities and recommend appropriate management to prevent escalation
to further comorbidities. Effective diabetes management would ensure optimal patient
care as well as reduced costs on the healthcare system.

15

1.4.4

Methodology

This research focuses on building predictive models using existing diabetes related
clinical data. The model predicts the probability of diabetic patients to develop related
comorbidities. To make the results of the model easily accessible, a simple user-friendly
assessment tool is developed which predicts the probability of the three comorbidities to
the users.

In addition, an interactive dashboard is designed to visualize insightful

information representing several years of diabetes data including the identified risk factors
related to various comorbidities.
The raw clinical data is integrated into a database using SQL Server Management Studio
(SSMS) 15.0 [21]. The integrated database is used to build the predictive models with
IBM SPSS modeler 18.2 [22] which predicts the likelihood of the three comorbidities
(congestive heart failure, hypertension and renal failure). The models help to identify the
prominent risk factors for these comorbidities as well as the significance of variables for
the predictions. The results produced by each model and the identified risk factors are
analyzed in detail.
An interactive diabetes dashboard is built using SQL Server Reporting Services 15.0
(SSRS) which is a component of Microsoft Business Intelligence tool stack [23]. SSRS is
an effective reporting platform which includes various data visualization tools such as
charts, graphs, and gauges to represent data; capability to integrate maps and embed
images is also included. The aggregated results are presented in a visually pleasing
format with the option of drilling down to reports at finer granularities.

16

Finally, a user friendly tool is developed using IBM SPSS modeler. This tool allows
diabetic patients and healthcare professionals to view results generated by the models
predicting likelihood of one of the three comorbidities.
The study methodology is described in detail in Chapter 3.

1.5

Contributions

This research has two major contributions. Firstly, predictive models for Canadian
diabetic patients which forecasts the likelihood of related comorbidities have been
developed. These models could be used by healthcare professionals as a guideline to
identify patients who are at higher risk of developing predicted comorbidities and ensure
effective management of diabetes. Clinical data of diabetic patients who accessed
Northern Health (NH) facilities between April 1 2012 and March 31 2018 has been used
to train and test the models for accuracy.
The second contribution is the design and development of an interactive diabetes
dashboard. Data visualization in the form of charts and graphs can enable healthcare
professionals to have a better and deeper understanding of the variables associated with
the disease. The dashboard provides individual as well as aggregated data at facility and
community levels with drill down and drill through reporting. This information can be useful
to identify the gaps in healthcare and enhance related services by making informed
decisions in a timely manner. These contributions are described in detail below:
Identifying diabetic-patient comorbidities: Predictive models built specific to diabetic
patients predicting the likelihood of comorbidities - benign hypertension, congestive heart

17

failure and acute renal failure. These results can be used as a guideline by healthcare
professionals to identify and treat patients who are at a higher risk for developing these
comorbidities.
Enhanced patient-care: Early detection of patients who are at high risk for developing
one or more comorbidities could help prevent further complications to their health and
timely treatment ensuring overall well-being of patients.
Reduce healthcare costs: Identified high risk patients who receive timely care are less
susceptible to other complications; this in turn benefits the patients as well as the
healthcare system with elimination of complex treatments reducing the burden of cost.
Holistic healthcare approach: Interactive diabetes dashboard provides an overview of
the current state of diabetes and diabetic patients in the selected dataset. The embedded
drill downs allow filtering of results by various demographics such as Health Service
Delivery Area (HSDA), LHA and comorbidities. The historical information is presented in
a way which facilitates analysis and assists decision makers in identifying gaps in
provision of healthcare. The stakeholders can use this information to develop plans for
improving related services.
Diabetes comorbidity prediction tool: The results from the model are incorporated into
a user-friendly web form which predicts the possibility of one of the three comorbidities
for diabetic patients. The interactive web form asks for a user to enter input for the
selected variables and predicts the value of the target variable using the underlying
models. This tool could be used by the healthcare professionals to enhance treatment
and management of diabetes.

18

Chapter 2
Related Work
Diabetes being a worldwide chronic illness has no shortage of research especially with
focus on early diagnosis and detection of the disease. One reason for such extensive
research in this direction is the cost and toll of diabetes management on healthcare
systems. The research community has explored various data mining techniques to detect
and diagnose diabetes. The research has encompassed not only diabetes, but all
associated comorbidities including hypertension, renal failure and cardiovascular
diseases. Diabetes research thus branches into various fields including but not limited to
healthcare, data mining and data visualization. The literature review presented in current
chapter was done to align with the focus of this research, that is, building predictive
models for diabetes comorbidities and designing an interactive diabetes dashboard.

This chapter is divided in three main sections. Firstly, representative studies on diabetes
and data mining are presented. These include analysis of various data mining algorithms
and techniques used in the study of diabetes and related comorbidities. A review of the
application of research work to develop user-friendly tools such as diabetes calculators
19

is then presented. Finally, the use of data visualization for enhanced and cost effective
healthcare is explored. Limitations of existing literature are provided to identify research
gaps that are addressed by the work presented in this thesis.

2.1

Diabetes and Data mining

Data mining plays a huge role in the healthcare sector with algorithms that have the ability
to analyze, detect and predict the presence of diseases in patients. Early detection of
diseases can help in timely and efficient decision making by healthcare professionals.
Diabetes is one such disease where data mining can be a vital part of developing tools
that can facilitate enhanced healthcare service. In this section, research with respect to
data mining and diabetes is explored and techniques for implementing predictive
modeling specific to diabetes are analyzed.

In 2012, a study was done to predict T2D using data mining. The aim of the research was
to apply artificial metaplasticity on multilayer perceptron (AMMLP) as a data mining (DM)
technique for diabetes and compare results with the decision tree, Bayesian classifier and
other algorithms [24] .The comparisons were done using classification accuracy, analysis
of sensitivity and specificity and confusion matrix. The results showed an accuracy of
89.5% for AMMLP which was superior to decision tree and Bayesian classifier algorithms.
The dataset used for this research was obtained from the National Institute of Diabetes
and Digestive and Kidney Diseases. This dataset comprised of a specific group of Pima
Indian women tested for diabetes. The sample used eight variables:

20

·

number of times pregnant

·

plasma glucose concentration

·

glucose tolerance test

·

diastolic blood pressure

·

triceps skin fold thickness

·

serum insulin

·

body mass index

·

diabetes onset within five years

This research [24] had a sample size of 768 which was further reduced to 763 after
elimination of records with missing data. It is also to be noted that out of 768 patients only
268 had diabetes. T2D is common among both men and women but this dataset did not
include men and it was also focused on a specific group making the relevancy of results
for other groups questionable. It was concluded that AMMLP performed with better
accuracy but it was compared only with two other classifier algorithms.

In 2013, a research project proposed automated detection of diabetes mellitus using
neural networks without patients undergoing clinical tests [17]. The neural network had
a total of 27 nodes (13 input, 13 hidden and 1 output) [17]. The input nodes are the
variables with shared historical data, the hidden nodes are where the computation occurs
and the output node is the result for the given input. The neural network was built using
the backpropagation1 algorithm and out of 20 datasets tested, 18 produced accurate

1

“backward propagation of errors” calculating gradient of error function
21

results with an overall accuracy of 92.8%. For this research, a survey was done using
100 datasets. Each data set included people of various ages, genders and lifestyles to
get an unbiased result. Eighty datasets were used for training and twenty were used for
testing the system. The following variables were used for building the model:
·

Age

·

Gender

·

Weight

·

Height

·

Weight loss

·

Thirst increase

·

Hunger increase

·

Appetite increase

·

Nausea

·

Fatigue

·

Vomiting

·

Bladder, skin infections

Considering that the data was collected using surveys and was self-reported, the
authenticity of the diagnostics becomes questionable. This research [17] also mentions
another study using ANN and feature extraction which achieved an accuracy of 94.6%
for classifying patients as diabetic and non-diabetic.

22

Another study [16] was conducted to compare three data mining models (ANN, decision
tree and logistic regression) to predict diabetes or prediabetes by various risk factors [16].
A questionnaire to obtain information on demographics, family diabetes history,
anthropometric measurements and lifestyle risk factors was given to 1,457 participants,
735 of whom had diabetes. The following twelve input variables were used:
·

Age

·

Family history of diabetes

·

Marital status

·

Education level

·

Work stress

·

Duration of sleep

·

Physical activity

·

Preference for salty food

·

Gender

·

Eating fish

·

Drinking coffee

·

Body mass index

The output variable was a flag variable with possible values of 0 and 1, where 1 indicated
if the person had diabetes or prediabetes. Results from the three predictive models are
shown in Table 2.

23

Table 2 Comparison of Data Mining Models

ANN

Logistic regression

Decision tree

Sensitivity

79.40%

79.40%

78.11

Specificity

73.54%

65.47%

75.78%

Accuracy

76.54%

72.59%

76.97%

In conclusion, the decision tree had the highest classification accuracy, followed by
logistic regression and ANN. Classification accuracy is the percentage of correct
predictions in a model and is considered to be a vital performance indicator. A limitation
of this study [16] was that the sample population chosen was only from two communities
in Guangzhou, China and cannot be considered an appropriate representation of the
entire population. Also, some of the individuals who participated in the study provided
self-reported data which make the results less reliable.

In a recent study published in 2020, data was collected from over 230,000 participants
during the years 2006-2017 to develop a T2D risk prediction model using machine
learning algorithms [25]. This research excluded all diabetic participants as well as any
participants taking medication for diabetes. The collected medical, behavioral,
demographic and incidence data was used to predict T2D in participants at 3, 5, 7 and 10
years. The participants selected in the research were followed up for the entire time period
thus making it a longitudinal dataset. Three machine learning algorithms, random forest,
multilayer feedforward artificial neural network implementing a deep-learning approach,
and a gradient boosting machine approach, were compared with conventional logistic
regression model. The AUC (Area under Curve) in machine learning models was higher

24

than the conventional regression model. AUC is a statistical performance measurement
which is used to validate the model. A higher AUC implies better prediction capabilities of
the model. The highest accuracy was recorded by gradient boosting algorithm with an
AUC of 79% in 3-year prediction and 75% in 10-year prediction. The machine learning
models also predicted BMI as the vital risk factor contributing to T2D. It was also noted
that diabetes incidence was recorded higher among men than women over the ten-year
period. Limitations of this research were that it used self-reported data and the exclusion
of participants was done by use of diabetes related medication instead of a clinical
diagnosis.

The studies presented above focused on predicting diabetes [16] [17] [25]. The next few
research works [24] [26] explore the use of data mining techniques to improve
management and treatment plans for diagnosed diabetic patients. Management of
diabetes is a critical challenge for healthcare professionals as well as for the patients
themselves. Diabetic patients have higher risk of being diagnosed with multiple
comorbidities which, in turn, increases the complexity of treatment and care. Hence, it
would be ideal to predict comorbidities using data mining techniques. Existing literature
shows a few studies that have focused on this topic.

A comorbidity study [24] done on diabetic patients identified that hypertension plays a
critical role in its association with other comorbidities. Hypertension was identified as a
critical factor for T2D patients having stroke as well as dyslipidemia [27]. For this research,
20,314 patients with T2D were chosen from Keimyung University Dongsan Medical
Center. Apriori algorithm was used to find the association between T2D and various
25

comorbidities. Hypertension had the highest association followed by gastritis and senile
cataract. Apriori algorithm was implemented through a proprietary tool, Dx Analyze,
which aided in the process of data cleansing and construction of data marts as well. A
limitation of the study was that the data represented only one medical facility and the Dx
Analyze tool needs to be applied on data from multiple facilities to check for relevancy of
the results. The authors also acknowledge the limitations of Apriori to determine causality
of disease and recommend further research considering chronology of diseases
in patients.

In another interesting research, mortality of diabetic patients in ICU was predicted
[26]. The MIMIC-III database which records ICU admissions was used for this study.
There were a total of 10,318 diabetic patients in this database; this number came down
to 4,111 after exclusion of missing values for blood glucose. Existing algorithms to predict
mortality were used for the models - Charlson Comorbidity Index (CCI), Elixhauser
Comorbidity Index and Diabetes Complications Severity Index (DCSI). CCI and
Elixhauser calculate risk-scores based on ICD-9 diagnosis codes for each patient while
the DCSI is an alternative risk score designed specific to diabetic patients. The results
showed AUC values to be 0.694, 0.682 and 0.656 for DCSI, Elixhauser and CCI,
respectively. The AUC improved to 0.785 when all three metrics were combined using
logistic regression. In addition, the random forest model achieved an accuracy of 0.787.
This research focused on five variables:
·

Age

·

Gender

26

·

Ethnicity

·

Insurance

·

Admission Type

A limitation of this research was that it used random sampling of 70/30 for analysis which
resulted in an imbalance of less than 10% of positive cases. Also, it did not consider
patients directly admitted for diabetes related care because it was complicated to identify
with the different diagnostic codes recorded for a patient. Length of stay is a variable
which was not analyzed and is recommended to be explored by the authors. This
research also recommends exploring other machine learning algorithms such as random
forest and ANN for better predictions.

Data mining algorithms can be effectively used and adopted in healthcare to build
predictive models with patient-specific information to predict diseases such as diabetes.
Predictive models for T2D comorbidities could contribute to associating the relation
between risk factors and identify onset of specific comorbidities [28]. The models can also
be used to develop tools to aid in informed decision making for optimized treatment of
diabetic patients. User-friendly electronic tools to identify patients with diabetes can be
highly beneficial for efficient treatment and management of diabetes. The next section
covers literature focusing on existing calculators for diabetes.

27

2.2

Diabetes Calculator

Early identification of diabetes is ideal for well-being and treatment of patients and
diabetes calculators are an effective tool to accomplish this. These calculators can act as
a guideline for patients to analyze their risk of being diagnosed with diabetes; higher the
potential risk, more advisable and essential to contact a physician. Over the years, there
has been a lot of research done globally on diabetes calculators, some of which is
presented in this section.

In 2003, a tool to predict T2D (Diabetes Risk Score) was developed to identify individuals
at risk without undergoing laboratory tests [18]. The risk factors taken into account were:
·

Age

·

Body Mass Index (BMI)

·

Waist circumference

·

History of antihypertensive drug treatment and high blood glucose

·

Physical activity

·

Daily consumption of fruits, berries, or vegetables

For this study [18], a random population sample between ages 35-64 was selected and
followed for 10 years. Each category was assigned a score using multivariate logistic
regression model coefficients. The cumulative sum of all scores was calculated as the
Diabetes Risk Score (DRS). The research identified 182 cases of diabetes incidence in
4,435 subjects. DRS has been implemented in Finland as one of the tools in their diabetes
prevention program. The SAS (version 8.2; SAS Institute, Cary, NC) software was used

28

for analysis. This research has several limitations. First, the risk factors do not include
family history of diabetes which is an important factor contributing to an increased risk of
acquiring the disease [29]. The researchers recommend addition of this factor in future
work. The individuals with high glucose levels were not excluded at the baseline under
the assumption that no biochemical tests were performed at that stage. In addition, the
data used to build the model was obtained from surveys and the national population
register.

In 2005, Indian Diabetes Risk Score (IDRS) was proposed for screening undiagnosed
diabetic patients [19]. Indian Diabetes Risk Score used four risk factors: age, abdominal
obesity, family history of diabetes and physical activity. Multiple logistic regression
analysis was applied using undiagnosed diabetes as the dependent variable. When risk
score was greater than or equal to 60, the IDRS had an accuracy of 61.3% with a positive
predictive value of 17.0% and a negative predictive value of 95.1%. Receiver Operating
Characteristic (ROC) curves showed that area under ROC curve was 0.698 with a
confidence interval of 95%. Indian Diabetes Risk Score, which categorizes risk factors
based on their severity, can be a cost effective tool for mass screening in developing
countries like India where a large number of cases are undiagnosed. The risk score for
this research was derived from Chennai Urban Rural Epidemiology Study (CURES). The
response rate for this study was 90.4% and the results were subject to internal
validation. The sample size for this research was 2,350 patients. This research [19] did
not take dietary consumption into account which is one of the recommended risk factors
by the American Diabetes Association. In addition, anti-hypertensive medication was

29

excluded as one of the variables considering that a lot of people do not take medication.
The other shortcoming of this research is that it is a cross-sectional study and the authors
recommend validating this study with prospective studies. A cross-sectional study collects
data from various sects of the population at a given time opposed to collecting data over
time. For medical research involving predictions, prospective studies are preferred as
they are longitudinal and the results obtained can have a better relevance.

A simple tool for detecting undiagnosed diabetes and prediabetes was proposed using
data from the Third National Health and Nutrition Examination Survey [NHANES] [30].
The models were built using two methods – classification tree analysis and logistic
regression. The diabetic risk calculator tool used the following risk factors:
·

Age

·

Waist circumference

·

Gestational diabetes

·

Height

·

Race/Ethnicity

·

Hypertension

·

Family History

·

Exercise

The classification tree model was used based on its ease of use and the results obtained
are shown in Table 3:

30

Table 3 Diabetes Risk Calculator Results for the United States [30]

Sensitivity
Specificity
Positive Predictive value
Negative Predictive value
ROC

Undiagnosed Diabetes
88%
75%
14%
99.3%
85%

Prediabetes
75%
65%
49%
85%
75%

ROC area under the curve for undiagnosed diabetes was 0.85 and for prediabetes was
0.75. ROC is used to evaluate the performance of models where the true positive rate is
represented by sensitivity and false positive is represented by specificity. With ROC
analysis, optimal models for predictions can be evaluated. This research [30] eliminated
the variables for body mass index (BMI) in favour of height and weight, and the cholesterol
variables were eliminated due to missing fields and low predictor value. Another important
variable eliminated was diabetes in any blood relative. There were 18 variables chosen
but not all of them were used in the final model. Finally, the tool is yet to be developed
into a patient friendly electronic version for broader use.

In 2011, the Public Health Agency of Canada (PHAC) came up with a strategy for
preventive intervention [15]. Before such an intervention can be applied in Canada, it is
important to have an early detection strategy to be successfully implemented. The PHAC
developed a non-laboratory-based screening questionnaire to identify diabetes and
prediabetes among middle-aged adults. The questionnaire was built on the basis of
Finnish Diabetes Risk Score (FINDRISC) which led to the development of The Canadian
Diabetes Risk Questionnaire (CANRISK) [15].

31

CANRISK asks 13 questions that categorizes people as low risk, moderate risk and high
risk. The low risk has a score of less than 21, and the high risk has a maximum score of
86. The moderate risk scores can vary from 21 to 32. The 13 questions focus on the
various risk factors such as age, gender, height & weight (to calculate BMI), blood
pressure and blood pressure during pregnancy (gestational diabetes). CANRISK also has
questions related to family history of diabetes as well as ethnicity and education. Each of
these variables contribute to the total diabetes risk score. In case of moderate risk,
CANRISK recommends consulting a healthcare practitioner whereas in the case of high
risk blood sugar test is recommended. CANRISK has been implemented and translated
into different languages [15]. Limitations of this work are that some ethnic groups are
under-represented in the sample and CANRISK is yet to be evaluated as a screening tool
for high risk patients.

2.3

Data Visualization and Diabetes

Healthcare is one of the areas where abundant data is stored in various disparate formats.
Integrating and organizing such data is an ongoing challenge faced by healthcare
providers. Critical data can be challenging to be retrieved from electronic health records.
Data visualization can help solve this challenge and lead to enhanced patient care and
optimized diabetes management. Research has shown that management of diabetes
improves when patients are provided with information and knowledge about their health
condition.

32

In a study, patients were assessed by a diabetologist and given access to a web portal
which had information regarding diabetes, their personal health status as well as the
ability to contact the diabetologist [31] . The primary goal of this research was to monitor
the blood glucose levels (A1C) and to observe differences between users who had access
to the web portal and those who did not. This study observed that the web portal users
had lower levels of A1C compared to the non-users. Further, it confirmed the usefulness
of a web based tool to enhance patient management and cut costs in the long term. This
research used only 8% of the original patients (157/1957) for the final analysis as only
157 patients had covariate data and did a follow-up visit. This study also did not explore
the demographic factors that would influence the usage of the web portal, and did not
distinguish between patients with T1D and T2D.

There has been work done towards building dashboard for diabetes. As mentioned in
Chapter 1, the iScreen electronic diabetes dashboard [14] observed that evaluation of
decision support tools facilitate complicated screening for diabetes care. However,
iScreen included only T1D patients and had a small sample size of only fifty patients.

Mosaic is a project funded by the European Union (EU), specifically to explore predictive
models and decision support system for T2D care and management; a clinical decision
support system (CDSS) dashboard was built for this project [32]. The dashboard explored
diabetic patient data and risk of complications; it consisted of three sections consolidating
metabolic control, frequent temporal patterns and drug purchase patterns. An outcome
assessment and research support system (ORSS) was designed for clinicians

33

[32]. Figure 7 shows an example of CDSS where patients are grouped by complication
categories and details.

Figure 7 CDSS Diabetic Patients Complications

Upon evaluating the CDSS for nine clinicians, it was observed that T2D patients who had
access to CDSS recorded shorter durations with their clinical visits and screening for
complications increased in the visits indicating optimized patient care. The researchers
observed that the dashboard can be improved by implementing a more detailed humancomputer interaction study. The dashboard was evaluated for patient management but
not for any clinical outcomes. There were a limited number of clinicians involved and they
all were from the same facility.
Nevertheless, chronic diseases such as diabetes can be aided with the help of data
visualization tools such as dashboards to support clinical decision-making, including
diagnosis, treatment plans, and effective management if used in a coordinated fashion to
improve the overall well-being of patients complementing the healthcare system.

34

2.4

Summary

Research in data mining and diabetes has identified risk factors which have a strong
relationship to diabetes, such as age, BMI and dietary consumption. Predictive models
play an important role in forecasting future health outcome of patients. Unfortunately,
majority of existing research work in prediction focuses on comparing different data
mining algorithms to determine the most efficient algorithm to build the model.
Undoubtedly, there is a lack of research to identify risk factors leading to comorbidities
such as cardiovascular disease, renal failure, and hypertension for diagnosed diabetic
patients. Several tools have been developed for early diagnosis and management of the
disease. One of the major issues with the current risk calculators is that the majority of
them are paper-based questionnaires as opposed to online tools [33]. The identified
research gaps discovered in current literature are summarized below:
Self-reported data: Majority of the published work is based on survey data and
questionnaires which makes the data quality highly questionable. This, in turn, impacts
the reliability of the predictive models which are built on top of this data.
Low Count of Diabetic Patients: Many datasets had a low count of diabetic patients,
and there is a lack of research using datasets which exclusively represent diagnosed
diabetic patients.
Domain-Specific Datasets: The datasets used in majority of the researches were
specific to an ethnic group or to a particular facility which makes results applicable and
relevant only to the group associated with the dataset.

35

Location Demographics: The datasets used lack information on demographics such as
the community and facilities accessed by the patients. Location demographics can
contain useful information specific to a community or facility which can represent insightful
information. Also, there is limited research for predictive models based on data for
diabetes patients in Canada.
Adding New Input Variables:
Majority of the studies have included age, gender, ethnicity and BMI as input variables
for building the models. Variables such as length of stay, discharge date, and availability
of family physician have not been explored to evaluate their impact on the models.

Diabetes Comorbidity Assessment Tool:
Several calculators for diagnosing diabetes have been developed over the years,
including CANRISK, but tools available for identifying comorbidity or multi-morbidities in
diabetic patients are scarce.

To overcome these limitations, predictive models for diabetes comorbidities have been
built using NH clinical data. Clinical data eliminates the issue of self-reporting as only
diagnosed patients are part of the dataset. This dataset has exclusive information of
diabetic patients who are diagnosed with either T1D or T2D, also this dataset is specific
to Canada and has information of all patients who have accessed NH facilities from 2012
to 2018. This includes patients from different communities accessing various facilities
which makes it a generic dataset rather than specific to a domain. The models included
the lesser explored variables such as length of stay, access to family physicians and

36

facilities. The importance of each of these variables with respect to the different
comorbidities was analyzed for building the predictive models. The results of the diabetes
comorbidity models were integrated with a user-friendly tool to predict the risk of
hypertension, cardiovascular disease and renal failure in diabetic patients. In addition to
this, a dashboard has been developed for visualization of existing clinical diabetes data
to give the users insightful and useful information about diabetes. This enables users to
interact and analyze anonymized patient data for effective decision making and improved
healthcare outcomes. The existing research has served as a guideline to choose relevant
variables and algorithms for developing the models.

37

Chapter 3
Methodology
This research has three interrelated components. First, a model for predicting diabetes
comorbidities is proposed. Second, an interactive dashboard has been developed to
provide insights about diabetes using visual analytics. In the process, hidden data
patterns are uncovered and the newly discovered knowledge is imparted via this
interface. Finally, a user-friendly assessment tool allows users to benefit from the model
results for their specific cases. These components are explained in detail in this chapter.

3.1

Proposed Model

The key components of the model are shown in Figure 8.This model was studied for three
representative comorbidities using several data mining algorithms and the NH clinical
dataset of diagnosed diabetic patients. IBM SPSS modeler was used to identify and apply
the most efficient data-mining algorithms for prediction of these comorbidities.
Relationships between different comorbidities and demographics were also identified.
The visual analytics dashboard was built using SSRS as the underlying platform. The
front end for the assessment tool consists of a simple web form wherein the user enters
information such as age, diagnosis code and health service delivery area. This
information is processed based on a Microsoft SQL server database back end and the
38

predictive models to determine the possibility of diabetes comorbidities in future. To
recap, after importing the Excel csv data file into the SQL Server database, the entire
process can be grouped into three distinct phases, namely, predictive modeling,
dashboard design and assessment tool. The steps within each phase are listed below.

Figure 8 Components for Predictive Modeling and Data Visualization

39

Phase 1: Data preprocessing and modeling
·

Data was cleansed and prepared for the model.

·

After data preprocessing, testing and training tables for three diabetes
comorbidities were created.

·

Relationships were established within the database to associate demographic and
diagnostic data from different tables for data visualization.

·

Three predictor variables were chosen from the top twenty diagnostic codes. A
separate model was built for each of these variables. The remainder diagnostic
codes together with demographic data then became the input variables.

·

Various data mining algorithms such as logistic regression, decision tree and
artificial neural network together with their ensembles were compared for
relevance and accuracy.

·

To evaluate the model, the data was divided in two sets, one for training and the
other for testing. A larger dataset was used for training which improved accuracy
of the algorithm. The test data, which was a smaller dataset, was then used to
evaluate performance.

Phase 2: Dashboard
·

A dashboard was designed to allow users to analyze/compare various key
performance indicators (KPIs) and filter results by the selected parameters. The
drilldown capabilities of the dashboard allow filtering by specific demographics and
understand KPIs at a finer granularity.

40

·

The performance of various data mining algorithms is also shown in the
dashboard.

Phase 3: Diabetes Comorbidity Assessment tool
The results from predictive models were integrated with a web-based, user-friendly
assessment tool to predict likelihood of comorbidities for individual patients. This tool
displays the risk score for diabetes comorbidities.

3.2

Data Source

A clinical dataset obtained from Northern Health (NH) has been used for this research.
This dataset consists of patients who have accessed one of the eighteen NH facilities in
three Health Service Delivery Areas (Northeast, Northern Interior, Northwest). The NH
dataset exclusively consists of diagnosed diabetic patients who were admitted to these
facilities for either acute care or day surgery. All patients were diagnosed with at least
one of 4,592 unique diagnostic codes. The dataset used for this research consists of a
total of 141,900 records representing 34,824 unique admissions for the period from April
1, 2012 to March 31, 2018. It is to be noted that these timelines were specified in fiscal
years (2012/13-2017/18) and there were no cases of gestational diabetes. The variables
included in this dataset are:
Ø Patient Code
Ø Stay Code (this code is unique to a particular acute/daycare stay (visit))
Ø Diagnosis Code Order of Entry (identifies the order in which the diagnosis codes
were abstracted)

41

Ø Health Service Delivery Area
Ø Facility Name
Ø Diagnosis Code (ICD-10-CA code that describes the diagnoses, conditions,
problems, or circumstances of the patient during the length of stay in the health
care facility)
Ø Diagnosis Code Long Description
Ø Age Units
Ø Average Total Length of Stay (the summation of both the acute care length of stay
and the ALC length of stay)
Ø Physician Code (has family doctor or not)
This dataset is a reliable source for building predictive models and analyzing data
because it only consists of diagnosed diabetic patients with associated diagnostic
codes.
In addition, it is also to be noted that this dataset has been anonymized by Northern
Health to protect the privacy of patients. No personal information of patients was
included in the dataset.

3.3

Data Preprocessing

Data preprocessing is the technique of cleaning and processing the data to ensure
efficient and accurate adaptation by different data mining algorithms. The performance of
predictive models not only depends on the data mining/machine learning techniques, but

42

is also highly dependent on the data quality. Hence, it is imperative to ensure that negative
factors such as noise, missing values and inconsistencies are addressed through data
preprocessing methods [34]. Figure 9 shows the common data preparation and data
reduction techniques involved in data preprocessing [35].

Figure 9 Tasks in Data Preprocessing [35]

For this research, data preprocessing has been done to obtain a specific set of relevant
variables through extensive analysis and filtering.
Dataset Preprocessing: The dataset provided by NH was an Excel sheet in a csv format.
This sheet was converted into a database. There were a total of 141,900 records with

43

stay codes repeating multiple times. To resolve this, pivot queries were used to get a
table with the 34,824 unique admissions. This was verified with the original Excel sheet.
This pivoting required that diagnosis codes for comorbidities be added as columns in the
table for each unique admission. This was deemed to be an unnecessary overhead
because 73% of these codes occurred in less than ten cases in the original dataset. The
process implemented for exclusion of comorbidities is explained in detail in the next
section. Considering all these factors, the top twenty diagnosis codes with maximum
counts were chosen to build the model. These codes together with their descriptions and
counts are listed in Table 4 below:
Table 4 Top Twenty Diagnostic Codes by Count

Diagnosis Code
1

E119

2

E1152

3

I100

4

E149

5

E1164

6

I500

7

E1123

8
9
10

N179
N390
N0839

11

E1138

12

H251

13

E1128

14
15

Z22300
J189

Diagnosis Description
Type 2 diabetes mellitus without (mention of)
complications
Type 2 diabetes mellitus with certain circulatory
complications
Benign hypertension
Unspecified diabetes mellitus without (mention of)
complication
Type 2 diabetes mellitus with poor control, so
described
Congestive heart failure
Type 2 diabetes mellitus with established or
advanced kidney disease
Acute renal failure, unspecified
Urinary tract infection, site not specified
Unspecified glomerular disorders in diabetes mellitus
Type 2 diabetes mellitus with other specified
ophthalmic complication not elsewhere classified
Senile nuclear cataract
Type 2 diabetes mellitus with other specified kidney
complication not elsewhere classified
Carrier of drug-resistant staphylococcus
Pneumonia, unspecified

44

Count
13,268
6,516
3,598
2,526
2,452
2,303
2,262
1,714
1,673
1,429
1,337
1,297
1,293
1,255
1,102

16

E109

17
18
19

Z22302
U980
Z515

20

E1178

Type 1 diabetes mellitus without (mention of)
complication
Carrier of drug-resistant enterococcus
Place of occurrence, home
Palliative care
Type 2 diabetes mellitus with multiple other
complications

1,058
996
967
958
939

Based on literature and Table 4, three diagnosis codes were selected as predictor or
target variables:
1. I500 (Congestive Heart Failure)
2. I100 (Benign Hypertension)
3. N179 (Acute Renal Failure)
For each of these diagnosis codes, training and testing datasets were initially created with
a ratio of 70:30. This ratio was later adjusted to study the efficiency of the models.
It is to be noted that the final dataset used to build the predictive models aggregated
patient admissions which resulted in each patient to have only one record to be consistent
with the total number of unique patients (14,016) after exclusions. This is due to the
reason that some patients had recorded a comorbidity in one of their admissions but in
the subsequent admissions, these comorbidities were missing which lead to data
inconsistencies. To handle this particular issue, it was assumed that if a patient had been
diagnosed with one of the twenty comorbidities in any one of their admissions, then they
were recorded with that particular comorbidity. This particular issue has also been
explained in the challenges section of this chapter.

45

3.4

Inclusion and exclusion

For relevance of data, inclusion/exclusion was done at two stages. First, NH ensured that
the data consisted of only diabetic patients. Second, irrelevant /redundant data was
excluded and only relevant variables based on literature were retained.
NH Inclusion/Exclusions: Discharges were included if at least one of the following
diagnosis codes was found on the record: E11*, E12* E13*, E14* and/or E232.
·

Type 1 Diabetes codes begin with E10

·

Type 2 Diabetes codes begin with E11

·

Type other codes begin with E13 and include Diabetes Insipidus E232 Diabetes

·

Type unspecified codes begin with E14

3.5

Predictive Modeling Inclusions/Exclusions:

46

Large datasets face curse of the dimensionality problem which impedes operations of
data mining algorithms raising the computational costs [36]. One solution to handle this
issue is to use the Feature Selection (FS) algorithm. FS eliminates irrelevant and
redundant variables. The NH dataset includes demographic and diagnostic data for each
patient with every admission. The variables including the twenty diagnostic codes
identified in data preprocessing stage were evaluated for importance using the FS
algorithm [37].

Figure 10 Feature Selection Model

47

Figure 11 FS Model Results

Figure 10 shows the FS model used on the NH clinical dataset. In this Figure, the SQL
data source represents the NH database, and Type identifies the data types of variables;
Type is also used to select the target predictor variable I500 (Congestive Heart Failure).
The golden model nugget contains the results of the FS model. Figure 11 shows the
variables evaluated and ranked by order of importance by the FS model. The Field column
provides the names of input variables which were described earlier in the data
preprocessing section. The Measurement column describes the variable type as identified
by the SPSS modeler either as continuous, nominal, ordinal or flag. Continuous is used
for integers, real numbers and date/time; Nominal can be used for numeric/string/date or
time; Flag is used for two distinct values such as true/false or binary values 0 or 1. It is

48

to be noted that out of the twenty diagnosis codes included, only seven were identified as
important by FS. This is due to the reason that remainder of the diagnosis codes had a
majority of values which were ‘0’ which prevented FS from ranking them; instead, these
codes were categorized as “Single category too large”. Since these variables were not
evaluated as unimportant by FS, they were included in the final dataset as input. The
finalized list of variables including diagnostic and demographic information was explored
further for data analysis and visualization. The data mining algorithms which were used
include Logistic Regression, CHAID, Neural Network, Random Forest, Bayesian Network
and Ensemble. Some of these algorithms are described later in this chapter.
The following variables were also excluded:
·

Stay Code - This code is unique to a particular acute/daycare stay (visit)

·

Fiscal Year - A fiscal year ranges from April 1 of the current year to March 31 of
the following year

·

Fiscal Period - Periods within the fiscal year (The days in period 1 and 13 will
vary, the remaining periods 2-12 will always be 28 days)

·

Discharge Date - The date the patient was discharged from the hospital

·

Institution Type - Identifies whether the hospital stay was an acute care stay or a
daycare visit

·

Age Code - Age code is either Year (Y), Months (M), or Days (D)

·

Acute Length of Stay - The length of stay in days associated with the acute care
portion of the stay

·

ALC Length of Stay - ALC length of stay is the number of days a patient is
classified as alternate level of care
49

·

Patient Community - Community of patient residence

·

Patient LHA - Local Health Area of patient residence

·

Patient Province - Province of patient residence

Stay code, Fiscal Year, Fiscal Period, Discharge Date were excluded because patient
data from different admissions was aggregated into one record for each patient. Age
Code specified the units in which the age was recorded (year, months or days). This only
resulted in removal of five patients. ‘Acute Length of Stay’ and ‘ALC Length of Stay’ were
excluded as the variable Total Length of Stay captured this information by default. Since
multiple admissions for patients and their diagnoses were aggregated, the average of
Total Length of Stay was calculated for all patients. Patient Community and LHA were
excluded due to patient migration across communities resulting in data inconsistency.
Patient Province was also excluded as majority of patients were from British Columbia
and this information was redundant. These eleven exclusions reduced the dataset to
twenty-six variables. Out of these variables, Patient Code (unique identifier) and the target
variable are not considered to be input variables thereby leaving twenty-four variables
which are listed below:
1. E119 - T2D without complications

2. E1152 - T2D with certain circulatory complications
3. E149 - Unspecified diabetes mellitus without (mention of) complication
4. E1164 - T2D with poor control
5. I100 - Benign hypertension

50

6. E1123 - T2D with established or advanced kidney disease
7. N179 - Acute Renal Failure
8. N390 - Urinary tract infection
9. N0839 - Unspecified glomerular disorders in diabetes mellitus
10. E1138 - T2D with other specified ophthalmic complication
11. H251 - Senile nuclear cataract
12. E1128 - T2D with other specified kidney complication
13. Z22300 - Carrier of drug-resistant staphylococcus
14. J189 - Pneumonia, Unspecified
15. E109 - T1D without complication
16. Z22302 - Carrier of drug-resistant enterococcus
17. U980 - Place of occurrence, home
18. Z515 - Palliative care
19. E1178 - T2D with multiple other complications
20. Facility HSDA - Facility Health Service Delivery Area
21. Facility Name - Specifies facility in which patient is admitted
22. Age - Specifies age of patient in years
23. Average Length of Stay - Average of total length of stay
24. Physician Code - Specifies if a patient has family physician or not
It is to be noted that, the above input variables are for predicting I500 which is the reason
for it not to be included as an input. An analogous process was followed for predicting
I100 and N179.

51

Out of 14,021 patients, there were only five patients who had their age units recorded
either as month or days; these were excluded from the study thus reducing the dataset
to 14,016 records.

3.6

Predictive Modeling

Predictive modeling is a statistical data mining technique normally used to predict future
behaviour. Predictive models analyze historical and current data to predict future
outcomes. Data mining algorithms, such as logistic regression, decision tree and neural
networks, have been used for building predictive models for early detection of diabetes
[11]. There are two types of learning used by data mining algorithms - supervised learning
and unsupervised learning [43]. Supervised learning uses labeled data whereas
unsupervised learning uses unlabeled data. Labeled data refers to data accompanied
with metadata, while unlabeled data lacks this information. Unsupervised learning has
been primarily used to solve association, clustering and anomaly detection problems. In
contrast, supervised learning are more suited to solve classification problems. With the
exclusive use of labeled data, this research makes use of supervised learning methods
in order to generate predictive models for classification. Some of the data mining
algorithms which use classification are briefly explained below:
Artificial Neural Networks

52

Artificial Neural Networks (ANNs) are biologically inspired models which have recently
found their applications in the field of healthcare. ANNs are based on the brain structure
and can be used to model extremely complex nonlinear functions. ANNs can be used in
sophisticated predictive applications such as multilayer perceptron (MLP) and radial basis
function (RBF) networks [38].

53

Figure 12 Neural Network Mapping

54

As seen in Chapter 2, ANNs have been effectively used for building predictive models for
diabetes. An example of neural network mapping shows how input data is matched to
the predictor variable which in this case was I100 (Hypertension) (Figure 12). Each
neuron output is calculated by the sum of inputs and activation functions.
Some advantages of ANNs are given below [39]:
·

Information is stored on a network which ensures that it can function even with
missing values; models can be trained to produce results even with incomplete
data.

·

ANNs lend well to parallel processing.

·

ANNs provide better fault tolerance; if one or more cells is corrupted, it still
generates results.

A notable disadvantage of ANNs is that solutions probed are unexplained in some cases
which reduces trust on the network.
Logistic Regression
Logistic regression has been used typically in the analysis of binary outcomes. It is a
statistical method for prediction of probability of occurrence of an event which is
represented by 1 and a non-event by 0. Predictor variable in logistic regression can be
either qualitative or quantitative [38].
Figure 13 shows the importance of the different input variables identified by the logistic
regression model for predicting I100 (Hypertension).

55

Figure 13 Logistic Regression Predictor Importance

Decision Tree
A decision tree assigns probability to each of the possible choices based on the context
of the decision and acts as a decision-making device. Decision tree has attribute nodes
that can be linked to two or more subtrees. Brieman’s classification and regression tree
(CART) is one of the popular decision tree algorithms [16].
Random Forest
Random Forest is an algorithm where subsets of a given dataset are chosen and multiple
classification trees are generated. A forest is then created from the ensemble of these
trees. Random Forest has been used in diabetes research and is recommended as an
efficient algorithm for building predictive models [40].

56

Bayesian Network
Bayesian Network algorithm is based on probabilistic theory and represents a set of
variables and their dependencies in the form of an acyclic graph [41]. This could be used
to identify relationships between a disease and its associated symptoms. Bayesian
Network has been used in the past to predict T2D patients [42].
CHAID
CHAID (Chi Square Automatic Interaction Detection) is a decision tree technique [43].
CHAID can be used to find the relationship between input and target variables. It is to be
noted that CHAID can have limitations due to the sample size of predictor variable. CHAID
has also been used to build predictive models for diabetic patients [44].
The performance of a predictive model depends on various factors such as data quality,
structure of data and variable selection [36]. In 2013, a study was done comparing the
performance of logistic regression and artificial neural network (ANN) models for
identifying risk factors for diabetes mellitus using IBM SPSS Modeler [38]. Figure 14
shows the overall data modeling process, and Figure 15 shows the process specifically
for partitioned data. The dataset consisted of 229 diabetes patients, 69.9% of whom had
uncontrolled blood glucose level. Results revealed that ANN model had a higher
classification accuracy of 72.5% in comparison to logistic regression which had an
accuracy of 69.9%. Similar results were recorded for partitioned data with ANN model
having an accuracy of 72.5% while logistic regression had an accuracy of 71.35%.

57

Figure 14 Data Mining Process for Entire Sample Data [38]

Figure 15 Data Mining Process for Partitioned Sample Data [38]

58

3.7

IBM SPSS Modeler

A number of predictive modeling tools such as R, Weka, Orange, Rapid Miner, GraphLab
Create, Octave and IBM SPSS are available. The IBM SPSS Modeler is a graphical data
science and predictive analytics platform which assists in providing insights to improve
decision making. For this research, IBM SPSS Modeler v18.1 [22] was chosen due to its
features listed below:
·

Advanced statistical analysis

·

Easy to use and flexible

·

Supports multiple data sources including Microsoft SQL database

·

Graduate licensing packages for students.

·

Supports all phases of data mining including model development, deployment and
refreshing

·

Ability to merge data from multiple sources

·

Options to choose advanced data mining algorithms to build predictive models

·

FS algorithm which helps identify important variables

·

Supports use of smaller datasets

·

Scalable platform makes it accessible to users of different skill levels

·

Automated data preparation and modeling

·

Visual analytics

·

Multiple deployment methods

59

3.8

Challenges

Data

The major challenge was the selection of appropriate variables as there was a lot of
diagnostic data with minimal records. Data inconsistency was another issue as there were
patients who had missing diagnosis codes on readmission. Both these challenges are
elaborated in detail below. Each patient had been recorded with at least one or more
diagnosis codes for every admission. The Table 5 below shows the distribution of
diagnosis for patients.

Table 5 Diagnosis/Patient Distribution

No of Diagnosis

No of Patients

2-5

7,140

6-10

3,382

11-15

1,490

16-20

756

>20

1,044

To filter and select the diagnosis codes which were relevant to this research was a tedious
task. This has been elaborated in the data preprocessing section. Even after filtering and
selecting prominent twenty diagnostic codes, there were inconsistencies found with the
data. For instance, there were patients who had been recorded with T2D the first time

60

they came in and on readmission, the same patient was recorded with Type 1 diabetes.
An example of this case scenario is shown for a patient in Figure 16.

Figure 16 Data Inconsistency Example

The data inconsistency issue raises the question that what happened to the patient’s
previous diagnosis of T2D on the first admission.
Another challenge was that not all of the recorded diagnosis codes were repeating on
readmission. For instance, diagnosis code I100 (hypertension) was recorded for a patient
on the initial admission but this diagnosis code was not recorded upon readmission
(Figure 17). This data inconsistency made it extremely challenging to build the predictive
models with longitudinal data. To resolve this issue, the dataset used for building the
predictive models retained all diagnosis codes for a patient if it was recorded in any of
their admissions.

61

Figure 17 Data Inconsistency I100 (hypertension)

To illustrate this, the diagnosis code I100 (hypertension) was retained for the patient even
though it was not recorded in the recurring admissions (Figure 17). For patient 110 (Figure
16) diagnosis codes for both T1D and T2D were retained. This ensured that no diagnosis
code was missed for building the predictive models. The corresponding dashboard is
consistent with this logic as well.

62

As mentioned earlier, all patients in this dataset were diabetic with diagnosis codes
recorded for T1D, T2D, other types of diabetes, diabetes insipidus and unspecified
diabetes. In total, there were 100 diagnosis codes for different types of diabetes with zero
or more associated comorbidities. These 100 diagnosis codes included diabetes
comorbidities such as E1123 (Type 2 diabetes mellitus with established or advanced
kidney disease) and E1128 (Type 2 diabetes mellitus with other specified kidney
complication not elsewhere classified). For kidney/renal comorbidities, there were a total
of 84 diagnosis codes. The word ‘kidney’ was found in 45 diagnosis codes and 39
diagnosis codes contained the word ‘renal’. Some of the patients who were recorded with
diagnosis codes such as E1123 were also recorded with other diagnosis codes such as
N179 (Acute renal failure) (Figure 18). The American Urological Association quotes renal
as a synonym for kidney [45], this was an interesting observation which could lead to the
possibility of combining diagnosis codes with the word renal/kidney under one umbrella.
However, there were 84 diagnosis codes for words kidney/renal alone and they included
different types of diabetes as well as other comorbidities. To combine these meaningfully,
a considerable medical background would be required. Thus, it was not explored further
in this research. All of these challenges made it tedious to finalize the predictor variables
filtering the different diagnosis codes. Thus diagnosis codes with more prominence were
chosen as the target variables.
In Figure 18, it can also be seen that a patient was recorded with T2D on their first visit,
but the same patient was recorded with unspecified diabetes in a subsequent admission.
Considering these anomalies in the dataset, the predictive models were built for all
diabetic patients instead of a specific diabetic type.

63

Figure 18 Diabetic Patient with Kidney Disease

Please note that the stay code is unique for each patient visit and no patient information
is included in Figure 16, Figure 17 and Figure 18 to ensure patient privacy.

Integration
Software from different vendors such as Microsoft SQL [46] for database, Microsoft SSRS
[47] for dashboard and IBM SPSS Modeler [22] for building predictive models required
integration packages to be built. For instance, to connect the Microsoft SQL data source
[46] with IBM modeler [22], an ODBC (Open Database Connectivity) data source had to
be created. Similarly, the results from SPSS modeler had to be exported to MS SQL
database for analysis using SSRS [47].

64

3.9

Data Visualization

3.9.1 Dashboard

Dashboards are an effective tool which allow visualization of large amounts of data in an
intuitive manner. The key performance indicators are integrated into visual displays which
can be further drilled down for finer granularity. These do not only provide insight into
data, but can be effective for quickly finding information. For instance, it was
demonstrated that the mean time to find all data elements was 6.3 minutes using
conventional approach compared to 1.9 minutes using a diabetes dashboard. The
research further established that analysis tools like dashboards can be an insightful asset
for healthcare information technology [27]. As seen in Chapter 2, a similar research [28]
by the Canadian Diabetes Association established that health professionals were more
effective in treating diabetes patients when using a dashboard which provided knowledge
of other risk factors and associated guidelines [28]. Similarly, the patients who are
presented with a dashboard listing the risk factors tend to benefit from the knowledge
contained therein.
To build a dashboard, it is essential to choose an appropriate visualization tool. A number
of tools are now available for this purpose. These include Tableau [48], QlikView [49],
Datawrapper [50], Fusioncharts [51] and SSRS [47]. In this research, SSRS was selected

65

due to its simplicity, capability to produce interactive visualizations and ability to adjust
with fast changing datasets.

3.10 SSRS

SSRS is a business intelligence module which enables users to create visually appealing
reports via charts, maps and dashboards [47]. SSRS provides features including:
·

Compatibility with different data sources ranging from simple Excel sheets to
databases

·

Interactive sorting capabilities

·

Drilldown/Drillthrough reporting

·

Security via access controls

·

Intuitive Visualization

·

Export features – reports can be exported to various formats including Word,
Excel, PowerPoint, pdf, TIFF, MHTML, CSV, and XML

SSRS requires a backend database and a wrapper for rendering reports in a browser.
For this research, the Microsoft SQL server database was used.
SSRS reporting has been effectively used in healthcare to maximize profits, minimize
risks, reduce costs and enhance patient experience. In addition, it has also been used for
presenting data in a user-friendly form (Figure 19).

66

Figure 19 COVID Prevalence in the World [29]

3.11 Summary

The proposed models are based on several data mining algorithms and produce datadriven results which can assist physicians in developing effective treatment plans. The
dashboard presents insightful information such as obesity rates based on geographical
locations, food habits and the overall trend for diabetes over the years. Such information
can educate the users about diabetes and its impact on health. The assessment tool,
which uses results from predictive models, would facilitate users to be better informed
about diabetes comorbidities. The existing diabetes risk score calculators are mostly

67

based on limited paper based questionnaires and not easily accessible. In summary, the
study methodology consists of the following steps:
·

NH dataset was imported into SQL database

·

Data preprocessing techniques were used to eliminate redundant information.

·

Data analysis allowed selection of the input and three predictor variables for
diabetes comorbidities.

·

The database served as a data source for the IBM SPSS Modeler and was used
to evaluate relative performance of various data mining algorithms.

·

The accuracy of each algorithm was determined using training and testing
datasets.

·

The results from the models were displayed using SSRS via an interactive
dashboard.

·

A user-friendly tool was developed to calculate the risk of developing comorbidities
for individual patients.

·

The dashboard was integrated with the diabetes assessment tool.

The information provided by the predictive model will be both helpful and insightful for
diabetes patients as well as non-diabetic users to have a better understanding of their
health and act as an effective indicator to further discuss with a healthcare practitioner.

68

Chapter 4
Experiments and Results

In this chapter, the experiments and results of this research are discussed. This chapter
is split into two parts. First, the diabetes dashboard is explained, and then results from
the predictive modeling for three target variables are presented. For both these parts, the
NH clinical dataset was used which included only diabetic patients who accessed one or
more of the NH facilities between the period 2012-2018.
The Diabetes Dashboard consists of three main reports with drilldown capabilities. The
first report shows overall aggregated statistics for the NH diabetes dataset; the second
report is the Diabetes Types and Comorbidities dashboard which shows the prominent
comorbidities for patients with different types of diabetes and also includes the prominent
Primary Diagnosis Codes. The third report is the HSDA comparison which shows
aggregated patients and admission statistics across the three Health Service Delivery
Areas - Northwest, Northeast and Northern Interior.
Predictive Modeling is done using five base classification algorithms together with their
ensemble for three comorbidities (Hypertension, Congestive Heart Failure and Acute
Renal Failure) using IBM SPSS Modeler. The results for each target variable is shown
with the corresponding explanation and analysis. The relationships found between the
input and target variables using the FS algorithm are also explained followed by a
summary of the analysis.
69

4.1

Diabetes Dashboard

Figure 20 Diabetes Dashboard

Figure 20 shows the main diabetes dashboard which displays the clinical data sliced
along various dimensions including population, diagnosis codes, diabetes types,
admissions and comorbidities for patients admitted in NH facilities over the years. The
dashboard also allows navigation to reports at a finer granularity via drilldowns. Each of
the charts/tables included in this dashboard is further explained below. The image on the
top-right was obtained from Diabetes Canada [8].

70

Figure 21 Diabetes Dashboard Overall Statistics

Figure 21 shows an overview of aggregated statistics obtained from the NH clinical
dataset. The top row shows that there were a total of 14,021 patients with 34,824
admissions which averages out to approximately three admissions per patient. Out of
these patients, 12% with T1D, 80% were diagnosed with T2D, and remaining 8% with
other types of diabetes (includes diabetes insipidus, other and unspecified types). The
Diagnosis/Patient Statistics table shows that the average age of all patients over the years
was 63 and an average of 2,337 patients were admitted each year. Another observation
was that the admitted patients recorded an average of four diagnoses from the possible
4,592 diagnosis codes. The maximum number of diagnosis codes recorded for a patient
was 89, there were four patients who recorded more than 80 comorbidities and fifty
patients who recorded between 50-80 comorbidities. A detailed breakdown of
comorbidities is shown in Figure 25. The number of diabetic patients and admissions by
province is also shown in Figure 21. Drilldown from this chart shows these numbers for
each LHA specific to British Columbia (Figure 22). It should be noted that the higher
number of patients in the drilldown is due to the patient migration which records the patient

71

more than once. However, this anomaly does not impact the model. LHAs with fewer than
ten patients and those recorded as ‘Unknown‘ were grouped in a single category labeled
as ‘Other’.

Figure 22 Diabetes Dashboard - Patients/Admissions Drilldown

72

Figure 23 Diabetes Dashboard - Patients/Admissions by Year

Figure 23 shows the number of patients and admissions for each year from 2012/13 to
2017/18. The maximum number of patients were recorded for the year 2012/13 (3,763)
and the minimum was in 2016/17 (1,672), also the patients were consistently decreasing
till 2017/18 followed by a slight increase in 2017/18 (1,848). This trend is consistent with
Statistics Canada numbers. In 2012/13, 5.7% of British Columbia residents were
diagnosed with diabetes which was lower than national average of 6.5%. In the following
years (2013/14, 2014/15) this number dropped to 5.5% compared with the national
average of 6.6% (2013/14) and 6.5% (2014/15). In 2017/18, the national average went
up to 7.3% and British Columbia recorded a corresponding increase to 5.9%. On the other
hand, the admissions trend is not consistent with the trend observed for number of
patients. In 2015/16, the number of admissions (5,443) was lowest and 2017/18 recorded

73

highest number of admissions (6,483) even though the number of patients was almost
identical. It should be noted that the admissions include patients from previous years and
also readmissions of the same patient. For instance, the year 2017/18 recorded 6,483
admissions for the cumulative number of patients (14,021) and not the new patients
(1,848) only.

Figure 24 Diabetes Dashboard – Patients by Diabetes Type (Yearly)

Figure 24 shows the number of patients by diabetes types (T1D, T2D and Other). The
number of patients consistently decrease for T2D until 2016/17 (1,320) from 2012/13
(3,021) and then increases slightly in 2017/18 (1,413). A similar pattern was observed for
T1D patients. These observations are consistent with Figure 23 which showed

an

increase of patients in 2017/18. For other types of diabetic patients, a different trend was
observed which recorded the lowest number of patients in 2017/18 (130) and the highest
number in 2013/14 (308). Since this group represents only 8% of the total number of
patients, the impact on the overall trend is relatively insignificant.

74

Figure 25 Patients with Comorbidities

Figure 25 shows the comorbidities per patient. The average number of diagnosis for a
patient was observed as four with the majority of patients (7,140) having two to five
comorbidities. There were fewer patients with higher number of comorbidities. The lowest
number of patients (756) was recorded for 16-20 comorbidities and then an increase was
observed for 20+ comorbidities. On further breakdown for patients with greater than
twenty comorbidities, it was observed that 990 patients recorded 20-40 comorbidities,
thirty-three patients recorded 50-70 comorbidities, seventeen patients had 60-80
comorbidities and only four patients recorded over 80 comorbidities. As mentioned earlier,
all patients in the NH dataset had at least one type of diabetes (T1D, T2D or other). The
209 patients shown in Figure 25 are those who had only one diabetes diagnosis code
recorded.

75

Figure 26 Diabetes Dashboard – Prominent LHAs with Diabetic Patients

Figure 26 shows the prominent communities which had the highest number of diabetic
patients. The dataset consisted of 305 communities and seventy LHAs of which Prince
George recorded the maximum number of patients consistently over the years. The
University Hospital of Northern British Columbia in Prince George accounted for 53% of
the total patients and 47% of overall admissions. The GR Baker Memorial Hospital in
Quesnel accounted for 7% of the total patients and 13% of the overall admissions. It was
also observed that all communities showed an increase of patients in the year 2017/18
from the previous year making it consistent with the trends noted earlier (Figure 23 and
Figure 24). Since Prince George and Quesnel are categorized as both LHA as well as
communities, these names will refer to one or the other depending on the context. Peace
River South and Peace River North consists of thirteen and sixteen communities,
respectively. Quesnel has three communities and Prince George has fifteen communities

76

including itself. It is also to be noted that the LHAs are specific to the patient and not to
the facilities. For instance, a patient can have their community recorded as Quesnel and
still be admitted to a facility in Prince George. The top ten LHAs with the maximum number
of patients were Prince George, Peace River South, Quesnel, Peace River North,
Terrace, Nechako, Prince Rupert, Kitimat, Smithers and Burns Lake.

Figure 27 Diabetes Dashboard - Prevalence of Diabetes by LHAs

Figure 27 shows prevalence of diabetes per thousand of the population. The population
figures were obtained from Census Canada (Peace River North and Peace River South
- 2001; Prince George and Quesnel - 2016).
Similar to
Figure 26 , Prince George recorded the maximum prevalence per thousand residents over

the study period. An interesting observation is that while Prince George and Quesnel did

77

not show any change between 2016/17 and 2017/18, both Peace River North and Peace
River South showed a slight increase over the same period. This is consistent with
Figure 26 where a spike in the number of patients was observed for both Peace River
North (32%) and Peace River (11%) South during this period.

4.1.1 Diabetes Types and Comorbidities

78

Figure 28 Diabetes Types/Comorbidities Dashboard

Figure 28 shows the overall aggregated statistics broken down by diagnosis codes
specific to the types of diabetes and comorbidities. Using charts and tables, the clinical
data from NH has been sliced along various patient groups (T1D, T2D and other types of
diabetes) and diagnosis. Each of these charts is explained below along with the
drilldowns, where applicable. The image on the bottom right has been taken from
Diabetes Canada [8].

Figure 29 Diabetes Types/Comorbidities Dashboard Statistics

Figure 29 shows vital statistics related to comorbidities. Out of 14,021 patients, it was
observed that one in five had hypertension and one in ten had heart/renal failure. These
three comorbidities accounted for 39% of the total patients and 22% of the total
admissions. This observation also became the basis of selection of the three target
variables identified in chapter 3.

79

Figure 30 Diabetes Types/Comorbidities Dashboard - Diagnosis Codes/ Diabetes Types

The diagnosis codes table in Figure 30 are grouped by different types of diabetes and
other comorbidities. It was observed that comorbidities accounted for 98% of the total
diagnosis codes.
Figure 30 also shows the number of patients with different types of diabetes. ‘Other’ type
of diabetes includes diabetes insipidus and unspecified diabetes types. It is observed that
80% of the total patients had T2D and the remaining 20% had T1D or Other types of
diabetes. The comorbidities specific to diabetes types are shown in Figure 32 and Figure
33.

80

Figure 31 Diabetes Types/Comorbidities Dashboard - Diagnosis Codes/ Diabetes Types

Upon admission, multiple diagnosis codes are normally entered, one of which becomes
the primary ‘most responsible’ code. Figure 31 shows the top five primary diagnosis codes
which account for 23% of total patients and 15% of total admissions. In this figure, while
H251 is showing the maximum number of patients’ primary diagnosis, it is not the case
when all diagnosis types are included. For example, H251 accounted only for 4% of the
total admissions and 7% of the total patients. Thus, it was not identified as a target
variable when building the model. It is observed that when H251 and I500 were included
in the diagnosis set for the patient, they were recorded as primary diagnosis in 98% and
48% of the cases, respectively. The description for the diagnosis codes is shown in the
table in Figure 31.

81

Figure 32 Diabetes Comorbidities Dashboard- T2D Comorbidities

Figure 32 shows the top five comorbidities for patients with T2D together with their
corresponding description. It was observed that 65% of the patients with T2D were
diagnosed with one or more of these comorbidities, 48% were diagnosed with one or
more of the top three comorbidities (I100, I500, N179). These three comorbidities were
selected as target variables for the predictive model.

82

Figure 33 Diabetes Comorbidities Dashboard- T1D/Other Diabetes Comorbidities

Figure 33 shows the top five comorbidities diagnosed for patients with T1D or any other
types of diabetes excluding T2D. These comorbidities represented 95% of the total
patients in this group. The three target variables (I100, I500, N179) selected for building
the models accounted for 63% of patients. The top two comorbidities (hypertension and
congestive heart failure) are the same in both sets (Figure 32 and Figure 33). However,
the third and fourth comorbidities (N390 and N179) are reversed in the two sets. N179
was selected as the target variable because of its high cumulative impact. Figure 34

83

shows the top five diagnosis codes embedded with different types of diabetes. Four of
these codes (starting with ‘E11’) represent T2D patients which can be attributed to the
fact that majority of the patients in this dataset have been diagnosed with T2D.

Figure 34 Comorbidities Dashboard- Diabetes Specific Diagnosis Codes

84

Figure 35 Diabetes Types/Comorbidities Dashboard- Diabetes Diagnosis Codes Drilldown

Cumulatively, the total number of patients represented by these codes exceed 14,021
patients because the same patient can be diagnosed with multiple codes. This issue was
not obvious in earlier charts because the patients were either filtered by diabetes types
85

or by primary admissions. Figure 35 shows the drilldown report which lists the top twenty
diabetes specific diagnosis for all patients.
4.1.2 HSDA Comparison

Figure 36 Diabetes HSDA Dashboard

Figure 36 shows a comparison of aggregated statistics for each of the three HSDAs –
Northwest (NW), Northern Interior (NI), Northeast (NE) - which recorded 24%, 57%, and
19% of the total patients, respectively. An interesting observation was that 6% of the

86

patients migrated to other communities and were thus counted more than once. This,
however, does not impact the number of visits because those are recorded independent
of the patient’s community. On average, approximately two admissions per patient were
recorded across all HSDAs, including patients from outside of BC. Even though NI
recorded majority of the patients as well as admissions, the average length of stay (LOS)
was very similar across all HSDAs. A similar pattern was also observed for patients who
had family physicians. In BC, there were a total of nineteen communities which recorded
over 100 patients for the years 2012/13 to 2017/18. Among these, Fort Nelson had 85%
of patients without a family doctor followed by Fort St. James, Houston, Queen Charlotte
and Burns Lake (73%, 42%, 40%, 33%). The five communities with the highest number
of patients (Prince George, Quesnel, Fort St. John, Terrace and Dawson Creek) had 14%,
27%, 18%,14% and 11% patients with no family doctors, respectively. NE had the highest
number of patients visiting from outside of BC. The facilities visited most by these patients
were Dawson Creek District Hospital (69 patients), Fort St. John General Hospital (37
patients) and University Hospital of Northern British Columbia (28 patients). All of these
patients were from Alberta.
The number of patients who were only recorded with only one diagnosis code was less
than 2% in each of the HSDAs. Patients with two to five comorbidities represented 53%,
50% and 59% of the total number of patients in NW, NI and NE, respectively. Patients
with six or more comorbidities were 45%, 49% and 39% for the same HSDAs,
respectively.
While 85% of total patients had a diagnosis code related to T2D, the three HSDAs had a
variation ranging from 72% (NE) to 90% (NI); NW was closer to the overall average (82%).

87

Of the eighteen facilities across all HSDAs, UHNBC admitted 50% of the total patients
followed by Mills Memorial (11%) Hospital and Fort St. John General Hospital (10%). The
lowest number of patients was admitted by McBride & District Hospital (0.5%).
Figure 37 shows the annual breakdown of cumulative visits and number of patients across

all HSDAs. This drilldown is obtained by clicking on one of the HSDA maps in Figure 36.

Figure 37 HSDA Dashboard - Patients/Visits Drilldown

88

4.1.3 Summary

An interactive diabetes dashboard was developed using dataset consisting of diabetic
patients who had accessed Northern Health facilities from the years 2012 to 2018. The
dashboard consisted of three main reports: 1) the diabetes dashboard which contained
overall aggregated statistic of this dataset, 2) the diabetes types and comorbidities
dashboard where the data was grouped by different types of diabetes and comorbidities
of the patients, and 3) the HSDA dashboard which grouped the data by three HSDAs –
NE, NW and NI.
The following are a few observations which were made from these reports:
·

80% patients were diagnosed with T2D

·

Average age of patients was found to be sixty-three

·

Average Number of diagnosis per patient was four

·

Number of new patients were consistently decreasing till 2016/17 with a slight
increase in 2017/18

·

HSDA Northern Interior recorded 57% of the total number of patients where LHA
Prince George had maximum number of Patients and Admissions.

·

All three target variables (I100, I500, N179) were recorded as one of the top five
comorbidities for T2D patients (excluding the diabetes diagnosis codes)

This dashboard also had drilldown capabilities to view reports at finer granularity by
various parameters such as HSDA, LHA and patient comorbidities.

89

4.2

Predictive Modeling

Predictive modeling is the process of applying data mining algorithms on historical data
to predict the likelihood of future outcomes. For the three target variables (I100, I500,
N179), predictive models were built using six data mining algorithms. The corresponding
results are explained in this section.
The six data mining algorithms chosen for building the predictive models were:
1. Bayesian Network
2. Neural Network
3. Random Forest
4. Logistic Regression
5. CHAID
6. Ensemble

4.2.1 Training Models

For each of the three target variables, the dataset was split such that the training
component contained 70% of the patients diagnosed with the corresponding target
variable. The remaining 30% was then used for testing. This resulted in a patient
distribution as shown in Table 6. For instance, N179 had a total of 1,303 patients who
were split into training (913) and testing (390) datasets, respectively. This number (1,303)
represents 9.3% of the total number of patients. In order to maintain this 70:30 ratio, the
desired number of patients in the dataset was then determined which in this case was
90

8,274 (59% of the total patient population). The remainder was used for testing. This
method was consistently applied to all target variables. A higher percentage of records in
the training dataset allowed the models to learn the underlying patterns better, which
helped in making better predictions.
Table 6 Training/Testing Datasets

Target Variable
I100
I500
N179

Training
66%
65%
59%

Testing
34%
35%
41%

Total Patients
18.9% (2,656)
9.9% (1,385)
9.3% (1,303)

Figure 38 shows the training model for prediction of I100 (hypertension) using five data
mining algorithms with different nodes. Each of these nodes is explained below:

91

Figure 38 Predictive Modeling Training

92

SQL Access: This is the data source node which establishes a connection to diabetes
database and extracts the dataset consisting of the finalized twenty-six variables for
building the predictive models. Twenty-four of these variables were the input variables
and the remaining two variables were excluded because they were either a unique
identifier (Patient Code) or the target variable (I100).
Type: The Type node is used to specify the data type of the selected variables as either
nominal, categorical, continuous, flag or ordinal. This node allows to specify whether a
variable is input or target. Additionally, it also gives an option to specify one variable as
the unique identifier (Patient Code).
The twenty diagnosis codes and Physician Code were all assigned as a flag including the
target variable. The flag datatype is used for variables which have binary values, such as
0 or 1. Patient Code, Age, Average Length of Stay were assigned as continuous which is
used to describe numeric values including decimals. Facility Health Service Delivery Area
and Facility Name were assigned as nominal which is used for storing string values. For
this predictive model, I100 was set as the target variable and the remaining variables
were the input. Figure 39 shows the twenty-six variables with this information, where the
Measurement column shows the data type, the Values column shows the sample values,
the Missing column shows missing values in the dataset, the Check column specifies if a
variable needs to be excluded, and the Role column specifies the variable as input, target
or unique identifier.

93

Data Mining Model Node: The Type node is connected to the data mining model nodes
each of which represent one of the five (Bayes Network, Neural Network, Random Forest,
Logistic Regression, CHAID) algorithms. The Ensemble algorithm is not shown as it is
explained later in this chapter. Executing these nodes generates the model nugget which
contains the results of the trained model for the selected algorithm.

Figure 39 Predictive Modeling Training - Type Node

Analysis Node: The results from the model nugget are connected to the analysis node
which analyzes the prediction accuracy of the model. An example of analysis node for
predicting I100 using neural network is shown in Figure 40 where approximately 81% of
the total predictions were correct and 19% were wrong.

94

Figure 40 Predictive Modeling Training - Analysis Node

95

4.2.2 Testing Models

Figure 41 Predictive Modeling Testing

96

Figure 41 shows the testing model used for predicting one of the target variables (I100).
The nodes shown are explained below:
SQL Access: The data sources represent the testing dataset with 30% of patients with
I100. A major difference between the testing and training data source is that the former
does not contain information of the corresponding target variable.
Type: The data type node used in testing is identical to the one used in training with the
exception of target variable. It is necessary for the training and testing to have identical
input variables with the specified data types for successful execution. An example of the
type node used for testing is shown in Figure 42 where there is no target variable (I100)
information being sent to the trained model nugget.
Trained Model Nuggets: These nuggets possess the required information to predict the
target variable. Executing these trained models generate the results of one of the five
corresponding data mining algorithms (Bayesian Network, Neural Network, Random
Forest, Logistic Regression, CHAID). These results include the predicted values of the
target variable (I100) which is pushed to an output table.
Output Table: This table contains the results of the executed training model nugget along
with the other input variables. The predicted values of the five algorithms were evaluated
for accuracy and are explained in the data analysis section.

97

Figure 42 Predictive Modeling Testing - Type Node

98

Figure 43 Predictive Modeling Ensemble Training/Testing

Figure 43 shows the training and testing models for the Ensemble algorithm. The SQL
access node and the Type node are identical to the ones used in training (Figure 38) and
testing (Figure 41), respectively. Ensemble training and testing is explained below:

99

4.2.3 Ensemble

Ensemble Model Training
The Ensemble node combines results of predictions for the target variable (I100) from
the five trained models (Bayesian Network, Neural Network, Random Forest, Logistic
Regression, CHAID) and generates a field containing the aggregated results. The
Ensemble training results were observed by connecting the Ensemble node to the
analysis node. It can be seen that the Type node is connected to only one model nugget
(Bayesian Network). This is because the data types of the variables are fetched from the
first model nugget (Bayesian Network) and then passed to the other four model nuggets
followed by the Ensemble node. In Figure 38, the Type node was connected individually
to the five data mining model nuggets, as each model fetched the data types of variables
independently.
Ensemble Model Testing
The Ensemble model testing is very similar to that for the other five algorithms (Figure
41). The only difference is that instead of connecting individual model nuggets to the
output table, the five model nuggets are connected to each other and then to the
Ensemble node. This node is then connected to the Table node which generates the
aggregated results. These results are evaluated for accuracy by comparing with existing
data.
The process described above is also implemented for the other two target variables (I500,
N179).

100

4.2.4 Analysis of Results

The results generated for the five base algorithms and Ensemble were evaluated for
accuracy using the process described below:
The results from the output table for all testing models (Figure 41 and Figure 43) were
pushed into the diabetes database. Since this table did not contain the target variable, it
was added using a SQL query. The predicted column and the existing target variable
information was compared for each row and the statistical accuracy of predictions was
computed as follows:
=
For instance, the number of accurate predictions for I100 (testing dataset) using
Bayesian Network was 3,976. The total number of values in the testing dataset was 4,765
which gave an accuracy of 83.4%. Similarly, the accuracy was calculated for the
remaining algorithms for all target variables (I100, I500, N179).
It was also observed that the accuracy of predictions for all algorithms was consistently
better for true negative cases compared to true positives.

101

Figure 44 Predictive Modeling - I100 Results

Figure 44 shows the accuracy of the trained models for target variable I100 with a total of
4,765 patients. Ensemble and Logistic Regression had the highest accuracy for predicting
patients with or without I100 (hypertension). Both these algorithms recorded identical
accuracies of 84.15%. Bayesian Network, CHAID, Neural Network and Logistic
regression made accurate predictions for 3,976, 3,979 and 4,003 patients, respectively,
giving an accuracy as shown in Figure 44. The low accuracy (81.7%) of Random Forest
can be attributed to overfitting problem which is one of the drawbacks of this algorithm.
This dataset has 83% patients without hypertension and 17% (796) patients who were
diagnosed with I100 (hypertension). For patients without hypertension, the six algorithms
have an average accuracy of 97.2%. However, the average accuracy for those with

102

hypertension is only 15%. The reason for this low accuracy is the small number of patients
in this group for the training dataset.

Specifically, there were 2,656 patients with

hypertension which is only 18.9% of the total patients (14,016). This is the reason for
70:30 split of the testing and training datasets of the total (2,656) patients diagnosed with
target variable I100. A smaller number of patients in the training dataset would have
resulted in an even lower accuracy. The chosen distribution also ensured that both
training and testing datasets contained patients in proportion to the entire database.

Figure 45 Predictive Modeling - I500 Results

Figure 45 shows the accuracy of the trained models for target variable I500 (Congestive
Heart Failure) for a total of 4,959 patients. Ensemble recorded 92.61% accuracy followed
by Neural Network with 92.57%. Logistic Regression, Bayesian Network and CHAID had

103

accuracies of 92.5%, 92.5%, and 92.3%, respectively. There were 415 (70%) and 970
(30%) patients diagnosed with I500 in the training and testing datasets, respectively.
These datasets were used for all six algorithms. The average accuracy to predict patients
with and without congestive heart failure was 29.1% and 98.7%, respectively. As
explained earlier, the smaller number of patients in the training dataset for this group
(patients diagnosed with I500) contributed to the low accuracy. It is to be noted that
diagnosed I500 patients were only 9.8% of the total number of patients (14,016) in the
entire database.

Figure 46 Predictive Modeling - N179 Results

Figure 46 shows the overall accuracy of all algorithms for target variable N179 (Acute
Renal Failure). CHAID and Ensemble had accuracies of 96.77% and 96.74%,
respectively followed by Logistic Regression with 96.55%. Random Forest and Neural
Network recorded an identical accuracy of 96.37% and Bayesian Network had an
accuracy of 96%. Within the database, there were a total of 1,303 patients who were

104

diagnosed with N179; these patients were split into training and testing datasets in the
ratio of 70:30. The average accuracy for predicting patients with and without N179 is
63.7% (Figure 47) and 98.8%, respectively. It was observed that CHAID had the highest
accuracy of 67.7% followed by Ensemble with 66.4%. Bayesian Network, Logistic
Regression and Neural Network had accuracies of 63.8%, 63.6% and 61%, respectively.
For reasons mentioned earlier, Random Forest had the lowest accuracy (59.7%) for
predicting patients diagnosed with N179.

Figure 47 Predictive Modeling Accuracy for Patients with N179

It was observed that as the percentage of patients for a target variable decreased, there
was an increase in accuracy of predicting true positives across all algorithms. Since, N179
had the lowest percentage of diagnosed patients, it had higher accuracy for predicting
true positive cases.
105

4.2.5 Analysis of Variables

Figure 48 Predictive Modeling using Feature Selection (I100, I500, N179)

Figure 48 shows the predictive models using Feature Selection (FS) algorithm for the
three target variables (I100, I500, N179) and their ranking is shown in Figure 49.

106

Figure 49 Feature Selection Results (I100, I500, N179)

The dataset used was identical with the exception of target variables. It can be observed
that the diagnosis codes E1152, E1164, E119 and E149 were identified among the top
ten variables consistently for all three target variables. This can be attributed to the large
number of patients diagnosed with these codes (Table 7). It can be seen that I100 is also
included as one of the top three important variables for predicting I500 as well as N179.
107

Codes E119 and E149 consistently rank outside the top five variables. These two codes
specify diabetes patients without mention of complications which indicates that the
probability of these patients to be diagnosed with other comorbidities is relatively low.
45% of patients diagnosed with E119 and 38% of patients diagnosed with E149 had no
other comorbidities recorded in this dataset. In contrast, 16% of patients diagnosed with
E1152 and 14% of patients diagnosed with E1164 had no other comorbidities recorded.
On further analysis, it was observed that 75% of E119 patients and 78% of E149 patients
did not have either of I100, I500 and N179. Similarly, 49% of E1152 and 39% of E1164
patients were diagnosed with at least one or more of the three comorbidities (target
variables) resulting in both of these codes to be ranked in the top five. An additional
observation was that only two percent of the patients included all three target variables in
their diagnosis. This is the reason N179 or I500 does not rank as important variables
while predicting the others. The other variables such as Facility Name and Facility Health
Service Delivery Area also show up as important because a majority of patients in this
dataset were admitted to University Hospital of Northern British Columbia in HSDA
‘Northern Interior’.
The average age of patients in the dataset was 63 years and the average Total Length of
Stay was seven days. Both of these variables were ranked as important. Though
Physician Code was listed as one of the top ten variables, there was no substantial
relationship found by FS with any of the target variables. The other diagnosis codes were
not listed as important as they all had a lower percentage of diagnosed patients.

108

Table 7 Top Seven Diagnosis Codes

Code
E119
E1152
I100
E149
E1164
I500
N179

Diagnosis Description
Type 2 diabetes mellitus without (mention of) complications
Type 2 diabetes mellitus with certain circulatory complications
Benign hypertension
Unspecified diabetes mellitus without (mention of) complication
Type 2 diabetes mellitus with poor control, so described
Congestive heart failure
Acute renal failure

Patients
7,956
3,763
2,656
2,105
1,674
1,385
1,303

4.2.6 Diabetes Comorbidities Assessment Tool

A physician-friendly, interactive web form has been built to predict the likelihood of a
patient to be diagnosed with one of the three comorbidities (I100, I500, N179) in future.
An example for predicting I100 (hypertension) using this tool is shown in
Figure 50 (input) and Figure 51 (output). The user input is given for all input variables
excluding I100 which is the target variable. The Field column lists the input variables, the
Storage column shows the data type, and Values column is where the user enters the
input. It is to be noted that all string values need to be entered in double quotes and
storage type is different from the data type of the variables which was explained earlier.
Executing this web form runs the model in the background and generates output shown
in Figure 51. The Ensemble algorithm is used in this case because it had the highest

109

accuracy for predicting I100 among all six algorithms. The web form can be connected to
any of the six algorithms.

Figure 50 Diabetes Comorbidities Tool - User Input

110

Figure 51 Diabetes Comorbidities Tool - Output for I100

Figure 51 shows the output which contains the I100 diagnosis and prediction probability
for a patient with specified history. For example, a predicted value of 1 indicates that the
patient will have hypertension in future, and there is a probability of 64% for this to
happen. This patient was diagnosed with multiple comorbidities, which included the other
two target variables (I500, N179). This can be the reason for a high probability of 64%.

111

The prediction was in conformance with the actual data of this patient (11219) who was
in fact diagnosed with hypertension. Similar assessment tools were built for other two
target variables (I500, N179) using the Ensemble algorithm.

4.2.7 Summary

In the above experiments, an interesting observation was that a decrease in percentage
of diagnosed patients for a target variable leads to an increase in predicted values for the
corresponding patient groups. For example, I100, I500 and N179 had 18.9%, 9.9% and
9.3% of the total patients, respectively with average prediction accuracies of 83.5%,
92.4% and 96.5%, respectively. The reason is that the algorithms are able to train the
models better when there is a lower number of patients. I100 and N179 had the highest
(2,656) and lowest number of patients (1,303) with average corresponding accuracies of
15% and 63.7% when predicting their respective diagnosis.
It is observed that all algorithms perform relatively similar for each of the three target
variables due to the following reasons:
·

Auto Classifier node was used to identify the data mining algorithms with high
accuracies for all three target variables.

·

As mentioned in Chapter 3, only the important variables identified by FS algorithm
were selected as input variables and passed to the models.

·

All twenty diagnosis codes had binary data (0,1) which included the target variable
that helped the five classification algorithms to make efficient predictions.

112

It is also to be noted that, Random Forest occasionally suffered from overfitting problem
which trained models to learn the noise thereby leading to negative impact on accuracy
[52].

113

Chapter 5

Conclusion and Future Work

Diabetes is a chronic disease whose prevalence is growing at a rapid rate throughout the
world. It has also been called the biggest epidemic of the twenty-first century. The number
of people with diabetes rose from 108 million in 1980 to 422 million in 2014 [53]. The
global prevalence of diabetes among adults over 18 years of age rose from 4.7% in 1980
to 8.5% in 2014 [1] . In Canada, one person is diagnosed with diabetes every three
minutes, and one in ten deaths are attributed to this disease. Due to this prevalence, it
has received global attention and vast amounts of data has been collected. Unfortunately,
this data exists in disparate repositories and has not been harnessed to its full potential.
However, it is now a well-known fact that diabetic patients must monitor their health
constantly because of a higher risk of developing additional comorbidities over time.
Hypertension and Acute Renal Failure have been found to be among the top four
comorbidities (Figure 52) [54]. Diabetes Canada [55] also reported that in almost every
clinical trial one third of the patients with Heart Failure also had diabetes. Additionally,
Heart Failure occurs in diabetic patients at an earlier age at a rate which is two to fourfold higher in comparison with non-diabetic patients [55]. An early intervention and
effective management is desirable to identify patients during the early stages of the
disease. To this end, there have been efforts towards developing predictive models and
114

assessment tools for improving quality of life and reducing burden on the healthcare
system. However, the existing solutions have several drawbacks as explained in Chapter
2.

Figure 52 Comorbidities for Hospitalized Diabetic Patients in Canada [52]

One of the key shortcomings of existing research is the use of non-clinical data which is
collected using surveys and self-administered questionnaires. The dataset used for this
research was obtained from Northern Health which exclusively comprised of diabetic
patients with either T1D, T2D or any other types of diabetes between the years 20122018. While this data contained only clinical records, it existed in the form of spreadsheets
which made it difficult to analyze across a variety of parameters. In order to make data
valuable for physicians and other stakeholders, several Key Performance Indicators
(KPIs) were identified which provided insight into historical trends and patterns for using
visual analytics. These metrics were presented in a visually appealing dashboard and

115

data was mined for predictive analysis. The developed models were then incorporated
into an interactive assessment tool.
This research had two major contributions. First, Predictive models were developed to
find the likelihood of one or more of three comorbidities - Benign Hypertension,
Congestive Heart Failure or Acute Renal Failure using six algorithms. The model results
were incorporated into a physician friendly assessment tool which is flexible to be
connected to one of the six algorithms to predict diagnosis and likelihood of one of the
three comorbidities. Results from the assessment tool can act as an effective guideline
for healthcare professionals to identify high-risk diabetic patients thereby ensuring
effective diabetes management to reduce costs on the healthcare system. Second, an
interactive diabetes dashboard was developed to show an overview of the current state
of diabetes for the years 2012-2018 in Northern Health. This dashboard was built with
drill down capabilities to view aggregated results at finer granularities for various
demographics (HSDA, LHA and Diagnoses). This research used a dataset specific to
Northern Health facilities with majority of patients from Northern BC. However, both the
dashboard as well as the predictive models have the capability to be extended to other
regions and provinces which would reflect on the assessment tool as well.

Diabetes Dashboard
The dashboard was built using the Microsoft BI tool stack with provisions for integrating
with diabetes dataset. The dashboard consists of three top-level reports. First, the main
dashboard displays overall statistics for 14,021 patients who recorded 34,824 admissions

116

in various Northern Health facilities. Second, a diabetes comorbidities dashboard
identified the prominent comorbidities for these patients. Third, a HSDA comparison
dashboard provides overall statistics for the three HSDAs – NE, NI and NW. These toplevel reports have drilldown capabilities to view reports at finer granularities. Several
observations were made from these reports. For instance, it was interesting to note that
51% of the patients had been diagnosed with between two to five comorbidities in addition
to diabetes. The three selected target variables (I100, I500, N179) were among the top
ten most prominent diagnosis codes recorded in the NH dataset. There was a consistent
decrease of new diabetic patients from 2012 to 2017 with a slight increase observed in
2018. In addition, it was noted that the average age of patients was found to be sixtythree.

Predictive Modeling

Patients diagnosed with diabetes can develop several other diseases over time. In this
research, the focus was on identifying diabetic patients who are at a higher risk of being
diagnosed with one or more of the following common comorbidities:
·

I100 (Benign Hypertension)

·

I500 (Congestive Heart Failure)

·

N179 (Acute Renal Failure)

The reason for choosing these comorbidities was the large number of patients in the
dataset who were diagnosed with at least one of these codes. For instance, there were
approximately 19% of patients diagnosed with I100 (Hypertension). The other two target

117

variables I500 (Congestive Heart Failure) and N179 (Acute Renal Failure) also ranked
among the top seven diagnosis codes with highest number of patients. Thus, these codes
provide a good representation and also demonstrate how other comorbidities can be
added to the study. Similarly, there are a number of data mining algorithms which are
available in SPSS modeler. The following six representative algorithms were chosen to
build our models:
·

Bayesian Network

·

Neural Network

·

Random Forest

·

Logistic Regression

·

CHAID

·

Ensemble

These six algorithms were evaluated for accuracy for the three target variables and
analyzed. The important input variables for each target variable was determined by a
built-in Feature Selection (FS) algorithm. It was observed that a decrease in the number
of patients for target variables resulted in an increase in the accuracy of all algorithms.
Another interesting observation was that Random Forest had a lower accuracy due to
overfitting. Overall, an accuracy of 83.5%, 92.4% and 96.5% was observed for I100, I500
and N179, respectively.
Finally, a Diabetes Comorbidities Assessment Tool was built which took input from the
user via an interactive web form and predicted the likelihood of one of the three target
variables. This tool is flexible and can be connected to any one of the six algorithms to

118

predicts the probability of a patient to be diagnosed with one of the three comorbidities in
future.

5.1

Future Work

The work presented in this thesis has demonstrated the importance of visual and
predictive analytics using clinical data. However, during the process several challenges
were encountered and a wish list for further work evolved. One of the characteristics of
the NH diabetes dataset was that the diabetes diagnosis codes were combined with other
comorbidities. For example, diagnosis code E1123 represents patients having “Type 2
diabetes mellitus with established or advanced kidney disease”. It would be more
desirable to have an exclusive code for recording the type of diabetes (T1D, T2D, etc)
and separate the comorbidities diagnosis of the patients. This can make it easier to
segregate patients with different type of diabetes and find out specific comorbidities of
patients as well. Since majority of patients in the dataset are diagnosed with T2D, it would
be interesting to create dataset with only T2D patients and run the existing models for all
three target variables. These results have the potential to reveal interesting correlations
which are specific to T2D patients and can help healthcare professionals as well as
patients to have a better understanding of their specific comorbidities.
The three selected target variables (I100, I500 and N179) had relatively fewer number of
diagnosed patients in the dataset which lead to reduced accuracy in predictive modeling
for those group of patients. It would be helpful to combine different diagnosis codes with

119

help of a Physician to increase the number of patients in these groups. For instance, the
word ‘Heart’ is there in thirty-five diagnosis codes and ‘Hypertension’ was found to be in
seventeen diagnosis codes. If all or at least, some of these codes can be combined, it
would increase the number of diagnosed patients for the corresponding target variables.
This increase of patients would reflect in the training dataset which can help enhance the
accuracy of the models.
Another recommendation would be to use the six algorithms for predicting the three target
variables and only choose the important variables identified by FS as shown in Figure 49.
This would eliminate some of the diagnosis codes which were included earlier. This could
potentially produce interesting comparative results on the performance of predictive
models.
It would be interesting to capture patient migration between communities and connect it
with admissions and number of patients in the corresponding facilities over the years.
Finally, adding time dimension to the metrics could allow a longitudinal study which could
also predict the timelines when a comorbidity is likely to occur.

120

References
[1]

Diabetes Canada, "What is Diabetes?," [Online]. Available: http://www.diabetes.ca/aboutdiabetes/types-of-diabetes. [Accessed 11 June 2020].

[2]

Government of Canada, "Types of Diabetes - Canada.Ca," 20 10 2016. [Online]. Available:
https://www.canada.ca/en/public-health/services/chronic-diseases/diabetes/typesdiabetes.html. [Accessed 11 June 2020].

[3]

Diabetes Canada, "Assess your risk of developing diabetes," 2020. [Online]. Available:
https://www.diabetes.ca/en-CA/type-2-risks/risk-factors---assessments. [Accessed 11
June 2020].

[4]

Government of Canada, "How to Prevent Type 2 Diabetes," 14 11 2008. [Online]. Available:
https://www.canada.ca/en/public-health/services/chronic-diseases/diabetes/prevent-type2-diabetes.html. [Accessed 11 June 2020].

[5]

L. C. Rosella, M. Lebenbaum, T. Fitzpatrick, A. Zuk and G. L. Booth, "Prevalence of
Prediabetes and Undiagnosed Diabetes in Canada (2007–2011) According to Fasting
Plasma Glucose and HbA1c Screening Criteria," Diabetes Care, vol. 38, no. 7, pp. 12991305, July 2015.

[6]

Canadian Community Health Survey, "Statistics Canada - Canadian Community Health
Survey - Annual Component," [Online]. [Accessed 11 June 2020].

[7]

Diabetes Canada, "WHY FEDERAL LEADERSHIP IS ESSENTIAL CONCERNING
DIABETES,"
[Online].
Available:
https://www.diabetes.ca/how-you-canhelp/advocate/why-federal-leadership-is-essential. [Accessed 20 Dec 2017].

[8]

Diabetes Canada, "Diabetes in Canada," February 2020. [Online]. Available:
https://www.diabetes.ca/DiabetesCanadaWebsite/media/Advocacy-andPolicy/Backgrounder/2020_Backgrounder_Canada_English_FINAL.pdf. [Accessed 11
June 2020].

[9]

I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas and I. Chouvarda,
"Machine Learning and Data Mining Methods in Diabetes Research," Computational and
Structural Biotechnology Journal, vol. 15, pp. 104-116, 2017.

121

[10] M. Marinov, A. S. M. Mosa and I. Yoo, "Data-mining Technologies for Diabetes: A
Systematic Review," Journal of diabetes science and technology, vol. 5, no. 6, pp. 15491556, 2011.
[11] S. Sankaranarayanan and P. T. Perumal, "A Predictive Approach for Diabetes Mellitus
Disease through Data Mining Technologies," in 2014 World Congress on Computing and
Communication Technologies, Trichirappalli, India, 2014.
[12] S. Chaudhuri and U. Dayal, "An overview of data warehousing and OLAP technology,"
ACM SIGMOD, vol. 26, no. 1, 1997.
[13] R. J. Koopman, K. M. Kochendorfer, J. L. Moore, D. R. Mehr, D. S. Wakefield, B.
Yadamsuren, J. S. Coberly, R. L. Kruse, B. J. Wakefield and J. L. Belden, "A Diabetes
Dashboard and Physician Efficiency and Accuracy in Accessing Data Needed for HighQuality Diabetes Care," Annals of Family Medicine, vol. 9, no. 5, pp. 385-405, 2011.
[14] S. Zahanova, A. Tsouka, M. R. Palmert and F. H. Mahmud, "The iSCREEN Electronic
Diabetes Dashboard: A Tool to Improve Knowledge and Implementation of Pediatric
Clinical Practice Guidelines," Canadian Journal of Diabetes, vol. 41, no. 6, pp. 603-612,
2017.
[15] Government of Canada, "The Canadian diabetes risk questionnaire," 29 03 2017. [Online].
Available:
https://healthycanadians.gc.ca/en/canrisk?utm_source=VanityURL&utm_medium=URL&u
tm_campaign=publichealth.gc.ca/canrisk.. [Accessed 11 June 2020].
[16] X.-H. Meng, Y.-X. Huang, D.-P. Rao, Q. Zhang and Q. Liu, "Comparison of three data
mining models for predicting diabetes or prediabetes by risk factors," The Kaohsiung
Journal of Medical Sciences, vol. 29, no. 2, pp. 93-99, February 2013.
[17] S. Kumari and A. Singh, "A data mining approach for the diagnosis of diabetes mellitus," in
2013 7th International Conference on Intelligent Systems and Control (ISCO), Coimbatore,
India, 2013.
[18] J. Lindström and J. Tuomilehto, "The Diabetes Risk Score: A Practical Tool to Predict Type
2 Diabetes Risk," Diabetes Care, vol. 26, no. 3, pp. 725-731, 2003.
[19] V. Mohan, R. Deepa, M. Deepa, S. Somannavar and M. Datta, "A Simplified Indian
Diabetes Risk Score for Screening for Undiagnosed Diabetic Subjects," The Journal of the
Associations of Physicans of India, vol. 53, pp. 759-763, 2005.
[20] A. Althubaiti, "Information bias in health research: definition, pitfalls, and adjustment
methods," Journal of Multidisciplinary Healthcare, no. 9, pp. 211-217, 2016.
[21] Microsoft, "What is SQL Server Management Studio (SSMS)?," 11 September 2019.
[Online]. Available: https://docs.microsoft.com/en-us/sql/ssms/sql-server-managementstudio-ssms?view=sql-server-ver15. [Accessed 1 November 2020].

122

[22] IBM, "SPSS Modeler - Overview," 2020. [Online]. Available: https://www.ibm.com/caen/products/spss-modeler. [Accessed 22 September 2020].
[23] "Visual
Studio
2019,"
Microsoft,
2019.
https://visualstudio.microsoft.com/vs/. [Accessed 15 July 2020].

[Online].

Available:

[24] A. Marcano-Cedeno and D. Andina, "Data mining for the diagnosis of type 2 diabetes," in
World Automation Congress 2012, Puerto Vallarta, Mexico, 2012.
[25] L. Zhang, X. Shang, S. Sreedharan, X. Yan, J. Liu, S. Keel, J. Wu, W. Peng and M. He,
"Predicting the Development of Type 2 Diabetes in a Large Australian Cohort Using
Machine-Learning Techniques: Longitudinal Survey Study," JMIR MEDICAL
INFORMATICS, vol. 8, no. 7, 2020.
[26] R. S. Anand, P. Stey, S. Jain, D. R. Biron, H. Bhatt, K. Monteiro, E. Feller, M. L. Ranney, I.
N. Sarkar and E. S. Chen, "Predicting Mortality in Diabetic ICU Patients Using Machine
Learning and Severity Indices," AMIA Joint Summits on Translation Science Proceedings,
vol. 2018, no. 1, pp. 310-319, 2018.
[27] H. S. Kim, A. M. Shin, M. K. Kim and Y. N. Kim, "Comorbidity Study on Type 2 Diabetes
Mellitus Using Data Mining," Korean J Intern Med., vol. 27, no. 2, pp. 197-202, 2012.
[28] A. Dagliati, S. Marini, L. Sacchi, G. Cogni, M. Teliti, V. Tibollo, P. D. Cata, L. Chiovato and
R. Bellazzi, "Machine Learning Methods to Predict Diabetes Complications," Journal of
Diabetes Science and Technology, vol. 12, no. 2, pp. 295-302, 2018.
[29] P. Kekäläinen, H. Sarlund, K. Pyörälä and M. Laakso, "Hyperinsulinemia cluster predicts
the development of type 2 diabetes independently of family history of diabetes," Diabetes
Care, vol. 22, no. 1, pp. 86-92, 1999.
[30] K. E. Heikes, D. M. Eddy, B. Arondekar and L. Schlessinger, "Diabetes Risk Calculator: A
Simple Tool for Detecting Undiagnosed Diabetes and Pre-Diabetes," Diabetes Care, vol.
5, pp. 1040-1045, 2008.
[31] M. Lau, H. Campbell, T. Tang, D. J S Thompson and T. Elliott, "Impact of Patient Use of an
Online Patient Portal on Diabetes Outcomes," Canadian Journal of Diabetes, vol. 38, no.
1, pp. 17-21, 2014.
[32] A. Dagliati, L. Sacchi, V. Tibollo, G. Cogni, M. Teliti, A. Martinez-Millana, V. Traver, D.
Segagni, J. Posada, M. Ottaviano, G. Fico, M. T. Arredondo, P. D. Cata, L. Chiovato and
R. Be, "A dashboard-based system for supporting diabetes care," Journal of the American
Medical Informatics Association, vol. 25, no. 5, pp. 538-547, 2018.
[33] G. Stiglic and M. Pajnkihar, "Evaluation of Major Online Diabetes Risk Calculators and
Computerized Predictive Models," PLoS One, vol. 10, no. 11, 2015.
[34] D. Pyle, Data preparation for data mining, San Francisco, California: Morgan Kaufmann
Publishers, Inc., 1999.

123

[35] S. García, S. Ramírez-Gallego, J. Luengo, J. . M. Benítez and F. Herrera, "Big data
preprocessing: methods and prospects," Big Data Analytics, vol. 1, no. 1, p. 9, 2016.
[36] R. E. Bellman, Adaptive control processes: a guided tour, Princeton, New Jersey: Princeton
Legacy Library, 2015.
[37] G. Chandrashekar and F. Sahin, "A survey on feature selection methods," Computers &
Electrical Engineering, vol. 40, no. 1, pp. 16-28, 2014.
[38] N. Suhaimi and A. Ismail, "Comparing the Performance of Logistic Regression and Artificial
Neural Networks Models: An Application to Type 2 Diabetes Mellitus," 2012. [Online].
Available:
https://www.academia.edu/7511501/Comparing_the_Performance_of_Logistic_Regressio
n_and_Artificial_Neural_Networks_Models_An_Application_to_Type_2_Diabetes_Mellitu
s. [Accessed 1 November 2020].
[39] M. M. Mijwel, "Artificial Neural Networks Advantages and Disadvantages," 2018 January.
[Online].
Available:
https://www.researchgate.net/profile/Maad_Mijwil/publication/323665827_Artificial_Neural
_Networks_Advantages_and_Disadvantages/links/5aa2c01faca272d448b5a23d/ArtificialNeural-Networks-Advantages-and-Disadvantages.pdf. [Accessed 27 September 2020].
[40] H. Esmaily, M. Tayefi, H. Doosti, D. Ghayour-Mobarhan, H. Nezami and A.
Amirabadizadeh, "A Comparison between Decision Tree and Random Forest in
Determining the Risk Factors Associated with Type 2 Diabetes," Journal of Research in
Health Sciences, vol. 18, no. 2, p. 412, 2018.
[41] S. B. Kotsiantis, I. Zaharakis and P. Pintelas, Supervised Machine Learning: A Review of
Classification Techniques, Greece: IOS Press, 2007.
[42] Y. Guo, G. Bai, Y. Hu, "Using bayes network for prediction of type-2 diabetes," in Internet
Technology And Secured Transactions, London, 2012.
[43] F. M. Díaz-Pérez, "CHAID algorithm as an appropriate analytical method for tourism market
segmentation," Journal of Destination Marketing and Management, vol. 5, no. 3, pp. 275282, 2016.
[44] G.Reachad,Michault, H.Bihan, C.Paulino, R.Cohena, H.Le Clésiau, "Patients’ impatience
is an independent determinant of poor diabetes control," Diabetes & Metabolism, vol. 37,
no. 6, pp. 497-504, 2011.
[45] Urology Care Foundation, "Kidney Failure: Symptoms, Causes & Diagnosis - Urology Care
Foundation,"
2020.
[Online].
Available:
https://www.urologyhealth.org/urologicconditions/kidney-(renal)failure#:~:text=What%20is%20Kidney%20(Renal)%20Failure,kidney%20(or%20renal)%2
0failure.. [Accessed 08 10 2020].

124

[46] Microsoft,
"Microsoft
SQL
documentation,"
https://docs.microsoft.com/en-us/sql/?view=sql-server-ver15.
2020].

[Online].
[Accessed 1

Available:
November

[47] Microsoft, "What is SQL Server Reporting Services (SSRS)?," 05 June 2019. [Online].
Available:
https://docs.microsoft.com/en-us/sql/reporting-services/create-deploy-andmanage-mobile-and-paginated-reports?view=sql-server-ver15. [Accessed 27 September
2020].
[48] Tableau, "Tableau," [Online]. Available: https://www.tableau.com/. [Accessed 08 10 2020].
[49] Qlik, "Qlik," [Online]. Available: https://www.qlik.com/. [Accessed 08 10 2020].
[50] Datawrapper, "Datawrapper," [Online]. Available: https://www.datawrapper.de/. [Accessed
08 10 2020].
[51] FusionCharts, "FusionCharts,"
[Accessed 08 10 2020].

[Online].

Available:

https://www.fusioncharts.com/.

[52] T. Hastie, R. Tibshirani and J. Friedman, "Random Forests," in The Elements of Statistical
Learning, New York, Springer, 2008, pp. 587-603.
[53] World Health Organization, "Diabetes," 8 June 2020. [Online]. Available:
https://www.who.int/news-room/fact-sheets/detail/diabetes. [Accessed 1 November 2020].
[54] A. Wielgosz, S. Dai, P. Walsh, J. McCrea-Logie and E. Celebican, "Comorbid Conditions
in Canadians Hospitalized Because of Diabetes," Canadian Journal of Diabetes, vol. 42,
pp. 106-111, 2018.
[55] K. A. Connelly, R. E. Gilbert and P. Liu, "Treatment of Diabetes in People With Heart
Failure," 2018. [Online]. Available: https://guidelines.diabetes.ca/cpg/chapter28. [Accessed
1 November 2020].
[56] Microsoft, "ASP.NET Overview | Microsoft Docs," 08 October 2019. [Online]. Available:
https://docs.microsoft.com/en-us/aspnet/overview. [Accessed 27 September 2020].
[57] "Supervised
and
Unsupervised
Learning,"
2011.
[Online].
Available:
https://sites.astro.caltech.edu/~george/aybi199/Donalek_classif1.pdf. [Accessed 2 Nov
2020].
[58] M. R. DEVI, "Analysis of Various Data Mining Techniques to Predict Diabetes Mellitus,"
International Journal of Applied Engineering Research, vol. 11, no. 1, pp. 727-730, 2016.
[59] G. Shmueli, "To Explain or to Predict?," Statistical Science, vol. 25, no. 3, pp. 289-310,
2010.

125