ADVANCED ANALYTICS TO PREDICT SURVIVABILITY
OF BREAST CANCER PATIENTS
by
Sonal Bajaj
B.Tech, Uttar Pradesh Technical University, UP, India, 2014

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
November 2018
©Sonal Bajaj, 2018

Abstract
Cancer is a significant burden of disease worldwide. Amongst women, breast cancer is the
most common cancer and the primary cause of death of women followed by heart diseases.
With increasing breast cancer cases and technological improvements, large volumes of data
related to breast cancer are collected every year around the globe. This historical data is a vast
source of knowledge, and when extracted, this knowledge could be used in making decisions
in the future. Descriptive analytics uncovers hidden patterns and trends and provides insights
into the past to answer “What has happened?”. Predictive analytics uses different modeling
techniques on historical data to predict future medical outcomes and answer “What could
happen?”.
Cancer care institutions and registries have collected large volumes of cancer data in
various formats. Unfortunately, these repositories are not easily accessible, and the stored
formats are difficult to analyze. The National Cancer Institute’s Surveillance, Epidemiology,
and End Results (SEER) Program is a premier source for cancer statistics in the United States.
Although the data is accessible, it lacks consistency; generating reports from such data is a
labour-intensive process. An end-to-end process is proposed through which such data can be
cleansed, integrated and presented in the form of interactive dashboards with drill-down and
drill-through reporting capabilities. This provides a comprehensive view of over forty years of
data consisting of over one million records with provisions to slice this data along several
dimensions. The underlying patterns and trends could be utilized in improving treatment plans,
data-driven resource allocation, and better patient care. The dashboard would be extensible,
scalable, and update in real-time with new data.

ii

Additionally, developing a breast cancer predictive model that predicts survival months
for diagnosed patients is proposed. The cleansed and pre-processed data from the analysis is
used for creating data subsets, which in turn, trains the predictive model. The outcomes of
different modeling techniques along with assessing the impacts of retraining the predictive
model are observed in the experimentations conducted for this research.

ii

Acknowledgement
I would first like to thank my supervisor Dr. Waqar Haque for his constant support, guidance
and motivation. The door to Dr. Haque’s office was always open whenever I ran into trouble
or had a question about my research. It would never have been possible for me to take this
work to completion without his constant support and generosity.

Next, I would like to thank my supervisory committee Dr. Alex Aravind and Dr.
Pranesh Kumar for their support and direction in my research. I would also like to thank Dr.
Robert Olson for his suggestions towards this research and my colleagues at Northern Health
for supporting and motivating me at all times.

Lastly, I would like to express my gratitude to my family and friends for providing me
with unfailing support and continuous encouragement throughout my years of study. A huge
shout out to my elder brother Sameer Bajaj for being my pillar of strength and source of
inspiration throughout this journey. This accomplishment would not have been possible
without them. Thank you.

Sonal Bajaj

iii

Contents
Abstract.. ................................................................................................................................... ii
Acknowledgement ................................................................................................................... iii
Contents. ...................................................................................................................................iv
List of Figures ......................................................................................................................... vii
List of Tables ............................................................................................................................ix
List of Equations ........................................................................................................................ x
Acronyms ..................................................................................................................................xi
1. Introduction ............................................................................................................................ 1
1.1 Background ...................................................................................................................... 3
1.1.1 Common Breast Cancer Research Methods ............................................................. 3
1.1.2 Medical Prognosis and Survival Analysis ................................................................ 4
1.1.3 Knowledge Data Discovery and Data Mining .......................................................... 5
1.1.4 Data Visualization .................................................................................................... 6
1.1.5 Past Research ............................................................................................................ 7
1.2 Problem Statement ........................................................................................................... 8
1.3 Motivation of the Research .............................................................................................. 9
1.4 Research Methodology .................................................................................................. 11
1.5 Contributions ................................................................................................................. 11
1.6 Organization of Thesis ................................................................................................... 13
2. Literature Review ................................................................................................................ 14
2.1 Cancer Data Analysis and Visualization ....................................................................... 14
2.1.1 Global Burden of Disease Compare ....................................................................... 15
2.1.2 U.S. Cancer Statistics Data Visualizations Tool .................................................... 16
2.1.3 Global Cancer Observatory .................................................................................... 17
2.1.4 Genomic Data Commons DAVE Tools ................................................................. 19
2.1.5 Summary of Cancer Data Analysis and Visualization ........................................... 20
iv

2.2 Predicting Breast Cancer Survival using Data Modeling Techniques........................... 21
2.2.1 Comparison of different Modeling Techniques ...................................................... 21
2.2.2 Hybrid modeling for Predicting Breast Cancer Survival ........................................ 25
2.2.3 Ensemble Modeling Technique for Predicting Cancer Survival ............................ 27
2.2.4 Summary of Breast Cancer Prediction using Data Modeling Techniques ............. 31
2.3 Summary ........................................................................................................................ 32
3. Methodology ........................................................................................................................ 33
3.1 Proposed Approach ........................................................................................................ 33
3.1 Data Extraction .......................................................................................................... 33
3.2 Data Analysis ............................................................................................................. 35
3.3 Data Visualization ..................................................................................................... 36
3.4 Predictive Modeling................................................................................................... 37
3.5 Evaluation .................................................................................................................. 37
3.2 Breast Cancer Analysis and Visualization ..................................................................... 38
3.2.1 Tableau ................................................................................................................... 38
3.2.2 Accessing SEER Database and SEER*Stat ............................................................ 40
3.2.3 Data Understanding, Preparation and Extraction ................................................... 40
3.2.4 Database.................................................................................................................. 45
3.2.5 Visualization Dashboard......................................................................................... 45
3.3 Breast Cancer Predictive Model .................................................................................... 47
3.3.1 SPSS Modeler ......................................................................................................... 48
3.3.2 Data Preparation ..................................................................................................... 50
3.3.3 Determining Input Variables .................................................................................. 50
3.3.4 Selecting Modeling Techniques ............................................................................. 52
3.3.5 Training Predictive Model ...................................................................................... 59
3.3.6 Testing and Validation of Predictive Model ........................................................... 62
3.4 Summary ........................................................................................................................ 65
4. Experiments and Results ...................................................................................................... 66
4.1 Data Analysis and Visualization.................................................................................... 66
4.1.1 Breast Cancer Survivability Dashboard ................................................................. 67
v

4.1.2 Breast Cancer Metastasis ........................................................................................ 68
4.1.3 Breast Cancer TNM System ................................................................................... 70
4.1.4 Geographic Distribution of Breast Cancer Cases by Race ..................................... 72
4.1.5 Breast Cancer Cases by Race and Age Range ........................................................ 73
4.1.6 Breast Cancer Anatomy Dashboard........................................................................ 74
4.1.7 Lymph Node Involvement Dashboard ................................................................... 75
4.1.8 Geographic Distribution by Incidence/Mortality cases of Breast Cancer Cases .... 77
4.1.9 Breast Cancer Survival/Mortality rate by Age Range ............................................ 79
4.1.10 Summary of Data Analysis and Visualization Results ......................................... 83
4.2 Predictive Modeling....................................................................................................... 84
4.2.1 Comparison of Average Survival Months (Actual vs Measured) .......................... 86
4.2.2 Comparison of Accuracy of Modeling Techniques ................................................ 96
4.2.3 Impact of Retraining Breast Cancer Predictive Model ......................................... 104
4.2.4 Summary of Predictive Modeling ......................................................................... 109
4.3 Summary ...................................................................................................................... 110
5. Conclusion and Future Work ............................................................................................. 111
5.1 Future Work ................................................................................................................. 113
Bibliography .......................................................................................................................... 114
Appendix A… ........................................................................................................................ 121
Appendix B ............................................................................................................................ 122

vi

List of Figures
Figure 1. Screenshot of GBD Compare Breast Cancer Incidence Visualization .................... 15
Figure 2. Screenshot of CDC Visualization tool ..................................................................... 17
Figure 3. Screenshot of GCO Visualization ............................................................................ 18
Figure 4. Screenshot of GDC DAVE Tool Visualization........................................................ 19
Figure 5. Survivability calculation by Bellaachia and Guven ................................................. 23
Figure 6. Lung Cancer Outcome Calculator screenshot .......................................................... 28
Figure 7. BOSOM Calculator - Table for Predicted Survival ................................................. 29
Figure 8. Methodology ............................................................................................................ 34
Figure 9. Selecting Database in SEER*Stat ............................................................................ 41
Figure 10. Filters used in SEER*Stat ...................................................................................... 42
Figure 11. Selecting Variables in SEER*Stat.......................................................................... 44
Figure 12. Feature Selection Model Snapshot ......................................................................... 51
Figure 13. Structure of Neural Network .................................................................................. 54
Figure 14. Predictive Model Training snapshot ...................................................................... 59
Figure 15. Ensemble Node Setting .......................................................................................... 61
Figure 16. Predictive Model Testing Snapshot........................................................................ 62
Figure 17. Execution of Excel Output Node ........................................................................... 63
Figure 18. Predictive Model as calculator for individual case ................................................ 64
Figure 19. User Input Node Snapshot...................................................................................... 64
Figure 20. Breast Cancer Dashboard Story Point Panel .......................................................... 67
Figure 21. Breast Cancer Survivability Dashboard ................................................................. 67
Figure 22. Breast Cancer Metastasis ....................................................................................... 69
Figure 23. Breast Cancer TNM System ................................................................................... 71
Figure 24. Geographic Distribution of Breast Cancer Cases by Race ..................................... 72
Figure 25. Breast Cancer Cases by Race and Age Range ....................................................... 73
Figure 26. Breast Cancer Anatomy ......................................................................................... 74
Figure 27. Lymph Node Involvement in Breast Cancer Cases ............................................... 76
Figure 28. Geographic Distribution by Breast Cancer Incidence Cases ................................. 77
Figure 29. Selection panel - Breast Cancer Incidence/Mortality Cases .................................. 77
vii

Figure 30. Geographic Distribution by Breast Cancer Mortality Cases .................................. 79
Figure 31. Box and Whisker Visual ........................................................................................ 80
Figure 32. Breast Cancer Survival Rate by Age Range........................................................... 80
Figure 33. Selection panel - Breast Cancer Survival/Mortality Rate ...................................... 81
Figure 34. Breast Cancer Mortality Rate by Age Range ......................................................... 82
Figure 35. Vital Status comparison ......................................................................................... 85
Figure 36. Average Survival Months ...................................................................................... 87
Figure 37. Average Survival Months (excluding TNM variables) .......................................... 88
Figure 38. Average Survival Months by Age Range............................................................... 89
Figure 39. Average Survival Months by Marital Status .......................................................... 90
Figure 40. Average Survival Months by Positive to Examined Regional Nodes Ratio .......... 91
Figure 41. Average Survival Months by Positive to Radiation and Sequence Surgery .......... 93
Figure 42. Average Survival Months by Estrogen Receptor Status ........................................ 94
Figure 43. Average Survival Months by Progesterone Receptor Status ................................. 95
Figure 44. Average Survival Months by Behavior Code ICD-O-3 ......................................... 95
Figure 45. Accuracy of different modeling techniques ........................................................... 96
Figure 46. Accuracy by Age Range ......................................................................................... 97
Figure 47. Accuracy by Marital Status .................................................................................... 98
Figure 48. Accuracy by Positive to Examined Ratio ............................................................... 99
Figure 49. Accuracy by Radiation Sequence Surgery ........................................................... 100
Figure 50. Accuracy by Estrogen Receptor Status ................................................................ 101
Figure 51. Accuracy by Progesterone Receptor Status ......................................................... 102
Figure 52. Accuracy by Behavior ICD-O-3 .......................................................................... 103
Figure 53. Vital Status comparison ....................................................................................... 104
Figure 54. Average Survival Months Actual vs Measured ................................................... 106
Figure 55. Average Survival Months Actual vs Measured (Retrained model) ..................... 106
Figure 56. Accuracy of Predictive Model.............................................................................. 107
Figure 57. Accuracy of Retrained Predictive Model ............................................................. 108

viii

List of Tables

Table 1. Breast Cancer Survival Statistics............................................................................... 10
Table 2. Accuracy Comparison of SVM and SVM Ensembles............................................... 30
Table 3. Data Modeling Techniques used in Cancer Survivability Prediction........................ 32
Table 4. List of short-listed Variables ..................................................................................... 43
Table 5. Grouping of Variables for Data Preprocessing.......................................................... 47
Table 6. List of Variables (Input & Target) for Predictive Modeling ..................................... 52
Table 7. Modeling Technique Classes ..................................................................................... 53
Table 8. Training Accuracy ..................................................................................................... 61
Table 9. Accuracy of Modeling Techniques............................................................................ 88
Table 10. Impact of Retraining .............................................................................................. 108

ix

List of Equations
Equation 1. Ensemble Prediction Equation ............................................................................. 58
Equation 2. Average Survival Months .................................................................................... 86
Equation 3. Accuracy............................................................................................................... 96

x

Acronyms

ADP

Automated Data Preparation

ANN

Artificial Neural Network

AUC

Area Under Curve

BI

Business Intelligence

BN

Bayesian Network

BOSOM

Breast Cancer Outcome-Survival Online Measurement

C&RT

Classification and Regression Tree

CDC

Centers for Disease Control and Prevention

CHAID

Chi-squared Automatic Interaction Detection

CI5

Cancer Incidence in Five Continents

COD

Cause of Death

COPE

Corporate Oncology Program for Employees

CSU

Section of Cancer Surveillance

DAVE

Data Analysis, Visualization, and Exploration

DCCPS

Division of Cancer Control and Population Sciences

EMR

Electronic Medical Records

GBD

Global Burden Disease

GCO

Global Cancer Observatory

GDC

Genomic Data Commons

IARC

International Agency for Research on Cancer

IHME

Institute of Health Metrics and Evaluation

IICC

International Incidence of Childhood Cancer

KDD

Knowledge Data Discovery

LCOC

Lung Cancer Outcome Calculator

MLP

Multilayer Perceptron
xi

NCI

National Cancer Institute

NPCR

National Program of Cancer Registries

QUEST

Quick, Unbiased, Efficient Statistical Tree

RBF

Radial Basis Function

SEER

Surveillance, Epidemiology, and End Results

SPR

Surveillance Research Program

STR

Survival Time Recode

SVM

Support Vector Machine

VSR

Vital Status Recode

WDBC

Wisconsin Diagnostic Breast Cancer Database

WHO

World Health Organization

xii

1. Introduction

Cancer is generally referred to as a large group of diseases that can affect any part of the human
body. It is the uncontrolled growth of cells, which can invade one site or spreads to many sites
within the body [1]. A constant increase in the number of cancer cases has been reported over
the past decade globally. According to the American Cancer Society, there are over 1.6 million
new cases, and over half a million cancer-related deaths were estimated to have been reported
in the United States in 2016 alone [2]. Hence it is of no surprise that cancer is the second most
leading cause of deaths in the United States. Cancer is a leading cause of deaths in Canada as
well, accounting for 30% of all deaths according to the Canadian Cancer Society [3].
There exist more than 100 types of cancer with different symptoms and treatment.
Breast cancer is one of the most common cancer among women. However, due to technological
advancements and increasing cancer-related research, many new early detection methods and
treatments have been developed which have helped to decrease cancer-related deaths [4].
However, cancer in general and breast cancer, in particular, is still a significant cause of
concern [5]. Breast cancer accounted for about 25% of all the new cancer cases among women
in Canada in 2017, with an estimated total count over 26,000 for the year [3]. The numbers are
on the rise globally, for both incidence and mortality due to breast cancer.
“Breast cancer starts when cells in the breast begin to grow out of control. These cells
usually form a tumor that can often be seen on an x-ray or felt as a lump” [6]. Breast cancer is
most common in women, but it is also possible in men. The tumor is considered malignant if
the cancer cells multiply and start affecting the surrounding cells or metastasize to other parts
of the body such as the liver, lung, bone or brain. The tumor is benign or non-malignant when
1

there are abnormal growths, but do not invade other cells outside of the breast. Although nonmalignant tumors are considered not life-threatening, some benign breast tumors can increase
a woman's risk of getting a malignant tumor in breast [6].
Even though breast cancer research is clinical or biological [5], the data-driven research
outcomes can be valuable and a significant step forward in cancer treatment. Dr. Robert Stein,
UCL breast cancer consultant, states that the one-size-fits-all basis is the common practice in
all areas of cancer treatment [7]. However, tools like IBM Watson for Oncology [8] and SAP’s
Corporate Oncology Program for Employees (COPE) [9] help physicians provide better cancer
care. SAP’s COPE is corporate oncology program run by SAP for its employees in Germany,
U.S and Canada. The program offers treatment cost coverage of tumor analysis and helps
physicians find safe and effective treatment for each patient. IBM Watson [8] for oncology is
trained to design cancer treatments. Over time, Watson has performed very well at
recommending a treatment plan for different types of cancer. The clinicians continue to add to
the Watson cancer assessing repository [10]. By selecting treatment plan based on genetic
changes and evidence by understanding the data points, Watson, COPE and other tools have
brought a paradigm shift in cancer care in today’s world.
With such new tools and research aid, breast cancer care can be improved further.
Predicting survival of breast cancer patient can provide physicians help in developing the
treatment plan specific to each patient. Different modeling techniques that can be used to
develop such predictive model. One hypothesis here is that Ensemble modeling technique can
be more accurate than other modeling techniques when designing the breast cancer predictive
model. Ensemble modeling is the process of combining two or more modeling techniques and
score the combined results by using voting or averaging techniques.

2

1.1 Background
One of the most frequently asked questions by cancer patients post-diagnosis is the lifespan
they are left with. To predict how long a cancer patient will live is a tough question for
oncologists to answer. Oncologists answer to such questions are based on past records of cancer
patients with similar prognosis or by consulting other physicians and researchers working on
comparable cases. Although careful prognosis is vital, it is difficult to find accurate survival
time of patients because survivability is based on many factors [11]. Also, these predictions
may not be absolute as the past records are not entirely reliable and the prognosis from different
oncologists are generally inconsistent [12].

1.1.1 Common Breast Cancer Research Methods
The common breast cancer research methods include experimental, observational and clinical
trials. The experimental method involves new medication, treatment plan or new treatment aid
introduced to a new set of people. The results of a new intervention are then compared with
another set of people (control group) who are not exposed to the new intervention being tested.
The control group and other group’s members are selected randomly by the researchers.
Experimental research methods help in testing the new techniques and learning how cancer
starts or metastasizes. The second common research method is observational, which involves
observing a set of people in a natural environment to determine factors associated with a
specific outcome [13]. As a result, observational studies can establish the association of the
variables to the outcome. Another common breast cancer research method is clinical trials.

3

Clinical trials are medical experiments performed on humans. The development of new
procedures and drugs are usually developed based on clinical trials [4, 14].
In this research, observational research methodology is used. The historical data
records of cancer patients are used to understand the patterns, trends (data analysis) and factors
involved in the disease outcomes (data mining) and predict the outcome for new patients
(predictive analytics). The purpose of this research is twofold: to develop a breast cancer
dashboard and build a predictive model by utilizing the relationship between a set of
independent variables and the survival months.
1.1.2 Medical Prognosis and Survival Analysis
Predicting the outcome of a disease is one of the most challenging tasks for researchers [5]. In
cancer treatment survival is considered as the most important outcome [15]. “Survival analysis
is a study of time between entry into observation and a subsequent event” [16]. In today’s
world, scientists use survival analysis for not just commencement time of disease but also for
the time before the stock market crash, time until weather changes or equipment failure, the
time before natural calamities like flood, and earthquake [16]. For cancer, the essential event
of interest is “death”. Other events include relapse of disease, recovery from disease or death.
Survival analysis tools are also used for leukemia patient readmission time, time for an average
person to develop a heart disease, the time before death for elderly population, and so on [17].
In medical prognosis “Survival analysis” is referred to a field which uses different
methods and techniques on collected historical data, to predict the patient’s survival from a
disease over a specific period of time [5]. With Electronic Medical Records (EMR) systems
development, storing patient’s history, test results, diagnoses and other relevant facts has
become easy and manageable. Moreover, such open source data source acts a tremendous
4

resource for researchers who want to develop survivability prediction models. Data analysis
and knowledge discovery research techniques are used by researchers to predict the outcome
of disease by identifying the patterns and relationship between different variables of historical
data [5].
1.1.3 Knowledge Data Discovery and Data Mining

Historical data from cancer patients’ medical records are a powerful source of information. It
helps oncologists and researchers find the grounds for inter-relationships of present to
historical cases [18]. Using historical data to predict outcomes in breast cancer could be dated
back to 1992 where neural network analysis was used to predict the recurrence of breast cancer
[19]. However, with no specific global standard to record the patient data, a vast inconsistency
is often observed across the data available globally. Despite this inconsistency, these records
remain invaluable medical literature.
Knowledge Data Discovery (KDD) is a significant process of extracting knowledge
from raw data. KDD is defined as a step-by-step process of understanding the realm,
preparation of data, followed by the collection and formulation of knowledge from extracted
patterns. The ‘‘post-processing of the knowledge’’ then can be applied to capture the
knowledge from a large amount of recorded data [20]. In KDD, data mining is the step of
collecting and formulating knowledge from data using different pattern extraction methods.
The knowledge discovery process is an essential part of medical data mining [21].
Data mining has influenced many fields such as medicine, media, astronomy, business,
marketing, investment, manufacturing, and telecommunications. In today’s digital world, large
volumes of data are produced every day, and manual data analysis is an impractical approach.

5

Thus an automated data analysis is the need of the hour. Data mining is the attempt to address
a problem of the digital era, i.e. data overload [22]. Data mining uses different algorithms to
extract useful patterns from data [23] and then use the patterns to build predictive models.
There are two major data mining tasks, grouped as descriptive data mining and predictive data
mining. The descriptive data mining includes association, clustering and summarization tasks.
On the other hand, predictive data mining tasks include classification, prediction, and
time-series analysis tasks [24]. Descriptive data mining tasks describe the significant
properties of the actual data and predictive data mining tasks aim to do predictions based on
existing data [25]. In this research, descriptive and predictive data mining tasks are used.

1.1.4 Data Visualization
A powerful option to make complex data usable and relevant is through data visualization. The
goal of interactive data visualization is to display data in forms of visuals to help the user
understand the data quickly, identify the areas of improvement and take decisions accurately
based on historical events or data [26]. On a broader scale, there are two goals of data
visualization – explanatory i.e. to explore data to solve a particular problem, or exploratory,
i.e. to explore large sets of data for enhancing understanding of data and finding crucial missing
information. Converting structured data into meaningful charts and graphical depictions enable
the users to gain insights into all the captured data [27]. As the healthcare sector is rapidly
moving towards data analytics and has started relying on digital information to improve health
care and reduce the costs, data analysis and visualization of data has become a core component.
Many health organizations also have online visualization tools available for public to explore
the trends of incidence, mortality, demographics and other statistics. Institute of Health Metrics
6

and Evaluation’s “GBD (Global Burden Disease) Compare” is an interactive analytics and
visualization tool available online [28]. The tools allow users to visualize and compare disease
causes and risks in the form of treemaps, maps, diagrams, charts within the region (intercountry or intra-country) or worldwide [28]. There are some online visualization tools and
dashboards available for cancer data also. The National Program of Cancer Registries (NPCR)
has the United States Cancer Data Visualization online tool [29] available which fetches data
from National Cancer Institute (NCI) and Centres for Disease Control and Prevention’s CDC
for all cancer types for the year 2010 - 2014. The visualization tools have U.S Cancer
demographics represented graphically by cancer incidence rate, mortality rate and type of
cancer. World Health Organization’s Global Cancer Observatory (GCO) is a web-based
platform which provides global cancer statistics [30].

1.1.5 Past Research
Many former researchers have established the capacity of data mining by the medium of
application to medical records [18]. For instance, Agrawal et al.’s Lung Cancer Outcome
Calculator [5] is a survival prediction model for lung cancer patients using different data
mining techniques and SEER data [31]. It can predict survival of lung cancer patients from 6
months, 9 months, 1 year, 2 years and 5 years of diagnosis [5]. Zorluoglu et al. used Wisconsin
Diagnostic Breast Cancer Database (WDBC) for similar work to predict whether a breast
cancer tumor is malignant or benign [32]. Several other prognostic applications, using different
tools and algorithms, also exist that can predict breast cancer survivability, for example,
Adjuvant [33], PREDICT [34].

7

1.2 Problem Statement
Cancer care institutions and registries have collected large volumes of cancer data in various
formats. Unfortunately, these repositories are not easily accessible, and the stored formats are
difficult to analyze. The National Cancer Institute’s SEER Program is one of the organizations
which maintains cancer statistics of the US population. Though the data is accessible, it lacks
consistency and generating reports from such data is a labour intensive process.
An end-to-end process is proposed through which this data is cleansed, integrated and
presented in the form of interactive dashboards with drill-down and drill-through reporting
capabilities. This provides a coherent view of the data and allows users to observe hidden
patterns and trends which could be utilized towards improving treatment plans, data-driven
resource allocation and better patient care. The top-level dashboard presents the main KPIs and
is supported by a sequence of visualizations to convey information which can be sliced and
diced along several dimensions. This real-time dashboard can be updated as soon as new data
is uploaded to the database.
Further by using the pre-processed and cleansed data, a breast cancer predictive model
is proposed that predicts survival months of a breast cancer patient from the time of diagnosis.
The predictive model is trained, tested and validated with different subsets. The predictor’s
selection is based on main KPIs identified by analysis along with an expert’s opinion from a
total of 134 variables available in SEER. Different data mining is used, and the predictive
models will be designed based on data mining technique which performs best amongst all,
along with the ensemble of selected techniques.

8

1.3 Motivation of the Research
“Time is shortening. But every day that I challenge this cancer and survive is victory for me” – Ingrid
Bergman (Cancer patient)

The traditional cancer survival prediction methods such as research on past experiences using
spreadsheets require an unacceptably large amount of time and effort to study the data. The
classic ways of identifying survival time of cancer patients include a comparison study between
the patient’s health situation and symptoms with previously recorded cancer patient’s medical
records, statistical computation of survival rates based on historical records, or consulting
another breast cancer expert.
“Cancer prognosis is the doctor's best estimate of how cancer will affect person” [3].

Many factors that can affect a person's prognosis. Survival statistics is one of the
methodologies that physicians use to develop a prognosis for a person with cancer.
Researchers, when developing a prognosis, often look at studies that measure survival for one
specific cancer type, stage or risk group. The survival rate is the percentage of people with
cancer who are alive at some point in time (i.e. 1, 3, 5 or 10 years) after their diagnosis [3].
Survival prediction is the process of finding the time left for a patient to live. It is generally
associated with diseases, which have high mortality rate such as cancer. Survival prediction is
part of physician’s prognostic investigation, where a result takes the form of a numerical
percentage of survival over a period that depends on a factor such as tumor size, time after
diagnosis and stage of cancer [18]. Table 1 shows global survivability rates of breast cancer
for the year 2016.

9

Time since Diagnosis
5-year

Survival Rate
89%

10-year

83%

15-year

78%

Table 1. Breast Cancer Survival Statistics [2]

All cancer types have high mortality rates [35] and cancer patients are always frenzied
to know how much time they are left with. Despite many treatment options available for cancer
patients, there is no assurance that the patient will be cured after treatment. Each patient
responds to the treatment differently. Physicians estimate the prognosis of cancer by using
statistics collected by researchers over many years. Various statistics are used to estimate
cancer prognosis [36], most commonly: Cancer-specific survival, Relative survival, Overall
survival and Disease-free survival. These statistical survival prediction methodologies are
time-consuming and lack accuracy. By applying data mining techniques to breast cancer data,
a breast cancer survival prediction model is built. Data mining techniques help rank and link
cancer attributes to survival outcome [5]. The outcome of the breast cancer survival predictive
model will help both physicians and patients to determine accurate survivability, serving as a
reference for patients and provide them with a second opinion [18]. It can also assist physicians
in decision-making to determine the best treatment for breast cancer patient. The modeling
technique can predict outcome depending on patient-specific attributes instead of relying on
personal experiences or time-consuming statistical evaluation. The combination of breast
cancer effects across the different age range, the promising results and benefits of data-driven
research in healthcare, and the desire to contribute towards improving healthcare and treatment
of breast cancer have together motivated this research.

10

1.4 Research Methodology
The main steps involved in this research are:
 Literature review
 Determining relevant data source
 Consultation with medical personnel/oncologist to shortlist the relevant variables
 Design data analysis dashboard
 Determine relevant tools and modeling techniques
 Develop a predictive model
 Training, testing and validation of the predictive model

1.5 Contributions
This research has two primary contributions. The first contribution focuses on data
visualization for breast cancer data of over 40 years. This is achieved by developing a
dashboard built from breast cancer patients’ data to uncover hidden patterns as well as provide
easy-to-understand metrics for users of all backgrounds. The dashboard includes breast cancer
data for patient population demographics, patient volumes, diagnosis and treatment.
The second contribution focuses on building a predictive model to predict breast cancer
patient survivability. This model is built from the preprocessed data extracted from the SEER
database. The model could be used by doctors to determine their predicted survival time
ranging in months for their patients diagnosed with breast cancer. The model is trained with
the existing data and processes the given breast cancer-related attributes and predict survival
months [18]. This research will contribute towards the healthcare field in following ways:
11

1. Increase accuracy of the diagnoses: Predictive algorithms help physicians input the
patient’s clinical symptoms and get more accurate diagnosis thereby assisting their
judgements [37]. The treatment plan of patients can then be enriched with predictive
analytics results such as survival months.

2. Provide physicians with answers they are seeking for individual patients: There are
possibilities where a treatment plan works best for a set of patients, but may or may not
work for another individual patient. Predictive analytics can help physicians plan the
treatment specific to the patient. The breast cancer treatment plan can include a
combination of different treatments such as surgery, radiation therapy, chemotherapy,
hormonal therapy and targeted therapies. An ideal treatment plan should work against all
things inside the cells that caused cancer to develop, grow, and possibly spread to other
parts of the body [38].
3. Increase the patients’ understanding and participation: The outcomes of predictive
analytics like survivability and possible health risk indicators can help the patients’
educate themselves about their disease. An educated patient can take more responsibility
of their treatment plan. The patients can then equally participate with their physicians to
help make decisions on their treatment plan.
4. Analyze massive healthcare data: In current scenarios, the massive amount of data
generated by the healthcare industry is digitized for ease. It is a tedious work to make
sense from such massive digital data. The analysis of healthcare data will help by
providing actionable insights to both physicians and healthcare industries regarding their

12

planning, administration and assessment [37]. It thereby augments the decision-making
ability of the administration, by evaluating the critical opportunities, such as quality care
improvement or patient injury prevention, and allocating resources to the fundamental
processes [39].

1.6 Organization of Thesis
This thesis consists of five chapters. In this chapter, the background, problem statement, the
motivation and contributions of this research has been provided. In Chapter 2, the related work,
challenges and different research approaches are discussed. Chapter 3, extensively discusses
the implementation steps involved in developing the breast cancer dashboard and the predictive
model. Also, data mining and modeling techniques are formally introduced in this chapter.
Chapter 4, provides experimentation and analysis of results. Chapter 5 provides the conclusion
and the future directions to extend the work done in this research.

13

2. Literature Review
To better understand the research questions raised and addressed in the problem statement, an
extensive literature review is conducted. For this purposes, traditional/narrative literature
review methodology is used. It involves critiquing and reviewing existing work and deduce a
conclusion about the research questions raised. This review type comes handy in collecting a
volume of related work in a specific research area and then summarize it by highlighting the
current techniques and approaches used. By finding gaps or disparity in the literature,
researchers can determine or define a new research approach or hypotheses.

This process is utilized to understand the nuances of the approaches used in the
literature specifically concerning two areas – data analysis and visualization of existing cancer
data, and various predictive modeling techniques to determine the survivability of cancer
patients. In this chapter, the observations in the two corresponding sections are presented.

2.1 Cancer Data Analysis and Visualization
Over the past few years, several studies have been conducted focusing on the analysis of data
related to multiple types of diseases including breast cancer [28, 30, 40, 41]. More recently,
several health organizations have developed online analytics and visualization tools for cancer
and other diseases from their data repositories. Some of these tools available are discussed in
the following sections:

14

2.1.1 Global Burden of Disease Compare

Figure 1. Screenshot of GBD Compare Breast Cancer Incidence Visualization [28]

The Institute of Health Metrics and Evaluation (IHME) [42] at the University of Washington
measures, compares and evaluates strategies for various health issues, diseases, injuries and
risk factors, around the globe. IHME’s tool, GBD [28] evaluates global health challenges and
risk factors so that health systems can be aligned with the disease trends. The tool analyzes
data from 1990-2016 and provides a comparison of the effects of different diseases on a set of
population. The policymakers can thus make more informed decisions with respect to the
15

allocation of resources for better health care. The data related to premature deaths, disabilities,
and injury is collected from over 130 countries and can be visualized along several dimensions
including demographics, mortality, disease causes and risk factors [28]. The visualization is
available in different formats such as map, treemap, line chart, patterns bar chart, pyramid
chart, arrow chart and heat map. The dashboard can be drilled down to specific countries and
states. The tool has three main tabs- single (single chart type), explore (map and one additional
chart) and compare (two of the same chart by year, age, sex, cause of disease, risk and location).
Figure 1 is a screenshot of line graph visualization from ‘compare by cause’ tab where
the cause is selected as breast cancer. The colour-coded lines show breast cancer rate and trend
for selected countries. The same website also provides links to other visualization projects such
as Mortality, Cause of Death (COD), Epidemiological (Epi), and Financing Global Health.
[43].
2.1.2 U.S. Cancer Statistics Data Visualizations Tool

The Centers for Disease Control and Prevention (CDC) [44] and the National Cancer Institute
(NCI) [45] collects cancer data from hospitals, physicians, clinics and health labs all over the
U.S. and have made this data available through a visualization tool [40]. CDC recommends
professionals like planners, policymakers, health advisors, researchers, and journalists to use
this information to view and report cancer statistics [44]. The dashboard has multiple tabs and
dropdowns for cancer types, historical trends, incidence, mortality rate, gender, age and
demographics. While the data includes cases registered in the year 2010-2014, nation-wide
changes in rates are available for the period 2006-2014.

16

Figure 2. Screenshot of CDC Visualization tool [40]

Additional functionality includes geographical distribution displayed on an interactive map
together with comparative numbers for all states [40]. Figure 2 is a screenshot of CDC
visualization from demographics tab showing the rate of new cancer by sex, age group,
race/ethnicity for all types of cancer in the United States (2015).

2.1.3 Global Cancer Observatory
World Health Organization’s (WHO) [46] GCO [30] is a web-based visualization tool for
global cancer statistics. The data presented is gathered from different projects of International
Agency for Research on Cancer (IARC) Section of Cancer Surveillance (CSU) [47]
including GLOBOCAN [48], Cancer Incidence in Five Continents (CI5) [49], International

17

Incidence of Childhood Cancer (IICC) [50], and Cancer Survival in Africa, Asia, the Caribbean
and Central America [51]. The dashboard has four main tabs – Cancer Today, Cancer over
Time, Cancer Tomorrow and Cancer Causes. The ‘Cancer Today’ tab presents incidence,
mortality, and types of cancer estimates for 184 countries, broken down by age group and
gender. The ‘Cancer over Time’ tab shows trends of cancer incidence and mortality for the last
50 years for 40 countries. The ‘Cancer Tomorrow’ tab provides visualization of cancer
prediction up to the year 2035 by country and cancer type. Finally, ‘Cancer Causes’ tab
highlights the causes of cancer, and the vital contributing risk factors [30]. Figure 3 is a
screenshot of GCO visualization from ‘Cancer Today’ tab displaying estimated top 10 cancer
incidence (cases) in the year 2018.

Figure 3. Screenshot of GCO Visualization [30]

18

2.1.4 Genomic Data Commons DAVE Tools
The National Cancer Institute’s [45] Genomic Data Commons (GDC) [52] provides access to
standardized clinical and genomic data. GDC also includes data from, The Cancer Genome
Atlas (TCGA) [53] and Therapeutically Applicable Research to Generate Effective Therapies
(TARGET) [54]. The GDC Data Analysis, Visualization, and Exploration (GDC DAVE) Tools
[41] provides cancer supporting gene and variant level analysis of GDC data. These tools
provide researchers’ ability to visualize gene data with high impact mutations, most frequently
mutated genes, survival analysis of different cases, and graphical visualization of cancer gene
mutations. The data from each analysis can be visualized in bar charts, graph plots, trend lines
and tabular format, along with download functionality. Figure 4 is a screenshot from GDC
DAVE Tool visualization, showing a distribution of most frequently mutated genes. This tool
can be accessed by using the GDC Data Portal.

Figure 4. Screenshot of GDC DAVE Tool Visualization [41]

19

GDC provides data sharing, data submission across different cancer genomic studies
and research thereby supports the development of precision medicine for cancer. A secure
GDC API is also developed to provide batch data submissions [52].

2.1.5 Summary of Cancer Data Analysis and Visualization
The tools discussed above are new and recently available platforms to analyze and visualize
cancer data. While each of them is simple and easy to use, they are all tied to the databases
which are not publically available. With no scope of adding a custom database at the backend,
it leaves these tools as standalone projects that cannot be integrated into other projects or tools.
More importantly, these tools haven’t been available for public use for long which diminishes
the scope for a comprehensive evaluation or comparison among them.
Since all these tools are deployed over the web, there are no distributions available to
use them offline or on desktop modes. The wide variety of options to visualize data, prove
handy to understand the trends better and inspires the researchers to design a similar tool. It
serves as a motivation to design a flexible, scalable, easy to integrate platform, with
visualization features comparable to these tools that could accommodate varying databases.

20

2.2 Predicting Breast Cancer Survival using Data Modeling Techniques
Predicting the outcome of a disease is one of the most challenging tasks for researchers and
medical personnel. In cancer treatment, survival is considered the most critical outcome [15].
Cancer survival prediction using data mining on historical records is possible. The existing
predictive models have used data mining techniques such as artificial neural networks, decision
trees and statistical methods to predict cancer survival. The different approaches used for
cancer survival prediction are grouped into three categories – (i) comparison of different
modeling techniques [11, 55, 57, 59, 60, 61, 62] to identify the most accurate prediction model,
(ii) hybrid prediction model [65, 66] and (iii) ensemble of different modeling techniques [5,
18, 32, 68].

2.2.1 Comparison of different Modeling Techniques
In 2005, Delen et al. [55] developed breast cancer prediction model and compared different
modeling techniques. Two data mining techniques, i.e. artificial neural networks, decision trees
(C5) and one statistical technique, logistic regression were compared. The study used the SEER
public-use database [31] for the year 1973-2000. The software packages used in the research
for exploring data were MS Access database, SPSS statistical analysis tool, STATISTICA data
miner and Clementine data mining toolkit. The data source was pre-processed, and the final
dataset consisted of 202,932 records. The modification and removing of records were
completed to predict survivability exclusive to breast cancer. Data cleansing and preparation
strategies followed are as described below:


The records in which the patient didn’t survive for sixty months post-diagnosis were
removed.
21



The records with “Cause of Death” other than breast cancer were removed.



The records of those that were not followed up for sixty months were removed.



The records with missing values were removed.



The records with unusual “Tumor Size” variable values were also eliminated.

Only 17 out of 72 variables were selected; these included 16 predictor variables and 1
dependent variable. Some of the key variables used were: race, age, grade, marital status,
primary site code, histology, behaviour, extension of disease, lymph node involvement,
radiation, stage of cancer and tumor size. The binary ‘dependent variable’ was assigned values
of 0 and 1, where 0 denoted ‘did not survive’ and 1 denoted ‘survived’. The comparative
performance of three data mining methods was evaluated by accuracy, sensitivity, specificity
and k-fold cross-validation. The results showed that decision tree (C5) was the best predictor
with the highest accuracy of 93%; followed by artificial neural networks with an accuracy of
91.2%, and logistic regression with an accuracy of 89.2% [55]. Some of the shortcomings of
this study are:
1) The pre-classification method used in the study for determining the records of
‘died/not survived’ category was incorrect.
2) The study is based on the assumption that all patients died due to breast cancer only,
which is not always the case [56].
3) The study did not use Vital Status Recode (VSR) and Cause of Death variables.
VSR marks whether the patient is dead or alive as of study cut-off date and Cause
of Death provides the reason of cause of death of the patient. These two variables
have been shown as important variables for cancer survival prediction [57] and
other related studies.

22

Several spin-offs of their work followed through the years, and the SEER public-use
data was observed to be used as the primary data source for these studies. In 2006, Bellaachia
and Guven [57] implemented data mining techniques on breast cancer data to enhance Delen
et al.’s study. The study included two more variables in addition to the 17 variables selected
by [55], i.e. Cause of Death and Vital Status Recode. A new dependent variable Survivability
was derived using Survival Time Recode (STR) and VSR. For a 60-month threshold, the
‘Survivability’ variable was calculated using the logic shown in Figure 5.

Figure 5. Survivability calculation by Bellaachia and Guven [57]

SEER public-use database [31] was used for the period 1973-2002. The study compares
three data mining techniques: Naïve Bayes, back-propagated neural networks, and C4.5
decision tree algorithm. WEKA [58] toolkit software package was used for developing the
prediction model. Accuracy, precision, and recall performance measures were used to evaluate
the data mining techniques. The experimentation ranked Naïve Bayes technique as best with
84.5% accuracy, followed by artificial neural networks and C4.5 algorithms with 86.5% and
86.7% accuracy, respectively. With respect to Delen et al.’s study, the variation in the accuracy
of the two studies is due to different SEER datasets, pre-processing and data mining techniques
23

[57]. One limitation of this study, as stated by the authors, is the exclusion of records with
missing data (Extent of Disease and Site Specific Surgery).
Endo et al. [59] compared seven algorithms to predict breast cancer survival using the
SEER public-use database [31] from the year 1992 to 1997. Logistic Regression model,
Artificial Neural Network, Naïve Bayes, Bayes Net, Decision Trees with Naïve Bayes,
Decision Trees (ID3), Decision Trees (J48) were used to develop the prediction models.
Among these methods, the Logistic Regression model showed the highest accuracy with
85±0.2%, Decision tree (J48) showed the highest sensitivity and ANN displayed the highest
specificity. The study used accuracy to evaluate the model performance, but the authors also
state that sensitivity is a comparatively better parameter for survival based prediction models.
A study by Wang et al. [11] predicts 5-year breast cancer patient survivability by using
two data mining techniques, i.e. Logistic Regression model and a Decision Tree model. The
study is performed on the the SEER public-use database [31] for the year 2010. The study
concludes that the Logistic Regression model is better than the Decision Tree model [11]. The
study uses the same data preparation method as used by [55]. The dataset used in both the
studies are different. The incidence and mortality trends in the datasets used by both these
studies are significantly different. This study concludes with higher accuracy than [55]. The
former study shows 91.19% and 91.34% accuracy while the latter shows 85.8% and 86.0%
accuracy of decision trees and logistic regression respectively.
Few studies [60, 61, 62] as discussed next, have developed a breast cancer detection
model which predicts whether the cancer is present or not. These studies have also used data
mining techniques and performed a comparison of these techniques.
Chaurasia and Pal [60] developed a diagnosis system for breast cancer detection. The
model uses RepTree, RBF Network and Simple Logistic modeling techniques. The study uses
24

University Medical Centre Institute of Oncology, Ljubljana, Yugoslavia database. The
extracted breast cancer data had 286 rows and 10 variables for each row. The results of the
study state that Simple Logistic modeling technique has higher accuracy (74.47%) as compared
to RepTree (71.32%) and RBF Network (73.77%).
Senturck and Kara [61] performed a breast cancer diagnosis study using data mining
on the UCI Machine Learning database from the University of Wisconsin Hospitals, Madison.
The study aimed to analyze the performance of seven different algorithms. RapidMiner 5.0
[63] tool was used for data mining and prediction. The study concluded that a Support Vector
Machines algorithm is best for breast cancer diagnosis prediction with an accuracy of 96.0%
[61]. The study predicts if the tumor is benign or malignant. The prediction model is trained
with 699 cases only (records with missing information were removed).
Chaurasia and Pal [62] developed a breast cancer detection model using WEKA [58]
software for data mining. This study also used UCI Machine Learning database from the
University of Wisconsin Hospitals, Madison. The study aims to compare the performance of
three classification techniques: Sequential Minimal Optimization (SMO), IBK and BF Tree.
The study concludes that SMO classification techniques have the highest prediction accuracy
(96.2%) amongst all three techniques. This study is different from their earlier work [60] in
which that Simple Logistic, RepTree and RBF Network modeling techniques were used. The
databases used in both studies are from different demographics.
2.2.2 Hybrid modeling for Predicting Breast Cancer Survival
Hybrid modeling is an approach when two or more modeling techniques are combined, for
example, clustering and classification techniques combined or clustering and association
modeling techniques used together [64].
25

In 2008, Khan et al. [65] investigated a hybrid scheme based on fuzzy decision trees as
an alternative to breast cancer prognosis. The data source used for the study was the SEER
public-use database [31] for the period 1973-2003. An essential aspect of the research was to
use a hybrid modeling technique based on Fuzzy Decision trees. Data pre-processing method
removed the records having missing data and included only records with a Cause of Death
(COD). The final dataset of 162,500 records with 16 variables and a binary target variable (0
denoted ‘did not survive’ and 1 denoted ‘survived’) was used for experimentation. The
performance measure evaluation stated that hybrid fuzzy decision tree classification technique
(accuracy 85%) is more powerful and fair than independently applied decision tree
classification technique (accuracy 82%) [65]. The study is also based on an assumption similar
to Delen et al. [55] that all patients died due to breast cancer only, which is not always the case.
A 2012 survey states that external causes, heart failure, suicide and gastrointestinal diseases
are other reasons for breast cancer patient death [56].
According to Choi et al. [66], a hybrid Bayesian model for predicting breast cancer
prognosis can outperform other models. Three different model for cancer prognosis were
examined: Bayesian Network (BN) model, Artificial Neural Network (ANN) model and hybrid
BN model. The hybrid model developed was a combination of an ANN model and the BN
model. The SEER public-use database [31] for the time period 1973-2003 was used to build
the model with 294,275 records and 9 input variables. For a threshold of 60 months, the
proposed hybrid BN model performed better than the Bayesian network [66]. The study states
that the proposed hybrid BN model and the ANN models outperformed the BN model. The
goal of this study was to explain the power of BN models over ANN models. However, the
study results showed that the hybrid BN model’s performance was mainly due to ANN model
instead of the BN model. The Area Under Curve (AUC) of ANN (0.930) difference from
26

Hybrid BN (0.935) is minimal (0.005) as compared to BN (0.813). Authors stated that the
better performance of the hybrid BN was originated from ANN instead of BN.

2.2.3 Ensemble Modeling Technique for Predicting Cancer Survival
Ensemble modeling techniques are used to improve the performance of individual
classification techniques such as decision trees, regression, neural networks, support vector
machines and Bayesian networks. Ensembles combine predictions of multiple classification
techniques to achieve better prediction accuracy [67]. Common ensemble techniques used are
bagging, boosting, voting and stacking. Ensembles modeling techniques only combine
classification techniques, unlike hybrid modeling technique which can combine classification
and clustering, or clustering (example- K-Means and two-step clustering models) and
association techniques (example- Apriori and Carma models).
In 2010, Agrawal et al. [5] developed an online lung cancer outcome calculator, using
data mining and predictive modeling. The research aimed at developing an accurate survival
prediction model by using the SEER public-use database [31]. The study used 1998-2001 data
for five-year prediction. Records earlier than 1998 were eliminated because some variables
were added only after 1998. Few variables were modified and merged to form new variables
for the study. Only records with “Cause of Death” as lung cancer were used. The WEKA [58]
software tool was used to evaluate the data mining techniques used. An ensemble of the data
mining algorithms- J48 Decision Tree, Alternating Decision Tree, Logit Boost, Random
Subspace and the Random Forest was used in the study. The predictive model was built with
64 variables, and the online calculator was built by using 13 of 64 variables.

27

Figure 6. Lung Cancer Outcome Calculator screenshot [5]

The 13 variables were selected based on the predictive power1 by using a feature
selection2 method. Some of the key variables used were: age, birthplace, cancer grade, farthest
extension of tumor, lymph node involvement and total regional lymph nodes examined.
Overall, the ensemble voting classification technique performed best with the highest

Predictive power: The predictive power of an attribute here refers to the ranking or inter relatability ability of
that attribute with other attributes to form patterns [101].
2
Feature selection: It is used to identify the fields that are most important for a given analysis.
1

28

prediction accuracy (91.4%) and AUC (94%) [5]. Figure 6 shows a snapshot of the online lung
cancer calculator result window.
In 2014, GilTroy Paular Meren [18] developed a Breast Cancer Outcome-Survival
Online Measurement (BOSOM) calculator. An online survival measurement calculator using
data mining and predictive modeling on the SEER public-use database [31] (1973-2010). The
study uses the framework and data mining techniques as Agrawal et al.’s ‘Online Lung Cancer
Outcome Calculator’ to establish the prediction calculator. The time interval used for
prediction includes 2 years, 4 years, 6 years, 8 years and 10 years. The study concluded with
the average accuracies of the calculator and completed dataset as 88.27% and 91.71%,
respectively. Figure 7 is a snapshot of the BOSOM calculator output screen. The classifiers
used in this study are same as used in [5]. The calculator of this study is a replica of LCOC [5]
using the same methodology for breast cancer survivability.

Figure 7. BOSOM Calculator - Table for Predicted Survival [19]

29

Gokhan and Mustafa [32] used an ensemble of three data mining techniques to diagnose
breast cancer. Clementine software was used for data mining on the Wisconsin Diagnostic
Breast Cancer Database. Amongst Decision Trees, Support Vector Machines, Artificial Neural
Network and an ensemble of all three, the ensemble model proved to be better than individual
models. The dataset used in this study has 569 instances or records. The limited data used in
this study was one of the major shortcomings of this study. The study only determines whether
the case is malignant or benign.
Lastly, in 2017, Huang et al. [68] compared Support Vector Machine (SVM) modeling
technique with SVM ensemble technique for breast cancer prediction. The datasets used in the
study are Wisconsin Diagnostic Breast Cancer Database, UCI machine learning database and
ACM SIGKDD Cup 2008. WEKA [58] data mining software is used to construct SVM
classifiers. It is concluded that for smaller datasets, SVM ensembles performed better than
individual SVM classification technique. For large datasets, SVM ensembles using the
boosting method performs better than the other classification techniques. The results are
presented in Table 2 [68].

Dataset
Small-scale
dataset

Large-scale
dataset

Modeling Technique
GA+linear SVM

Accuracy (%)
96.57%

GA+RBF SVM (boosting)

98.28%

GA+linear SVM (boosting/bagging)

96.57%

Poly SVM (boosting)

99.51%

GA+poly SVM (bagging)

99.50%

RBF SVM (boosting)

99.52%

Individual SVM
SVM Ensemble

SVM Ensemble

Table 2. Accuracy Comparison of SVM and SVM Ensembles [68]

30

2.2.4 Summary of Breast Cancer Prediction using Data Modeling Techniques
Table 3 summarizes the data modelling techniques used for cancer survivability prediction in
literature.
Type

Author

Comparison Delen et al.
of different
modeling
techniques
Bellaachia and
Guven

Year
2005
2006

Endo et al.

2008

SEER public-use
database (1992 –
1997)

Wang et al.

2013

Senturk and Kara

2014

SEER public-use
database (1973 –
2007)
UCI Machine
Learning
Repository

Chaurasia and Pal

2017

2017

Hybrid

Dataset

SEER public-use
database
(1973 – 2000)
SEER public-use
database (1973 –
2000)

Khan et al.

2008

Choi et al.

2009

University
Medical Centre
Institute of
Oncology,
Ljubljana,
Yugoslavia
database
UCI Machine
Learning
Repository
SEER public-use
database
(1973 – 2003)
SEER public-use
database
(1973-2003)

31

Technique

Decision Tree (C5),
Artificial Neural Network,
Logistic regression
Naïve Bayes,
Back-propagated Neural
Network,
C4.5 Decision Tree
Logistic Regression,
Artificial Neural Network,
Naïve Bayes,
Bayes Net,
Decision Trees with naïve
Bayes,
Decision Trees (ID3),
Decision Trees (J48)
Logistic Regression,
Decision Tree (J48)
Artificial Neural Network,
Decision Trees,
Logistic Regression,
Support Vector Machines,
Naïve Bayes,
K-Nearest Neighborhood
RepTree,
RBF Network,
Simple Logistic

SequentialMinimalOptimizatio
n,
IBK,
BF Tree
Decision Trees,
Fuzzy Decision Trees
Artificial Neural Network,
Logistic Regression,
Bayesian Network

Ensemble

Agrawal et al.

2010

GilTroy Paular
Meren

2014

Gokhan and
Mustafa

2015

Huang et al

2017

SEER public-use
database
(1988-2001)

J48 decision tree,
Random forest,
LogitBoost,
Random subspace
Alternating Decision Tree
SEER public-use
ZeroR,
database
Random forest,
(1973-2010)
LogitBoost,
Random subspace,
J48 Decision Tree,
Alternating decision Tree
Wisconsin
Decision Trees,
Diagnostic Breast Support Vector Machines,
Cancer Database
Artificial Neural Network
Wisconsin
SVM,
Diagnostic Breast SVM ensemble
Cancer Database,
UCI machine
learning database,
ACM SIGKDD
Cup 2008

Table 3. Data Modeling Techniques used in Cancer Survivability Prediction

2.3 Summary
In this chapter, an overview of research done on breast cancer survivability has been provided.
The models presented in the literature are based on several data mining techniques and different
datasets. Most of these works focus on comparing data mining techniques for building
predictive models, and only a few have developed specific tools to predict the outcome
(survival) based on the patient-specific input. Predicting survivability of breast cancer patients
can greatly assist physicians in developing the treatment plan specific to each patient. In the
next chapter, the methodology used for this research along with implementation steps are
discussed.

32

3. Methodology
As mentioned in Chapter 1, the two main contributions of this thesis are the prediction of breast
cancer survivability and analysis of breast cancer data. This chapter breaks down the
overarching contributions into the sets of smaller tasks and explains each of the tasks along
with presenting the rationale behind finalizing the routes adopted and choices made to
accomplish them. In precision, the chapter presents various approaches that could be used for
predicting survivability and analyzing breast cancer data, the predictive model along with
implementation details of the different modeling techniques used for developing the predictive
model to predict survival months, and the dashboard to visualize 40 years of historical breast
cancer data.

3.1 Proposed Approach
Figure 8. Methodology gives an overview of the method used for this research. It is primarily
divided into five tasks: data extraction (from raw data), data analysis (including preprocessing), data visualization, predictive modeling, and evaluation of the predictive model.
3.1 Data Extraction
The SEER database maintains cancer statistics for the US and monitors annual cancer
incidence progression of various types of cancer. The breast cancer data occurring in different
population subgroups is available for the period from 1973-2013. Among other variables, the
data includes patient records, race/ethnicity, primary site, the first course of treatment, and
follow-up vital status [69]. The “SEER limited-use” data is defined by demographics, treatment

33

Figure 8. Methodology
(Icons copyright: SEER*Stat, Microsoft, tableau and IBM)

34

(e.g. surgery, radiation therapy), diagnosis (e.g. primary site, tumor size), and an outcome
characteristic (e.g. survival time, cause of death), which makes SEER an excellent source for
outcome analysis and prediction-based studies. The SEER dataset used for this research is a
collection of data from 18 registries. SEER*stat statistical software [70] is used to extract raw
data from the SEER database. This software allows viewing of patient record and production
of different sessions such as Frequency, Rate, Survival, and Case Listing. After consultation
with a radiation oncologist, 30 variables were selected (from a total of 134 variables available
in SEER) to prepare the relevant dataset. The dataset is filtered to only include cases, which
died due to cancer, i.e. ‘Dead’ and ‘N/A not first tumor’ are selected for the SEER causespecific death classification (the detailed definition of SEER variables are presented in
Appendix B).

3.2 Data Analysis
Data analysis is the process of transforming raw data into usable form. Once the raw records
are extracted, data preprocessing is performed to produce a relevant subset. The pre-processed
data is imported into the SQL Server database [71], followed by analysis leading to building
the dashboard and reports using Tableau.
a) Data preprocessing
The data preprocessing is done at two levels:


SEER-related preprocessing: Normalization of data, such as converting text values
to numeric representation is performed as a part of preprocessing. SEER*stat software
is used to accomplish this task. The derived data is cleansed for eliminating redundant
content. Male breast cancer cases are also eliminated as a part of data cleansing.
35



Problem-specific preprocessing: This includes selecting data records for a distinct
time of significance and eliminating attributes which do not hold any considerable
predictive power. One of the steps is the removal of records which represent deaths due
to a reason other than breast cancer.

b) Data Modeling
The pre-processed data is imported into MS SQL Server to create a database consisting of
relevant dimensions and measures. Tableau [72] is connected to the database, and the tables
are joined to create a view, extending horizontally by adding columns of data, as needed. The
data is further cleansed (such as by changing data types, renaming & resetting fields) and
prepared for analysis. Calculated fields, formulas, grouping and sets are added via SQL
queries. The data is sliced and diced by using filters and parameters. The dissected data is then
visualized using workbook, dashboards, and stories.

3.3 Data Visualization
The dashboard and dynamic reports contained in them use views and tables created in the SQL
Server database that help convert massive amounts of data into meaningful and actionable
information. This is accomplished by building dashboards which contain visually appealing
and interactive components including charts, graphs, tooltips, and drill-down/ drill-through
reports.
Dashboards provide interactive access to informative data and help understand the
enormous data generated by every cancer incidence. In addition to tracking of KPIs, the reports
also allow discovery of hidden patterns in the data. All of this, in turn, can improve the quality
of cancer care. The policymakers, health professionals, advisors, and planners could use this
36

data to view and report breast cancer statistics which provide a better understanding of the
incidence and mortality trends. Physicians can identify treatment options, wellness programs,
and patient engagement. It can also empower patients to choose the right care through
interactive visualization of treatment cost, quality and effectiveness.

3.4 Predictive Modeling
By using the pre-processed and cleansed data, a breast cancer predictive model has been
designed to predict survival months of a breast cancer patient from the year of diagnosis. The
predictive model has been trained, tested and validated with SEER data. The predictors
shortlisting is based on main KPIs identified by the analysis. Seeking an expert’s opinion in
choosing the relevant predictors from a total of 134 variables available in SEER is an inevitable
process for this task. By using various data modeling techniques, the predictive model has been
developed with modeling techniques of highest accuracy, along with their ensemble. There are
three crucial steps in this stage: selection of predictors, selection of training and testing
datasets, and developing the predictive model using the training dataset.

3.5 Evaluation
The predictive model has been evaluated on the testing dataset:
a) By comparing the actual average survival month with predicted survival months.
b) By calculating the performance metrics such as accuracy.

37

3.2 Breast Cancer Analysis and Visualization
With increasing cases of cancer, the amount of data associated with cancer has also increased
proportionally. Analyzing such huge datasets is difficult. Dashboards provide a visual
mechanism to track KPIs and other metrics relevant to specific processes [73]. The purpose
of the dashboard is to capture, process, and distribute information in an intelligible format to
enable users to understand the data better [74]. This, in turn, can improve the quality of cancer
care.
For this research, a combined dataset of over 40-year historical breast cancer data has
been used. The dataset, which holds data in the raw form, is analyzed and then set for
visualization by means of interactive reports and dashboard. The dashboard helps uncover
hidden patterns as well as provide easy-to-understand metrics for users of all backgrounds. The
dashboard includes breast cancer data for patient population demographics, patient volumes,
diagnosis and treatment.
There are several data visualization tools available such as SSRS, Tableau, Power View
and Qlik View, which can be used to build dashboards and visualize data. In this research,
Tableau is chosen because of its versatility, flexible interface and other capabilities described
in the following section.

3.2.1 Tableau
Tableau [75] is a commercially available tool, which allows building dashboards by
transforming data into visually appealing and interactive visualizations. It can be connected to
a variety of data sources such as Access, Excel, and data warehouse or web-based data [76]. It
38

is an easy-to-use tool for data analysis and building dashboards together with drill-down and
drill-through reports. Such visualization can help users (physicians, researchers & patients) to
find the right path forward. For example, physicians can identify treatment options, wellness
programs, and patient engagement. It can even empower patients to choose the right care
through interactive visualizations of treatment cost, quality and effectiveness [72].
Tableau is a useful tool for organizations which have massive amounts of data (e.g. the
healthcare industry) to be converted into meaningful and actionable information. Also, the
reports generated by Tableau can be published to a shareable URL [77]. Tableau can extract
useful information out of large data, which is otherwise difficult to examine manually. The
advantages of using Tableau as a Business Intelligence (BI) tool in this research are:


Tableau can discover hidden patterns in the data.



Tableau is an easy to navigate tool with simple drag and drop features for creating
dashboards.



Tableau can analyze millions of rows of data in seconds.



Gartner’s BI Magic Quadrant [78] report has ranked Tableau as a leader for four
consecutive years [79].



Tableau can link multiple data sources (such as databases, flat files, web services) for
quick and accurate analysis.



Tableau can provide real-time dashboards.



Tableau provides an option of extracting data from the data source or have a live
connection with the data source such as healthcare environments.

39

3.2.2 Accessing SEER Database and SEER*Stat
The SEER Program [80] is a database of cancer statistics in the United States. SEER is
supported by the Surveillance Research Program (SPR) [81] in the National Cancer Institute’s
Division of Cancer Control and Population Sciences (DCCPS) [82]. The SEER database
monitors the annual cancer incidence progression of each type of cancer.
The SEER public-use data is available from the SEER web site on submitting a SEER
limited-use data agreement form [31]. The data agreement form is available in Appendix A.
The data can be accessed by two different options, SEER*Stat’s client-server mode or by
downloading compressed files. The first method requires the download and installation of
SEER*Stat software [70] on the machine. It requires an internet connection to extract data
from the SEER database. The variables selected by the user are transferred from the user’s
machine to the SEER*Stat server to retrieve data as per the ad-hoc requests submitted. The
second method requires the user to download compressed files of the data in two formats, i.e.
binary and ASCII version of data. These files can then be accessed using the SEER*Stat
software.

3.2.3 Data Understanding, Preparation and Extraction

The SEER*Stat statistical software version 8.3.5 is used to obtain breast cancer data.
SEER*Stat software is associated with the SEER research data - available either directly from
SEER’s server or a local file. For this research, the first method is used, i.e. extracted data
using SEER*Stat software and SEER*Stat server. SEER*Stat software provides different
types of sessions, designed to calculate specific statistics. The software helps to view a given
40

record of a cancer patient and can produce different sessions such as frequency, rate, survival,
and case listing sessions. In this research, a case listing session is used to get data at the
individual case or patient level. Case listing session also allows accessing the data in ASCII
text format and generates a dictionary for each variable selected.

Figure 9. Selecting Database in SEER*Stat

The database selection could be made as shown in Figure 9. For this research database
used is “Incidence-SEER 18 Regs Research Data + Hurricane Katrina Impacted Louisiana
Cases, Nov 2015 Sub (1973-2013 varying)”. This dataset is a collection of data from 18
different SEER registries i.e. Atlanta, Connecticut, Detroit, Hawaii, Iowa, New Mexico, San
Francisco-Oakland, Seattle-Puget Sound, Utah, Lost Angeles, San Jose-Monterey, Rural
41

Georgia, Alaska Native Tumor Registry, Greater California, Greater Georgia, Kentucky,
Louisiana and New Jersey [45]. The selected dataset has data for all types of cancer and 134
variables to use.
The ‘Selection’ tab allows filtering the database by site and morphology, year of
diagnosis and other factors to customize the data extraction with respect to the research. The
filters used on the selected database are shown in Figure 10. Following are filters are applied:


Only known age cases in research database are included



Site and Morphology is selected as ‘Breast’



Year of diagnosis is selected from 1973-2013



Only ‘Dead’ and ‘N/A not first tumor’ is selected for SEER cause-specific death
classification.

Figure 10. Filters used in SEER*Stat

42

List of short-listed variables
1. CS mets3 at Diagnosis-bone
2. CS mets at Diagnosis-lung
3. CS mets at Diagnosis-liver
4. CS mets at Diagnosis-brain
5. Breast Subtype
6. Vital status recode (study cut-off used)
7. Age recode
8. Radiation
9. Radiation sequence with surgery
10. T value4
11. N value5
12. M value6
13. Regional nodes positive
14. CS lymph nodes
15. CS mets at Diagnosis
16. CS extension
17. CS tumor size
18. Marital status at diagnosis
19. Regional nodes examined
20. Estrogen Receptor Status
21. Progesterone Receptor Status
22. Survival months
23. Laterality
24. Histologic Type ICD-O- 3
25. Race/ethnicity
26. Year/Month of Diagnosis
27. Behavior code ICD-O-3
28. Surgery of Primary Site
29. Reason no cancer-directed surgery
30. SEER cause-specific death classification
Table 4. List of short-listed Variables

Mets - Metastasis
T value - Size of the original tumor
5
N value - Degree of nearby lymph nodes involved
6
M value - Presence of distant metastasis
3
4

43

Out of 134 variables available in SEER, it is crucial to use variables relevant to the
prediction of breast cancer survivability. Seeking an expert’s opinion for shortlisting the
variables is crucial. Hence, Dr. Robert Olson, who is a Radiation Oncologist at the BC Cancer
Agency Centre for the North [83] and Regional Director of Faculty Development, Affiliate
Assistant Professor, Northern Medical Program, UNBC [84] was consulted. Dr. Robert Olson
suggested a list of 30 relevant variables out of the 134 variables available (Table 4). These
variables along with their definitions are listed in Appendix B and were selected in the case
listing session window (Figure 11).

Figure 11. Selecting Variables in SEER*Stat

The ‘Output’ tab allows naming the dataset and the session is executed which produces
an output table or matrix. The resulting SEER*Stat matrix window could be exported in the
44

CSV format. Before extraction, the data is converted into ASCII text format. A dictionary file
is auto-generated along with CSV file which provides codes for the variables formatted in the
matrix. The SEER*Stat case listing session can be saved and reused to run and extract data.

3.2.4 Database

The extracted CSV file is imported into the SQL Server. The database is created based on star
schema. The dictionary file is used to create dimension tables. Each variable now has primary
and foreign keys. The master CSV file serves as the fact table using the foreign keys of each
variable (dimension tables).

3.2.5 Visualization Dashboard
There are several data visualization tools available such as SSRS, Tableau, Power View and
Qlik View, which can be used to build dashboard and visualize data. As stated earlier, Tableau
is used primarily because of its versatility, and flexible interface. Tableau is connected to SQL
Server breast cancer database. The tables are joined to create a view, extending horizontally
by adding columns of data, as needed. Tableau is further used to cleanse the data such as –
changing data types, renaming and resetting fields suitable for analysis. Calculated fields,
formulas, grouping and sets are added via SQL queries. The grouping is done based on the
level at which analysis is to be performed. Some of the groupings done are as shown in Table
5.

45

Grouping
Age-Range (group)

Dimension Used
Age-Range

Race/Ethnicity (group)

Race/Ethnicity

Regional nodes
examined (group)

Regional nodes
examined

Level
10-year level expect
80-84 and 85+

Data elements used
Ages 01-04, Ages 05-09,
Ages 10-14, Ages 15-19,
Ages 20-24, Ages 25-29,
(01-09, 10-19, 20Ages 30-34, Ages 35-39,
29, 30-39, 40-49,
Ages 40-44, Ages 45-49,
50-59, 60-69, 70-79, Ages 50-54, Ages 55-59,
80-84, 85+)
Ages 60-64, Ages 65-69,
Ages 70-74, Ages 75-79,
Ages 80-84, Ages 85+
White,
Black, Others include:
Others, Unknown
American Indian, Aleutian,
Alaskan Native or Eskimo
(includes all indigenous
populations of the Western
hemisphere), Chinese,
Japanese, Filipino,
Hawaiian, Korean,
Vietnamese, Laotian,
Hmong, Kampuchean
(including Khmer and
Cambodian), Thai, Asian
Indian or Pakistani, Asian
Indian, Pakistani,
Micronesian, Chamorran,
Guamanian, Polynesian,
Tahitian, Samoan, Tongan,
Melanesian, Fiji Islander,
New Guinean, Other Asian,
including Asian,
Pacific Islander, Other
Exact number (01Exact number (01-89) of
89) of nodes
nodes examined:
examined, 90 or
Exact 1 nodes examined to
more nodes were
Exact 89 nodes examined
examined,
Unknown:
No nodes were
Unknown,
examined,
Unknown whether nodes
No regional nodes
were examined,
were removed,
Not Applicable or negative,
Regional node
Not stated in patient record
removal documented
as dissection,
Regional lymph
node removal as
sampling,
Regional nodes were
surgically removed,
Unknown

46

Regional nodes positive
(group)

Regional nodes
positive

Exact number (0189) of nodes
positive,
90 or more nodes
were examined,
All nodes examined
are negative,
No nodes were
examined,
Positive aspiration
of lymph nodes(s)
was performed,
Positive nodes are
documented, but not
specified,
Unknown

Exact number (01-89) of
nodes examined:
Exact 1 nodes examined
and tagged positive to Exact
89 nodes examined and
tagged positive
Unknown:
Unknown,
Unknown whether nodes
were examined,
Not Applicable or negative,
Not stated in patient record

Table 5. Grouping of Variables for Data Preprocessing

The data can be extracted in the Tableau workbook which is a compressed snapshot of
actual data. It is used to make the data engine work faster and provides faster analytical and
query performance. The extracted data could be further sliced and diced by using filters and
parameters such as year of diagnosis, race/ethnicity and state. The dissected data is then
visualized using workbook, dashboards, and stories. The screenshots of the dashboard are
presented in section 4.1.

3.3 Breast Cancer Predictive Model
Identifying patterns from the historical data and using them to make predictions forms the basis
of predictive analysis [86]. Predictive analysis deals with developing models using wide
varieties of data modeling techniques [87]. Decision trees (C&RT, QUEST, CHAID), Neural
Networks, Linear Regression, and Support Vector Machines are some of the popular data
modeling techniques.

47

The breast cancer predictive model is developed from the preprocessed data extracted
from the SEER database. This model is trained with the existing data of breast cancer patients
and could be used by doctors to determine the patient’s survival time ranging in months. The
step by step process of developing a predictive model is discussed in the following subsections.
However, before developing the predictive model, it is essential to finalize on the technology
to create the model. For this research IBM SPSS Modeler 18.1 was used. The underlying
rationale for selecting IBM SPSS Modeler is discussed next.

3.3.1 SPSS Modeler
Many of the studies discussed in the literature [5, 18], have used WEKA [58] as the underlying
software to design their predictive models. However, with the growing popularity of IBM’s
SPSS Modeler [88] and SAS Enterprise Miner [89], an inclination towards the commercially
available tools over the open source ones used earlier was imminent. Amongst the two, IBM’s
SPSS Modeler is selected for this research. The rationale for this decision was two-fold:


Base this research by designing a predictive model of a different, reliable platform to
increase the probability of any differences one could observe with the techniques used
to design the predictive models, and



The reliability of the brand name of IBM, and the assurance of adequate technical
support, along with the detailed and publicly available documentation which helped
address many of the primary concerns.

IBM SPSS Modeler is a software package used for building predictive models using
advanced algorithms and data mining techniques, such as decision trees (C&RT, QUEST,
48

CHAID), neural networks, linear regression, and support vector machines. SPSS Modeler 18.1
is used for this research and has the following features:


It has a highly interactive and user-friendly interface.



SPSS Modeler can do data preparation for the user and has the capability of automated
preparation of raw data via the ‘Automated Data Preparation’ (ADP) node.



Modeling nodes such as ‘Auto Classifier’, ‘Auto Numeric’ and Auto Cluster’ are
powerful techniques that can compare several modelling methods and rank them in
effective order.



It helps extract values from a variety of data including structured data, unstructured
data, and that from other sources such as a database, variable file, statistics file, IBM
Cognos BI [90] and SAS file.

Once IBM SPSS Modeler was finalized as the platform to create the predictive model, the
next task was to put it to use and commence designing the predictive model. Although the IBM
documentation guides creating custom models, the content on a few occasions was ambiguous
and the documentation, in general, was verbose. Broadly, the following essential tasks were
identified to accomplish the predictive model:


Preparing the data for modeling



Determining input and target variables of breast cancer predictive model



Selection of modeling techniques for developing the predictive model



Training the predictive model



Testing the predictive model

These tasks are described in the following sections.
49

3.3.2 Data Preparation
The SEER data is available from 1973-2013 which was divided into two datasets, one for
training and the other for validating/testing. The 1988-2003 dataset is selected for training the
model due to the following reasons:
 The model needs to be trained on one particular dataset and then tested and validated on
different datasets, i.e. data outside trained dataset.
 The training dataset should have data for all shortlisted variable.
 The range of the number of years to predict survivability in this research is arbitrarily
set at 10 years. Since the follow-up cut-off date for selected SEER data is December 31,
2013, the cases registered in 2003 or before are considered.
Thus, the dataset from 1988-2003 is selected for training the predictive model and the dataset
for the year 2004 is used for testing and determining the predictive model’s accuracy.

3.3.3 Determining Input Variables

Post data preparation, the subsequent task involved shortlisting relevant variables which have
predictive power. The independent or target variable’s relation with the input variables
determines the power of the predictive model. The first screening process of narrowing down
30 relevant variables out of 134 total variables played an instrumental role in this process. The
shortlisted variables were relevant if they are the best fit or have predictive power. The target
or outcome variable is ‘survival months’ which is the dependent variable. The remaining 29
variables are independent variables and would be checked if they have a relationship with the
dependent variable, i.e. ‘survival months’. Another important factor required to take into
50

consideration is to select input variables which are available for the selected period (19882003).
Feature selection modeling technique is used for shortlisting relevant input variables. This
technique is used during the preliminary stages of analysis to locate variables that are most
likely to be of interest. The feature selection consists of three steps: Screening, Ranking and
Selecting. The ‘Feature Selection’ node (Figure 12) is configured to find the rankings of all
input variables, i.e. important, marginal and unimportant. Variables not available for training
period are removed from the list of input variables because there is no data for those variables
to train the model.

Figure 12. Feature Selection Model Snapshot

The ‘Feature Selection’ model nugget filters out nine variables, of the 30 total variables, as
unimportant. The remaining 21 variables are ranked by their importance. By cutting down the
number of fields in the model, scoring time and amount of data collected in future iterations
can be reduced [88]. The list of variables according to importance ranking are shown in
Table 6. The detailed definitions and coding of these variables are listed in Appendix B.

51

Input variables:
1. Marital Status
2. Race/ethnicity
3. Age recode
4. Laterality
5. Histologic Type ICD-O-3
6. Behavior code ICD-O-3
7. Regional nodes positive
8. Regional nodes examined
9. Reason no cancer-directed surgery
10. Radiation
11. Radiation sequence with surgery
12. Surgery of Primary Site
13. Vital Status recode
14. Estrogen Receptor Status
15. Progesterone Receptor Status
16. T value
17. N value
18. M value
19. Year/Month of diagnosis
Target variables:
20. Survival months
Record ID (unique identifier):
21. Patient ID
Table 6. List of Variables (Input & Target) for Predictive Modeling

3.3.4 Selecting Modeling Techniques
The predictive model is developed using modeling technique(s) which are based on the use of
algorithms. There are three modeling technique classes in SPSS Modeler, namely
Classification, Association and Segmentation. Some of the examples of modeling techniques
in these classes are described in Table 7. Classification models take one or more input fields
and can predict one or more target variables. Association models find patterns in the data,
where one or more entities are associated with one or more entities. These models allow a

52

variable to act as input and target both. On the other hand, Segmentation models divide the
data into clusters that have similar patterns of input variables.
The goal of this research is to predict survival months of patients. Survival months is
of continuous (numeric) data type thus the modeling techniques selection is based on the
models which allow continuous numeric range target. The classification techniques which
support continuous numeric range target include Neural Networks, C&R Tree, CHAID, Linear
Regression, Generalized Linear Regression and Support Vector Machines.

Classification
Decision Trees:
C&R Tree, Quest, CHAID,
C5.0
Regression:
Linear, Logistic, Generalized
linear, Cox regression
Neural Networks
Support Vector Machines
Bayesian Networks

Association

Segmentation (clustering)

Apriori model
Carma model
Sequential detection model

Kohnen Networks
K-Means clustering
Two-step clustering
Anomaly detection

Table 7. Modeling Technique Classes

The predictive model is built using the modeling techniques classes described above.
The modeling techniques which complete execution in a reasonable time and have a high
correlation of variables were selected. The top three classification techniques selected are
Neural Network, CHAID and C&R Tree.. These techniques are discussed in detail in the
following sections.

53

3.3.4.1 Neural Network
Neural Networks work by finding unknown and intricate patterns in the data. They resemble
the human brain as it gains knowledge from the learning process. The basic units called neurons
are organized in layers. The neurons are connected with different weights, and the network
learns from training. There are three layers in the neural network, namely, the input layer,
hidden layers and output layer (Figure 13). The input variables are presented to the input layer.
The values are propagated from each unit in the hidden layers. The predicted outcome is
delivered from the output layer. The network learns by examining each record and predicting
target and making the adjustment to the weights if the prediction is incorrect. This is a recursive
process and is only stopped if there is a stopping criterion defined before training. The Neural
Network models are recommended to use if interpretability is not a priority.

Figure 13. Structure of Neural Network

54

It is not easy to understand the underlying process of creating a relationship between the
target and input variables. There are two types of Neural Network model available in IBM
SPSS Modeler: Multilayer Perceptron (MLP) and Radial Basis Function (RBF). MLP is made
up of two or more hidden layers. It is a feed-forward, supervised learning network. It is a
function of multiple input variables that minimize the prediction error of one or more targets
[87]. MLP has higher training and scoring time compared to RBF. Further, RBF has low
predictive power as compared to MLP. Training the Neural Network model is a critical step.
The model’s accuracy is dependent on the training process. Finding suitable model settings for
training the Neural Network models is an iterative process. Post data preparation, next step is
identifying factors such as the type of Neural Network model (MLP or RBF), structure (number
of hidden layers and number of neurons in each layer), stopping rules, training time, training
cycles, and ensemble voting if applicable. After training the model, the results are validated
and the process is repeated if required. The field requirements are simple as there only must be
at least one target value and one input field [88]. Neural Networks deal with the missing values
with these two options:


Records with missing values are excluded.



Missing values are imputed – Continuous fields impute the average value of minimum,
and maximum values observed.

3.3.4.2 Chi-squared Automatic Interaction Detection
Chi-squared Automatic Interaction Detection (CHAID) is a classification scheme for building
a decision tree by using chi-square statistics. Decision tree models predict future outcomes
based on a set of decision rules. Decision trees work best with categorical data elements such
55

as surgery versus non-surgery treatment, married versus unmarried patients, and types of nodes
involved. The decision trees also relate predictions to the values of continuous variables.
CHAID uses statistical tests as criteria for evaluating the predictors (input variables). The
unique feature of CHAID is that it groups variables that are statistically similar to the target
variable. It also maintains a group of variables that are statistically dissimilar. Taking the
similar ones into consideration, CHAID finds the best predictor to create first branch of the
tree. This tree has child nodes which also fall under the statistically similar variables group.
The tree is completed by continuing this process. The statistical test for grouping similar
variables is done by using F-test for the continuous target and chi-squared test for the
categorical target. CHAID not being a binary tree method produces two or more groupings at
all levels of the tree. The CHAID decision trees are wider than binary tree methods. The
advantages of using CHAID are:


It works for all types of variables.



It accepts both frequency and case-weighted variables.



It can handle missing values by treating them as a single category.
The predictions in CHAID are made by following the terminal node of the tree which

has a specific predicted value associated with it. For numeric target, the terminal node’s
predicted category is calculated as the weighted mean of the target values for records in the
node. CHAID works for all types of inputs. Target and inputs can be continuous or categorical
[88].

56

3.3.4.3 Classification and Regression Tree
The Classification and Regression Tree (C&RT) model is used when there are multiple inputs
and one target variable. C&RT groups data into two subsets and repeats the process until
homogeneous records are grouped together. The subsets split again and process repeats and
stops when stopping rule is applied or homogeneity is achieved totally. C&RT might use the
same predictor field at different tree levels. All the splits are binary; thus C&RT is strictly a
binary tree. It provides the option first to grow the tree and then prune based on cost-complexity
criteria. This technique requires one or more input variables and exactly one target variable.
The advantage of using the C&RT modeling node are:


It does not require long training time.



It is quite adaptive if data has missing values or has a large number of fields.
The predictions in C&RT are also done by following the tree splits to a terminal node

of the tree. The terminal node has a specific predicted value associated with it. For numeric
target, the terminal node’s predicted category is calculated as the weighted mean of the target
values for records in the node. Both target and input variables can be continuous or categorical
[88].

3.3.4.4 Ensemble
Ensemble modeling technique combines multiple individual models to provide better
prediction accuracy. In literature, it is observed that the researchers use an ensemble of
different modeling techniques and predict better outcomes as compared to individual models
[5, 18]. By combining the predictions from multiple models, limitations of individual models
57

can be avoided and thereby result in high accuracy overall. Models when combined in this
manner perform at least as well as the best of individual models and even better [88].
Some of the difficulties faced when developing the ensemble model are finding out the
combination of models to be used. Individual models selected for ensemble should have high
prediction accuracy and should not overfit [64]. There are different rules for combining
predicted values from individual models to compute ensemble score value. For categorical
targets, combining rules available are voting, the highest probability and highest mean
probability. For continuous targets, averaging is the only combining rule. Since the target
variable is a continuous variable, an ensemble score is computed by averaging the values of
individual models. The ensemble prediction is calculated by using the following formula
(Equation 1):
𝑀

1
𝑦̂𝑖,𝑀 = ∑ 𝑦̂𝑖,𝑚
𝑀
𝑚=1

Equation 1. Ensemble Prediction Equation

where, 𝑦̂𝑖,𝑀 is the final predicted value of case i
𝑦̂𝑖,𝑚 is the mth base model’s predicted value for case i

A hypothesis is proposed that Ensemble model outperforms the individual models, i.e. Neural
Network, C&RT and CHAID models. After selecting the modeling techniques, the predictive
model is built. The following sections present the three phases of building a prediction model
in detail.

58

3.3.5 Training Predictive Model
Figure 14 is a snapshot of the training phase of the predictive model built in IBM SPSS Modeler
[88]. The model description is as follows:

Figure 14. Predictive Model Training snapshot

 Data (1988-2003) node is the Excel Source node. Excel Source node allows importing any
Excel workbook available on the local machine. The import can be customized by selecting
a specific range of cells defined in the Excel worksheet or by selecting a specific worksheet
from the entire workbook. The variable names can be changed in Source node settings. The
Excel worksheet imported for training the model has SEER extracted breast cancer data
from 1988-2003.
 Type node: The Source node is connected to a Type node which defines the measurement
level for each variable such as Nominal, Ordinal, Continous, Categorical, Flag or Typeless.
59

Type node also defines the role of each input field such as Input, Target, Both (Input &
Target), None, Partition, Split, Frequency, and Record ID. Input fields are the predictors
and Target is the field that the model has to predict. In the model, there exist a total 21
fields, out of which there are 19 predictors (input), 1 Target (Survival months) and 1 Record
ID (Patient ID).
 Modeling nodes: The Type node is connected to three modeling nodes - Neural Network,
CHAID and C&R Tree. The selected modeling nodes are classification models which use
one or more predictors to predict the target. Each modeling node has field option where
variables are specified as input and target. However, since Type node is used, there is no
need to address the field specification in modeling nodes.
 When the model completes its execution, the resulting nuggets are added to stream for each
modeling node (Figure 14). The nuggets contain complete information of the model (rules
and equations developed) and accuracy of the independent model formulated by IBM SPSS
Modeler. The model summary can be browsed by double-clicking the generated nuggets.
 Ensemble node is added to the stream to create an ensemble of these techniques. For this
purpose, the modeling nuggets are connected to the ensemble node as shown in Figure 14.
The ensemble model provides minimal options such as selecting a target field for ensemble,
filtering out the fields generated by ensemble models and calculating standard error. Figure
15 shows Ensemble node settings, ‘Survival Months’ is selected as the target and the option
‘Filter out fields generated by ensemble models’ is unselected to get individual models’
prediction along with ensemble. An option for calculating the standard error was also
chosen.

60

Figure 15. Ensemble Node Setting

On completing execution, each modeling technique node creates nuggets (diamondshaped) except for the Ensemble which is a combining rule model. Each nugget has the model
summary which displays a number of variables used, predictor importance, stopping rules, and
number of layers. The nugget summary also has the training accuracy of each modeling
technique. The following training accuracies were observed for selected modeling techniques
of the predictive model:

Training Accuracy

Neural Network
82.9%

Modeling Techniques
CHAID
82.1%

C&RT
82.0%

Table 8. Training Accuracy

In the Analysis node, training and actual outcomes are analyzed for the individual
model as well as Ensemble model. The statistical measure used to compare is mean, minimum,
maximum, mean absolute error and standard deviation. Once, the model has trained it is now
ready to be tested and validated. The next section discusses testing and validation methods
used in this research.
61

3.3.6 Testing and Validation of Predictive Model
Figure 16 is a snapshot of the testing phase of the developed predictive model. The model is
now tested for the cases outside its training range, that is, cases registered in 2004. The source
node now represents the testing dataset, i.e. Data (2004). The output of individual models and
ensemble model can be captured in different ways:


Table - a static matrix window generated in IBM SPSS Modeler



Analysis node - this node allows testing the measured (predicted) values against the
known result



Excel node - it generates an Excel sheet with predicted values



The additional nodes used here are Transpose node to transpose columns into rows,
and Type node to extract data into Excel sheet.

Figure 16. Predictive Model Testing Snapshot

62

Figure 17. Execution of Excel Output Node

For testing the model with all cases diagnosed in 2004 together, Excel output node is
executed (as shown in Figure 17) which generates an Excel sheet with predicted outcomes for
each record. This newly generated Excel sheet is then used to compare measured values with
actual values (survival months) to validate the accuracy of the predictive model. The results of
the validation of the model are discussed in the next chapter.
The method presented above works well for testing a large number of cases together.
However, if the user wants to predict survival months for one individual case, they could use
the User Input to enter values of all variables and use the same predictive model as a calculator.
Figure 18 is a snapshot of using the predictive model as a calculator with the User Input node
instead of the source node. Figure 19 shows the User Input node window. The definitions and
coding of each variable are attached in Appendix B.

63

Figure 18. Predictive Model as calculator for individual case

Figure 19. User Input Node Snapshot

64

3.4 Summary
In this chapter, a detailed description of the five main tasks, namely, data extraction,
data analysis, data visualization, predictive modeling, and evaluation of predictive model has
been presented. These tasks collectively contribute to accomplishing the implementation of the
predictive model along with visualization dashboard. A comprehensive understanding of
several predictive modeling techniques along with the rationale behind choosing the right tools,
technologies, and approaches to accomplish the primary contributions have also been
presented. The experiments and analysis of results of both dashboard and predictive model are
presented in the next chapter.

65

4. Experiments and Results
In this chapter, the experimental results of data analysis and predictive techniques are
presented. The first section demonstrates the data analysis and visualization reports and
dashboard representing forty years of breast cancer data. The second section demonstrates the
accuracy of the predictive model and comparison study of predictive model’s accuracy on
different datasets.

4.1 Data Analysis and Visualization
For visual analytics, the pre-processed SEER data is first imported to SQL Server to create
‘Cancer’ database in a format suitable for analysis. Tableau is connected to the Cancer
database, and the tables are joined to create a virtual table, extending horizontally by adding
columns of data needed. The data is further cleaned up (by changing data types, renaming &
resetting fields, wherever needed) to prepare data for analysis. Data is analyzed by adding new
calculated fields, formulas, grouping and sets by writing SQL queries. The data is sliced and
diced by using filters on measures and dimensions and creating parameters. The data is then
visualized using workbook, dashboards and stories. The dashboard and dynamic reports use
the views and tables created in the SQL Server database.
The subsequent sections present each of the reports generated using Tableau [75].
Tableau story [91] and story points are used for visual analytics of breast cancer data. Tableau
story is a sequence of visualizations that work together to convey the findings. Each sub-report
is also called a story-point. The top-level dashboard provides an overview of the KPIs and
includes navigation controls including a tab panel (Figure 20) which allows switching between
breast cancer metastasis, TNM system, cases by geo-mapping, geo-mapping by race/ethnicity,
66

cases by race and age range, lymph node involvement, incidence/mortality, survival/mortality
and anatomy dashboards.

Figure 20. Breast Cancer Dashboard Story Point Panel

4.1.1 Breast Cancer Survivability Dashboard

Figure 21. Breast Cancer Survivability Dashboard

67

The main dashboard (Figure 21) contains five sub-reports/story points which can be filtered by
year of diagnosis (1973-2013). The first table shows the total number of cases by vital stats
(i.e. Alive or Dead). Out of the total of 1,037,457 cases registered in SEER over the forty-year
period, 241,677 cases died due to breast cancer and rest were tagged Alive at the study cut-off
date (i.e. December 31, 2013). Amongst all the causes of death in females (including cancer
and non-cancer related deaths), breast cancer is the third highest cause of death after lungbronchus cancer and diseases of the heart. For registered breast cancer cases, a similar kind of
categorization is represented by marital status at diagnosis. The analysis shows that the
majority of women (597,909 cases) were married at the time of breast cancer diagnosis. This
is followed by widowed, single (never married), divorced, separated and unmarried in
descending order. There are about 44,700 cases whose marital status was unknown. All the
cases are further categorized by age-range and survival years. For this table, only cases that are
tagged Alive as of cut-off date are used. Out of total ‘Alive’ tagged cases, a majority of the
patients who survived more than 10 years are in age-range of 50-59, followed by 40-49 and
60-69 years. Brandt et al. [92] also concluded that women aged under 40 and above 80 years
at the time of diagnosis have a poor survival rate independent of any factors. This is consistent
with observation made by the dashboard.

4.1.2 Breast Cancer Metastasis
The Breast Cancer Metastasis (Figure 22) dashboard consists of five sub-reports which can be
filtered by year of diagnosis and alive cases or cases who died. The data for metastasis is only
available from 2010 onwards. The vital stats table shows that out of a total of 258,125 cases
registered with metastasis, 13,499 patients had died due to breast cancer and rest were alive at
68

the study cut-off date. The four pie-charts display information regarding metastasis to bone,
brain, liver and lung. Cases are categorized based on whether the breast cancer metastasized,
no metastasis and unknown. The number of unknown cases is low and hence is not included
in the pie-charts. It is noted that the majority of the cases who died had their breast cancer
metastasized to the brain, followed by lung, liver and brain, respectively. Similarly, ‘Alive’
cases can be selected to observe the impact of metastasis.

Figure 22. Breast Cancer Metastasis (Data 2010+)

69

4.1.3 Breast Cancer TNM System
The Breast Cancer TNM System dashboard (Figure 23) consists of three sub-reports
categorized by T, N, M values for cases registered from 1988 to 2003. The bar graph views
can be switched by age-range, marital status and year of diagnosis by using the radio buttons
on the top-right of the dashboard. A massive jump in the number of cases registered is observed
in 1992 and 2000. This is because new SEER registries were added to the SEER database in
these two years [93]. Figure 23 consists of three sub reports which present cases by T value
(size of the original tumor), N value (degree of nearby lymph nodes involved) and M value
(presence of distant metastasis). Each value type has different categories such as T0, T1, TX,
N0, N1, Nx, M0, and M1. Each point in the charts has tooltips to display more information and
can be clicked to open relevant web pages.
The dashboard shows that for all cases diagnosed in 1988-2003, the maximum number of cases
have T1 value (i.e. one primary tumor) followed by the T2 value (i.e. two primary tumors). On
the other hand, the maximum number of cases did not have any nearby lymph nodes containing
cancer (N0 value) and no distant cancer metastasis was found (M0 value). When the view is
switched, and cases are categorized by age range, the dashboard shows that 50-59 age range,
with maximum number of cases, 58.28% of cases have T1 value, followed by T2 (24.04%),
TX, Txa, T3, T4d, T4b, T4a, T0, Tis, and T4c. The N value charts show that for the 50-59 age
range, 60.46% cases have N0 value followed by N1x (16.20%), N1b, NX, N1a and N2. The M
value chart shows 93.29% of cases in the 50-59 age range have M0 value, i.e. no distant
metastasis. Similar trends are observed when the view is switched to marital status.

70

Figure 23. Breast Cancer TNM System

71

4.1.4 Geographic Distribution of Breast Cancer Cases by Race

Figure 24. Geographic Distribution of Breast Cancer Cases by Race

The Geo-mapping by Race/Ethnicity dashboard (Figure 24) is a dynamic map showing
the case concentration and changes over the years. The pie-charts categorize the cases by
Race/Ethnicity: White, Black, and Others, the latter being a clickable option which drills down
to display the distribution of cases by all races in data. A “play/pause” feature allows a dynamic
display of the changes over the years (1973-2013). The State parameter can be used to filter
and select one or more States and observe the trend for selected regions.
72

4.1.5 Breast Cancer Cases by Race and Age Range

Figure 25. Breast Cancer Cases by Race and Age Range

Breast cancer cases by race and age-range are shown in Figure 25. The bubble chart
shows the cases categorized by the races other than white and black race which are filtered
from the Geo-mapping dashboard. A tabular breakdown of cases by age range shows that the
50-54 age range has the highest number of cases followed by 45-49 and 55-59. This dashboard
can be further filtered by year of diagnosis, race and state parameters.

73

4.1.6 Breast Cancer Anatomy Dashboard

Figure 26. Breast Cancer Anatomy

The Breast Cancer Anatomy dashboard (Figure 26) consists of four sub-reports/story points.
The data used is for the period 1973-2013 except for breast subtype for which data is only
available from 2010 onwards. The tables represent breast cancer cases categorized by laterality,
examined and positive regional nodes, and breast subtype. The highlighted cells in each table
display the highest number of cases in that category. For example, in most cases, the tumour
originated on the left side of the body/organ. The regional nodes are the lymph nodes present
74

in the armpits [94]. For a maximum number of cases, regional nodes were examined (1-89 in
number) to test cancer cells involvement; 50% of these were negative (absence of cancer cells).
The breast subtype is further categorized by ER (Estrogen-Receptor-positive) status and PR
(Progesterone-Receptor-positive) status. The analysis shows that a maximum number of cases
have positive ER and PR status and the breast subtype is Her2-/HR+ (Human epidermal growth
factor receptor 2 negative/Hormone Receptor-positive).

4.1.7 Lymph Node Involvement Dashboard
The lymph node involvement in breast cancer is shown in Figure 27. The presence of cancer
cells in a lymph node under arms acts as an indicator of increased risk of cancer spreading in
the body [94]. Higher the number of lymph nodes containing cancer cells, higher is the
seriousness of cancer. Physicians often use the count of lymph nodes involved to design
treatment plans. The data for this dashboard was only available from 2004 onwards. The cases
registered before 2004 are categorized as Unknown. The axillary/regional lymph nodes
involvement is observed in a maximum number of cases. Axillary lymph nodes dissection is
performed on axillary nodes which are suspected to have cancer cells in them. Since, axillary
nodes constitute around 75% of the lymph nodes drain from breasts, the analysis indicated that
from over 88 thousand cases of breast cancer reported, axillary/regional lymph nodes
involvement accounted for almost one third of them, thus making axillary/regional lymph node
involvement as the top category in the number of cases. No distant metastasis was found in a
maximum number of cases, i.e. the cancer is not spread to other parts for such cases. The
prognosis factors table is displayed in the prognostic indicators table. Positive/elevated topped
the list of prognostic indicators.
75

Figure 27. Lymph Node Involvement in Breast Cancer Cases

76

4.1.8 Geographic Distribution by Incidence/Mortality cases of Breast Cancer Cases

Figure 28. Geographic Distribution by Breast Cancer Incidence Cases

Figure 29. Selection panel - Breast Cancer Incidence/Mortality Cases

77

Geographic distribution by Incidence/Mortality provides Incidence and Mortality cases
mapping by State and County as two separate dashboards (Figure 28 and Figure 30) either of
which can be accessed by switching the views (Figure 29). Both dashboards can be filtered by
the year of diagnosis and the State. Figure 28 and Figure 30 display breast cancer incidence
and mortality cases for the year 2013 and shows that California had the highest number of
breast cancer cases, followed by Washington, Michigan, Kentucky, and Connecticut. The pink
plots denote the number of cases only and don’t take into consideration the population of the
states while determining the rankings by state. The dashboard helps identify that incidence
count of breast cancer is not directly proportional to population. This is evident from the fact
that the population of Michigan in 2013 was higher than the population of Washington, yet
Washington had higher incidence count.
The mortality (i.e. cases died due to breast cancer) trends for the year 2013 are not
coherent with the incidence trends observed for the same year. The state of California had the
highest mortality count followed by Kentucky, Michigan and Connecticut. Alaska had the
lowest mortality count. The mortality count, however, has some similarities, especially with
respect to their dependency on the total population count for the states for that given year. This
is evident from the fact that, Kentucky, like Washington, had a total population lower than that
of Michigan, yet had higher mortality count than Michigan.

78

Figure 30. Geographic Distribution by Breast Cancer Mortality Cases

4.1.9 Breast Cancer Survival/Mortality rate by Age Range
A box and whisker visual as shown in Figure 31 is a chart type which displays data distribution
by quartiles. The box represents the value between first and third quartiles. The whiskers (A –
lower whisker and B – upper whisker) represent the distance between the lowest value to the
first quartile and the fourth quartile to the highest value. At the median (E) the box colour
changes and becomes lighter showing the upper and lower quartiles (D and C respectively).
79

The lower (C) and upper hinge (D) are medians of the lower and upper half of the data [95].
Each box and whisker is for specific age range showing the distribution of cases by incidence
or mortality rates.

Figure 31. Box and Whisker Visual

Figure 32. Breast Cancer Survival Rate by Age Range

80

Figure 32 shows a box-whisker chart of the survival rate of by age-range for California (19732013). The survival rate is the ratio of cases tagged Alive to the total number of cases
registered. The survivability (by years) is plotted against age-range and survival rate. The
median survival rate is highest for age-range 60-69 (82.2%) followed by the age-ranges 50-59,
40-49, and 70-79. There is an exception, and age range 10-19 has a 100% survival rate because
of the low number of cases (12).

Figure 33. Selection panel - Breast Cancer Survival/Mortality Rate

A similar dashboard is shown for the mortality rate (Figure 34) for the same
geographical area. The incidence and mortality rate are two separate dashboards either of
which can be accessed by switching the views (Figure 33). The mortality rate is the ratio of
cases tagged ‘Dead’ to the total number of cases registered. The median mortality rate is
highest for age-ranges 85+ (28.6%) and 30-39 (28.2%). The age-range 60-69 has the lowest
mortality rate of 17.7%. Each plot of the box-whisker chart shows the total number of cases,
Alive/Dead cases, and survival/mortality rate for selected survivability period. The
survivability period ranges from 1 to >10 years. The bottom point on each box-plot displays
the lowest survivability period and increases in the top to bottom fashion (Figure 32). However,
the pattern reverses in mortality rate dashboard (Figure 33), the top point of each box-plot starts
with the lowest survivability period and increases in the top to bottom fashion.

81

Figure 34. Breast Cancer Mortality Rate by Age Range

82

4.1.10 Summary of Data Analysis and Visualization Results
An interactive, end-to-end process to cleanse, integrate, analyze, and visualize
enormous amount of data is developed. The purpose is to enable healthcare professionals,
patients and policymakers with a better understanding of the hidden patterns in data, which in
turn, could be useful to to improve the quality of healthcare collectively. Forty years of breast
cancer data (over one million records) extracted from the SEER database was used to
demonstrate the analytical power of data visualization. The research approach involved using
several data preprocessing techniques on the raw data followed by a selection of the relevant
30 variables, out of a total of 134 variables, before feeding the processed data to the SQL Server
for analysis. Tableau was used for understanding and interpreting the data along several
dimensions. Dynamic reports generated using the drill-down capabilities of the dashboard
provide insights at a finer granularity. The dashboard shows incidence and mortality trends and
highlights the underlying patterns observed in breast cancer patients which could be used to
support clinical decisions made by physicians in formulating treatment plans. The dashboard
is scalable and capable of integrating new data in real-time. Although SEER data from the US
currently power the dashboard, it can be configured to use data from other sources as well. A
better understanding of the incidence and mortality trends could potentially guide data-driven
resource allocation. Physicians could also use this information to educate patients and create
more awareness about the disease.

83

4.2 Predictive Modeling
The predictive model is trained with the breast cancer dataset, consisting of 400,000 patients
registered from 1983-2003. The dataset is obtained by performing pre-processing and
transformation of SEER dataset. The variables used to train the model are selected by using
the feature selection algorithm. Twenty-one variables are selected out of 134 total variables
available from the SEER. The model uses CHAID, C&RT, Neural Network, and Ensemble
modeling techniques. Equal weights are assigned to selected models and Ensemble score is
generated by averaging. The outcome variable ‘Survivability’ refers to the survival time (in
months) of each patient. The performance metrics used are average and the accuracy. The
metrics and graphs are computed by using Tableau.
In the following sections, graphs showing a comparison of actual and measured
(predicted) average survival months are plotted by age range, marital status, positive to
examined regional nodes ratio (percentage), radiation sequence surgery, ER status, PR status
and Behavior code ICD-O-3. Similarly, a comparison of the accuracy of individual modeling
techniques and ensemble modeling technique is plotted for the same variables. The results are
presented for cases tagged as ‘Dead’ on the cut-off date. According to SEER variable
description cases tagged as ‘Alive’ are the cases who died after the follow-up cut-off date, i.e.
December 31, 2013. The noted survival months of ‘Alive’ cases is not their actual survival
months as they were not followed up after cut-off. Thus, only cases tagged as ‘Dead’ at cutoff date are used to validate the predictive model’s accuracy.
The developed predictive model is tested for cases registered in the year 2004 which is
outside trained period (1988-2003) of the model. This dataset is selected for testing and
validating the predictive model due to the following reasons:
84



The cut-off date of study is 2013 and cases diagnosed in 2004 are followed up till 2013
which gives survivability range from 1 to >10 years.



The majority of cases fall under 10 and >10 years survivability period as shown in the
visualization dashboard (Figure 21).
Thus the dataset for the year 2004 has a maximum number of cases in any calendar

year outside the training period. Additionally, the same dataset yields all survivability ranges
which makes it apt to select it for testing and experimentation purposes. There are a total of
55,268 cases registered in 2004 (only cases with the exact month of diagnosis are included). A
total of 46,365 cases are tagged ‘Alive’ at the cut-off date, and 8,903 cases are marked ‘Dead’
at the cut-off date.
V i t a l S t a t u s Ac t u a l vs M e a s u r e d ( 2 0 0 4 )
90%

83.39%

82.78%

# OF CASES (%)

80%
70%
60%
50%
40%
30%
17.22%

16.11%

20%
10%
0%
Alive

Dead
Actual

Measured

Figure 35. Vital Status comparison (2004)

The vital status statistics of actual and predicted model’s output (i.e. measured 7) are
compared in Figure 35. For cases diagnosed in 2004, 83.39% of cases are tagged ‘Alive’ and

7

Measured survival months are the survival months predicted by predictive model

85

16.61% are tagged ‘Dead’ at the cut-off date. The predictive model developed predicts 82.78%
of cases as ‘Alive’ and 17.22% as ‘Dead’ as of the cut-off date (December 31, 2013) thus
demonstrating accuracy of 99.26% and 96.45%, respectively.

4.2.1 Comparison of Average Survival Months (Actual vs Measured)
The average survival months for the following experiments is calculated by using the formula
(Equation 2):

Average Survival Months =

Sum of measured Survival Months
Total number of cases

Equation 2. Average Survival Months

Figure 36 shows measured (predicted) survival months by each selected modeling
technique and their ensemble. Actual survival months (average) of cases registered in the year
2004 and tagged ‘Dead’ at cut-off date, as shown in the yellow bar, is 42 months. Both
Ensemble and CHAID measured survival months (average) as 45 months which is closest to
actual survival months. C&RT and Neural Network predicts average survival months of 33
and 56 respectively. Overall, Ensemble is performing best along with CHAID. C&RT and
Neural Network modeling predictions are off from actual average survival months.
In the following sections, graphs of actual vs measured survival months are plotted as
trend-lines. Each line in the graphs show the trend of average predicted survival months of
each model (C&RT, CHAID and Neural Network), and their Ensemble along with actual
survival months (denoted by Survival Months). The bars in each graph display the number of
cases falling under the specific category, i.e. such as age-range, marital status, and lymph node
involvement. Next, the performance of individual modeling techniques is observed.
86

Av erag e Surv iv al M onths (2004)

SURVIVAL MONTHS (AVERAGE)

60
50
40
30
20
10
0

Survival Months
42

Ensemble
45

CHAID
45

C&RT
33

Neural
56

Figure 36. Average Survival Months (2004)

From Figure 36, it is evident that Neural Networks do not demonstrate higher accuracy
(i.e. 66%) in comparison to the other modeling techniques used. This is likely due to the
following:
The training dataset used for training (from the years 1988-2003) had values for T, N,
and M variables. On the contrary, for the testing dataset (2004 and onwards) the values of
these variables are missing. Neural Networks tend to learn more and extract better knowledge
by observing trends in the datasets [88]. Neural Networks relative predictor importance8 is
not uniform in its training. Hence, an inconsistent data across training and testing dataset
resulted in lower accuracy. Other modeling techniques perform better even with the missing

“Predictor importance is determined by computing the reduction in variance of the target attributable to each
predictor, via a sensitivity analysis” [88].
8

87

value of T, N and M variables because they give equal preference to all the variables (i.e. equal
predictor importance to all predictors).
To validate the above hypothesis, a new experimentation was performed by eliminating
the T, N and M variables from the training dataset for all the modeling techniques. The results
from this experiment (Figure 37) confirm the hypothesis stated above. Neural Networks
produced higher accuracy in comparison to the other modeling techniques (Table 9).

Aver ag e Surviva l M ont hs ( 2004)

SURVIVAL MONTHS (AVERAGE)

60
50
40
30
20
10
0

Survival Months
42

Ensemble
50

CHAID
53

C&RT
53

Neural
44

Figure 37. Average Survival Months (2004 - excluding TNM variables)

Accuracy (%)

Ensemble
81

Modeling Techniques
CHAID
C&RT
74
74

Table 9. Accuracy of Modeling Techniques

88

Neural Network
95

4.2.1.1 Average Survival Months by Age Range

Figure 38. Average Survival Months by Age Range (2004)

Figure 38 compares actual and measured average survival months by age range 10 to 85+ for
cases registered in 2004 which are tagged ‘Dead’, at the cut-off date. The bars in the chart
display the number of cases for each age range. Approximately 43% of cases tagged ‘Dead’ at
cut-off date, fall under the 45-64 age range. The actual survival months varies over different
age ranges. Cases with age 85 and above have lowest survival months, i.e. 28 months. The
graph shows both Ensemble and CHAID performing closest to the actual survival months and
also overlap at few data points. However, Ensemble performs better than C&RT for age 70
and onwards. C&RT, on the other hand, performs best for lowest and highest age range
categories. Neural Network predicts high survival month as compared to actual overall age
ranges.
89

4.2.1.2 Average Survival Months by Marital Status of patient

Figure 39. Average Survival Months by Marital Status (2004)

Figure 39 compares actual and measured average survival months by marital status (married,
single, widowed, divorced, separated and unknown) for cases registered in 2004 which are
tagged ‘Dead’, at the cut-off date. The bars in the chart displays a number of cases for each
marital status. About 45% of cases tagged ‘Dead’, at cut-off date, are married at diagnosis. The
actual survival months vary by marital status. Widowed cases have the lowest survival months,
i.e. 35 months. The graph shows both, Ensemble and CHAID predict closest to the actual
survival months. The trend lines overlap for divorced cases. C&RT predicts low survival
months as compared to other techniques. However, Neural Network predicts in range of 45-60

90

months when actual survival months range from 35-46 survival months. Overall, CHAID and
Ensemble perform closely.

4.2.1.3 Average Survival Months by Positive to Examined Nodes Ratio

Figure 40. Average Survival Months by Positive to Examined Regional Nodes Ratio (2004)

Figure 40 compares actual and measured average survival months by the ratio of positive to
examined regional nodes for cases registered in 2004 which are tagged ‘Dead’ at the cut-off
date. The bars in the chart displays the number of cases with the positive to examined regional
nodes ratio. Amongst the cases with examined nodes, 33% cases have no positive regional
nodes. The actual survival months is lowest for cases with unknown nodes examined i.e. 30
months. The graph shows that the Ensemble performs better than other models for cases having
91

70% or less positive to regional nodes ratio and cases having no positive node at all. CHAID
performs better for cases having 81-90% and unknown positive to regional nodes ratio. Neural
Network predicts survival months in a range of 49-61 months when actual survival months
ranges from 30-53 survival months.

4.2.1.4 Average Survival Months by Radiation and Surgery Sequence
Figure 41 shows the case counts and survival months’ distribution by radiation and surgery
sequence of cases registered in 2004 which are tagged ‘Dead’, at the cut-off date. The bars
show a number of cases which had radiation and surgery performed categorized as –
intraoperative radiation therapy, intraoperative radiation with other radiation given before or
after, sequence unknown yet both surgery and radiation are given, radiation both before and
after surgery, radiation before surgery, radiation after surgery and no radiation and/or surgery.
A maximum (67%) number of cases tagged ‘Dead’ did not have radiation and/or surgery
performed followed by 32% of cases who received radiation after surgery. The actual survival
months is lowest for cases who died without radiation or surgery, i.e. 37 months. Both CHAID
and Neural Network predict 42 and 43 months for such cases, respectively. Next, for cases
which had radiation after surgery have 51 survival months, highest survival months amongst
all other categories. Ensemble and CHAID predict 50 and 52 survival months, respectively.
C&RT and Neural Network predicted survival months are off the actual range. Both are
ranging between 32-37 and 54-60 months, respectively.

92

Figure 41. Average Survival Months by Positive to Radiation and Sequence Surgery (2004)

4.2.1.5 Average Survival Months by Estrogen and Progesterone Receptor Status
Figure 42 shows actual and measured average survival months are plotted by ER status for
cases registered in 2004 which are tagged ‘Dead’ at the cut-off date. The status is recorded as
– positive, negative, borderline and unknown. 52% of cases tagged ‘Dead’, at cut-off date,
have positive ER status and 30% cases have negative ER status. The actual survival months
recorded varies from 32-50 months. Cases with unknown and negative ER status have the
lowest survival months, i.e. 31 and 33 months respectively. Cases with positive ER status have
the highest survival months, i.e. 50 months. CHAID (47 months) performs better than
Ensemble (46 months) and other models for such case. For cases having negative ER status,
93

C&RT predicts 35 months compared to actual survival months (33 months). Neural Network
predicts survival months ranging from 49-56 months over all ER status available.

Figure 42. Average Survival Months by Estrogen Receptor Status (2004)

Next, Figure 43 compares the actual and measured average survival months by PR status
which is recorded as – positive, negative, borderline and unknown. 42% of ‘Dead’ cases have
negative PR status, and 38% of cases have a positive PR status. The actual survival months
recorded varies from 33-52 months. Cases with unknown and negative PR status have the
lowest survival months, i.e. 32 and 37 months, respectively. The graph shows that for the
maximum number of cases (negative PR status), C&RT performs best with 34 survival months
as compared to 37 actual survival months. The highest survival months recorded is 52 months.
CHAID and Neural Network perform best with 47 and 57 survival months.

94

Figure 43. Average Survival Months by Progesterone Receptor Status (2004)

4.2.1.6 Average Survivability by Behavior Code ICD-O-3

Figure 44. Average Survival Months by Behavior Code ICD-O-3 (2004)

In Figure 44, actual and measured average survival months are compared by behavior type for
cases registered in 2004 which are tagged ‘Dead’ at the cut-off date. The cases are categorized
as malignant (invasive) and carcinoma in situ (non-invasive). 97% of cases tagged ‘Dead’, at
cut-off date, have the malignant tumor and invasive primary breast cancer. The Ensemble and
95

CHAID models perform best with 45 and 46 average predicted survival months, respectively,
compared to 41 actual average survival months. Neural Network model performs best for cases
which have carcinoma in situ behavior type.

4.2.2 Comparison of Accuracy of Modeling Techniques
The accuracy of the experiments is calculated by using the formula (Equation 3)-

Accuracy = {1 − abs {

(Actual Survival Months – Measured Survival Months)
}} ∗ 100
Actual Survival Months
Equation 3. Accuracy

Accur a c y ( 2004)
100

93

92

90

80

ACCURACY (%)

80

66

70
60
50
40
30
20
10
0

Ensemble
93

CHAID
92

C&RT
80

Neural
66

Figure 45. Accuracy of different modeling techniques (2004)

Figure 45 compares accuracies of CHAID, C&RT, Neural Network and Ensemble
modeling techniques. Ensemble has the highest accuracy when compared at aggregated year

96

level, i.e. 2004. This is followed by CHAID and C&RT with 92% and 80%, respectively.
Neural Network has the lowest accuracy, i.e. 66%.
In the following sections, the accuracy of the predictive model is plotted as trend-lines
where each line shows the accuracy of the individual modeling technique (C&RT, CHAID and
Neural Network), and their Ensemble. The bars in each graph display the number of cases
following under the specific category i.e. such as age-range, marital status, etc.

4.2.2.1 Accuracy by Age Range

Figure 46. Accuracy by Age Range (2004)

Figure 46 shows the comparison of the accuracy of predictive models with respect to age for
cases registered in 2004 which are tagged ‘Dead’ at the cut-off date. The age ranges from 10
to 85+. The bars in the chart displays a number of cases for each age range. 15% of total cases
97

tagged ‘Dead’ fall under 80-85+ age range. The graph shows interesting trends, CHAID and
Ensemble model has the highest accuracy overall but drops down for age 75 and onwards.
C&RT model on the other hand has the highest accuracy of 95%, 91% and 100% for the age
range 10-24, 80-84 and 85+, respectively. Neural Network prediction ranges from 29-79%
which is lowest compared to other models. Ensemble outperforms the CHAID model for some
age-ranges.

4.2.2.2 Accuracy by Marital Status

Figure 47. Accuracy by Marital Status (2004)

Figure 47 shows the comparison of the accuracy of predictive models by marital status
(married, single, widowed, divorced, separated and unknown) for cases registered in 2004
which are tagged ‘Dead’ at the cut-off date. The bars in the chart displays a number of cases
for each marital status. The graph shows that 45% of died cases are married, 22% cases are
98

widowed, and 11% are divorced. Ensemble and CHAID have the highest accuracy for married
cases. CHAID has the highest prediction accuracy of divorced cases, i.e. 93%. C&RT has the
highest prediction accuracy for widowed cases. Neural Network tends to have low prediction
accuracy ranging from 47-82%.

4.2.2.3 Accuracy by Positive to Examined Nodes Ratio

Figure 48. Accuracy by Positive to Examined Ratio (2004)

Figure 48 shows the comparison of the accuracy of predictive models by the ratio of positive
to examined regional nodes for cases registered in 2004 which are tagged ‘Dead’ at the cut-off
date. The bars in the chart displays a number of cases with positive to examined regional nodes
ratio. Amongst the cases with examined nodes, 67% cases have positive to examined regional
nodes (ranging 1-100%). The Ensemble has the highest accuracy for cases having 70% or less
99

positive to regional nodes ratio and cases having no positive nodes. CHAID and C&RT have
the highest accuracy for cases having 81-90% and unknown positive to regional nodes ratio.
Neural Network prediction accuracy ranges from 34-84%.

4.2.2.4 Accuracy by Radiation and Surgery Sequence

Figure 49. Accuracy by Radiation Sequence Surgery (2004)

Figure 49 shows the case counts and survival months’ distribution by the radiation and surgery
sequence of cases registered in 2004 which are tagged ‘Dead’ at the cut-off date. The bars show
a number of cases which had radiation and surgery performed categorized as – intraoperative
radiation therapy, intraoperative radiation with other radiation given before or after, sequence
unknown yet both surgery and radiation are given, radiation both before and after surgery,
100

radiation before surgery, radiation after surgery and no radiation and/or surgery. The graph
shows that 66% of cases tagged ‘Dead’ did not have radiation and/or surgery performed
followed by 32% of cases who got radiation after surgery is performed. Ensemble and CHAID
have the highest accuracies ranging from 81-98% and 83-99% respectively. Neural Network
has the lowest accuracy overall, except for cases categorized as radiation after surgery. C&RT
has the lowest accuracy for cases categorized as radiation after surgery.

4.2.2.5 Accuracy by Estrogen and Progesterone Receptor Status

Figure 50. Accuracy by Estrogen Receptor Status (2004)

Figure 50 compares the accuracy of each modeling technique by ER status for cases registered
in 2004 which are tagged ‘Dead’ at the cut-off date. The status is recorded as – positive,
negative, borderline and unknown. 52% of cases tagged ‘Dead’, at cut-off date, have positive
ER status and 30% cases have negative ER status. The highest number of cases have positive
101

ER status, CHAID has the highest accuracy for such cases, i.e. 93% followed by Ensemble
with 91% accuracy. C&RT has the lowest accuracy for such cases. However, C&RT has the
highest accuracy for cases having negative ER status. Neural Network overall has low accuracy
as compared to other modeling techniques.

Figure 51. Accuracy by Progesterone Receptor Status (2004)

Figure 51 compared the actual and measured average survival months by PR status for cases
registered in 2004. The PR status is recorded as – positive, negative, borderline and unknown.
42% of ‘Dead’ cases have negative PR status, and 38% of cases have a positive PR status. The
graph shows exciting trends. C&RT has highest accuracy for cases with negative PR status and
unknown PR status. However, CHAID has the highest accuracy for cases with positive PR
status. On the other hand, Neural Network and Ensemble have second highest accuracy, i.e.

102

89% for cases with positive PR status. The Ensemble has the highest accuracy for cases with
borderline PR status.

4.2.2.6 Accuracy by Behavior Code ICD-O-3

Figure 52. Accuracy by Behavior ICD-O-3 (2004)

In Figure 52, the accuracy of modeling techniques is compared by behavior type for cases
registered in 2004 which are tagged ‘Dead’ at the cut-off date. The cases are categorized as
malignant (invasive) and carcinoma in situ (non-invasive). The Ensemble has the highest
accuracy of 91% followed by CHAID and C&RT respectively. However, Neural Network has
the lowest accuracy for malignant cases. On the other hand, for the cases having carcinoma in
situ behavior type, Neural Network has the highest accuracy. However, the Neural Network
also highest accuracy for carcinomic cases (97%) followed by Ensemble, CHAID and C&RT
respectively.

103

4.2.3 Impact of Retraining Breast Cancer Predictive Model

Predictive model deployment is an iterative process. The data with which the prediction model
is trained should have the same data distribution as the data on which it is tested [96]. With
new technological improvements, early detection and screening techniques, the breast cancer
data distribution has changed over time [97, 98]. To get more accurate results, retraining the
existing model with new data is a good practice. Retraining the predictive model in IBM SPSS
Modeler is possible. The existing model is trained with new data, reproducing/refreshing the
modeling nuggets. The impact of retraining the model is presented in the next subsections.
4.2.3.1 Case Study of Cases Diagnosed in 2008
In the year 2008, a total of 67,017 breast cancer cases were registered in the SEER database.
The predictive model is tested for cases registered in the year 2008 (which is outside training
range). The vital status of actual and predicted model’s outcome (i.e. measured) are compared.

Vital Status Actual vs Measured (2008)
100%

89.85%

89.85%

# OF CASES (%)

90%
80%
70%
60%
50%
40%
30%
20%

10.15%

10.15%

10%
0%
Alive

Dead
Actual

Measured

Figure 53. Vital Status comparison (2008)

104

As shown in Figure 53, out of total cases diagnosed in 2008, 83.39% of cases are tagged
‘Alive’ and 16.11% are tagged ‘Dead’ at the cut-off date. The predictive model also predicts
83.39% of cases as ‘Alive’ and 16.11% as ‘Dead’ as of the cut-off date (December 31, 2013).
This shows that the predictive model is predicting the exact number of mortalities for the year
2008. To check the predictive model’s accuracy in predicting survival months, measured
survival months are compared with actual survival months. The results are presented by
average and accuracy.
Further, the existing model is retrained with more recent data, i.e. cases diagnosed from
2003 to 2007. The same testing dataset (of cases diagnosed in 2008) is used to test the measured
average and accuracy of the retrained model. The results from both the experiments are
compared by plotting actual vs measured survival months. As long as the model is trained with
the recent data possible, it produces a more accurate outcome.

4.2.3.1.1 Comparing outcomes of Predictive Model with Retrained Model by Average
Survival Months
Figure 54 compares the actual and measured average survival months. A total of 6,774 cases
are tagged ‘Dead’ at the cut-off date. The average survival months of ‘Dead’ cases is 28.37
months. CHAID predicts 32 survival months followed by C&RT (39 months), Ensemble (42
months) and Neural Network (53 months). This graph shows that the Neural Network, C&RT
and Ensemble are not performing very well.

105

Av erag e Surv iv al M onths (2008)

SURVIVAL MONTHS (AVERAGE)

55

45

35

25

Survival Months

Ensemble

CHAID

C&RT

Neural

28.37

42

32

39

53

Figure 54. Average Survival Months Actual vs Measured (2008)

Av erag e Surv iv al M onths (2008) - Retrained

SURVIVAL MONTHS (AVERAGE)

30

29

28

27

26

25

Survival Months
28.37

Ensemble
28.51

CHAID
28.11

C&RT
29.33

Neural
28.09

Figure 55. Average Survival Months Actual vs Measured (2008) (Retrained model)

On the other hand, when the predictive model is retrained with 2004-2007 data and
tested with the same dataset (2008), the results change drastically. The measured survival
106

months of retrained model ranges from 28.09 to 28.51 survival months when the actual survival
months is 28.37 (Figure 55). This shows the impact of retraining the model with the prediction
outcomes improving exceptionally.

4.2.3.1.2 Comparing outcomes of Predictive Model with Retrained Model by Accuracy
Figure 56 compares the accuracy of the predictive model when trained with 1988-2003 data.
Only cases tagged ‘Dead’ at the cut-off date are taken into consideration. The graph shows the
CHAID has the highest accuracy, i.e. 86%, and C&RT and Ensemble technique’s accuracy is
comparatively lower, i.e. 61% and 53%, respectively. Neural Network has the lowest accuracy
i.e. 14%. When the model is retrained with 2004-2007 historical data, the accuracy improves
drastically.

Accu ra c y (2008)
100
90

ACCURACY (%)

80
70
60
50
40
30
20
10
0

Ensemble
53

CHAID
86

C&RT
61

Figure 56. Accuracy of Predictive Model (2008)

107

Neural
14

Accur a c y ( 2008) - R et rained
100
90

ACCURACY (%)

80
70
60
50
40
30
20
10

Ensemble

CHAID

C&RT

Neural

100

97

99

99

Figure 57. Accuracy of Retrained Predictive Model (2008)

Figure 57 shows Ensemble has 100% accuracy, C&RT and Neural Network have 99%
accuracy followed by CHAID. The summary of the impact of retraining the model is presented
in Table 10. The ‘Trained’ columns represent results generated from 1988-2003 trained
predictive model, and ‘Retrained’ columns represent the results generated by the retrained
model (2004-2007).
Actual Survival Months
28.37
Ensemble
CHAID
C&RT
Neural

Average Survival Months
Trained
Retrained
42
28.51
32
28.11
39
29.33
53
28.09

Accuracy
Trained Retrained
53
100
86
97
61
99
14
99

Table 10. Impact of Retraining

Apart from 2004 and 2008 test datasets as shown in above subsections, similar
experiments were conducted for other datasets which are outside the training range, and similar
results are achieved.

108

4.2.4 Summary of Predictive Modeling
The developed predictive model is tested with different datasets outside the training data.
Predicted average survival months and accuracy of each modeling technique is observed for
cases diagnosed in 2004 and 2008. The test results are validated by comparing the actual
survival months of cases tagged ‘Dead’ on the cut-off date. Following key observations are
made:


The Ensemble of selected modeling techniques performs better than other models with
93% accuracy.



CHAID with 92% accuracy ranks second following Ensemble technique.



C&RT and Neural Network modeling techniques have 80% and 66% accuracy,
respectively.



Results are consistent across various attributes (age, marital status, regional nodes).



Ensemble and CHAID perform best amongst all other modeling techniques used in
this study.



Neural Network performs poor consistently across all the experiments. Some
exceptions were observed when the number of cases was low.



Testing the predictive model with 2008 dataset gives low accuracies for all the models



On retraining the predictive model with more recent data, the accuracy of the predictive
model improved significantly resulting in Ensemble performing best with 100%
accuracy.



The accuracy of Neural Networks improves upon retraining. The inconsistency of data,
i.e. missing values of T, N and M variables in testing and training datasets does not
exist anymore. The 2004-2007 retraining dataset has over 200,000 cases. The Neural
109

Networks is retrained with missing values of these variables and thus predicts with
very high accuracy.
All experimental results presented are for cases tagged “Dead” at the study cut-off date
due to uncertainty of survival months beyond this date. The noted survival months of “Alive”
cases is not their survival months as patients could have died one month, one year or 5 years
beyond the study cut-off date. It is not possible to validate the results of “Alive” cases and
hence they are not included in the results. However, if both “Alive” and “Dead” cases are taken
into consideration, Neural Networks also perform well with higher accuracy across all the
experiments.
On testing with various other datasets outside training ranges, similar patterns of results
are observed. By analysis of various data sets, it is concluded that the developed predictive
model is a powerful application when retrained periodically.

4.3 Summary
In this chapter, the experimental results and analysis of the predictive model’s outcome are
presented. The results are presented by comparing actual and measured survival months and
validated by computing accuracy of each modeling technique. The visualization dashboard
snapshots are also presented along with an analysis of each report. In the next chapter, the
conclusion of the thesis is presented along with the possible future work.

110

5. Conclusion and Future Work
Breast cancer is the most common cancer in women across the globe. With an increasing
number of breast cancer incidences an early detection and treatment is the ideal way to decrease
breast cancer mortalities. It is also important to note that not all breast cancer patients die due
to breast cancer, many deaths happen due to other diseases which are a consequence of breast
cancer and metastasized cancers. Breast cancer remains the second most common cause of
death in women after heart diseases. Despite technological advancements and early cancer
detection techniques, one-size-fits-all is a common practice used for developing a cancer
treatment. Data-driven research outcomes are steps towards advancements in cancer treatment.
These outcomes include cancer prognosis, survival outcome and side effects of therapies. This
research focuses on the survival outcome of breast cancer patient and uses historical data to
develop a visualization dashboard and a predictive model. The survival outcome of the patient
not only helps physicians in designing a custom treatment plan for each patient, but it also
helps keeps the patient informed and involved in the process of cancer treatment.

In this research, an interactive, end-to-end process to cleanse, integrate, analyze, and
visualize enormous amount of data has been presented. The purpose is to enable healthcare
professionals, patients and policymakers with a better understanding of the hidden patterns in
data, which in turn, could be useful to collectively improve the quality of healthcare. Forty
years of breast cancer data (over one million records) extracted from the SEER database was
used to demonstrate the analytical power of data visualization. The research approach involved
using several data preprocessing techniques on the raw data followed by a selection of the
relevant variables, before feeding the processed data to the SQL Server for analysis. Tableau
111

was used for understanding and interpreting the data along several dimensions. Dynamic
reports generated using the drill-down capabilities of the dashboard provide insights at a finer
granularity. The dashboard shows incidence and mortality trends and highlights the underlying
patterns observed in breast cancer patients which could be used to support clinical decisions
made by physicians in formulating treatment plans. The dashboard is scalable and capable of
integrating new data in real-time. Although SEER data from the US currently power the
dashboard, it can be configured to use data from other sources as well. A better understanding
of the incidence and mortality trends could potentially guide data-driven resource allocation.
Physicians could also use this information to educate patients and create more awareness about
the disease.
The pre-processed data is used to develop a predictive model to predict survival months
of individual breast cancer patient. For predictions, Neural Network, CHAID, C&RT modeling
techniques along with their Ensemble are used, which predict survival months of each patient.
The predictive model calculator allows user to enter values of variables such as Marital Status;
Race/ethnicity; Age; Laterality; Histologic Type; Behavior Code; Regional nodes positive and
examined; Cancer-directed surgery; Radiation; Surgery of Primary Site; ER and PR status;
T, N, M value; Year of diagnosis. The predictive model is then run and produces survival
months predicted by each modeling technique and the Ensemble modeling technique. The
predictive model is developed in SPSS Modeler 18.1 [88] (explained in Chapter 3).

112

5.1 Future Work
This research can be extended in several ways as described below:

 The predictive model can be deployed on cloud and re-trained and tested online.
 The calculator can be developed with a web-interface and hosted on available web services
such as Amazon Web Services, IBM Bluemix.
 A similar predictive model can be developed for other cancer types and diseases.
 The developed predictive model can be trained and tested with different demographics data.
 Automating the process of re-training the developed predictive model with the most recent
data.
 The Visualization dashboard can be extended to other diseases and cancer types.

113

Bibliography
[1]

WHO, "Cancer," [Online]. Available: http://www.who.int/cancer/en/.

[2]

American Cancer Society, "Cancer Facts & Figures 2016," [Online]. Available:
http://www.cancer.org/acs/groups/content/@research/documents/document/acspc047079.pdf.

[3]

Cancer Canadian Society, "Cancer statistics at a glance," Cancer Canadian Society,
2015. [Online]. Available: http://www.cancer.ca/en/cancer-information/cancer101/cancer-statistics-at-a-glance/?region=on#.

[4]

Dursun Delen, "Predicting breast cancer survivability: a comparison of three data
mining methods," Artifiical Intelligence in Medicine, vol. 34, no. 2, pp. 113-127, 2005.

[5]

A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi and A. Choudhary, "Lung cancer
survival prediction using ensemble data mining on SEER data," Scientific
Programming - Biological Knowledge Discovery and Data Mining, vol. 20, no. 1, pp.
29-42, August 2012.

[6]

American Cancer Society, "About Breast Cancer," [Online]. Available:
https://www.cancer.org/cancer/breast-cancer/about/what-is-breast-cancer.html.

[7]

SAP Voice, "Four-Time Cancer Survivor on the Importance of Data-Driven Health
Decisions," BrandVoice - Forbes, 2016.

[8]

IBM,
"IBM
Watson
Health,"
IBM,
[Online].
https://www.ibm.com/watson/health/oncology-and-genomics/oncology/.

[9]

SAP, "The SAP Corporate Oncology Program," SAP, [Online]. Available:
https://www.sap.com/canada/about/careers/joining/benefits.html.

Available:

[10] L. Mallory, "IBM's Watson is really good at creating cancer treatment plans," Engadget,
2017. [Online]. Available: https://www.engadget.com/2017/06/01/ibm-watson-cancertreatment-plans/.
[11] K.-M. Wang, M. Bunjira, W.-L. Wu and Y. Lin, "Optimal Data Mining Method or
Predicting Breast Cancer Survivability," International Journal of Innovative
Management, Information and Production (ISME International), vol. 3, no. 2, pp. 2833, 2013.
[12] N. A. Christakis, J. L. Smith, C. M. Parkes and E. B. Lamont, "Extent and determinants
of error in doctors' prognoses in terminally ill patients: prospective cohort
studyCommentary: Why do doctors overestimate? Commentary: Prognoses should be
based on proved indices not intuition," British Medical Journal, vol. 320, no. 7233, pp.
469-472, 2000.
[13] ASCO Cancer.Net, "Understanding Cancer Research Study design and How to
Evaluate Results," Cancer.Net, [Online]. Available: https://www.cancer.net/researchand-advocacy/introduction-cancer-research/understanding-cancer-research-studydesign-and-how-evaluate-results.
114

[14] Susan G. Komen, "Komen Perspectives - The Importance of Clinical trials in Breast
Cancer
Treatment,"
[Online].
Available:
https://ww5.komen.org/KomenPerspectives/Komen-Perspectives---The-Importanceof-Clinical-Trials-in-Breast-Cancer-Treatment-(July-2012).html.
[15] American Society of Clinical Oncology, "Outcomes of cancer treatment for technology
assessment and cancer treatment guidelines," Journal of Clinical Oncology, vol. 14, no.
2, pp. 671-679, 1996.
[16] T. Smith and B. Smith, "Survival Analysis And The Application Of Cox's Proportional
Hazards modeling using SAS," in Proceedings of the twenty-sixth annual SAS user's
group international conference, 2001, pp. 224-246.
[17] D. G. Kleinbaum and M. Klein, "Survival Analysis," in Survival Analysis, Springer,
2012, pp. 1-54.
[18] G. P. Meren, "BOSOM Calculator: A Breast Cancer Outcome- Survival Online
Measurement Calculator using Data Mining and Predictive Modeling on SEER data,"
University of the Philippines, 2014.
[19] P. M. Ravdin, G. M. Clark, S. G. Hilsenbeck, M. A. Owens, P. Vendely, M. R. Pandian
and W. L. McGuire, "A demonstration that breast cancer recurrence can be predicted
by Neural Network analysis," Breast Cancer Research and Treatment, vol. 21, no. 1,
pp. 44-53, 1992.
[20] N. Lavrac, "Selected techniques for data mining in medicine," Artificial Intelligence in
Medicine, vol. 16, no. 1, pp. 3-23, 1999.
[21] K. J. Cios and G. W. Moore, "Uniqueness of medical data mining," Artificial
Intelligence in Medicine, vol. 26, no. 1-2, pp. 1-24, 2002.
[22] U. Fayyad, G. Piatestsky-Shapiro and P. Smyth, "From Data Mining to Knowledge
Discovery in Databases," AI Magazine, vol. 17, no. 3, p. 37, 1996.
[23] A. G. Eapen, "Application of Data mining in Medical Applications," University of
Waterloo, 2004.
[24] WIDESKILLS, "Introduction to Data Mining Tasks," [Online]. Available:
http://www.wideskills.com/data-mining-tutorial/05-data-mining-tasks.
[25] S. Gupta, D. Kumar and A. Sharma, "Data Mining Classification Techniques Applied
for Breast Cancer Diagnosis and Prognosis," Indian Journal of Computer Science and
Engineering (IJCSE), vol. 2, no. 2, pp. 188-195, 2011.
[26] J. G. Stadler, D. Kipp, J. D. Siewert, F. Tessa and N. E. Lewis, "Improving the
Efficiency and Ease of Healthcare Analysis Through Use of Data Visualization
Dashboards," Big Data, vol. 4, no. 2, pp. 129-135, 2016.
[27] Einforchips, "White Paper - Revolutionizing The Healthcare Industry with Big Data,
Analytics and Visualization".
[28] Institute of Health Metrics and Evaluation (IHME), "GBD Compare Data
Visualization," IHME, University of Washington, 2017. [Online]. Available:
https://vizhub.healthdata.org/gbd-compare/.

115

[29] Centres for Disease Control and Prevention, "United States Cancer Statistics: Data
Visualizations," Centres for Disease Control and Prevention, 2017. [Online]. Available:
https://nccd.cdc.gov/USCSDataViz/rdPage.aspx.
[30] WHO, "Global Cancer Observatory," World Health Organization, [Online]. Available:
https://gco.iarc.fr/.
[31] SEER, "Data & Software for Researchers," National Cancer Institute, [Online].
Available: seer.cancer.gov/resources/.
[32] G. Zorluoglu and M. Agaoglu, "Diagnosis of Breast Cancer Using Ensemble of Data
Mining Classification Methods," International Journal of Bioinformatics and
Biomedical Engineering, vol. 1, no. 3, pp. 318-322, 2015.
[33] P. M. Ravdin, L. A. Siminoff, G. J. Davis, M. B. Mercer, J. Hewlett, N. Gerson and H.
L. Parker, "Computer Program to Assist in Making Decisions about Adjuvant Therapy
for Women With Early Breast Cancer," Journal of Clinical Oncology, vol. 19, no. 4,
pp. 980-991, 2001.
[34] G. C. Wishart, E. M. Azzato, D. C. Greenberg, J. Rashbass, O. Kearins, G. Lawrence,
C. Caldas and P. D. Pharoah, "PREDICT: a new UK prognostic model that predicts
survival following surgery for invasive breast cancer," Breast Cancer Research, vol.
12, no. 1, p. 401, 2010.
[35] Statistics Canada, "The 10 leading causes of death, 2011," [Online]. Available:
http://www.statcan.gc.ca/pub/82-625-x/2014001/article/11896-eng.htm.
[36] National Cancer Institute, "Diagnosis and Staging,"
www.cancer.gov/about-cancer/diagnosis-staging/prognosis.

[Online].

Available:

[37] L. A. Winters-Miner, "Seven ways predictive analytics can improve healthcare,"
Elsevier’s Daily stories for the science, Technology and health communities, 2014.
[38] Breast Cancer Organisation, "Why So Many Types of Breast Cancer Treatment?,"
[Online]. Available: http://www.breastcancer.org/treatment/planning/types_treatment.
[39] Health Catalyst, "Predictive Analytics Solutions,"
https://www.healthcatalyst.com/predictive-analytics.

[Online].

Available:

[40] U.S. Cancer Statistics Working Group., "U.S. Cancer Statistics Data Visualizations
Tool," .S. Department of Health and Human Services, Centers for Disease Control and
Prevention and National Cancer Institute, June 2018. [Online]. Available:
www.cdc.gov/cancer/dataviz.
[41] Genomic Data Commons, "GDC DAVE Tools," National Cancer Institute, [Online].
Available: https://portal.gdc.cancer.gov/.
[42] IHME, "About IHME," IHME, [Online]. Available: http://www.healthdata.org/about.
[43] Institute of Health Metrics and Evaluation (IHME), "Data Visualizations," Institute of
Health
Metrics
and
Evaluation,
2017.
[Online].
Available:
http://www.healthdata.org/results/data-visualizations.
[44] CDC, "Centers for Disease Control and Prevention," Centers for Disease Control and
Prevention, 2017. [Online]. Available: https://www.cdc.gov/.
116

[45] National Cancer Institute (NCI), "About NCI," National Cancer Institute, [Online].
Available: https://www.cancer.gov/about-nci.
[46] WHO, "World Health Organization Home," World Health Organization , [Online].
Available: http://www.who.int/en/.
[47] WHO, "Section of Cancer Surveillance," World Health Organization, [Online].
Available: http://www.iarc.fr/en/research-groups/sec1/index.php.
[48] GLOBOCAN, "Estimated Cancer Incidence, Mortality and Prevalence Worldwide in
2012,"
World
Health
Organization,
2012.
[Online].
Available:
http://globocan.iarc.fr/Pages/fact_sheets_population.aspx.
[49] WHO, "CI5 Cancer Incidence in Five Continents," World Health Organization,
[Online]. Available: http://ci5.iarc.fr/.
[50] WHO, "International Incidence of Childhood Cancer 3," World Health Organization,
[Online]. Available: http://iicc.iarc.fr/.
[51] WHO, "Cancer survival in Africa, Asia, the Caribbean and Central America," World
Health Organization, [Online]. Available: http://survcan.iarc.fr/.
[52] Genomic Data Commons, "About the GDC," National Cancer Institute, [Online].
Available: https://gdc.cancer.gov/about-gdc.
[53] National Cancer Institute (NCI), "About TCGA," National Human Genome Research
Institute, [Online]. Available: https://cancergenome.nih.gov/abouttcga.
[54] National Cancer Institute (NCI), "TARGET: Therapeutically Applicable Research To
Generate Effective Treatments," National Cancer Institute Office of Cancer Genomics,
[Online]. Available: https://ocg.cancer.gov/programs/target.
[55] D. Delen, G. Walker and A. Kadam, "Predicting breast cancer survivability: a
comparison of three data mining methods," Artificial Intelligence in Medicine, vol. 34,
no. 2, pp. 113-127, 2005.
[56] M. Riihimäki, H. Thomsen, A. Brandt, J. Sundquist and K. Hemminki, "Death causes
in breast cancer patients," Annals of Oncology, vol. 23, no. 3, pp. 604-610, 2012.
[57] A. Bellaachia and E. Guven, "Predicting Breast Cancer Survivability Using Data
Mining Techniques," Age, vol. 58, no. 13, pp. 10-110, 2006.
[58] WEKA, "Home," The University
https://weka.wikispaces.com/.

of

Waikato,

[Online].

Available:

[59] A. Endo, S. Takeo and H. Tanaka, "Comparison of Seven Algorithms to Predict Breast
Cancer Survival," Biomedical Soft Computing and Human Sciences, vol. 13, no. 2, pp.
11-16, 2008.
[60] V. Chaurasia and S. Pal, "Data Mining Techniques: To Predict and Resolve Breast
Cancer Survivability," International Journal of Computer Science and Mobile
Computing, vol. 3, no. 1, pp. 10-22, 2017.
[61] Z. K. Senturk and R. Kara, "Breast Cancer Diagnosis via Data Mining: Performance
Analysis of Seven Different Algorithms," Computer Science & Engineering: An
International Journal (CSEIJ), vol. 4, no. 1, p. 35, 2014.
117

[62] V. Chaurasia and S. Pal, "A Novel Approach for Breast Cancer Detection Using Data
Mining Techniques," International Journal of Innovative Research in Computer and
Communication Engineering, vol. 2, no. 1, 2017.
[63] RapidMiner, "Data Science Behind Every Decision," [Online]. Available:
https://rapidminer.com/.
[64] B. V. Sumana and T. Santhanam, "An empirical comparison of ensemble and hybrid
classification," Proc Processing and VLSI, pp. 463-470, 2014.
[65] M. U. Khan, J. P. Choi, H. Shin and M. Kim, "Predicting Breast Cancer Survivability
Using Fuzzy Decision Trees for Personalized Healthcare," in 30th Annual International
IEEE EMBS Conference, Vancouver, British Columbia, 2008.
[66] J. P. Choi, T. H. Han and R. W. Park, "A Hybrid Bayesian Network Model for
Predicting Breast Cancer Prognosis," Journal of Korean Society of Medical
Informatics, vol. 15, no. 1, pp. 49-57, 2009.
[67] E. Alpaydin, Introduction to Machine Learning Second Edition, MIT Press, 2004.
[68] M.-W. Huang, C.-W. Chen, W.-C. Lin, S.-W. Ke and C.-F. Tsai, "SVM and SVM
Ensembles in Breast Cancer Prediction," PloS one, vol. 12, no. 1, p. e0161501, 2017.
[69] B. F. Hankey, L. A. Ries and B. K. Edwards, "The surveillance, epidemiology, and end
results program: A National Resource," Cancer Epidemiology & Prevention
Biomarkers, vol. 8, no. 12, p. 1117, 1999.
[70] NCI Surveillance, Epidermiology, and End Results Program (SEER), "SEER*Stat
Software,"
National
Cancer
Institute,
[Online].
Available:
https://seer.cancer.gov/seerstat/.
[71] Microsoft,
"SQL
Server
2016,"
Microsoft,
[Online].
https://www.microsoft.com/en-ca/sql-server/sql-server-2016.

Available:

[72] Tableau,
"Healthcare,"
Tableau,
https://www.tableau.com/stories/topic/healthcare.

Available:

[Online].

[73] Klipfolio, "Operational dashboards vs Analytical Dashboards," [Online]. Available:
http://www.klipfolio.com/resources/articles/operational-analytical-bi-dashboards.
[74] X. Zhang, K. Gallagher and S. Goh, "BI Application: Dashboards for Healthcare.," in
AMCIS, Detroit, 2011.
[75] Tableau, "Tableau Home," [Online]. Available: http://www.tableau.com/.
[76] Interworks,
"Why
Tableau,"
Interworks,
[Online].
https://interworks.co.uk/business-intelligence/why-tableau/.

Available:

[77] D. Howlett, "Why Tableau is crushing it for 21st century analysis," 2014. [Online].
Available: http://diginomica.com/2014/09/11/tableau-crushing/.
[78] Gartner, "Magic Quadrant for Business Intelligence and Analytics Platforms," Gartner,
[Online].
Available:
https://www.gartner.com/doc/reprints?ct=160204&id=12XXET8P.
[79] A. Chandramouly., "For fourth year, Gartner names Tableau a ‘leader’ in Magic
Quadrant,"
February
2016.
[Online].
Available:
118

https://www.tableau.com/about/blog/2016/2/fourth-year-gartner-names-tableauleader-magic-quadrant-49719.
[80] SEER, "Overview of SEER program," Surveillance Epidemiology and End Results,
[Online]. Available: http://www.seer.cancer.gov/about/overview.html.
[81] National Cancer Institute, "Surveillance Research Program," National cancer Institute,
[Online]. Available: https://surveillance.cancer.gov/.
[82] National Cancer Institute, "Divison of Cancer Control and Population Sciences,"
National Cancer Institute, [Online]. Available: https://cancercontrol.cancer.gov/.
[83] NCI's Surveillance, Epidermiology, and End Results Program (SEER), "SEER Registry
Grouping for Analyses," National Cancer Institure Surveillance, Epidermiology, and
End Results Program, [Online]. Available: http://seer.cancer.gov/registries/terms.html.
[84] BC Cancer Agency, "Research at BC Cancer Agency- Centre for the North," [Online].
Available:
http://www.bccancer.bc.ca/our-services/centres-clinics/centre-for-thenorth/research.
[85] University of Northern British Columbia, "Dr Robert Olson," [Online]. Available:
http://www.unbc.ca/robert-olson.
[86] C. Nyce, "Predictive Analytics White Paper," American Institute for CPCU. Insurance
Institute of America, pp. 9-10, 2007.
[87] S. Finlay, "Predictive Analytics," in Predictive Analytics, Data Mining and Big Data:
Myths, Misconceptions and Methods, Springer, p. 237.
[88] IBM,
"IBM
SPSS
Modeler,"
[Online].
01.ibm.com/support/docview.wss?uid=swg27050406.

Available:

http://www-

[89] R. Mikut and M. Reischl, "Data mining tools," Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, vol. 1, no. 5, pp. 431-443, 2011.
[90] IBM,
"IBM
Cognos
Analytics,"
IBM,
[Online].
Available:
http://www.ibm.com/analytics/us/en/technology/cognos-software/#what-is-cognos.
[91] Tableau,
"Stories,"
Tableau,
[Online].
https://onlinehelp.tableau.com/current/pro/desktop/en-us/stories.html.

Available:

[92] J. Brandt, J. P. Garne, I. Tengrup and J. Manjer, "Age at diagnosis in relation to survival
following breast cancer: a cohort study," World Journal of Surgical Oncology, vol. 13,
no. 1, p. 33, 2015.
[93] SEER, "SEER Registeries," Surveillance, Epidemiology, and End Results (SEER)
Program, [Online]. Available: https://seer.cancer.gov/registries/.
[94] BreastCancer.Org, "Lymph Node Involvement," BreastCancer.Org, 2017. [Online].
Available:
http://www.breastcancer.org/treatment/surgery/lymph_node_removal/axillary_dissect
ion.
[95] Interworks, "Tableau Essesntials: Chart Types - Box-and-Whisker Plot," [Online].
Available: https://interworks.com/blog/ccapitula/2014/12/09/tableau-essentials-charttypes-box-and-whisker-plot/.
119

[96] Amazon Machine Learning, "Retraining Models on New Data," AWS, [Online].
Available:
https://docs.aws.amazon.com/machine-learning/latest/dg/retrainingmodels-on-new-data.html.
[97] M. L. Lousdal, I. S. Kristiansen, B. Moller and H. Stovring, "Trends in breast cancer
stage distribution before, during and after introduction of a screening programme in
Norway," The European Journal of Public Health, vol. 24, no. 6, pp. 1017-1022, 2014.
[98] A. Aljarrah and W. Miller, "Trends in the distribution of breast cancer over time in the
southeast of Scotland and review of the literature," Ecancermedicalscience, vol. 8, p.
427, 2014.
[99] "11 Reasons to start a database instead of Excel and Word," The IT Training Surgery,
[Online]. Available: https://theittrainingsurgery.com/11-reasons-start-using-databaseinstead-excel-word/.
[100] "SAS
Enterprise
Miner,"
SAS,
[Online].
https://www.sas.com/en_ca/software/enterprise-miner.html.

Available:

[101] SAP, "Power Users' Guide," SAP Documentation, [Online]. Available:
https://help.sap.com/saphelp_nw70ehp2/helpdata/en/1e/013f420e09b26be10000000a
155106/frameset.htm.

120

Appendix A
Forms

121

Appendix B
The list of all the variables selected with their definition9and coding are listed below:

1) CS mets at DX-bone (2010+)
“Identifies the presence of distant metastatic involvement of bone at time of diagnosis. The
presence of metastatic bone disease at diagnosis is an independent prognostic indicator, and it
is used by Collaborative Staging to derive TNM-M codes and SEER Summary Stage codes for
some sites. This field should be coded for all solid tumors, Kaposi sarcoma, Unknown Primary
Site, and Other and Ill-Defined Sites. Only available for 2010+ diagnosis. This includes only
the bone, not the marrow.”
Codes
0: No
1: Yes
8: N/A
9: Unknown
14: Blank(s)
2) CS mets at DX-lung (2010+)
“Identifies the presence of distant metastatic involvement of the lung at the time of diagnosis.
The presence of metastatic lung disease at diagnosis is an independent prognostic indicator,
and it is used by collaborative Staging to derive TNM-M codes and SEER Summary Stage
codes for some sites. Only available for 2010+ diagnosis. Note: This includes only the lung,
not pleura or pleural fluid.”
Codes
0: No
1: Yes
8: N/A
9: Unknown
14: Blank(s)

9

All the definitions are from SEER’s documentation for text files [49].

122

3) CS mets at DX-liver (2010+)
“Identifies the presence of distant metastatic involvement of the liver at time of diagnosis. The
presence of metastatic liver disease at diagnosis is an independent prognostic indicator, and it
is used by Collaborative Staging derive TNM-M codes and SEER Summary Stage codes for
some sites. Only available for 2010+ diagnosis.”
Codes
0: No
1: Yes
8: N/A
9: Unknown
14: Blank(s)
4) CS mets at DX-brain (2010+)
“Identifies the presence of distant metastatic involvement of the brain at the time of diagnosis.
The presence of metastatic brain disease at diagnosis is an independent prognostic indicator,
and it is used by Collaborative Staging to derive TNM-M codes and SEER Summary Stage
codes for some sites. This field should be coded for all solid tumors, Kaposi sarcoma, Unknown
Primary Site, and Other and Ill-Defined Primary Sites. Only available for 2010+ diagnosis.”
Codes
0: No
1: Yes
8: N/A
9: Unknown
14: Blank(s)
5) Breast Subtype (2010+)
“Created with combined information from ER Status Recode Breast Cancer (1990+), PR Status
Recode Breast Cancer (1990+), and Derived HER2 Recode (2010+).”
Codes
1: Her2+/HR+
2: Her2+/HR3: Her2-/HR+
4: Triple Negative
5: Unknown
9: Not 2010+ Breast

123

6) Vital status recode (study cutoff used)
“Any patient that dies after the follow-up cut-off date is recoded to alive as of the cut-off date.”
Codes
1: Alive
4: Dead

7) Age recode with < 1 year old
“The age recode variable is based on Age at Diagnosis (single-year ages). The groupings used
in the age recode variable are determined by the age groupings in the population data. This
recode has 19 age groups in the age recode variable (< 1 year, 1-4 years, 5-9 years, 85+ years).”
Codes
00: Age 00
01: Age 01-04
02: Age 05-09
03: Age 10-14
04: Age 15-19
05: Age 20-24
06: Age 25-29
07: Age 30-34
08: Age 35-39
09: Age 40-44
10: Age 45-49
11: Age 50-54
12: Age 55-59
13: Age 59-60
14: Age 65-69
15: Age 70-74
16: Age 75-79
17: Age 80-84
18: Age 85+
99: Unknown Age
8) Radiation
“The method of radiation therapy performed as part of the first course of treatment.”
Codes
0: None; diagnosed at autopsy
1: Beam radiation
124

2: Radioactive implants
3: Radioisotopes
4: Combination of 1 with 2 or 3
5: Radiation, NOS – method or source not specified
6: Other radiation (1973-1987 cases only)
7: Patient or patient’s guardian refused radiation therapy
8: Radiation recommended, unknown if administered
9: Unknown if radiation administered
9) Radiation sequence with surgery
“The order in which surgery and radiation therapies were administered for those patients who
had both surgery and radiation”
Codes
0: No radiation and/or surgery as defined above
2: Radiation before surgery
3: Radiation after surgery
4: Radiation both before and after surgery
5: Intraoperative radiation therapy
6: Intraoperative radiation therapy with other radiation given before or after surgery
9: Sequence unknown, but both surgery and radiation were given

10) N value-based on AJCC 3rd (1998-2003)
“Derived by algorithm from extent of disease (EOD). N value denotes the degree of nearby
lymph nodes involved.”
Codes
00: N0
10: N1
11: N1a
12: N1b
19: N1x
20: N2
21: N2a
22: N2b
23: N2c
30: N3
70: NXr
80: Nxu
90: NX
125

11) M value-based on AJCC 3rd (1998-2003)
“Derived by algorithm from extent of disease (EOD). M value denotes the presence of distant
metastasis.”
Codes
00: M0
10: M1
99: MX
11: M1a
12: M1b
12) T value-based on AJCC 3rd (1998-2003)
“Derived by algorithm from extent of disease (EOD). T value denotes the size of the original
(primary) tumor.”
Codes
00: Tis
01: Ta
10: T1
11: M1a
12: M1b
13: T1c
16: T1a1
17: T1a2
19: T1x
20: T2
21: T2a
22: T2b
23: T2c
29: T2x
30: T3
31: T3a
32: T3b
33: T3c
39: T3x
40: T4
41: T4a
42: T4b
43: T4c
44: T4d
49: T4x
126

70: T0
71: T0a
72: T0b
81: Txa
82: TXb
83: TXc
84: TXd
99: TX
13) Regional Nodes Positive
“Records the exact number of regional lymph nodes examined by the pathologist that were
found to contain metastases.”
Codes
0: All nodes examined are negative
01-89: Exact number of nodes positive
90: 90 or more nodes are positive
95: Positive aspiration of lymph node(s) was performed
97: Positive nodes are documented, but number is unspecified
98: No nodes were examined
99: Unknown whether nodes are positive; not applicable; not stated in patient record
14) Regional Nodes Examined
“Records the total number of regional lymph nodes that were removed and examined by the
pathologist.”
Codes
0: No nodes were examined
01-89: Exact number of nodes examined
90: 90 or more nodes were examined
95: No regional nodes were removed, but aspiration of regional nodes was performed
96: Regional lymph node removal was documented as a sampling, and the number of nodes
is unknown/ not stated
97: Regional lymph node removal was documented as a dissection, and the number of nodes
is unknown/not stated
98: Regional lymph node were surgically removed, but the number of lymph nodes is
unknown/not stated and not documented as a sampling or dissection, nodes were
examined, but the number is unknown
99: Unknown whether nodes are positive; not applicable; not stated in patient record

127

15) CS mets at dx (2004+)
“Information on distant metastasis. Available for 2004+. Earlier cases may be converted and
new codes added which weren't available for use prior to the current version of CS.”
Codes
0: No distant metastasis
5: No clinical or radiographic evidence of distant metastasis, but deposits of molecularly
or microscopically detected tumor cells in circulating blood, bone marrow or other nonregional nodal tissue that are 0.2 millimeters (mm) or less in a patient without symptoms or
signs of metastasis
7: Stated as M0(i+) with no other information on distant metastasis
10: Distant lymph node(s): Cervical; NOS, Contralateral/bilateral axillary and/or internal
mammary, Other than above. Distant lymph node(s), NOS
40: Distant metastasis except distant lymph node(s) (code 10) Carcinomatosis
42: Further contiguous extension: Skin over: Axilla, Contralateral (opposite) breast,
Sternum, Upper abdomen.
44: Metastasis: Adrenal (suprarenal) gland, Bone, other than adjacent rib, Contralateral
(opposite) breast - if stated as metastatic Lung, Ovary, Satellite nodule(s) in skin other than
primary breast
50: (40 - 44) + 10
60: Distant metastasis, NOS. Stated as M1 with no other information on distant metastasis
99: Unknown; distant metastasis not stated. Distant metastasis cannot be assessed. Not
documented in patient record.
126: Unknown

16) CS tumor Size
“Information on tumor size. Available for 2004+. Earlier cases may be converted and new
codes added which weren't available for use prior to the current version of CS.”
Codes
0:
Indicates no mass or no tumor found; for example, when a tumor of a stated primary
site is not found, but the tumor has metastasized
1-989:1-989 millimeters
990: Microscopic focus or foci only; no size of focus is given
991: Described as less than 1 cm
992: Described as less than 2 cm
993: Described as less than 3 cm
994: Described as less than 4 cm
995: Described as less than 5 cm
996: Site-specific codes where needed
997: Site-specific codes where needed
998: Site-specific codes where needed
999: Unknown; size not stated; not stated in patient record
128

17) CS Extension
“Information on extension of the tumor. Available for 2004+. Earlier cases may be converted
and new codes added which weren't available for use prior to the current version of CS.”
Codes
0:
In situ: noninfiltrating; intraepithelial. Intraductal WITHOUT infiltration. Lobular
neoplasia.
50: Paget disease of nipple WITHOUT underlying tumor
70: Paget Disease disease of nipple WITHOUT underlying invasive carcinoma
pathologically
100: Confined to breast tissue and fat including nipple and/or areola Localized, NOS
110: Stated as T1mi with no other information on extension
120: Stated as T1a with no other information on extension
130: Stated as T1b with no other information on extension
140: Stated as T1c with no other information on extension
170: Stated as T1 (NOS) with no other information on extension or size
180: Stated as T2 with no other information on extension or size
190: Stated as T3 with no other information on extension or size
200: Invasion of subcutaneous tissue. Local infiltration of dermal lymphatics adjacent to
primary tumor involving skin by direct extension. Skin infiltration of primary breast including
skin of nipple and/or areola.
300: Attachment or fixation to pectoral muscle(s) or underlying tissue. Deep fixation.
Invasion of (or fixation to) pectoral fascia or muscle
380: OBSOLETE DATA CONVERTED V0203. See code 790. Stated as T4 (NOS) with no
other information on extension
390: OBSOLETE DATA CONVERTED V0203. See code 410. Stated as T4a with no other
information on extension.
400: Invasion of (or fixation to): Chest wall, Intercostal or serratus anterior muscle(s),
Rib(s). See codes 610 (obsolete), 612-615, and 620 (obsolete) for combinations with this code.
410: Stated as T4a with no other information on extension
510: OBSOLETE DATA RETAINED V0200. Extensive skin involvement, including:
Satellite nodule(s) in skin of primary breast, Ulceration of skin of breast. Any of the following
conditions described as involving not more than 50% of the breast, or amount or percent
512: Extensive skin involvement, including: Satellite nodule(s) in skin of primary breast,
Ulceration of skin of breast.
514: Any of the following conditions described as involving less than one-third (33%) of the
breast WITHOUT a stated diagnosis of inflammatory carcinoma. WITH or WITHOUT
dermal lymphatic infiltration: Edema of skin, En cuirasse, Erythema, Inflammation of skin,
516: 514 + 512
518: Any of the following conditions described as involving one third (33%) or more but
less than or equal to half (50%) of the breast WITHOUT a stated diagnosis of inflammatory
carcinoma. WITH or WITHOUT dermal lymphatic infiltration: Edema of skin, En cuira
519: 518 + 512
129

520: Any of the following conditions described as involving more than 50% of the breast
WITHOUT a stated diagnosis of inflammatory carcinoma. WITH or WITHOUT dermal
lymphatic infiltration: Edema of skin, En cuirasse, Erythema, Inflammation of skin, Peau d'ora
575: 520 + 512
580: Any of the following conditions with amount or percent of breast involvement not
stated and WITHOUT a stated diagnosis of inflammatory carcinoma. WITH or WITHOUT
dermal lymphatic infiltration: Edema of skin, En cuirasse, Erythema, Inflammation of skin, P
585: 580 + 512
590: OBSOLETE DATA CONVERTED V0203. See code 605. Stated as T4b with no other
information on extension.
600: Diagnosis of inflammatory carcinoma WITH a clinical description of inflammation,
erythema, edema, peau d'orange, etc., involving less than one-third (33%) of the skin of the
breast, WITH or WITHOUT dermal lymphatic infiltration.
605: Stated as T4b with no other information on extension
610: OBSOLETE DATA RETAINED V0200. (400) + (510)
612: Any of (512-516) + 400
613: Any of (518-519) + 400
615: Any of (520-585) + 400
620: OBSOLETE DATA RETAINED V0200. (400) + (520)
680: Stated as T4c with no other information on extension
710: OBSOLETE DATA RETAINED V0200. Diagnosis of inflammatory carcinoma WITH
a clinical description of inflammation, erythema, edema, peau d'orange, etc., involving not
more than 50% of the skin of the breast, WITH or WITHOUT dermal lymphatic infiltration.
Infl
715: OBSOLETE DATA RETAINED V0202. Diagnosis of inflammatory carcinoma WITH
a clinical description of inflammation, erythema, edema, peau d'orange, etc., involving not
more than one-third (33%) of the skin of the breast, WITH or WITHOUT dermal lymphatic
infilt
720: OBSOLETE DATA CONVERTED V0102. Description: Diagnosis of inflammatory
carcinoma WITH a clinical diagnosis of inflammation, erythema, edema, peau d'orange, etc.,
of not more than 50% of the breast, WITH or WITHOUT dermal lymphatic infiltration.
Inflammator
725: Diagnosis of inflammatory carcinoma WITH a clinical description of inflammation,
erythema, edema, peau d'orange, etc., involving one-third (33%) or more but less than or equal
to one-half (50%) of the skin of the breast, WITH or WITHOUT dermal lymphatic i
730: Diagnosis of inflammatory carcinoma WITH a clinical description of inflammation,
erythema, edema, peau d'orange, etc., involving more than one-half (50%) of the skin of the
breast, WITH or WITHOUT dermal lymphatic infiltration.
750: Diagnosis of inflammatory carcinoma WITH a clinical description of inflammation,
erythema, edema, peau d'orange, etc., but percent of involvement not stated, WITH or
WITHOUT dermal lymphatic infiltration. Note: If percentage is known, code to 600, 725
780: Stated as T4d with no other information on extension
790: Stated as T4 (NOS) with no other information on extension
950: No evidence of primary tumor
999: Unknown; extension not stated. Primary tumor cannot be assessed. Not documented in
patient record.
1022: Unknown
130

18) CS Lymph Nodes
“Information on involvement of lymph nodes. Available for 2004+. Earlier cases may be
converted and new codes added which weren't available for use prior to the current version of
CS.”
Codes
0:
No regional lymph node involvement OR isolated tumor cells (ITCs) detected by
immunohistochemistry/immunohistochemical (IHC) methods or molecular methods ONLY.
50: Evaluated pathologically: None; no regional lymph node involvement BUT ITCs
detected on routine hematoxylin and eosin (H and E) stains.
130: Evaluated pathologically: Axillary lymph node(s), ipsilateral, micrometastasis ONLY
detected by IHC ONLY (At least one micrometastasis greater than 0.2 mm or more than 200
cells AND all micrometastases less than or equal to 2 mm)
150: Evaluated pathologically: Axillary lymph node(s), ipsilateral, micrometastasis ONLY
detected or verified on H&E (At least one micrometastasis greater than 0.2 mm or more than
200 cells AND all micrometastases less than or equal to 2 mm) Micrometastasis, NOS
155: Evaluated pathologically: Stated as N1mi with no other information on regional lymph
nodes
250: Evaluated pathologically: Movable axillary lymph node(s), ipsilateral, positive with
more than micrometastasis (At least one metastasis greater than 2 mm)
255: Evaluated pathologically: Movable axillary lymph node(s), ipsilateral, positive with
more than micrometastasis (At least one metastasis greater than 2 mm)
257: Evaluated clinically: Clinically stated only as N1 (Clinical assessment because of
neoadjuvant therapy or no pathology)
258: Evaluated pathologically: Pathologically stated only as N1 [NOS], no information on
which nodes were involved
260: Stated as N1 [NOS] with no other information on regional lymph nodes
280: OBSOLETE DATA RETAINED V0104. Stated as N2, NOS
290: OBSOLETE DATA CONVERTED V0203. See code 610. Clinically stated only as N2,
NOS (clinical assessment because of neoadjuvant therapy or no pathology)
300: OBSOLETE DATA CONVERTED V0203. See code 620. Pathologically stated only
as N2 NOS; no information on which nodes were involved
500: OBSOLETE DATA RETAINED V0104. Fixed/matted ipsilateral axillary nodes,
positive with more than micrometastasis (i.e., at least one metastasis greater than 2 mm).
Fixed/matted ipsilateral axillary nodes, NOS
510: Evaluated clinically: Fixed/matted ipsilateral axillary nodes clinically (Clinical
assessment because of neoadjuvant therapy or no pathology). Stated clinically as N2a
(Clinical assessment because of neoadjuvant therapy or no pathology)
520: Evaluated pathologically: Fixed/matted ipsilateral axillary nodes clinically with
pathologic involvement of lymph nodes WITH at least one metastasis greater than 2 mm
600: Axillary/regional lymph node(s), NOS Lymph nodes, NOS
610: Evaluated clinically: Clinically stated only as N2 [NOS] (Clinical assessment because
of neoadjuvant therapy or no pathology)
620: Evaluated pathologically: Pathologically stated only as N2 [NOS]; no information on
which nodes were involved
131

630: Stated as N2 [NOS] with no other information on regional lymph nodes
710: Evaluated pathologically: Internal mammary node(s), ipsilateral, positive on sentinel
nodes but not clinically apparent (No positive imaging or clinical exam)
WITHOUT axillary lymph node(s), ipsilateral
720: Evaluated pathologically: Internal mammary node(s), ipsilateral, positive on sentinel
nodes but not clinically apparent (No positive imaging or clinical exam) WITH axillary lymph
node(s), ipsilateral
730: Evaluated pathologically: Internal mammary node(s), ipsilateral, positive on sentinel
nodes but not clinically apparent (No positive imaging or clinical exam) UNKNOWN if
positive axillary lymph node(s), ipsilateral
735: Evaluated clinically: Internal mammary node(s), ipsilateral, positive on sentinel nodes
but primary not resected WITHOUT axillary lymph node(s), ipsilateral OR UNKNOWN if
positive axillary lymph node(s)
740: Internal mammary node(s), ipsilateral, clinically apparent (On imaging or clinical
exam) WITHOUT axillary lymph node(s), ipsilateral
745: Internal mammary node(s), ipsilateral, clinically apparent (On imaging or clinical
exam) UNKNOWN if positive axillary lymph node(s), ipsilateral
748: Stated as N2b with no other information on regional lymph nodes
750: Infraclavicular lymph node(s) (subclavicular) (level III axillary nodes) (apical),
ipsilateral WITH or WITHOUT axillary nodes(s)
WITHOUT internal mammary node(s)
755: Stated as N3a with no other information on regional lymph nodes
760: OBSOLETE DATA RETAINED AND REVIEWED V0203. See codes 763 and765.
Internal mammary node(s), ipsilateral, clinically apparent (on imaging or clinical exam)
WITH axillary lymph node(s), ipsilateral, codes 150 to 600 WITH or WITHOUT
infraclavicular (level III axillary nodes) (apical) lymph nodes.
763: Internal mammary node(s), ipsilateral, clinically apparent (On imaging or clinical
exam) WITH axillary lymph node(s), ipsilateral, codes 150 to 600 WITHOUT infraclavicular
(level III axillary nodes) (apical) lymph nodes or unknown if infraclavicular (level III axillary
nodes) (apical) lymph nodes involved
764: Internal mammary node(s), ipsilateral, clinically apparent (On imaging or clinical
exam) WITHOUT axillary lymph node(s), ipsilateral WITH infraclavicular (level III axillary
nodes) (apical) lymph nodes involved
765: Internal mammary node(s), ipsilateral, clinically apparent (On imaging or clinical
exam) WITH axillary lymph node(s), ipsilateral. WITH infraclavicular (level III axillary
nodes) (apical) lymph nodes involved
768: Stated as N3b with no other information on regional lymph nodes
770: OBSOLETE DATA RETAINED V0200. Internal mammary node(s), ipsilateral,
clinically apparent (on imaging or clinical exam). UNKNOWN if positive axillary lymph
node(s), ipsilateral
780: OBSOLETE DATA RETAINED V0200. (750) + (770)
790: OBSOLETE DATA CONVERTED V0203. See code 820. Stated as N3, NOS
800: Supraclavicular node(s), ipsilateral
805: Stated as N3c with no other information on regional lymph nodes
810: Evaluated clinically: Clinically stated only as N3 [NOS] (Clinical assessment because
of neoadjuvant therapy or no pathology)
132

815: Evaluated pathologically: Pathologically stated only as N3 [NOS]; no information on
which nodes were involved
820: Stated as N3, NOS with no other information on regional lymph nodes
999: Unknown; regional lymph nodes not stated. Regional lymph node(s) cannot be
assessed. Not documented in patient record
1022: Unknown
19) Marital Status at diagnosis
“This variable identifies the patient’s marital status at the time of diagnosis for the reportable
tumor.”
Codes
1: Single (never married)
2: Married (including common law)
3: Separated
4: Divorced
5: Widowed
6: Unmarried or domestic partner (same sex or opposite sex or unregistered)
9: Unknown
20) ER Status Recode Breast Cancer (1990+)
“Created by combining information from Tumor marker 1 (1990-2003), with information from
CS site-specific factor 1 (2004+).”
Codes
1: Positive
2: Negative
3: Borderline
4: Unknown
9: Not 1990+ Breast
21) PR Status Recode Breast Cancer (1990+)
“Created by combining information from Tumor marker 2 (1990-2003), with information from
CS site-specific factor 2 (2004+). This field is blank for non-breast cases and cases diagnosed
before 1990.”
Codes
1: Positive
2: Negative
3: Borderline
133

4: Unknown
9: Not 1990+ Breast
22) SEER cause-specific death classification
“This variable designates that the person died of their cancer for cause-specific survival.”
Codes
0: Alive or dead of other cause (Filtered out from the extracted dataset used in this research)
1: Dead
9: N/A not first tumor
23) Survival Months
“Created using complete dates, including days, therefore may differ from survival time
calculated from year and month only”

24) Laterality
“Laterality describes the side of a paired organ or side of the body on which the reportable
tumor originated. Starting with cases diagnosed January 1, 2004 and later, laterality is coded
for select invasive, benign, and borderline primary intracranial and CNS tumors.”
Codes
0: Not a paired site
1: Right: origin of primary
2: Left: origin of primary
3: Only one side involved, right or left origin unspecified
4: Bilateral involvement, lateral origin unknown; stated to be single primary- Both ovaries
involved simultaneously, single histology, Bilateral retinoblastomas,Bilateral Wilms’s
tumors
5: Paired site: midline tumor
9: Paired site, but no information concerning laterality; midline tumor
25) Histologic Type ICD-O-3
“Histologic Type describes the microscopic composition of cells and/or tissue for a specific
primary. The tumor type or histology is a basis for staging and determination of treatment
options. It affects the prognosis and course of the disease. The International Classification of
Diseases for Oncology, Third Edition (ICD-O-3) is the standard reference for coding the
histology for tumors diagnosed in 2001 and later. All ICD-O-2 histologies for 1973-2000 were
converted to ICD-O-3.”
134

Codes
0: Benign (Reportable for intracranial and CNS sites only)
1: Uncertain whether benign or malignant, borderline malignancy, low malignant potential,
and uncertain malignant potential (Reportable for intracranial and CNS sites only)
2: Carcinoma in situ; intraepithelial; noninfiltrating; noninvasive
3: Malignant, primary site (invasive)
26) Race/ethnicity
“Recode which gives priority to non-white races for persons of mixed races.”
Codes
01: White
02: Black
03: American Indian, Aleutian, Alaskan Native or Eskimo (includes all indigenous
populations of the Western hemisphere)
04: Chinese
05: Japanese
06: Filipino
07: Hawaiian
08: Korean (Effective with 1/1/1988 dx)
10: Vietnamese (Effective with 1/1/1988 dx)
11: Laotian (Effective with 1/1/1988 dx)
12: Hmong (Effective with 1/1/1988 dx)
13: Kampuchean (including Khmer and Cambodian) (Effective with 1/1/1988 dx)
14: Thai (Effective with 1/1/1994 dx)
15: Asian Indian or Pakistani, NOS (Effective with 1/1/1988 dx)
16: Asian Indian (Effective with 1/1/2010 dx)
17: Pakistani (Effective with 1/1/2010 dx)
20: Micronesian, NOS (Effective with 1/1/1991)
21: Chamorran (Effective with 1/1/1991 dx)
22: Guamanian, NOS (Effective with 1/1/1991 dx)
25: Polynesian, NOS (Effective with 1/1/1991 dx)
26: Tahitian (Effective with 1/1/1991 dx)
27: Samoan (Effective with 1/1/1991 dx)
28: Tongan (Effective with 1/1/1991 dx)
30: Melanesian, NOS (Effective with 1/1/1991 dx)
31: Fiji Islander (Effective with 1/1/1991 dx)
32: New Guinean (Effective with 1/1/1991 dx)
96: Other Asian, including Asian, NOS and Oriental, NOS (Effective with 1/1/1991 dx)
97: Pacific Islander, NOS (Effective with 1/1/1991 dx)
98: Other
99: Unknown

135

27) Year of Diagnosis
“The year of diagnosis is the year the tumor was first diagnosed by a recognized medical
practitioner, whether clinically or microscopically confirmed.”

28) Behavior Code ICD-O3
“SEER requires registries to collect malignancies with in situ /2 and malignant /3 behavior
codes as described in ICD-O-3. SEER requires registries to collect benign /0 and borderline /1
intracranial and CNS tumors for cases diagnosed on or after 1/1/2004. Behavior is the fifth
digit of the morphology code after the slash (/).”
Codes
0: Benign (Reportable for intracranial and CNS sites only)
1: Uncertain whether benign or malignant, borderline malignancy, low malignant potential,
and uncertain malignant potential (Reportable for intracranial and CNS sites only)
2: Carcinoma in situ; intraepithelial; noninfiltrating; noninvasive
3: Malignant, primary site (invasive)
29) Surgery of Primary Site
“Surgery of Primary Site describes a surgical procedure that removes and/or destroys tissue of
the primary site performed as part of the initial work-up or first course of therapy.”
Codes
00: None; no surgical procedure of primary site; diagnosed at autopsy only
10-19: Site-specific codes. Tumor destruction; no pathologic specimen or unknown whether
there is a pathologic specimen
20-80: Site-specific codes. Resection; pathologic specimen
90: Surgery, NOS. A surgical procedure to the primary site was done, but no information on
the type of surgical procedure is provided.
98: Special codes for hematopoietic, reticuloendothelial, immunoproliferative,
myeloproliferative diseases; illdefined sites; and unknown primaries, except death certificate
only
99: Unknown if surgery performed; death certificate only
30) Reason no cancer-directed surgery
“This variable documents the reason that surgery was not performed on the primary site”

136

Codes
0: Surgery performed
1*: Surgery not recommended
2*: Contraindicated due to other conditions (1973-2002)
5: Patient died before recommended surgery
6: Unknown reason for no surgery
7* Patient or patient's guardian refused
8: Recommended, unknown if done
9: Unknown if surgery performed; Death Certificate Only case; Autopsy only case (2003+)
*Codes not used prior to 1988. Code ‘2’ used only for Autopsy only cases prior to 1988

137