EXPLORING HOW GRAPHLET ANALYSIS CAN BE USED TO IDENTIFY
HIGHLY-CONNECTED CANCER DRIVING GENES
by
Alexander Spyridon Douglas
B.Sc. H., University of Northern British Columbia, 2017

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
INTERDISCIPLINARY STUDIES

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
December 2022
© Alexander Spyridon Douglas, 2022

In Appreciation
to
My Family and Friends

Acknowledgement
Firstly, I would like to express my deepest gratitude and respect to Dr. Alex
Aravind. Without his help I would have not been able to pursue a graduate
degree that I found worthwhile. I would like to express my sincerest gratitude
to Dr. David Casperson, Dr. Chow Lee, and Dr. Liang Chen for assisting
me achieve my masters degree and for fostering a healthy pursuit of scientiﬁc
knowledge and academic integrity. I am extremely indebted to the Chairperson, the Department of Interdisciplinary Studies, the University of Northern
British Columbia (UNBC), and Prince George, for providing all the necessary
resources to complete my degree.
Secondly, I am indebted to my labmates Conan Veitch, Dylan Fossl, Colton
Jensen, Hunter Graeme, Furkhana Khan, and Courtney Lawrence for their support, discussion, encouragement, and pleasant company. I would also like to
extend a very heartfelt thanks to the Geoffrey R. Weller Library and UNBC
Security staff members for their support and friendship throughout all the years
I’ve worked at UNBC.
Thirdly, it is with great pleasure to acknowledge the good company of Father
Theodore Gove and Paul English, both of whom were left positive inﬂuences
towards my life.
And ﬁnally, I want to express my deepest gratitude to my family and friends
for their love, support and encouragement throughout the course of my time in
graduate school. I would like to give a very special thanks to my mother, Dr.
Vasiliki Kravariotis-Douglas, and my sister, Ms. Penny Hance, for their combined, immeasurable academic support and mastery to help shape this thesis
and for instilling me with the courage to strive for greater things.
Prince George, BC
December 9th, 2022

(Alexander Douglas)

iii

Abstract
One of the driving force that pushes a normal cell towards a cancerous state is led by unregulated control of its cellular division through the suppression of tumour suppressor genes and
over expression of oncogenes. These cancer-driver genes have been associated in the advancement of different cancer types, though how they are represented and with which genes they
are associated within the cancer’s gene expression proﬁles is a boon towards understanding
of said cancer’s development. An novel way of modelling these gene expression proﬁles is
to use graphlet-based network analysis, a data mining technique that allows for identiﬁcation,
understanding, and prediction of their functionality, emergent properties, and potential controllability. The thesis aims to identify patterns in gene co-expression networks of shared cancer
census genes found in breast cancer and lung cancer using cancer gene census data provided
by COSMIC, the Catalogue of Somatic Mutations in Cancer

iv

TABLE OF CONTENTS

Acknowledgement

iii

Abstract

iv

Table of Contents

v

List of Tables

vii

List of Figures

viii

1

Introduction

1

2

Graphlet Analysis Metrics and Algorithm
6
2.1
Graph Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Automorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2
Graphlets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3
Algorithm Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4
Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5
Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3

COSMIC Database
3.1
COSMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Cancer Gene Census . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Comparative Gene Expression . . . . . . . . . . . . . . . . . . . . . .

18
18
18
21

4

Graph Network Creation
4.1
Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Proposed Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Gene Expression Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Gene Expression Matrices . . . . . . . . . . . . . . . . . . . . . . . .

24
24
24
27
29
29

v

5

Cancer Network Results
5.1
Cancer Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Structural Network Observations . . . . . . . . . . . . . . . . . . . . .
5.1.2 Potential Genetic Implications . . . . . . . . . . . . . . . . . . . . . .

31
31
31
34

6

Discussion and Overview

49

Bibliography

53

A Appendix
57
A.1 Figures and Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

vi

LIST OF TABLES

2.1

Orbit checksums of each orbit found in 3-, 4-, and 5-node graphlets. . . . . . 13

5.1

Orbit Clusters Found in Identiﬁed Age Groups of Breast and Lung Cancer
Gene Co-expression Networks. . . . . . . . . . . . . . . . . . . . . . . . . . 32
Graphlet Correlation Distance between in Identiﬁed Age Groups of Breast
and Lung Cancer Gene Co-expression Networks. . . . . . . . . . . . . . . . . 34
Gene Distributions Identiﬁed Age Groups of Breast and Lung Cancer Gene
Co-expression Networks. Gene Count versus Age Group . . . . . . . . . . . 34
Identiﬁed Census Genes in Structurally Shared Orbits in Breast Cancer Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Identiﬁed Census Genes in Structurally Shared Orbits in Lung Cancer Networks. 45

5.2
5.3
5.4
5.5

vii

LIST OF FIGURES

2.1

2.3

Visual representation of graph isomorphism between two graphs representing
graphlet G7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Visual representation of graph automorphism between two graphs representing graphlet G7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Automorphism orbits for the twenty-nine 3-, 4-, and 5-node graphlets . . . . . 12

3.1
3.2
3.3

Tier 1 and Tier 2 of the Cancer Gene Census . . . . . . . . . . . . . . . . . . 20
The Cancer Gene Census curation steps undertaken by COSMIC. . . . . . . . 22
Overview of the software’s operational steps . . . . . . . . . . . . . . . . . . 23

4.1

Steps to create cancer gene expression proﬁle (CGEP) graphs . . . . . . . . . 28

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

Breast 20–30 age group and lung 30–40 age group carcinoma graphs . . . . .
Breast 30–40 age group and lung 40–50 age group carcinoma graphs . . . . .
Breast 90–100 age group and lung 80–90 age group carcinoma graphs . . . .
Breast 20–30 age group and lung 30–40 age group orbit clustermap . . . . . .
Breast 30–40 age group and lung 40–50 age group orbit clustermap . . . . . .
Breast 90–100 age group and lung 80–90 age group orbit clustermap . . . . .
Identiﬁed breast cancer gene distribution per given orbits. . . . . . . . . . . .
Identiﬁed lung cancer gene distribution per given orbits. . . . . . . . . . . . .

36
37
38
39
40
41
42
43

A.1
A.2
A.3
A.4
A.5
A.6
A.7
A.8
A.9
A.10
A.11
A.12

Full-sized graph of 20–30 age group breast carcinoma (see Figure 5.1). . . . .
Full-sized graph of 30–40 age group lung carcinoma(see Figure 5.1). . . . . .
Full-sized graph of 30–40 age group breast carcinoma (see Figure 5.2). . . . .
Full-sized graph of 40–50 age group lung carcinoma (see Figure 5.2). . . . . .
Full-sized graph of 90–100 age group breast carcinoma (see Figure 5.3). . . .
Full-sized graph of 80–90 age group lung carcinoma (see Figure 5.3). . . . . .
Full-sized cluster map for the breast cancer 20–30 age group (see Figure 5.4).
Full-sized cluster map for the lung cancer 30–40 age group (see Figure 5.4). .
Full-sized cluster map for the breast cancer 30–40 age group (see Figure 5.5).
Full-sized cluster map for the lung cancer 40–50 age group (see Figure 5.5). .
Full-sized cluster map for the breast cancer 90–100 age group (see Figure 5.6).
Full-sized cluster map for the lung cancer 80–90 age group (see Figure 5.6). .

58
59
60
61
62
63
64
65
66
67
68
69

2.2

viii

Chapter 1
Introduction
Cancer is one of the most prevalent, deadly, and challenging diseases worldwide. It is an umbrella term for many diseases whose pathological manifestations are a result of abnormal cell
growth invading the rest of the human body. To properly understand how these manifestations
are realized, a very effective approach is to collect an overwhelming amount of gene expression (GE) data from clinical samples and risk groups that have identiﬁed cancer-driving genes.
Genes are generally considered as such when their involvement is observed in the development
of speciﬁc cancer types [28], however there is no single gene attributed in tumour growth and
proliferation, but rather a broad network of co-dependent genetic abnormalities [21, 27, 28].
Gene expression (GE) signatures are reliable indicators that can be used to draw meaningful
analysis of cancerous genomic proﬁling and genetic abnormalities, which can then be used
to serve as surrogates for cancer phenotypes such as resisting cell death, inducing angiogenesis, enabling replicative immortality, activating invasion and metastasis, evading growth suppressors, and sustaining proliferative signaling [13, 21, 27]. Therefore, it is reasonable to hypothesize that computational analysis of rich cancer GE datasets would reveal common graph
signatures across cancer types.
The technique of identifying patterns of mRNA expression has been used to determine genetic occurences of investigated biological traits. This is true for both normal and cancerous
cellular functionality wherein biological pathways operate in tandem to ensure cellular operations are functional. In the case of a cancerous cell, genomic alterations can lead to abnormal
GE levels, signatures that can be used to serve as surrogates of cancer phenotypes [21, 27]. As
such, identifying key census genes associations patterns from cancer transcriptomic signatures
could help with deﬁning and understanding them. These signatures have been used in clinical
settings where they serve as an assessment standard to facilitate the interpretation of cancer
laboratory test results, simplifying cancer laboratory protocols, and reducing associated medical operation costs [13, 21, 27, 32]. In a research setting, signatures have shown to help in the
observation of biological mechanisms and in identifying possible drug targets for future cancer

1

treatments [13, 21, 27, 28, 32].
The ongoing mission to accumulate GE signatures into data sets from many different cancer types provides an excellent opportunity for thorough computational data mining in search
of such common patterns. A method to reveal meaningful signatures, structural workings, and
genetic relationships from cancer census GE data is to create a graph wherein genes are interconnected through comparable levels of expression. A method of analysis to extract such
properties is to analyze its local heuristics, or topology, by using graphlet analysis. This method
of analysis, as the name implies, utilizes graphlets, small, induced, and connected graphs developed for the sole purpose to effectively characterize the interconnectivity of nodes of larger
and more complex graphs [22, 24]. As graphs composed entirely of nodes interconnected with
edges, graphlets inherently capture these node into topological positions, or automorphic orbits, as per their relationship to every other node with an undirected or directed edge. Graphlets
are developed to compare appropriately large and highly connected biological graphs, as they
provide local heuristic relationships and highlight structural patterns and are computationaly
feasible [1, 5–7, 22, 24]. Graphlet analysis and metrics have been widely used on biological
graphs for comparative analysis [9], to uncover protein-protein interactions [9], and to relate
genes with their functional organization [1, 5–9, 20, 22].
The use of GE data is a very common tool to identify potential co-dependent gene pathways that a cell line, mutated or healthy, needs to control regulation for survival and function.
Collecting and analyzing GE data to identify bio-markers for clinical applications such as diagnosis, prognosis, and/or prediction of suitable treatment has been widely practiced. Such
methods provide impressive amounts of insight towards understanding biochemical and molecular mechanisms in human diseases. The most common method is to use Weighted Gene
Co-expression Network Analysis (WGCNA), a popular R package to assists researchers and
oncologists to identify clusters of co-expressed genes in GE data [1, 5–7, 16, 23, 35]. Its data
mining capabilities has been an effective method in detecting co-expressed modules and hub
genes [16, 23, 35]. In one study, Uddin et al. wanted to characterize the such pathways of ageassociated spatial learning impairment (ASLI), a human disease wherein many suffer from a
decrease in normal brain functions, including spatial learning impairments, as they age [23].
In their study, they used a mathematical modeling approach similar to this study where large
scale mRNA (level 3) GE datasets were created through the use of WGCNA graphs using
mathematical modelling. The use of WGCNA allowed them to conduct differential analysis
between young (learning unimpaired) and aged (predominantly learning impaired) rat brains
in the context of ASLI [23]. Uddin et al.’s analysis have identiﬁed several reproducible and
signiﬁcant graph modules with genes functioning in speciﬁc biological functional categories
including a “learning and memory” speciﬁc module containing many potential key ASLI hub
genes, some of which were also identiﬁed (but not prioritized) in the meta-analysis [23]. Many

2

of these co-expressed candidate hub genes not only show differential co-expression between
young and aged graphs, but are also reproducible in independent datasets. Uddin et al. are
conﬁdent that in grouping three distinctive categories that candidate ASLI hub genes should be
prioritized intoF follow-up research will help understand their potential molecular mechanisms
underlying complex behavioral traits (such as cognitive impairments including ASLI), memory
formation and synaptic plasticity [23]:
1. Hub genes whose role in learning (including spatial learning) is more transparent than
others (CAMK1G, DLG3, MAPK1, PPP2R2C, PRKACB)
2. Hub genes where there is not enough information in the literature to support which direction their expression pattern contributes to the ASLI phenotype (CDK5R1, CNTN1,
SCN2B, STXBP1, EIF5, GABRG1)
3. Hub genes where information is emerging indicating their direct or indirect role in learning and memory (PTEN, KCNAB2, MAPRE1, NDFIP1, RASGRP1, DPP6)
For Yin et al., their use of WGCNA took a novel approach: understanding the mechanism
underlying the progression of Hepatocellular carcinoma (HCC) through the use of gene cluster
monitoring [35]. Yin et al. were able to provide insight into HCC’s CGEP dynamic changes
by conducting a stepwise carcinogenic process, starting from pre-neoplastic lesions to HCC including normal, cirrhosis without HCC, cirrhosis, low-grade dysplastic, high-grade dysplastic,
very early HCC, early HCC, advanced HCC and very advanced HCC stages [35]. Through the
use of WGCNA for constructing gene co-expression graphs, they were able to narrow down
ﬁve distinct genes related to different HCC stages with the help of Ingenuity Pathway Analysis
(IPA) [35]. From those modules, hub genes are identiﬁed and tiered in stages related to HCC
progression:
1. Normal to Cirrhosis: Energy metabolism changed signiﬁcantly, leading to decreased
oxidative metabolism (MUT, AZGP1, HBB, HBA1, HBA2, HBD, SUCLA2, ACADM,
UQCRC2) .
2. Cirrhosis to Early Stage: Development of hepatic ﬁbrosis and hepatic stellate cell activition (KBP1A, ARPC4, HSP90AB1, ENO1).
3. Early to Very Advanced Stage: destablize chromosomal/cell cycle control and DNA
damage occurs, worsening with HCC progression (GINS1, NEK2, BUB1B, KIF11,
TOP2A)
Recent work done by Das et al. was done to prioritize performance and accuracy of attaining meaningful information from GE data using signed graphlet analysis with their newly

3

developed triangle counting algorithm [1]. Their study is aimed in providing additional graph
information that could not be attained using WGCNA, wanting to capture a graphs’ structural
interaction and topology with the use of graphlets and their ensuing orbits [1,5,6]. To ensure the
correctness of their proposed computational technique, Das et al. were able to reduce publically
accessible gastric cancer GE data into graphs for Graphlet Correlation Matrices (GCMs) [1].
The generated GCMs were then converted to simple heatmaps that observed patterns and differences found in the GE data, most notably [1, 5–7]:
1. Correlation intensity between investigated genes in a healthy cell line is much higher than
the cancerous cell lines, suggesting higher amounts of gene regulation in healthy cell
lines than in cancerous ones (signed graphs shows differences in GE regulatory control
between these cell lines consistent in gene regulation of healthy and cancerous cells
found in literature).
2. Orbit clustering identiﬁed in graphlet analyis suggests genomic lineage from gastric cell
lines is preserved in derived healthy and cancerous cell lines (Orbits Orbit 0, Orbit 2,
Orbit 3, Orbit 4, Orbit 9 and Orbits Orbit 1, Orbit 5, Orbit 6, Orbit 7, Orbit 8, Orbit 12,
Orbit 13)
3. Graphlet and orbit patterns between graphs of 22,283 raw-data gene probes and 612
identiﬁed dysregulated gene probes appear identical (suggests that GE graph structure
remains the same regardless of GE gene sample size)
For the purpose of this study, all cancer co-expression graphs are generated based on probabilistic, unidirected interactions between two genes based on their correlative strength in
mRNA expression levels. Each edge is assigned a weight, or correlative threshold, signifying the strength of potential interaction between associated genes. The frequency of certain
graphlets and their automorphic orbits within each weighted graph provides an overview of
descriptive and structural properties. These properties can then be identiﬁed and used to determine the complexity in node interactions. Three such properties, Graphlet Orbit Matrix
(GOM), Graphlet Node Distribution (GND), and Graphlet Correlation Matrix (GCM), are
used to compare and contrast generated cancer census gene graphs [1, 5–7, 22]. Both GOM
and GND contain the frequency of each node participating in differing orbits and differing
graphlets, respectively. Arguably the most important of the three properties, GCM identiﬁes
the dependencies of each orbit by compounding orbit participation of nodes found in GOM and
GND. This provides a topological richness of nodes association and engagement in a graph’
structure and dynamics, and as such can be used for comparison with other graphs [1, 5–7].
To reiterate, graphlet analysis is used for exposing local graph topologies to highlight node
interactions and functionality within a graph. Counting frequencies of graphlets and their re-

4

spective orbits maps out node connectivity, often manifesting signiﬁcant structural information [1, 5, 6, 9]. Undirected graphs are widely used due to their simplicity but are limited by
not representing edge direction. Having directionality provides critical information for understanding the power and inﬂuence of a single node in the graph. Networks, or graphs with
directed edges, would be better used to represent the complexity of transcriptional regulation
genetic components within organisms. Genetic Regulatory Networks (GRN) are such examples, highlighting genes promoting or inhibiting gene expression, which demands a particular
architecture of governing circuitry shaped through generations of evolution [3]. Reconstructing
GRNs for cancer genomes is a more tedious and complex affair compared to reconstruction of
stable evolutionary genomes [3]. Therefore, it would be fair to assume that cancer genomics
should be ﬁrst understood through the use of graphs in order to determine the presence of
census gene groupings before attempting to understand how each census gene inﬂuences the
other’s expression.
In this thesis, gene co-expression graphs are created from multiple cancers using rich data
sets from the Catalogue of Somatic Mutations in Cancer (COSMIC) and applying graphlet
analysis to observe potential gene clusterings and graph similarities.The cancer types, Lung
Carcinoma (LC) and Breast Carcinoma (BC), are explored together. The purpose of this
study is to explore observing gene co-expression patterns across multi-cancers, identifying
commonly related structural graph clustering, and identifying census genes partaking in those
clusters. An initial design and implementation of a new graphlet analysis alogirthm written
in Python is proposed, followed by an overview of COSMIC, the collecting, ﬁltering, and
transforming of its archived LC and BC into generate respective graphs. From there, graphlet
metrics are collected using the proposed algorithm from which frequently associated cancer
census genes are highlighted.

5

Chapter 2
Graphlet Analysis Metrics and Algorithm
Das et al. were able to develop a novel triangle counting algorithm for 3-node graphlet analysis
of node orbits [1, 5–7]. They were not, however, interested in using higher-level graphlets such
as 4- or 5-node graphlets. Das et al.’s pursuit in their endeavour generated a lot of interest in
expanding the ﬁeld of graphlet analysis. In particular, the use of 4- and 5-node graphlets to
conduct analysis of highly-connected node-to-node relationships would prove useful in identifying gene clusters of similar GE [22]. As such, there is a gap for this thesis to address
through achieving graph complexity measurements of the higher-level graphlets by developing
an algorithm that would be able to do so.
It should be noted that the development of this algorithm is not the main research goal.
That would be the analysis of census gene expression proﬁles (CGEPs) from two of the most
researched cancers in the world in an attempt to identify the most prevalent census genes that
appear in their respective proﬁles. It should be made clear that the algorithm developed for
this thesis is correct, conﬁrming the same exact graphlet metrics found by Das et al. by using
their sample graph for analysis [1, 5, 6]. It was the hope of this thesis to expand up to 5node graphlet analysis, allowing for increased sensitivity and detailed composition of large and
complex graphs, however we were limited due to computation resources to complete 5-node
graphlet analysis.

2.1

Graph Terminology

In graph theory terminology, a graph G is composed of a set of points termed nodes N, or
vertices, from which a set of links, termed edges E, extend to connect other nodes [4]. Each
edge of G connects two distinct nodes u and v. A representation of this deﬁnition is reﬂected
in (2.1).
In graph theory terminology, a simple graph G is composed of points termed nodes, or
vertices, from which edges E extend to connect other nodes. Mathematically speaking, G is

6

composed of two sets of elements; one is a set N whose members are nodes, and the other is a
set E of two-node subsets of N whose members are edge [17]. In short, G = (N, E ) is a graph
such that:

E ⊆ N × N,

(u, v) ∈ E ⇒ u = v and (v, u) ∈ E

(2.1)

As discussed previously, the intention of constructing a new algorithm was to tackle 4- and
5-node undirected graphlet metrics from large graphs. In order to do this, we must ﬁrst be able
develop an algorithm to identify orbits regardless of the node size of a graphlet. As such, the
main objective for this algorithm is to prioritize correctness over performance, ensuring that
it can not only identify orbits from x-node graphlets but to also identify automorphic orbits of
subgraphs isomorphic to x-node graphlets.

Figure 2.1: Visual representation of graph isomorphism between two graphs representing
graphlet G7

7

2.1.1

Isomorphism

Isomorphism preserves the connective properties between nodes in compared graphs despite
the nodes being labelled differently (See Figure 2.1). More precisely, graph X and graph Y are
isomorphic to each other if there is a bijective function
f : NX → NY

(2.2)

between their nodes such that for every pair of nodes in NX :

(u, v) ∈ EX if and only if ( f (u), f (v)) ∈ EY .

(2.3)

Figure 2.2: Visual representation of graph automorphism between two graphs representing
graphlet G7.

8

2.1.2 Automorphism
An automorphism is an isomorphism between a graph and itself (See Figure 2.2). More
speciﬁcally, if G = (N, E ), a bijective function f : N → N is an automorphism when, for all
(u, v) ∈ N × N,

(u, v) ∈ E if and only if ( f (u), f (v)) ∈ E.

(2.4)

2.1.3 Orbits
For a graph G, let Aut G be the collection of all automorphisms of G. For x a node of G, the
collection { f (x) | f ∈ Aut G} is called the orbit of x. Every node of G belongs to exactly one
orbit, that is the orbits partition the nodes of G.

2.1.4 Subgraphs
A subgraph H of a graph G consists of a subset of the nodes and edges of G that themselves
form a graph. That is, if H = (NH , EH ) and G = (NG , EG ) then NH ⊆ NG and EH ⊆ EG .
Note that since a subgraph is itself a graph, the endpoints of any edge in a subgraph must
also be in the subgraph [17]. Equationally, EH ⊆ NH × NH . As we also know EH ⊆ EG , we
have
EH ⊆ NH × NH ∩ EG .
(2.5)
When EH = (NH × NH ) ∩ EG , we say that H is an induced subgraph of G.

2.2

Graphlets

Graphlets are small, connected, and induced subgraphs that can be counted to ﬁnd local structures, heuristic subcomponents or “building blocks”, of larger graphs [4]. They make perfect
candidates in studying relationships between graph structure and paired node relationships,
aided by a focus on the frequency of different graphlets and of nodes at particular locations in
the graphlets [4]. Their use in the statistical analysis of large graphs directly resulted from metrics collected through the capture of unique graphlets identiﬁed to then determine the topological positions of their nodes participating in the investigated graph [1,4,5,22]. These discovered
topological positions are grouped as automorphic orbits, or orbits, as seen in Figure 2.3, there
are a total of 73 orbits and a total of 29 graphlets types from 3-, 4-, and 5-node graphlets.
Through the use of identiﬁed graphlets and their nodes’ orbits, graphlet metrics are compiled to provide a structural understanding of the graphs they are measured in [1]. The fre-

9

quency of graphlet nodes and their corresponding orbits is what is counted to generate metrics.
Below is a list of such metrics which have been modiﬁed from Das et al. [1, 5, 6] . Throughout
this description N is the set of nodes, Os the set of orbits, and G is the collection of 3-, 4- and
5-node graphlets (see Figure 2.3).
• Graphlet Orbit Matrix (GOM): A graphlet metric represented as a V × O matrix where
the (i, j ) entry is the number nodes ni identiﬁed in orbit o j . This represents the occurences of a node (gene) within a given orbit:
o1
GOM =

n1
n2
n3
..
.
ni

o2

o3

...

k0,1 k0,2
k1,1 k1,2
k2,1 k2,2
..
..
.
.
ki,1 ki,2

k0,3
k1,3
k2,3
..
.

. . . k0, j
. . . k1, j
. . . k2, j
.
..
. ..
. . . ki, j

ki,3

oj

• Graphlet Node Distribution (GND)1 : A graphlet metric represented as a V × H matrix
where the (i, h) entry is the number of nodes ni identiﬁed in graphlets gh . This represents
the occurences of a node (gene) within a given graphlet, which is derived from the orbits
identiﬁed from GOM:
g1
GND =

n1
n2
n3
..
.
ni

g2

g3

...

l0,1 l0,1
l1,1 l1,1
l2,1 l2,1
..
..
.
.
li,1 li,1

l0,2
l1,2
l2,2
..
.

. . . l0,h
. . . l1,h
. . . l2,h
.
..
. ..
. . . li,h

li,2

gh

• Graphlet Correlation Matrix (GCM): A graphlet metric represented by an O × O matrix
where the (i, i) entry is the Spearman correlation calculated between the ith and jth orbit
column of the GOM:
1
Also referred to as Graphlet Degree Distribution (GDD), and has been renamed as such to further identify its
functionality [22].

10

o1
GCM =

o1
o2
o3
..
.
oi

o2

o3

...

m1,1 m1,2
m2,1 m2,2
m3,1 m3,2
..
..
.
.
mi,1 mi,2

m1,3
m2,3
m3,3
..
.

. . . m0,i
. . . m1,i
. . . m2,i
..
...
.
. . . mi,i

mi,3

oi

In the context of this thesis, Cancer Gene Expression Proﬁles (CGEP) are the investigated
graphs G in which graphlets g are found. Individual nodes in G are represented as a census
cancer gene and are associated to other nodes based on the correlation of their mRNA expression levels, which is then represented as an undirected edge. Isomorphic g are identiﬁed from
G based on the subset of node combinations to study correlational GE relationships of genes
within the investigated CGEP.

2.3

Algorithm Proposal

In order to study this relationship of genes within G for a given g through software analysis,
the proposed algorithm must use the same method for identifying an orbit of a gene no matter
the graphlets being used for comparison. As such, there must be measurable variables that are
common throughout these graphlet analytics that are vital components to graphs and graphlets
alike, to which there are two very obvious ones: node and edge. This algorithm is able to count
the number of neighbouring node and connected edge from a given node. Since orbits are simply unique positions of node based on their topological positioning relative to neighbouring and
connected node, it stands to reason that each and every node can have their orbits identiﬁed via
a calculated orbit checksum OC. This is done by tallying the amount of edges eu found between
a given node m to its immediate neighbouring nodes nu , the amount of immediate neighbouring
nodes found, the amount of edges ev found between immediate neighbouring nodes and their
adjacent neighbouring nodes nv , and ﬁnally the amount of the adjacent neighbouring nodes
found. Equation 2.6 is the resulting formula that captures this process:
OCn = |E1 | + |E2 | + |N1 | + |N2 |

(2.6)

where:
E1 = {(n, u) : (n, u) ∈ E}

edges from n to N1

N2 = {v : u ∈ N1 , (u, v) ∈ E, v = n}

has a length two path from n

N1 = {u | (n, u) ∈ E}

distance one neighours

11

Figure 2.3: Automorphism orbits 1, 2, . . . , 72 for the twenty-nine 3-, 4-, and 5-node graphlets
G1, G2, . . . , G29 [22]. In each graphlet, nodes belonging to the same orbit are of the same
shade.
E2 = {(u, v) | u ∈ N1 , v ∈ N2 , (u, v) ∈ E}

edges from N1 to N2

This “orbit checksum” (OC) is calculated for each and every orbit in each x-node graphlet.
This in turn will allow for nodes of Gsubgraph isomorphic to Ggraphlet to be identiﬁed with
a given orbit that matches the checksum created from the corresponding node of Gsubgraph .
The end result creates an automorphism between Gsubgraph and Ggraphlet . Table 2.1 provides

12

the results of the orbit checksums of each orbit found in each x-node graphlet. Note that
orbit checksums are not necessarily unique to each graphlet; there will be orbits whose orbit
checksums are identical. This is a result of the node topology within those seperate graphlets
being identical. However, the argument to be made here is to identify unique orbits within
graphlets, not necessarily across all graphlets. The intention is to determine the automorphism
between the subgraph and graphlets being isomorphic to each other.

Table 2.1: Orbit checksums of each orbit found in 3-, 4-, and 5-node graphlets.
G

O

Σ

G1

1
2

3
4

G2

3

8

G3

4
5

3
6

G4

6
7

5
7

G5

8

8

G6

9
5
10 10
11 11

G7 12 12
13 15
G8 14 19
G9 15
16
17

3
6
8

G10

3
5
8
9

18
19
20
21

G

O

Σ

G

G12 24 5
25 12
26 13

G20

49 12
50 13

G21

51 10
52 12
53 15

G22

54 16
55 22

G23

56 7
57 21
58 22

G24

59 14
60 19
61 22

G25

G17 39 7
40 14
41 17
42 18

62 12
63 17
64 19

G26

G18 43 12
44 18

65 16
66 23
67 26

G27

68 21
69 26

G13 27 3
28
8
29 10
30 13
G14 31 7
32 12
33 14
G15 34

8

G16 35 5
36
8
37 10
38 11

O

Σ

G19 45 5
G28 70 25
46 12
G11 22 7
71 30
47 15
23 10
G29 72 34
48 17
G is the graphlet number; 0 is the orbit number; and Σ is the corresponding
checksum.

13

2.4

Software Implementation

In this section, pseudocode is provided to show how OCs can be created, attributed, and compared. The software application designed for this thesis is aimed to be a simple and readable
Python framework using previously existing network analysis packages for generating graphs
and subgraphs for the graphlet algorithm introduced in the following section. No matter a
user’s data or requirements, this application utilizies parallel computing for its graphlet algorithm to achieve better performance. The algorithm has been developed with parallelization in
mind. The parallelization occurs at the graphlet level where a child process is spawned for each
graphlet, and given the list of all known induced subgraphs in order to identifying isomorphic
subgraphs and their node orbits. Future algorithms will continue to implement parallelization
whenever possible for future graph and network analysis.
The intent is to provide open-source software that can be run on any platform so users can
feed in massive amounts of graph data on remote machinery operated by any private and public
institutions. Since the project is open source, anyone with appropriate tools can use it to run
directly on their systems.

2.5

Algorithm Implementation

Dictionaries are Python’s implementation of a data structure that is more generally known as an
associative array. A dictionary consists of a collection of key-value pairs, and each key-value
pair maps the key to its associated value. A dictionary is deﬁned by enclosing a commaseparated list of key-value pairs in curly braces “ ” where a colon “ ” separates each key
from its associated value, as seen in the code block below:
dict ={

Key1 : Value
Key2 : Value
...

(2.7)

Keyi : Value
}

All graphs, subgraphs, and graphlets are NetworkX Graph Class objects [12]. These Graph
objects are dictionary-of-dictionary-of-dictionary data structures where the outer dictionary
(the node dictionary) holds adjacency information keyed by node. The next dictionary, the
adjacency list dictionary, represents the node adjacency information and holds edge data keyed
by neighbor. Finally, the inner dictionary, the edge attribute dictionary, represents the edge data

14

and holds edge attribute values keyed by attribute names, as seen below:
Gdict ={

Node1 : {
}

}

Node2 : {Value}

(2.8)

All variable references made in this section are found in Algorithm 1. All 3-, 4-, and 5-node
graphlets are represented by the variable Ggraphlet . An undirected graph that consists of a set of
nodes n and edges e from which subgraphs will be identiﬁed and compared isomorphically with
Ggraphlet is represented by the variable G. Isomorphic subgraphs to Ggraphlet are represented by
the variable Gsubgraph . The algorithm begins by initializing the GOM as a Python
data structure, on line 3 of Algorithm 1. GOM is structured in such a way that each mapping
entry contains a key hashed to node n and the total number of orbits it has been found in:
GOMn ={

orbit1 : orbitcount
(2.9)

...
orbitn : orbitcount
}

A census cancer gene expression graph G and a set of graphlets are the inputs of this
algorithm to compare and identify subgraphs. The GOM is the output created from each G
containing all found nodes and orbits based on a graphlet of a x-node size. The algorithm ﬁrst
has a single graphlet Ggraphlet be generated and sent to the function
to generate OCn , represented in the code as
, speciﬁc to each node ni in their
respective orbits 2.6.
Then, a list of all found induced subgraphs of G are generated. Each subgraph Gsubgraph
isomorphic to Ggraphlet will have its OCn , its
, calculated using
. Each node of Gsubgraph is then compared with every other node of Ggraphlet
based on their
. Once a match is made, ni is attributed the orbit that ni is
situated in. Finally, all ni and their identiﬁed orbits are stored in the GOM, from which the
GCM and the GND will be calculated.

15

ni : Node of a subgraph
n j : Neighbouring node(s) to ni
nk : Neighbouring node(s) to n j
nl : Node of a graphlet
(ni ,n j ) : edge connecting ni and its neighbour n j
E : Number of edges between neighbours n j to ni
NN : Number of neighbours ni to n j
NNE : Number of edges neighbours nk to n j
NNN : Number of neighbours nk to n j
G : Graph Network
Ggraphlet : Graphlet of G
Gsubgraph : Subgraph of G
GOM : Graphlet Orbit Matrix
orbitchecksum: Orbit Checksum
nx: NetworkX package object
// Using a single graphlet
INPUT: G = (V , E ), dict(Ggraphlet )
2 OUTPUT: GOM for G
3 Shared Array: GOM;
4 for each Ggraphlet ∈ dict(Ggraphlet ) do
GenerateOrbitChecksum(Ggraphlet )
5
IdentifyGraphlets(Ggraphlet , G, nx)
end
1







in parallel

Algorithm 1: Graphlet and orbit counting algorithm: Part I. Syntax and nomenclature
follow that from Das et al. [1].

16

Procedure IdentifyGraphlets(Ggraphlet , G, nx)
subgraphList = [ ]
2
for Gsubgraph ∈ G do
3
if nx.connected (Gsubgraph ) AND nx.isomorphic(Gsubgraph , Ggraphlet ) then
4
ValidateOrbitChecksums(Gsubgraph , Ggraphlet )
5
end
end
end

1

Algorithm 2: Graphlet and orbit counting algorithm: Part II
Procedure ValidateOrbitChecksums(Gsubgraph , Ggraphlet )
for ni ∈ Gsubgraph do
2
E ← len(subgraph.edges(node))
3
4
NN ← len(subgraph[node])
NNE ← 0
5
NNN ← 0
6
for n j ∈ Gsubgraph do
7
8
NNE ← NNE + g.edges(n j )
NNN ← NNN + |g.ngbr (nk )|
9
end
10
g[ni ][orbitchecksum] ← E + NN + NNE + NNN
for n ∈ Ggraphlet do
11
if ni [orbitchecksum] == n [orbitchecksum] then
12
[ni ][orbit ] ← n [orbit ]
13
end
end
end
end
Algorithm 3: Implemented graphlet and orbit counting algorithm.
1

17

Chapter 3
COSMIC Database
This chapter outlines the signiﬁcance of
(COSMIC) in the pursuit of combating cancer by providing an expertly reﬁned and updated
database of identiﬁed cancer genes. In addition, we explore a way to identify
(GE) proﬁles to better understand the global picture of cancer census genes involved in breast
and lung cancers.

3.1

COSMIC

COSMIC is the most detailed and comprehensive resource to date for scientiﬁc exploration
into the effects of somatic mutations in human cancer [28]. It allows researchers to utilize
its high-resolution resources for exploring target genetic markers and trends in GE proﬁles
of human cancers [10]. COSMIC is built and designed to allow for expert manual curation
that encapsulates very high detail across mutation positions, disease descriptions, and other
patient and population data [33]. This curation process provides improved quality control over
systematic approaches where experienced curators can identify inconsistencies or errors in
publications, allowing for the rejection of untrustworthy, incomplete or unspeciﬁc data sources
[33].
It should be noted that this curation process is done by COSMIC’s team of postdoctoral
scientist curators that manually interpret peer reviewed publications, as stated by COSMIC
on their Expert Curation of Genes documentation [33], and the validity and integrity of these
claims should be analyzed and discussed openly.

3.1.1 Cancer Gene Census
The
(CGC) data are the primary data that COSMIC derives directly
from the scientiﬁc literature through a broad-range manual curation process. The CGC is one

18

of COSMIC’s many maintained and continually-updated projects which undergoes a curation
process that cultivates speciﬁc and detailed data via major cancer portals, such as
(TCGA) and the
(ICGC), and
from census gene supplementary data associated with their curated papers [2,10,14,26,28,33].
It’s intended purpose is to address the challenges posed by genetic mutations and the wide range
of consequences brought about by them. Driver mutations, which are mutations that determine
the development of a cancerous state, will cause dysfunction (protein structural alterations),
dysregulation (GE level changes), or complete abrogation (complete genetic deletion) of a
gene [10, 26, 28, 33]. To better help encapsulate these effects on cancer progression, the CGC
objective is to catalogue all genes that are causally implicated in cancer, or have extensive
literary evidence, for having affected functionalities resulting in the involvement causing a
cancer tumour’s development [10, 26, 28, 33]. As a result, the CGC categorizes cancer census
genes in two tiers [10, 26, 28, 33]:
• Tier 1 — genes that are characterized by the presence of mutational patterns that strongly
support their involvement in causing a cancer’s development, progression, and proliferation. A gene’s qualiﬁcation to be designated in Tier 1 must meet the following criteria:
– The investigated gene must have at least two publications from two independent
groups that describe its somatic mutations involved in at least one cancer type.
– The investigated gene must have at least two independent publications that provides
evidence of its functional involvement in the cancer type’s biological functionality.
• Tier 2 — genes that don’t have enough research evidence to suggest the presence of
mutational patterns to cause a cancer’s development, progression, and proliferation. A
gene’s qualiﬁcations to be designated in Tier 2 is decided by at least two postdoctoral
scientists following review of sufﬁcient literature.
To be classiﬁed into Tier 1, a gene must possess a documented and reproducible activity
relevant to cancer, alongside evidence of somatic mutations that changes its product’s activity
towards oncogenics [10,26,28,33]. Although only somatic mutations from cancer samples and
cell lines are catalogued in COSMIC, the CGC additionally uses information about germline
mutations and their consequences. COSMIC also acknowledges somatic mutation patterns
across cancer samples gathered. Tumour suppressor genes have often been shown to become
inactive or under-expressed as a result of a broad range of mutation types, whilst oncogenes
usually become unchecked in their ex- pression levels as a result of well-deﬁned missense or
inframe indel mutations [31]. With both mutations patterns and functional evidence, a gene
will be included into the Tier 1 list, indicative of the extensive evidence of their role in the
investigated cancer (Figure 3.1).

19

To be classiﬁed into Tier 2, a gene with more recently and strongly identiﬁed roles in
cancer but with less strong mechanistic or functional evidence are the most common candidates [10, 26, 28, 33]. These include genes with somatic mutational patterns atypical for either
oncogenes or tumour suppressor genes but which won’t have enough functional evidence in
the scientiﬁc literature to suggest their cancer role. Similarly, genes with strong published
evidence for a function in cancer but unclear mutation patterns or known to be dysregulated,
for instance solely by epigenetic means (for example by changes to promoter methylation),
are also included [10, 26, 28, 33]. With neither supportive evidence of mutations patterns nor
functionalities, a gene will only be listed as Tier 2, highlighting the need for more research to
deﬁnitively uncover further evidence as to their role in the investigated cancer (Figure 3.1).

Figure 3.1: (Adapted from Figure 2 of Sondka et al. [26]) Genes are classiﬁed into either Tier
1 or Tier 2 of the Cancer Gene Census based on: (a) the evidence of functional involvement in
oncogenesis, and (b) the somatic mutations patterns in cancer samples indicative of a gene’s
functionality investigated in the literature being curated [10, 26, 28, 33]. Only strong evidence
from both analyses will allow a gene to be classiﬁed as Tier 1. If a investigated gene has been
shown to have either cancer-driving somatic mutational patterns or descriptive biological
pathways highlighting their involvement in cancer progression, then they are attributed as Tier
2.
Genes that are targeted for manual curation are generally genes that meet the Tier 1 criteria of the CGC genes [28]. When a new cancer gene is updated into the CGC as a Tier 1
gene, COSMIC performs a literature review in PubMed to identity any publications reporting

20

the gene’s somatic mutations in cancer types. At the point at which a new gene is selected for
comprehensive curation, COSMIC inspects for mutation data, GE, clinical details, and cancer
phenotype information. The data are curated from both targeted gene studies and from whole
genome screens, so that the full range of reported somatic mutations are represented in COSMIC [10, 26, 28, 33]. Each quarterly database release includes data for new CGC cancer genes,
and it should be noted that the CGC is not static and is updated when new evidence comes to
light (Figure 3.2).
As a result of COSMIC’s dedication to its CGC project, all data are standardized and attributed to a given census gene and its mutation(s) [10, 26, 28, 33]. The CGC provides a wealth
of information pertaining to a single census gene and its role in the development and progression of the cancer it is involved with. COSMIC also provides, from its database, associative
variables such as GE and expression levels, cancer type, and cancer role for each individual
census gene identiﬁed. With these variables the aim is to create an approximation of a given
cancer’s in situ GE proﬁle using graphlet network reconstruction and analysis, as well as identifying common census genes and their neighbouring associations.

3.1.2 Comparative Gene Expression
In order to compare and contrast census genes within a network, a network association reconstruction based on individual GE values, cancer role type, and cancer association was proposed.
These associations are represented as statistically signiﬁcant over, under, or normal genetic expression levels, assigned as z-scores, referenced by individual sample mRNA expression microarray data per curated studies for the given census gene. Under-expressed genes have a
z-score value lower than −2, over-expressed genes have a z-score value greater than 2, and
normally-expressed genes have a z-score and between 2 and −2. These measures are useful to
determine whether a gene is up-, down-, or regulated relative to the normal samples or all other
tumour samples. A z-score simply represents the number of standard deviations away from the
average GE mean of the reference population used in the given study The z-score is provided
by TCGA, which COSMIC uses for its curation purposes. mRNA expression data is simply
the computed relative expression of an individual gene and tumour to the gene’s expression
distribution in a reference population [2, 33]. That reference population is either all tumours
that are diploid for the gene in question, or, when available, normal adjacent tissue [2].
For research purposes, we want to include both Tier 1 and Tier 2 when conducting our
investigations in understanding a cancer’s GE proﬁle. Determining the placement of Tier 2
genes in association with Tier 1 genes could create preemptive discoveries of signiﬁcant GE
proﬁles of said genes with low cancer research evidence and investigations. Figure 3.3 provides
an overview of the data ﬂow and processing operations for graphlet analysis.

21

Figure 3.2: The Cancer Gene Census curation steps undertaken by COSMIC. Once the
information about the gene is made public, COSMIC curators will explore for any presence of
patterns of somatic mutations in COSMIC’s resources prior to going forward with the
literature review. If COSMIC resources do not contain any information whatsoever for this
new gene, then curators will perform a literature review to identify roles the gene has
associated with a given cancer. Once there is minimum evidence from at least two
publications across different research groups showing experimental evidence of functional
involvement in the associated cancer, two COSMIC postdoc-scientist curators will then
discuss if the gene in question needs more evidence to be designated a census gene. Once
decided, the gene will be deﬁned either as a Tier 1 or a Tier 2 census gene with mutational and
functional information stored in the Cancer Gene Census database. With more research for
that given gene, COSMIC will continually update it with additional information.

22

Figure 3.3: A generalized overview of the software’s operational steps and functionality to
process raw cancer data, generate graphlets and cancer graphs, enumerate identiﬁed graphlets,
and collect graphlets metrics.

23

Chapter 4
Graph Network Creation
Since the introduction of DNA and RNA sequencing technologies, the collection and analysis
of Cancer Gene Expression Proﬁle (CGEP) data has been instrumental in identifying cancer
bio-markers for clinical applications such as cancer diagnosis, prognosis, and/or prediction for
suitable treatments for afﬂicted individuals. Accordingly, with more CGEP data being collected
comes the demand for hosting and sharing that has led to the development of research databases
that are accessible, cheap to maintain, fast to query, and straightforward to operate. In turn, the
need for effective analysis techniques to model and analyze from CGEP data is unavoidable in
order to draw meaningful conclusions. A very popular analysis technique is to model the data
as connected graphs in order to effectively represent cancer genes and their interactions with
each other in order to unveil hidden links and patterns. Using GE data accumulated in CGC,
the need to determine, identify, and understand the underlying molecular biology and genetics
of cancers would provide meaningful investigative directions for oncological research. It is
with the use of graphlet analysis that we aim to identify the topological signiﬁcance of cancer
gene interactions within and in comparison of different CGEPs.

4.1

Data Analysis

4.1.1 Data Collection
All data preparation steps including census GE graphs analysis and statistical analyses were
performed in Python using appropriate modules: Pandas and Dask for managing large data
ﬁles as dataframe datastructures to enable data analysis, NetworkX for creating graphs from
the data analysis, MatPlotLib and Seaborn to visualize results of the data analysis as heatmaps
and graphs. The use of a correlation matrix allows us to determine co-expression relationships
between genes based on their z-score GE values. All data ﬁles were downloaded from the
following COSMIC website:

24

as
ﬁles, an archive ﬁle compressed by the standard GNU zip (gzip) compression algorithm
[11]. The data download version is the v92 release of August 27th, 2020. The data ﬁles used
are
•

(8.6 GBs in size),

•

(2.21 GBs in size), and

•

(16.6 GBs in size).

All ﬁles were uncompressed into tab separated ﬁles to be used for data analysis. COSMIC data
ﬁles are available for download with the creation of a free account.
1. Complete Gene Expression (CGE) — Listed as
from COSMIC. This contains the census
genes associated with their identiﬁed z-score per the study associated. COSMIC labels
each of these as separate variables within the aforementioned data table as follows:
: the gene name for which the data has been curated into COSMIC. In most
cases this is the accepted Human Genome Organization Gene Nomenclature Committee (HGNC) identiﬁer (ex: AKT1 is short-form for v-akt murine thymoma viral
oncogene homolog 1).
: the GE score taken by a study for a particular census gene.
: A sample is an instance of a portion of a tumour being examined for mutations. A sample id is used to identify a sample within the COSMIC database.
2. Cancer Gene Census (CGC) — Listed as
from COSMIC. This contains the list of cancer census
genes curated by COSMIC. From this data table we are able to determine the associated
cancer and the cancer role for each gene. Each of these are separate variables are listed
by COSMIC as follows:
: the gene name for which the data has been curated into COSMIC. In
most cases this is the accepted Human Genome Organization Gene Nomenclature
Committee (HGNC) identiﬁer. This unique identiﬁer is what allows cross referencing to retrieve additional COSMIC information from other projects and data tables.
: the somatic cancer that a census gene has been identiﬁed as involved
in its progression and development. A census gene can have more than one tumour
type associated with it (ex: AKT1 is associated with breast, colorectal, ovarian, and
non-small cell lung cancer).

25

: the census gene functionality whose mutations have been determined to
be cancerous. It can be deﬁned as an oncogene, a tumour suppressor gene, or a
fusion/hybrid gene, or in any combination of the three.
: A sample is an instance of a portion of a tumour being examined for mutations. A sample id is used to identify a sample within the COSMIC database.
3. Census Gene Mutation Data (CGMD) — Listed as
from COSMIC. This contains all coding mutations in
census genes listed in the CGC. It contains the Functional Analysis through Hidden
Markov Models (FATHMM) descriptors and the somatic mutation status that allows us to
to determine the pathogenic nature of a census gene. These variables used are described
and listed by COSMIC as follows, among variables:
: The gene name for which the data has been curated in COSMIC. In most
cases this is the accepted HGNC identiﬁer.
: A calculated p-value score ranging from 0.0 to 1.0 determining
the cancerous affect of a given gene. Scores greater than 0.5 are considered deleterious, and scores greater than 0.7 are considered pathogenic. Mutations are classed
as neutral if the score less than or equal to 0.5.
: The age of the patient identiﬁed in the curated studies for the identiﬁed census gene
and associated sample(s).
: Information on whether the sample was reported to be
Conﬁrmed Somatic, Previously Reported or Variant of Unknown Origin.
: The primary tissue/cancer from which the sample originated.
: A sample is an instance of a portion of a tumour being examined for mutations. A sample id is used to identify a sample within the COSMIC database.
It should be carefully noted that the aforementioned variables from each data source are
not the complete list of variables found. Those that are listed are only sourced for the intent
of this thesis, which is to determine if graphlet analysis can provide a clearer understanding of
frequently associated genes of an identiﬁed CGEP. As such, the variables excluded were not
deemed necessary for this endeavour due to redundancy of representing qualititative descriptors of quantitative components (i.e. in CGMD, the variable FATHHM scores retains numeric
scores from 0.0 to 1.0 where values greater than 0.7 are considered pathogenic, otherwise
neutral, and reported in FATHMM prediction), or are completely unneeded in the creation of
CGEP graph (i.e. in CGMD, histological sub-classiﬁcation of sample tissues and of the variable
.

26

4.1.2 Proposed Data Analysis
It is important to note that the selected GE graphs variables were not selected without comparisons in current literature for differential expression analysis. To reiterate, the need for
understanding molecular pathways and genetics is fundamental to improve targeted cancer
treatments. In order to provide a reasonable estimate for a clinical investigation, we propose to
use the variables listed in Section 4.1.1 from the data ﬁles used to recreate CGEPs for graphlet
analysis following the data analysis steps in Figure 4.1.
The Complete Gene Expression data table consists entirely of quantiﬁed mRNA GEs of
individual census genes from studies that use standardized and tested RNA sequencing research
tools and methods. The platform codes currently used to produce the COSMIC GE values are
the following in order of preference:
,
,
, and
[33]; Illumina is simply the developer
and manufacturer of the RNA sequencing machinery and protocols used for large-scale analysis
of genetic variation and function. Additional COSMIC data is needed to identify co-expression
relationships between census genes with associative cancer. The CosmicMutantExportCensus
and cancer_gene_census data ﬁles provide ample information to create meaningful graphs that
can lead potential CGEP patterns, and the data analysis process is outlined step by step in
Figure 4.1.
First and foremost, only mutation samples from CGMD identiﬁed as pathogenic by Functional Analysis through Markov Models (FATHMM) predictions are carried over for graph
creations. The FATHMM variable is a machine learning approach (better known as FATHMMMKL) that integrates functional annotations from
(ENCODE) consortium with nucleotide-based sequence conservation measures that can predict
the functional consequences of non-coding variants [25]. If the mutation data is marked as
pathogenic then it is considered to be cancerous. From the pathogenic mutation data, mutation
samples are then processed for being conﬁrmed as somatic mutations. The experiments that the
mutation samples were taken from have had comparative genomic sequencing done between
normal samples from the same patient. This ensures that the data collected is conﬁrmed to
contain not only pathogenic samples but is reassuringly a mutation in comparison to a normal
human genome sample.
The resulting mutation data that is then further reﬁned can then only contain samples from
the investigated cancer types (breast and lung). Once completed, the data is further distilled
to primarily carcinoma histological samples. We acknowledge that there exist many cancer
histologies, but for the sake of this study it will be easier to compare graphlet analysis results
between the investigated cancers. Before extracting gene expression data based on census
genes, the mutation data is compounded based on age ranges of 10 years. This is an attempt
to further identify potential CGEP patterns related to age as it is a strong correlative factor in a

27

Figure 4.1: Data analysis steps to create meaningful CGEP graphs. COSMIC data are pruned
by keeping sample mutations that are conﬁrmed ﬁrst by being pathogenic (conﬁrmed through
FATHMM predictions) and then being somatic (conﬁrmed by sequencing both the tumour and
a matched normal tissue sample from the same patient). Following this, tumour samples are
separated based on cancer type and histology (breast and lung carcinomas). From the retained
sample mutations, only those identiﬁed with GEs are used for the creation of graphs. The
resulting graphs are then subjected to graphlet analysis, leading to the creation of comparative
data (GOM, GCM, etc.) that will allow for CGEP co-expression comparisons.
higher risk of developing cancer [34].
Finally, the last stages of the data analysis process is to collect gene expression data based
on identiﬁed census genes from the reﬁned mutation samples. The gene expression data stems
from the CGE data ﬁle wherein the gene names and study IDs are matched with those from
the collected mutation samples derived from the CGMD data ﬁle. A gene expression matrix is
created based on the gene name and study IDs from both data ﬁles, which will then undergo
Spearman correlation matrix calculations as described in detail in Section 4.2.1. After correlation matrices are completed, they are then used to generate graphs based on the strength in
correlative expression levels between census genes. A threshold is applied to only retain meaningful relationships between these genes, and in Chapter 5 levels of thresholds are selected to
provide a better comprehension of each cancer’s genetic co-expression proﬁles. To further enrich the CGEPs, metadata from the CGC data ﬁle is merged with the collected mutation sample
dataset to identify the type of roles each census gene plays; genes that either became oncogenic

28

or lost their tumour suppressor ability. This merging of additional information is done using
the sample IDs and gene names from both the CGC and mutation sample dataset. Generated
graphs represent each of these genes accordingly.

4.2

Gene Expression Rationale

As mentioned previously in Chapter 3, census genes are given unique identiﬁers attributed
by COSMIC that allows us to not only match them with associated cancer histology but also
determine whether or not their somatic mutations are deemed pathogenic, distinctive cancer
roles, determined age brackets, and their gene expression levels. The census genes are curated
as a result of having a mutation that would cause themselves or others to develop cancerdriving activities through uncontrolled expression. It is with these available data that graphs
can be created from GE levels following correlation matrix calculations from groups of census
genes for a given cancer.

4.2.1 Gene Expression Matrices
The use of correlation matrices is necessary to determine the co-expression graphs of genes for
their respective CGEPs. This allows us to determine the strength of GE correlations between
pairs of speciﬁc cancer genes; a weighted correlation between two genes represents connection strength between them in a CGEP. We use Spearman’s correlation coefﬁcient to calculate
the pairwise GE, which results in an adjacency matrix deﬁning the strength in correlative coexpression between each gene. For each dataset, a Weighted Gene Expression Correlation
Matrix (WGECM) was created by taking the weight of correlative co-expression (w) between
each pair of genes. These values are found in a range between −1.0 (strongly dissimilar correlative signiﬁcance) and 1.0 (strongly similar correlative signﬁcance). The self-correlated genes
of each matrix were not carried over into the graph creation: a gene will always be correlated
with itself does not provide meaningful information. For the purposes of our study, census
genes with w greater than designated threshold values of 0.3 to 0.9 for ten year age groups
starting from 0 to 100 are retained to generate CGEP graphs (i.e., only genes with above average median connectivity were kept for graph analysis). This will provide us comparative
analysis of CGEPs increasing w to compare between those age categories.
Following this ﬁltering step, adjacency matrices containing census genes with w greater
than designated thresholds are used to create co-expression graphs. In a co-expression graphs,
an edge between two genes (nodes) represents the co-expression relationship w. All generated
graphs are created with the use of NetworkX, a graph generation library for Python. In reference to Chapter 2, the steps for graphlet analysis was conducted on these graphs to create

29

GOMs, GCMs, and GNDs. Heatmaps of WGECMs, GOMs, GCMs, and GNDs were created
using Python’s Seaborn, MatPlotLib, Seaborn, and SciPy libraries and used for visualization
and analysis. These analyses and results are shown in full in Chapter 6.

30

Chapter 5
Cancer Network Results
Graphlet analysis metrics were conducted following the generation of BCG and LCG from
census genes provided by COSMIC. The following section provides a summary of results and
observations found from each GE graph. The result summaries were measured at different
correlation co-expression thresholds and age groups to further help with the investigation of
GE graph size and connections. This investigation used a minimum correlation co-expression
threshold of 0.3 and each collected GE data from recorded samples are compiled into 10-year
age groupings, starting with the years 0–10 and ending with the years 90–100.

5.1

Cancer Graph Networks

This section provides an overview as to the size and degree of connectivity in each investigated
cancer at varying threshold levels and age groups. The graphs below are constructed from GE
data with probabilistic thresholds of census gene co-expression greater than or equal to 0.3.
For both LCG and BCG, it is very interesting to observe from the 10 year age groupings some
highly occuring genes.

5.1.1 Structural Network Observations
In each of the graphs generated, cancer census gene as nodes have been distinguished in individual colours and labels. As mentioned on Section 3.1.1, there are two types of COSMIC
census genes used for generating graphs: genes with strong support in cancer development
involvement (Tier 1), and genes that might suggest some form of inﬂuence on cancer development (Tier 2). As seen in the corresponding graphs in Figures 5.1–5.3, a Tier 1 gene is
attributed with a single asterix to its gene name (i.e. “ATR*”) while a Tier 2 gene is attributed
with a double asterix (i.e “SEPT6**”). Secondly, genes are also coloured based on their cancer
role upon a mutation that would render them cancer-driving. As mentioned in Section 4.1.1 of

31

Chapter 4, each gene has been identiﬁed to have an either oncogenic, tumour suppressor, or a
mutated fusion of both. In Figures 5.1–5.3, oncogenes are labelled green, tumour suppressor
genes are labelled red, and fusions are labelled light-blue.
It was brieﬂy noted in Section 5.1, the constructed 10-year age ranged graphs represented
in Figures 5.1–5.3 were the only graphs generated from the 0.3 GE probabilistic threshold. No
other graphs were found as there were no edges found at correlation levels lower than 0.3. If the
age ranges were shortened to 5-, 2-, or down to 1-year age differences we could see stronger and
more positive GE correlations. An issue arose when trying to compute more complex, highly
connected node-to-node relationships using 5-node graphlets, namely the graphlet algorithm’s
consumption exceeding the allocated RAM resources provided for this research work. A hard
limit of 46 gigabytes of RAM was allocated for this research work and as a result no 5-node
graphlet analysis data was attained, however both 3- and 4-node graphlet analysis data were
collected.
Figures 5.4–5.6 showcase the GCMs calculated from the GOMs of the BCGs and LCGs.
These are generated using Python’s Pandas’ Spearman correlation matrix and Seaborn’s Clustermap tools. Hierarchical groupings, or clusters, of certain orbits found between Orbit 1 – Orbit 14 are observed from the aforementioned ﬁgures. An orbit cluster is a collection of orbits
such that any two orbits with a correlation coefﬁcient value greater than or equal to 0.8. This
collection can be as large as possible and is seen as the lightest coloured squares in Figures 5.4–
5.6. These orbit clusters are tallied and identiﬁed further in Table 5.1.
Table 5.1: Orbit Clusters Found in Identiﬁed Age Groups of Breast and Lung Cancer Gene
Co-expression Networks.
B REAST C ANCER
Age Group
Orbit Clusters
20 to 30

30 to 40

90 to 100

5, 7, 2, 11, 12
10, 14, 3, 13
6, 4, 9
1, 8
11, 3, 10
6, 1, 4
5, 2, 7

Age Group

L UNG C ANCER
Orbit Clusters

30 to 40

10, 14, 3, 13, 5, 7, 2, 11
12, 1, 8

40 to 50

12, 13, 3, 10
6, 9, 8, 14
2, 7, 5, 11

80 to 90

13, 11, 3, 10
2, 7, 5, 1, 4
6, 9

12, 2, 11, 5, 7
14, 10, 3, 13
1, 8
6, 9

There is a clear visible overall distinction of node connectivity in the observed generated
graphs as seen in Figures 5.1–5.3. In breast carcinoma graphs (BCG), Table 5.1 shows common

32

clusters of orbits (Orbit 2, Orbit 5, Orbit 7, Orbit 11, Orbit 12), orbits (Orbit 1, Orbit 8), orbits
(Orbit 3, Orbit 10, Orbit 13, Orbit 14), and orbits (Orbit 6, Orbit 9) between 20–30 and 90–100
graphs. This is interesting to note as it implies that there are some conserved graph structure
for GE between starkly different age groups. However, the graph structure shrinks in the 30-40
graph from the 20-30 graph. A smaller set of orbit clusters (Orbit 2, Orbit 5, Orbit 7) and (Orbit
3, Orbit 10) are shared across 20–30, 30–40, and 90–100 graphs, indicating again similarities
in graph structure, but there is a transition back to identical orbit composition from 20–30 to
90–100 graphs. It would appear that GE levels vary differently in the 30–40 age groups in
comparison to both the 20–30 and 90–100 age groups.
In contrast, the lung carcinoma graphs (LCG) shown in Figures 5.4–5.6 exhibits truncation
of orbit clusters with increased age. Table 5.1 shows common clusters of orbits (Orbit 2, Orbit
7, Orbit 5), orbits (Orbit 6, Orbit 9), and orbits (Orbit 3, Orbit 10, Orbit 13) between 40–50
and 80–90 graphs. However, the 30–40 graph does not share separate orbit clusters following
40–50 and 80–90 graphs. A single cluster orbit of (Orbit 10, Orbit 14, Orbit 3, Orbit 13, Orbit
5, Orbit 7, Orbit 2, Orbit 11) is observed instead. This doesn’t detract from the fact that all
LCG graphs share similar orbits, it simply indicates that there is a loss of node connectivity
from 30–40 onwards.
Between LCG and BCG, it is interesting to note that combinations of orbits (Orbit 4, Orbit
6, Orbit 9), (Orbit 5, Orbit 7, Orbit 2, Orbit 11), and (Orbit 10, Orbit 3, Orbit 13) exist across all
age groups graphs. This is quite an interesting result as it indicates that all these graphs share
some common structural similarities, potentially indicating some sort of preserved genetic expression.
To properly compare how similar/disimilar investigated graphs are to each other, a graphlet
correlation distance (GCD) matrix was calculated from the GCM of each graph. This method
of quantitative analysis measures the dependence of correlation of a set of vectors between two
variables; in this case, a list of calculated GCD mean values is compared from two graphs to
determine if they are more similar (a GCD value closer zero) or disimilar (a GCD value further
from zero) they are by measuring the Euclidean distance. The list of GCD mean values is
created by ﬁrst calculating the mean count of individual orbits found in the GCM of each graph,
then a correlation distance calculation is conducted between graphs by using the Python’s Scipy
Spatial Distance Correlation function. As seen in Table 5.2, the bolded GCD values represent
some interesting within- and between-group graph structure similarities. This can be seen in
orbit cluster heatmaps in Figures 5.4–5.6 where there are preserved structural similarities.

33

Table 5.2: Graphlet Correlation Distance between in Identiﬁed Age Groups of Breast and
Lung Cancer Gene Co-expression Networks.

0.3157
0.0043
0.0086
0.1614
0.2185

5.1.2

0.0043
0.2606

0.3157
0.2606
0.3981
0.0631
0.0252

0.0239
0.1187
0.1687

0.0086
0.3981
0.0239

0.1614
0.0631
0.1187
0.2348

0.2348
0.2968

0.0104

0.2185
0.0252
0.1687
0.2968
0.0104

Potential Genetic Implications

Observed in the graphlet analysis are speciﬁc orbit clusters that are preserved across the generated breast and lung cancer graphs. Understanding what these clusters represent in terms of
preserved oncological properties would be very interesting to explore in future studies. Now, in
order to verify that there is some sort of preserved genetic expression across age groups within
and between the investigated cancers, the frequency of census genes identiﬁed in those orbit
clusters must be investigated. Table 5.3 shows the number of census genes found in identiﬁed
BCG and LCG age groups.
Table 5.3: Gene Distributions Identiﬁed Age Groups of Breast and Lung Cancer Gene
Co-expression Networks. Gene Count versus Age Group
B REAST C ANCER

L UNG C ANCER

Age

Count

Age

Count

20–30
30–40
90–100

151
363
228

30–40
40–50
80–90

119
367
317

Following the common cluster groups identiﬁed between breast and lung age groups, the
distribution of the common orbit clusters (Orbit 10, Orbit 3) and (Orbit 5, Orbit 7, Orbit 2) are
subjected to a new metric that outlines which genes are observed to be predominantly present in
those orbits. For each given orbit a distribution of census genes have been found. Each of these
genes have associated with them a count or frequency in which they appear in a given orbit.
With this information, a search can be conducted to identify highly frequent genes in any given
orbit found to be integral to a graph’s structure. In reference to Figures 5.7 and 5.8, there is great
conﬁdence in knowing that situated in the 95% percentile orbit-gene frequency distribution are
census genes with the highest order of appearances. As seen in each orbit-gene distribution
graph (Figures 5.7 and 5.8), each common orbit from each age group, Kernel Density Estimate

34

(KDE) plots were generated to properly visualize and interpret the distribution of associated
census gene frequencies and the amount of said distribution that contains census genes that are
highly present in those orbits. Each orbit-gene distribution (OGD) graph (Figures 5.7 and 5.8)
contains the 95% percentile that acts a threshold to single out the highly-occuring genes. As
result of using a KDE, some of the x-axis representing the orbit frequencies show the curve
extending from negative x-values. There is no negative presence of any census gene in any
orbit. This is a result of the smoothing distortion when converting a data set into a distribution
curve, which is more visually understandable than using a conventional histogram. Using a
KDE does not destroy the data; it simply generates a continuous curve representing the data
set’s distribution. All KDE graphs were created using Python’s Seaborn kdeplot function.

35

Figure 5.1: BCG 20–30 age group (top) and LCG 30–40 age group (bottom) carcinoma
graphs using gene co-expression data at probabilistic thresholds equal to and greater than 0.3.
Each graph is represented from data gathered in ten year age groupings, all collected from the
data provided by COSMIC.
36

Figure 5.2: BCG 30–40 age group (top) and LCG 40–50 age group (bottom) carcinoma
graphs using gene co-expression data at probabilistic thresholds equal to and greater than 0.3.
Each graph is represented from data gathered in ten year age groupings, all collected from the
data provided by COSMIC.
37

Figure 5.3: BCG 90–100 age group (top) and LCG 80–90 age group (bottom) carcinoma
graphs using gene co-expression data at probabilistic thresholds equal to and greater than 0.3.
Each graph is represented from data gathered in ten year age groupings, all collected from the
data provided by COSMIC.
38

Figure 5.4: BCG 20–30 age group (top) and LCG 30–40 age group (bottom) orbit clustering
of corresponding graphlet correlation matrices (GCM) of both generated BCG (top) and lung
(bottom) graphs, as seen in Figure 5.1.

39

Figure 5.5: BCG 30–40 age (top) and LCG 40–50 age group (bottom) orbit clustering of
corresponding graphlet correlation matrices (GCM) of both generated BCG (top) and LCG
(bottom) graphs, as seen in Figure 5.2.

40

Figure 5.6: BCG 90–100 age group (top) and LCG 80–90 age group (bottom) orbit clustering
of corresponding graphlet correlation matrices (GCM) of both generated BCG (top) and LCG
(bottom) graphs, as seen in Figure 5.3.

41

Figure 5.7: Identiﬁed gene distribution per given orbits from census genes found
between-groups common orbit clusters between generated breast age group graphs. Breast
cancer age group graphs from left to right: 20–30, 30–40, and 90–100. Horizontal red marker
and arrow provides visual conﬁrmation of high-frequency gene found in 95% percentile of
respective kde orbit distributions.
42

Figure 5.8: Identiﬁed gene distribution per given orbits from census genes found
between-groups common orbit clusters between generated lung age group graphs. Lung
cancer age group graphs from left to right: 30–40, 40–50, and 80–90. Horizontal red marker
and arrow provides visual conﬁrmation of high-frequency gene found in 95% percentile of
respective kde orbit distributions.
43

Table 5.4: Identiﬁed Census Genes in Structurally Shared Orbits in Breast Cancer Networks.
Orbit 10
Genes
Counts
KMT2D*
19704
NOTCH2*
18843
PCM1*
18743
CREB1*
18691
PRDM2**
18634
CDH10**
18290
ZFHX3*
18046
CCR4**
18010

Orbit 10
Genes
Counts
AR*
10
EGFR*
9
IKBKB*
8
PPFIBP1*
8
TLX1*
8
HOXC11*
7
KIT*
7
PICALM*
8
SLC34A2*
7
ARID2*
6
BCOR*
6
CBLC*
6
CNTNAP2**
6
ERC1*
6
KMT2C*
6

Orbit 10
Genes
Counts
NSD1*
48985
ATRX*
47918
ATM*
47131
MTCP1*
46931
PBRM1*
46042
MED12*
45888
NF1*
45523
PRDM2**
45316
SPEN*
45307
USP8*
45272
KMT2A*
45150
ARID2*
44868

Breast Cancer 20 to 30 Age Group
Orbit 3
Orbit 5
Orbit 7
Genes
Counts
Genes
Counts
Genes
Counts
KMT2D*
1223 IL7R*
30035 KMT2D*
6083
NSD1*
1085 KMT2D*
29273 IL7R*
4249
CREB1*
1063 MAP2K4*
28496 CCR4**
4157
ZFHX3*
1013 CCR4**
28288 MAP2K4*
4140
NOTCH2*
1012 ATM*
28241 ATM*
4121
ARID2*
995 CDH10**
27150 PRDM2**
4084
PRDM2**
976 PRDM2**
26446 CDH10**
3522
KLF4*
940 PCM1*
26187 RNF43*
3516

Breast Cancer 30 to 40 Age Group
Orbit 3
Orbit 5
Orbit 7
Genes
Counts
Genes
Counts
Genes
Counts
AR*
2 MITF*
150 MITF*
56
HOXC11*
2 PAFAH1B2*
142 PAFAH1B2*
50
IKBKB*
2 NFKBIE*
97 NFKBIE*
30
MYH9*
85 GRIN2A*
20
EIF3E*
66 MAP3K13*
20
MAP3K13*
65 MYH9*
20
PICALM*
65 SGK1**
20
PRDM16*
62 WT1*
20
ERC1*
58 DDX6*
16
DDX3X*
56 POU5F1*
16
GRIN2A*
55 PRDM16*
16
WT1*
55 RSPO2*
16
DDX6*
53 HOXC11*
12
FBXW7*
52
MECOM*
52
PPFIBP1*
52
ERG*
51
SGK1**
50

Breast Cancer 90 to 100 Age Group
Orbit 3
Orbit 5
Orbit 7
Genes
Counts
Genes
Counts
Genes
Counts
ATRX*
1612 KMT2A*
89156 ACSL6**
12957
NSD1*
1566 MTCP1*
87707 USP8*
12050
ATM*
1544 ACSL6**
86446 MTCP1*
11616
PBRM1*
1532 USP8*
84033 KMT2A*
11330
SPEN*
1502 STAT3*
80351 STAT3*
10320
USP8*
1454 NSD1*
79379 NSD1*
10168
ARID2*
1453 LARP4B**
78582 ATM*
9820
PRDM2**
1441 MED12*
75942 MED12*
9553
MED12*
1439 MLLT10*
75510 MLLT10*
9340
SF3B1*
1426 ATM*
74065 LARP4B**
9261
MTCP1*
1420 NF1*
72151 LSM14A**
9076
LSM14A** 1403 CACNA1D* 70767 CACNA1D*
8702

Orbit 2
Genes
Counts
KMT2D*
1192
PRDM2**
915
ATM*
889
IL7R*
863
RNF43*
862
CCR4**
859
CREB1*
828
MAP2K4*
822

Orbit 2
Genes
Counts
MITF*
28
PAFAH1B2*
27
NFKBIE*
20
GRIN2A*
15
MAP3K13*
15
MYH9*
15
SGK1**
15
WT1*
15
DDX6*
14
POU5F1*
14
PRDM16*
14
RSPO2*
14
HOXC11*
13

Orbit 2
Genes
Counts
USP8*
1706
ACSL6**
1679
KMT2A*
1666
MTCP1*
1661
NSD1*
1594
STAT3*
1580
ATM*
1537
MED12*
1487
ATRX*
1469
LARP4B**
1455
CACNA1D*
1450
LSM14A**
1447

It should also be noted that each OGD graph in Figures 5.7 and 5.8 can be cross-referenced
with their respective cancer graphs in Figures 5.1–5.3. For example, both BCG 30–40 and LCG
80–90 have the smallest amount of node-edge connections, indicating low amount of census
gene co-expression similarities despite having some of the highest amount of genes, 363 and
317 genes respectively (Table 5.3). This is reﬂected in their respective OGD graphs wherein
the distributions are in a smaller range (x < 150 and x < 1200, respectively) in comparison to
other age groups (x < 40000). This is quite apparent when comparing higher-frequency genes

44

Table 5.5: Identiﬁed Census Genes in Structurally Shared Orbits in Lung Cancer Networks.
Orbit 10
Genes
Counts
TSC2*
12991
KCNJ5*
12695
SND1*
12649
CUX1*
12644
FAM131B**
12510
ARHGEF10** 12385

Lung Cancer 30 to 40 Age Group
Orbit 3
Orbit 5
Orbit 7
Genes
Counts
Genes
Counts
Genes
Counts
CHD2**
927 CHD2**
2120 MYH11*
16726
ARHGEF10**
911 TRAF7*
2055 PTPRT*
16726
TRAF7
831 AKAP9**
2044 PTPN13*
16637
AKAP9**
825 SETBP1*
2025 KCNJ5*
16489
KCNJ5*
779 ARHGEF10** 1974 SETBP1*
16334
PTPN13*
764
TRAF7*
15886

Orbit 10
Genes
Counts
MYO5A*
3898
PDGFRA*
3882
COL3A1**
3499
DDR2*
3090
ZEB1**
3088
BTK*
3024
NIN*
3011
FLI1*
2828
TCF12*
2591
IL7R*
2495
SFRP4*
2457
IKZF1*
2403
LCP1**
2306
POU2AF1*
2082
PPFIBP1*
2055
PTPRC*
1998
FBLN2**
1973
PRRX1*
1955
NOTCH2*
1757

Lung Cancer 40 to 50 Age Group
Orbit 3
Orbit 5
Orbit 7
Genes
Counts
Genes
Counts
Genes
Counts
MYO5A*
154 MYO5A*
19706 MYO5A*
7883
PDGFRA*
149 ZEB1**
14506 ZEB1**
5095
ZEB1**
132 NIN*
13917 NIN*
4504
COL3A1**
126 PDGFRA*
13434 PDGFRA*
3727
DDR2*
115 FLI1*
11944 IKZF1*
3386
NIN*
115 IKZF1*
10869 FLI1*
3190
BTK*
111 IL7R*
10720 IL7R*
2724
FLI1*
108 DDR2*
10690 DDR2*
2688
TCF12*
100 BTK*
10387 BTK*
2335
IKZF1*
98 COL3A1**
9924 TCF12*
2196
IL7R*
93 TCF12*
9665 COL3A1** 2138
SFRP4*
81 PRRX1*
9085 PRRX1*
2033
LCP1**
78 PPFIBP1*
8046 VAV1**
1798
POU2AF1*
78 PRDM1*
8027 PTPRC*
1776
FBLN2**
74 PTPRC*
7888 MAP3K1*
1738
PPFIBP1*
74 LCP1**
7852 PPFIBP1*
1722
PRRX1*
74 MAP3K1*
7498 PRDM1*
1655
PTPRC*
72 NOTCH2*
7496 CBFA2T3*
1439
VAV1**
69 VAV1**
7272 LCP1**
1429

Orbit 10
Genes
Counts
FANCA*
135
NFE2L2*
116
FAM131B**
110
CUL3**
109
APC*
103
POLQ*
103
RANBP2*
95
KMT2D*
89
RFWD3**
87
ELK4*
84
RNF213*
83
ZFHX3*
81
PTPRK*
74
HOXA13*
73
KMT2C*
71

Lung Cancer 80 to 90 Age Group
Orbit 3
Orbit 5
Orbit 7
Genes
Counts
Genes
Counts
Genes
Counts
CUL3**
10 ZFHX3*
284 ZFHX3*
1062
FANCA*
10 FOXO1*
192 FANCA*
933
RANBP2*
10 FANCA*
190 RANBP2*
775
FAM131B**
9 RANBP2*
188 FOXO1*
710
NFE2L2*
9 MLLT10*
183 MLLT10*
706
APC*
8 AFF4*
182 IDH1*
704
ELK4*
8 COL3A1**
174 TET2*
688
HOXA13*
8 HSP90AB1*
163 RFWD3**
661
POLQ*
8 IDH1*
148 CDC73*
654
EP300*
7 TET2*
147 ETNK1*
621
KMT2D*
7 CDC73*
141 CUL3**
615
RFWD3**
7 CUL3**
135 CBLB*
583
RNF213*
7 CBLB*
130 AFF4*
579
SMAD3*
7 POLE*
130 APC*
577
ZFHX3*
7 ETNK1*
115 POLQ*
569
TPR*
567

Orbit 2
Genes
Counts
CHD2**
726
ARHGEF10**
685
AKAP9**
660
TRAF7*
654
SETBP1*
636

Orbit 2
Genes
Counts
MYO5A*
792
ZEB1**
609
NIN*
551
PDGFRA*
517
IKZF1*
463
FLI1*
453
DDR2*
413
IL7R*
403
BTK*
385
COL3A1**
370
TCF12*
365
PRRX1*
332
VAV1**
309
PTPRC*
306
PPFIBP1*
304
MAP3K1*
294
PRDM1*
288
LCP1**
273
CBFA2T3*
265

Orbit 2
Genes
Counts
ZFHX3*
84
FANCA*
68
RANBP2*
68
FOXO1*
63
AFF4*
62
MLLT10*
62
COL3A1**
61
HSP90AB1*
60
CUL3**
56
IDH1*
53
TET2*
53
CDC73*
52
CBLB*
51
POLE*
51
ETNK1*
49

in Tables 5.4 and 5.5 between low-connectivity and high-connectivity graphs. For example,
BCG 30–40’s MITF* has a count of 150 in Orbit 5, which the highest appearing census gene in
the graph. In comparison, BCG 20–30’s MAP2K4* has a count of 822 in Orbit 2, which is the
lowest appearing census gene in the graph. Likewise, LCG 80–90’s ZFHX3* has a count of
1062 in Orbit 7, which the highest appearing census gene in the graph, but while LCG 40–50’s
VAV1** has a count of 69 in Orbit 3, which is the lowest appearing census gene, its highest

45

appearing gene is MYO5A* with 19,706 counts in Orbit 5 in the graph.
Following the investigation of the common orbits for high-frequency genes, all highfrequency genes have been identiﬁed in Tables 5.4 and 5.5 in order of their orbit appearances.
There are some observations noted from that table, namely that there appears to be certain
genes that are present in across all orbits and age groups of individual cancers:
• BCG 20–30:, KMT2D* and PRDM2** appears in orbits (Orbit 10, Orbit 3, Orbit 5,
Orbit 7, Orbit 2), CCR4** appears in only orbits (Orbit 10, Orbit 5, Orbit 7, Orbit 2).
Furthermore, there is more genetic variation across its orbits in comparison to other
graphs. As such, these census genes are all strongly presented in their respective orbits.
This is indicative of the high amounts of correlative GE between census genes for this
age group.
• BCG 30–40 age group, only HOXC11* appears in most orbits, orbits (Orbit 10, Orbit
3, Orbit 7, Orbit 2). For the remainder, they show weakly presented in their respective
orbits. There is also a lot of variation in census genes across all common orbits. This
shows that there are fewer amounts of similar correlative GE.
• For BCG 90–100 age group, NSD1*, MTCP1*, MED12*, and USP8* appear across all
orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2), while KMT2A* appears in only orbits
(Orbit 10, Orbit 5, Orbit 7, Orbit 2). This is an interesting ﬁnding as it shows that these
could be commonly associated genes for this graph. In terms of genetic variation, there
appears to be very little across some orbits. Again, another strong showing for the rest of
the census genes being present in their respective orbits with high amounts of correlative
GE between them.
• For LCG 30–40 age group, ARGHGEF10** and TRAF7* appear in only orbits (Orbit
10, Orbit 3, Orbit 5, Orbit 2) and (Orbit 3, Orbit 5, Orbit 7, Orbit 2), respectively. Regardles of just these two common genes, there isn’t a lot of genetic variation either, namely
due to a smaller amount of genes present in the graph. Again, another strong showing
for the rest of the census genes being present in their respective orbits with high amounts
of correlative gene expression between them.
• For LCG 40–50 age group, MYO5A*, PDGFRA*, COL3A1*, DDR2*, ZEB1*, BTK*,
NIN*, FLI1*, TCF12*, IL7R*, IKZF1*, LCP1**, PPFIBPI*, PRRX1*, BTK* and PTPRC* all appear across orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2). This is
another interesting ﬁnding that could very well showcase a common set of genes unique
for this graph. In addition, there is not a lot of genetic variation across orbits here.
• For LCG 80–90 age group, FANCA*, CUL3**, RANPB2*, and ZFHX3* appear across
all orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2). Very much like BCG 20–30, the

46

amount of genetic variation here is more than all other LCG graphs, showing that there
is a low amount of correlative GE across the graph. Despite this, there still is a strong
notion that a unique group of census genes has been associated with this graph.
Looking closer at the results from each age group, there are commonly occuring genes that
seem highly associated with other genes. Notable examples can be found in both BCG and
LCG graphs, namely KMT2D* found in 20–30 and MYO5A* found in 40–50, respectively:
• KTM2D: The KMT2D gene provides instructions for making an enzyme called lysine
(K)-speciﬁc methyltransferase 2D that functions as a histone methyltransferase [29]. Histone methyltransferases are complex enzymes that modify histones: structural proteins
that bind to DNA giving form to coiled structures called nucleosomes [19, 29]. Histones
are needed to both compact and protect DNA, preventing unnecessary damage and entanglement to the genetic material. Being coiled prevents DNA from gene expression
to occur as DNA transcription protein mechanism cannot bind to it. The histones must
release a section of DNA exposed in order for gene expression to occurs. KMT2D can
do so by methylating the H3 histone [19]; that is to say, adding a methyl group to the
H3 histone of the nucleosome complex [29]. It is thought that KMT2D has this ability
to regulate gene expression of certain genes as a result of its abilities, more speciﬁcally
in the role of a tumour suppressor gene assisting in the prevention of uncontrolled cellullar growth and proliferation. With this in mind, it can be hypothesized from the results
showcasing KMT2D with the highest degree of gene expression correlative association
that the regulation of this gene can affect the regulation of other genes.
• MYO5A: The MYO5A gene provides instructions for making an enzyme called myosin
Va that functions as an intracellular motor transport protein [15, 30]. Motor transport
proteins are by and large responsible for transporting/anchoring intracellular components
such as vesicles, organelles, and mRNA to various locations within the cell along actin
ﬁlaments, a major component of the cellular cytoskeletal structure [15, 18, 30]. Positioning and transportation of such components gives credence to a cell’s ability to not only
move crucial materials between organelles, but there is increased evidence to support
that it is involved in restructuring of a cell’s shape through the reorganizing of actin ﬁlaments [15, 18]. Being identiﬁed as a key oncogene in many cancers, including non-small
cell lung cancer and laryngeal squamous cell carcinoma, overexpression of MYO5A has
allowed for cellular proliferation and metastatis of cancerous cells [15, 18, 30]. With this
in mind, it can be suggested from the results showcasing MYO5A with the highest degree of gene expression correlative association that it could have an effect on regulating
expression of other genes to ensure metastasis occurs.

47

Overall, each and every age group has its own unique set of genes. This is especially
apparent in graphs LCG 40–50, LCG 80–90, and BCG 90–100. In future studies, more work
should be done to determine alternative methods to draw more meaningful metrics. Regardless,
this is a rather interesting investigation that could potentially be used to further unravel how
these genes are actually inﬂuencing each other and the cancer. Furthermore, these unique genes
have been found in the common orbits shared across all age groups. In terms of graph structure,
there must be some sort of similiarities in gene regulatory control.
Moving forward, there is promise in assessing the composition of census genes that are
present in higher abundance in common orbit clusters of differing graphs. This can be used
to further determine and isolate core groups of inﬂuential census genes each cancer age group
utilizes for growth and survival.

48

Chapter 6
Discussion and Overview
It is the ambition of this thesis to highlight the potential use of graphlet analysis to identify
and characterize cancer graphs in outlining involved genes and in comparing cancer genomic
characteristics. Providing a general overview, as was shown in this study, of a cancer’s composition of unique group of census genes could be beneﬁcial in furthering oncological research
focusing more on gene clustering based on their expression levels and control. It should be
made clear that this thesis project is very much exploratory. COSMIC has been an invaluable
resource for this project, allowing the comparison of not only different cancer types (breast
and lung) but with different age groups between them. However, there is a lot more to cancer
research than knowing a patient’s age. Following this thesis project, it would be beneﬁcial to
evaluate the observed census genes in terms of their hallmarks, if conﬁrmed and identiﬁed.
COSMIC provides the
(CGCHOC) data set
wherein curated census genes are updated with known cancer hallmarks. Investigating the cancer hallmarks of census genes in those common orbit clusters will provide an awareness of how
the generated cancer graphs are behaving in those particular age groups. COSMIC also provides additional histological context to cancer types, allowing to focus more on speciﬁc tissues
within larger organs and the differing stages of cancer development. This study did not fully
inspect the composition of census genes in orbits speciﬁc to graphs as the main objective was
to compare common internal structures between cancer graphs. It would be advantageous to
fully explore the unique orbit clusters of separate graphs to unearth potentialy genetic implications that could very well be individualistic to said graphs. Lastly, the co-expression graphs
generated from census gene expression data does not yield any information regarding how each
gene inﬂuences or distrupts the expression of another. This is easily represented with the use of
directed graphlet analysis where edges are weighted vectors to and from given nodes. Representing these vectors would be guided by oncological literature where ﬁndings of a census gene
is conﬁrmed to affect the expression levels of other census genes, mutated or not. Such a analysis would further enrich hidden structural dependencies by illustrating how that dependency

49

is initiated towards a cancerous state.
Another contribution this study provides is a preliminary graphlet algorithm that could see
future use in the NetworkX [12]. Currently, this algorithm is parallelized and could be easily
implemented and operated in computationally distributed environments to improve processing power for massive graphs. As such, this algorithm could potentially be integrated into
NetworkX’s growing list of graph features, metrics, and analysis. Future improvements of the
algorithm will require a revisit in data collection, processing, and deployed in a server database.
Currently, all used data sets need to be separately downloaded from COSMIC and store initial
graphlet analysis as Comma Delimited Values (CSV) ﬁles. Having the data be stored initialy
into a relationanl database management system such as PostgreSQL on a server would allow
for data streaming and serverless computing.
In this study, Dask dataframes were used to simply store all the raw, unﬁltered CSV ﬁles
as a Pandas’ DataFrame could not store over 40 gigabytes of data; this is due to RAM being
limited which a Pandas DataFrame uses, while a Dask DataFrame may live on any disk space
for larger-than-memory computing on a single machine. Filtered data from Dask DataFrame is
stored in a Pandas DataFrame, which improves further analysis. However from there leads to
a weak and core component of the software itself; it does not fully utilize data structures with
improved search performance such as Pandas’ DataFrame and relies mainly Python’s standard
dictionary data structure. To leverage the abilities of Dask and Pandas, Apache Spark would
be an appropriate replacement. Apache Spark, or Spark for short, is an open-source parallel
processing, multi-language framework that supports in-memory procesing to boost the data
analysis performance. Just like Pandas, Spark is able to achieve similar performance levels by
using RAM to store immediate data rather than on-disk memory. It is also able to run data
processing steps in parallelized batches across a cluster of computers, streaming data to and
from each machine. This would handle massive datasets with issue. Spark also has a graph
network tool called GraphX which is used to create, study, and modify graphs the same way
that NetworkX does. It would be worthwhile to explore the implementation of the proposed
algorithm with GraphX and its inherinted ability to handle massive amounts of data, resulting
in a step towards graph network analysis at an industry level.
There was no attempt to conduct false discovery experiments for this thesis. As mentioned
in the beginning of Chapter 2, the correctness of the graphlet Algorithm 1 was veriﬁed by
checking that it had similar results to Das et al.’s algorithm by computing their sample graph [1,
5]. Das et al. themselves conducted false discovery experiments to conﬁrm that their algorithm
did not generate clusters from nonsensical graph data. We were conﬁdent from our matching
results that Algorithm 1 would be able to provide biological signiﬁcance from analyzing the
BCGs and LCGs generated. However, future studies should investigate how well Algorithm 1
can conduct graphlet analysis with large and small graphs.

50

From identifying census gene, to matching ten’s of thousands of mRNA expression data,
and then generating co-expression graphs, dictionaries have been used almost exclusively due
to personal familiarity. Algorithmic performance did beneﬁt from the use of multicore/parallel processing, however it is not fully extended for certain aspects. Currently, parallelism is
extended to each graphlet, however this implementation did not take into account for massive
subgraph lists to be stored in memory. There have been cases when a list of subgraphs for
a particular graphlet analysis would be so large that its processing runtime would exceed the
runtime of other subgraphs lists. Further reﬁning the multiprocessing capabilities to accomodate maximizing available core resources would further reduce the software’s overall runtime.
Implementing the ability to have a single subgraph list to be shared amongst all graphlet processes running in parallel would help not only reduce the time needed for NetworkX to generate
a new list for every process, but also reduce the memory footprint by simply having a single
list stored in RAM. Lastly, another computational issue arose when exploring up to 5-node
graphlets. This is why graphlets G9–G29 have been omitted from this research venture as the
analysis of potential subgraphs resulted in extended periods of server runtime that lead to the
server seizing up. This could very well be a limitation to the power of graphlet analysis, hence
why future research lead by Pržulj et al. [22] and Das et al. will deal with minimizing orbit
redundancies and using directed graphlet analysis algorithms for improving the ﬁeld of graph
analysis [1, 5].

51

ni : Node of a subgraph
n j : Neighbouring node(s) to ni
nk : Neighbouring node(s) to n j
nl : Node of a graphlet
(ni ,n j ) : edge connecting ni and its neighbour n j
E : Number of edges between neighbours n j to ni
NN : Number of neighbours ni to n j
NNE : Number of edges neighbours nk to n j
NNN : Number of neighbours nk to n j
Ggraphlet : Graphlet of G
Gsubgraph : Subgraph of G
GOM : Graphlet Orbit Matrix
orbitchecksum: Orbit Checksum
nx: NetworkX package object
INPUT: list(Gsubgraph ), dict(Ggraphlet )
2 OUTPUT: GOM for G
3 Shared Array: GOM;
4 for each Ggraphlet ∈ dict(Ggraphlet ) do
IdentifyGraphlets(Ggraphlet , list(Gsubgraph ), nx)
in parallel
5
end
Algorithm 4: Revised graphlet and orbit counting algorithm: Part I.
1

52

Procedure IdentifyGraphlets(Ggraphlet , list(Gsubgraph ), nx)
for Gsubgraph ∈ list(Gsubgraph ) do
2
if nx.connected (Gsubgraph ) AND nx.isomorphic(Gsubgraph , Ggraphlet ) then
3
ValidateOrbitChecksums(Gsubgraph , Ggraphlet )
4
end
end
end

1

Algorithm 5: Revised graphlet and orbit counting algorithm, Part II [1].
Procedure ValidateOrbitChecksums(Gsubgraph , Ggraphlet )
for ni ∈ Gsubgraph do
E ← len(subgraph.edges(node)))
3
NN ← len(subgraph[node])
4
5
NNE ← 0
NNN ← 0
6
for n j ∈ Gsubgraph do
7
NNE ← NNE + |g.edges(n j )|
8
end
g[ni ][orbitchecksum] ← E + (NNE − 1)
9
for n ∈ Ggraphlet do
10
11
if ni [orbitchecksum] == n [orbitchecksum] then
[ni ][orbit ] ← n [orbit ]
12
end
end
end
end
1

2

Algorithm 6: Revised graphlet and orbit counting algorithm: Part II.

53

Bibliography
[1] Das A., Efﬁcient enumeration of small graphlets and orbits., Master’s thesis, University
of Northern British Columbia, Prince George, British Columbia, Canada, 2020.
[2] F. Collins and A. Barker, Mapping the cancer genome. pinpointing the genes involved in
cancer will help chart a new course across the complex landscape of human malignancies,
Scientiﬁc American 296 (2007), 50–57.
[3] A.I. Compos and J.A. Freye-Gonzalez, Evolutionary constraints on the complexity of
genetic regulatory networks allow predictions of the total number of genetic interactions.,
Sci Rep 9 (2019), 3618.
[4] M.R.T. Dale, Applying graph theory in ecological research, Cambridge University Press,
2017.
[5] A. Das, A. Aravind, and M. Dale, Algorithm and application for signed graphlets, (2019),
613–620.
[6] A. Das, A. Aravind, and G.S. Deo, Signed graphlets based gene expression analysis,
(2019), 2055–2062.
[7] A. Das, M. Drakos, A. Aravind, and D. Horning, Water governance network analysis
using graphlet mining, (2020), 633–640.
[8] D. Davis et al., Topology-function conservation in protein-protein interaction networks.,
Bioinformatics 31 (2015), 1632–1639.
[9] S. Doria-Belenguer, M.K. Youssef, N. Böttcher R., Malod-Dognin, and N. Pržulj, Probabilistic graphlets capture biological function in probabilistic molecular networks, Bioinformatics 36 (2020), no. Supplement 2, i804–i812.
[10] S.A. Forbes, D. Beare, H. Boutselakis, S. Bamford, N. Bindal, J. Tate, C.G. Cole, S. Ward,
L. Dawson E, Ponting, R. Stefancsik, B. Harsha, C.Y. Kok, M. Jia, H. Jubb, Z. Sondka,
S. Thompson, T. De, and P.J. Campbell, COSMIC: somatic cancer genetics at highresolution, Nucleic Acids Research 45 (2016), no. D1, D777–D783.
[11] J-L. Gailly, J. Meyering, and P. Eggert, Gnu operating system: Gnu gzip., 2004.
[12] A. Hagberg, D. Schult, and P. Swart, Networkx: Network analysis in python., 2004.

54

[13] D. Hanahan and R.A. Weinberg, Hallmarks of cancer: the next generation., 2011 144
(2011), 646–674.
[14] T.J. Hudson, W. Anderson, z A. Arte, A.D. Barker, C. Bell, R.R. Bernabe, M.K. Bhan,
F. Calvo, I. Eerola, et al., International network of cancer genome projects., Nature 464
(2010), 993–998.
[15] L. Lan, H. Han, H. Zuo, Z. Chen, Y. Du, W. Zhao, J. Gu, and Z. Zhang, Upregulation of
myosin va by snail is involved in cancer cellmigration and metastasis., Int. J. Cancer 126
(2010).
[16] P. Langﬁelder and S. Horvath, Wgcna: an r package for weighted correlation network
analysis., BMC Bioinformatics 559 (2008).
[17] T. Leighton and M. van Dijk, Massachusetts institute of technology opencouseware
6.042j/18.062j: Mathematics for computer science, https://ocw.mit.edu/, 2010, License:
Creative Commons BY-NC-SA.
[18] X. Liang, Z. Wu, S. Shen, Y. Niu, Y. Guo, J. Liang, and W. Guo, Linc01980 facilitates
esophageal squamous cell carcinoma progression via regulation of mir-190a-5p/myo5a
pathway, Archives of Biochemistry and Biophysics 686 (2020).
[19] L. Mariño-Ramírez, M.G Kann, B.A. Shoemaker, and D. Landsman, Histone structure
and nucleosome stability., Expert Rev Proteomics 5 (2005), 719–729.
[20] T. Milenkovic et al., Uncovering biological network function via graphlet degree signatures., Cancer Informatics 8 (2008), 257–273.
[21] M. Mina, F. Raynaud, D. Tavernari, E. Battistello, S. Sungalee, S. Saghaﬁnia, T. Laessle,
F. Sanchez-Vega, N. Schultz, E. Oricchio, and G. Ciriello, Conditional selection of genomic alterations dictates cancer evolution and oncogenic dependencies, Cancer Cell 32
(2017), no. 2, 155–168.e6.
[22] N. Pržulj, Biological network comparison using graphlet degree distribution, Bioinformatics 23 (2007), no. 2, e177–e183.
[23] Uddin R. and S.M. Singh, Gene network construction from microarray data identiﬁes a
key network module and several candidate hub genes in age-associated spatial learning
impairment., Front. Syst. Neurosci. 75 (2017).
[24] N. Shervashidze, S.V.N. Vishwanathan, T.H. Petri, K. Mehllorn, and K.M. Borgwardt,
Efﬁcient graphlet kernels for large graph comparison., Journal of Machine Learning Research 5 (2009), 488–495.
[25] H.A. Shihab, M.F. Rogers, J. Gough, M. Mort, D.N. Cooper, I.N.M. Day, T.R. Gaunt, and
C. Campbell, An integrative approach to predicting the functional effects of non-coding
and coding sequence variation, Bioinformatics 31 (2015), no. 10, 1536–1543.

55

[26] Z. Sondka, S. Bamford, C.G. Cole, S.A. Ward, I. Dunham, and S.A. Forbes, The cosmic
cancer gene census: describing genetic dysfunction across all human cancers., Nat. Rev.
Cancer 18 (2018), 696–705.
[27] Y. Sun, X. Liang, and K. Loparo, A common gene expression signature analysis method
for multiple types of cancer, 2019, pp. 200–211.
[28] J.G. Tate, S. Bamford, H.C. Jubb, Z. Sondka, D.M. Beare, l N. Binda, H. Boutselakis,
and C.G. Cole, Cosmic: the catalogue of somatic mutations in cancer, Nucleic Acids
Research 47 (2018), no. D1, D941–D947.
[29] E. Tonska, H.U. Osmanbeyoglu, P. Castel, C. Chan, Hendrickson R.C., M. Elkabets,
M.N. Dickler, M. Scaltriti, C.S. Leslie, S.A. Armstrong, and J. Baselga, Pi3k pathway
regulates er-dependent transcription in breast cancer through the epigenetic regulator
kmt2d, Science 355 (2017), 1324–1330.
[30] H. Velvarska and D. Niessing, Structural insights into the globular tails of the human type
v myosins myo5a, myo5b, and myo5c., PLoS One. 8 (2013).
[31] B. Vogelstein, N. Papadopoulos, V.E. Velculescu, S. Zhou, L.A. Diaz, and K.W. Kinzler,
Cancer genome landscapes, Science 339 (2013), no. 6127, 1546–1558.
[32] B.A. Weir, M.S. Woo, G. Getz, Perner S., R. Beroukhim, W.M. Lin, M.A. Province,
A. Kraja, et al., Characterizing the cancer genome in lung adenocarcinoma., Nature 450
(2007), 941–947.
[33] Wellcome Sanger Institute, The catalogue of somatic mutations in cancer.
[34] M.C. White, D.M. Holman, J.E. Boehm, L.A. Peipins, M. Grossman, and S.J. Henley,
Age and cancer risk: a potentially modiﬁable relationship., Am. J. Prev. Med. 46 (2014).
[35] L. Yin, Z. Cai, B. Zhu, and C. Xu, Identiﬁcation of key pathways and genes in the dynamic
progression of hcc based on wgcna, Genes 9 (2018), no. 2.

56

Appendix A
Appendix
A.1

Figures and Diagrams

Some of the ﬁgures found in Chapters 5 are sized in such a way that does not make the data
easily legible. They are resized and placed individually in the appendix to better read the data
they are presenting.

57

Figure A.1: Full-sized graph of 20–30 age group breast carcinoma (see Figure 5.1).

58

Figure A.2: Full-sized graph of 30–40 age group lung carcinoma(see Figure 5.1).

59

Figure A.3: Full-sized graph of 30–40 age group breast carcinoma (see Figure 5.2).

60

Figure A.4: Full-sized graph of 40–50 age group lung carcinoma (see Figure 5.2).

61

Figure A.5: Full-sized graph of 90–100 age group breast carcinoma (see Figure 5.3).

62

Figure A.6: Full-sized graph of 80–90 age group lung carcinoma (see Figure 5.3).

63

Figure A.7: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 20–30
age group breast cancer data (see Figure 5.4).

64

Figure A.8: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 30–40
age group lung cancer data (see Figure 5.4).

65

Figure A.9: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 30–40
age group breast cancer data (see Figure 5.5).

66

Figure A.10: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 40–50
age group lung cancer data (see Figure 5.5).

67

Figure A.11: Full-sized cluster map of the graphlet correlation matrices (GCM) for the
90–100 age group breast cancer data (see Figure 5.6).

68

Figure A.12: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 80–90
age group lung cancer data (see Figure 5.6).

69