EXPLORING HOW GRAPHLET ANALYSIS CAN BE USED TO IDENTIFY HIGHLY-CONNECTED CANCER DRIVING GENES by Alexander Spyridon Douglas B.Sc. H., University of Northern British Columbia, 2017 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN INTERDISCIPLINARY STUDIES UNIVERSITY OF NORTHERN BRITISH COLUMBIA December 2022 © Alexander Spyridon Douglas, 2022 In Appreciation to My Family and Friends Acknowledgement Firstly, I would like to express my deepest gratitude and respect to Dr. Alex Aravind. Without his help I would have not been able to pursue a graduate degree that I found worthwhile. I would like to express my sincerest gratitude to Dr. David Casperson, Dr. Chow Lee, and Dr. Liang Chen for assisting me achieve my masters degree and for fostering a healthy pursuit of scientific knowledge and academic integrity. I am extremely indebted to the Chairperson, the Department of Interdisciplinary Studies, the University of Northern British Columbia (UNBC), and Prince George, for providing all the necessary resources to complete my degree. Secondly, I am indebted to my labmates Conan Veitch, Dylan Fossl, Colton Jensen, Hunter Graeme, Furkhana Khan, and Courtney Lawrence for their support, discussion, encouragement, and pleasant company. I would also like to extend a very heartfelt thanks to the Geoffrey R. Weller Library and UNBC Security staff members for their support and friendship throughout all the years I’ve worked at UNBC. Thirdly, it is with great pleasure to acknowledge the good company of Father Theodore Gove and Paul English, both of whom were left positive influences towards my life. And finally, I want to express my deepest gratitude to my family and friends for their love, support and encouragement throughout the course of my time in graduate school. I would like to give a very special thanks to my mother, Dr. Vasiliki Kravariotis-Douglas, and my sister, Ms. Penny Hance, for their combined, immeasurable academic support and mastery to help shape this thesis and for instilling me with the courage to strive for greater things. Prince George, BC December 9th, 2022 (Alexander Douglas) iii Abstract One of the driving force that pushes a normal cell towards a cancerous state is led by unregulated control of its cellular division through the suppression of tumour suppressor genes and over expression of oncogenes. These cancer-driver genes have been associated in the advancement of different cancer types, though how they are represented and with which genes they are associated within the cancer’s gene expression profiles is a boon towards understanding of said cancer’s development. An novel way of modelling these gene expression profiles is to use graphlet-based network analysis, a data mining technique that allows for identification, understanding, and prediction of their functionality, emergent properties, and potential controllability. The thesis aims to identify patterns in gene co-expression networks of shared cancer census genes found in breast cancer and lung cancer using cancer gene census data provided by COSMIC, the Catalogue of Somatic Mutations in Cancer iv TABLE OF CONTENTS Acknowledgement iii Abstract iv Table of Contents v List of Tables vii List of Figures viii 1 Introduction 1 2 Graphlet Analysis Metrics and Algorithm 6 2.1 Graph Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Automorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Graphlets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Algorithm Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 COSMIC Database 3.1 COSMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Cancer Gene Census . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Comparative Gene Expression . . . . . . . . . . . . . . . . . . . . . . 18 18 18 21 4 Graph Network Creation 4.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Proposed Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Gene Expression Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Gene Expression Matrices . . . . . . . . . . . . . . . . . . . . . . . . 24 24 24 27 29 29 v 5 Cancer Network Results 5.1 Cancer Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Structural Network Observations . . . . . . . . . . . . . . . . . . . . . 5.1.2 Potential Genetic Implications . . . . . . . . . . . . . . . . . . . . . . 31 31 31 34 6 Discussion and Overview 49 Bibliography 53 A Appendix 57 A.1 Figures and Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 vi LIST OF TABLES 2.1 Orbit checksums of each orbit found in 3-, 4-, and 5-node graphlets. . . . . . 13 5.1 Orbit Clusters Found in Identified Age Groups of Breast and Lung Cancer Gene Co-expression Networks. . . . . . . . . . . . . . . . . . . . . . . . . . 32 Graphlet Correlation Distance between in Identified Age Groups of Breast and Lung Cancer Gene Co-expression Networks. . . . . . . . . . . . . . . . . 34 Gene Distributions Identified Age Groups of Breast and Lung Cancer Gene Co-expression Networks. Gene Count versus Age Group . . . . . . . . . . . 34 Identified Census Genes in Structurally Shared Orbits in Breast Cancer Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Identified Census Genes in Structurally Shared Orbits in Lung Cancer Networks. 45 5.2 5.3 5.4 5.5 vii LIST OF FIGURES 2.1 2.3 Visual representation of graph isomorphism between two graphs representing graphlet G7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Visual representation of graph automorphism between two graphs representing graphlet G7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Automorphism orbits for the twenty-nine 3-, 4-, and 5-node graphlets . . . . . 12 3.1 3.2 3.3 Tier 1 and Tier 2 of the Cancer Gene Census . . . . . . . . . . . . . . . . . . 20 The Cancer Gene Census curation steps undertaken by COSMIC. . . . . . . . 22 Overview of the software’s operational steps . . . . . . . . . . . . . . . . . . 23 4.1 Steps to create cancer gene expression profile (CGEP) graphs . . . . . . . . . 28 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Breast 20–30 age group and lung 30–40 age group carcinoma graphs . . . . . Breast 30–40 age group and lung 40–50 age group carcinoma graphs . . . . . Breast 90–100 age group and lung 80–90 age group carcinoma graphs . . . . Breast 20–30 age group and lung 30–40 age group orbit clustermap . . . . . . Breast 30–40 age group and lung 40–50 age group orbit clustermap . . . . . . Breast 90–100 age group and lung 80–90 age group orbit clustermap . . . . . Identified breast cancer gene distribution per given orbits. . . . . . . . . . . . Identified lung cancer gene distribution per given orbits. . . . . . . . . . . . . 36 37 38 39 40 41 42 43 A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 A.10 A.11 A.12 Full-sized graph of 20–30 age group breast carcinoma (see Figure 5.1). . . . . Full-sized graph of 30–40 age group lung carcinoma(see Figure 5.1). . . . . . Full-sized graph of 30–40 age group breast carcinoma (see Figure 5.2). . . . . Full-sized graph of 40–50 age group lung carcinoma (see Figure 5.2). . . . . . Full-sized graph of 90–100 age group breast carcinoma (see Figure 5.3). . . . Full-sized graph of 80–90 age group lung carcinoma (see Figure 5.3). . . . . . Full-sized cluster map for the breast cancer 20–30 age group (see Figure 5.4). Full-sized cluster map for the lung cancer 30–40 age group (see Figure 5.4). . Full-sized cluster map for the breast cancer 30–40 age group (see Figure 5.5). Full-sized cluster map for the lung cancer 40–50 age group (see Figure 5.5). . Full-sized cluster map for the breast cancer 90–100 age group (see Figure 5.6). Full-sized cluster map for the lung cancer 80–90 age group (see Figure 5.6). . 58 59 60 61 62 63 64 65 66 67 68 69 2.2 viii Chapter 1 Introduction Cancer is one of the most prevalent, deadly, and challenging diseases worldwide. It is an umbrella term for many diseases whose pathological manifestations are a result of abnormal cell growth invading the rest of the human body. To properly understand how these manifestations are realized, a very effective approach is to collect an overwhelming amount of gene expression (GE) data from clinical samples and risk groups that have identified cancer-driving genes. Genes are generally considered as such when their involvement is observed in the development of specific cancer types [28], however there is no single gene attributed in tumour growth and proliferation, but rather a broad network of co-dependent genetic abnormalities [21, 27, 28]. Gene expression (GE) signatures are reliable indicators that can be used to draw meaningful analysis of cancerous genomic profiling and genetic abnormalities, which can then be used to serve as surrogates for cancer phenotypes such as resisting cell death, inducing angiogenesis, enabling replicative immortality, activating invasion and metastasis, evading growth suppressors, and sustaining proliferative signaling [13, 21, 27]. Therefore, it is reasonable to hypothesize that computational analysis of rich cancer GE datasets would reveal common graph signatures across cancer types. The technique of identifying patterns of mRNA expression has been used to determine genetic occurences of investigated biological traits. This is true for both normal and cancerous cellular functionality wherein biological pathways operate in tandem to ensure cellular operations are functional. In the case of a cancerous cell, genomic alterations can lead to abnormal GE levels, signatures that can be used to serve as surrogates of cancer phenotypes [21, 27]. As such, identifying key census genes associations patterns from cancer transcriptomic signatures could help with defining and understanding them. These signatures have been used in clinical settings where they serve as an assessment standard to facilitate the interpretation of cancer laboratory test results, simplifying cancer laboratory protocols, and reducing associated medical operation costs [13, 21, 27, 32]. In a research setting, signatures have shown to help in the observation of biological mechanisms and in identifying possible drug targets for future cancer 1 treatments [13, 21, 27, 28, 32]. The ongoing mission to accumulate GE signatures into data sets from many different cancer types provides an excellent opportunity for thorough computational data mining in search of such common patterns. A method to reveal meaningful signatures, structural workings, and genetic relationships from cancer census GE data is to create a graph wherein genes are interconnected through comparable levels of expression. A method of analysis to extract such properties is to analyze its local heuristics, or topology, by using graphlet analysis. This method of analysis, as the name implies, utilizes graphlets, small, induced, and connected graphs developed for the sole purpose to effectively characterize the interconnectivity of nodes of larger and more complex graphs [22, 24]. As graphs composed entirely of nodes interconnected with edges, graphlets inherently capture these node into topological positions, or automorphic orbits, as per their relationship to every other node with an undirected or directed edge. Graphlets are developed to compare appropriately large and highly connected biological graphs, as they provide local heuristic relationships and highlight structural patterns and are computationaly feasible [1, 5–7, 22, 24]. Graphlet analysis and metrics have been widely used on biological graphs for comparative analysis [9], to uncover protein-protein interactions [9], and to relate genes with their functional organization [1, 5–9, 20, 22]. The use of GE data is a very common tool to identify potential co-dependent gene pathways that a cell line, mutated or healthy, needs to control regulation for survival and function. Collecting and analyzing GE data to identify bio-markers for clinical applications such as diagnosis, prognosis, and/or prediction of suitable treatment has been widely practiced. Such methods provide impressive amounts of insight towards understanding biochemical and molecular mechanisms in human diseases. The most common method is to use Weighted Gene Co-expression Network Analysis (WGCNA), a popular R package to assists researchers and oncologists to identify clusters of co-expressed genes in GE data [1, 5–7, 16, 23, 35]. Its data mining capabilities has been an effective method in detecting co-expressed modules and hub genes [16, 23, 35]. In one study, Uddin et al. wanted to characterize the such pathways of ageassociated spatial learning impairment (ASLI), a human disease wherein many suffer from a decrease in normal brain functions, including spatial learning impairments, as they age [23]. In their study, they used a mathematical modeling approach similar to this study where large scale mRNA (level 3) GE datasets were created through the use of WGCNA graphs using mathematical modelling. The use of WGCNA allowed them to conduct differential analysis between young (learning unimpaired) and aged (predominantly learning impaired) rat brains in the context of ASLI [23]. Uddin et al.’s analysis have identified several reproducible and significant graph modules with genes functioning in specific biological functional categories including a “learning and memory” specific module containing many potential key ASLI hub genes, some of which were also identified (but not prioritized) in the meta-analysis [23]. Many 2 of these co-expressed candidate hub genes not only show differential co-expression between young and aged graphs, but are also reproducible in independent datasets. Uddin et al. are confident that in grouping three distinctive categories that candidate ASLI hub genes should be prioritized intoF follow-up research will help understand their potential molecular mechanisms underlying complex behavioral traits (such as cognitive impairments including ASLI), memory formation and synaptic plasticity [23]: 1. Hub genes whose role in learning (including spatial learning) is more transparent than others (CAMK1G, DLG3, MAPK1, PPP2R2C, PRKACB) 2. Hub genes where there is not enough information in the literature to support which direction their expression pattern contributes to the ASLI phenotype (CDK5R1, CNTN1, SCN2B, STXBP1, EIF5, GABRG1) 3. Hub genes where information is emerging indicating their direct or indirect role in learning and memory (PTEN, KCNAB2, MAPRE1, NDFIP1, RASGRP1, DPP6) For Yin et al., their use of WGCNA took a novel approach: understanding the mechanism underlying the progression of Hepatocellular carcinoma (HCC) through the use of gene cluster monitoring [35]. Yin et al. were able to provide insight into HCC’s CGEP dynamic changes by conducting a stepwise carcinogenic process, starting from pre-neoplastic lesions to HCC including normal, cirrhosis without HCC, cirrhosis, low-grade dysplastic, high-grade dysplastic, very early HCC, early HCC, advanced HCC and very advanced HCC stages [35]. Through the use of WGCNA for constructing gene co-expression graphs, they were able to narrow down five distinct genes related to different HCC stages with the help of Ingenuity Pathway Analysis (IPA) [35]. From those modules, hub genes are identified and tiered in stages related to HCC progression: 1. Normal to Cirrhosis: Energy metabolism changed significantly, leading to decreased oxidative metabolism (MUT, AZGP1, HBB, HBA1, HBA2, HBD, SUCLA2, ACADM, UQCRC2) . 2. Cirrhosis to Early Stage: Development of hepatic fibrosis and hepatic stellate cell activition (KBP1A, ARPC4, HSP90AB1, ENO1). 3. Early to Very Advanced Stage: destablize chromosomal/cell cycle control and DNA damage occurs, worsening with HCC progression (GINS1, NEK2, BUB1B, KIF11, TOP2A) Recent work done by Das et al. was done to prioritize performance and accuracy of attaining meaningful information from GE data using signed graphlet analysis with their newly 3 developed triangle counting algorithm [1]. Their study is aimed in providing additional graph information that could not be attained using WGCNA, wanting to capture a graphs’ structural interaction and topology with the use of graphlets and their ensuing orbits [1,5,6]. To ensure the correctness of their proposed computational technique, Das et al. were able to reduce publically accessible gastric cancer GE data into graphs for Graphlet Correlation Matrices (GCMs) [1]. The generated GCMs were then converted to simple heatmaps that observed patterns and differences found in the GE data, most notably [1, 5–7]: 1. Correlation intensity between investigated genes in a healthy cell line is much higher than the cancerous cell lines, suggesting higher amounts of gene regulation in healthy cell lines than in cancerous ones (signed graphs shows differences in GE regulatory control between these cell lines consistent in gene regulation of healthy and cancerous cells found in literature). 2. Orbit clustering identified in graphlet analyis suggests genomic lineage from gastric cell lines is preserved in derived healthy and cancerous cell lines (Orbits Orbit 0, Orbit 2, Orbit 3, Orbit 4, Orbit 9 and Orbits Orbit 1, Orbit 5, Orbit 6, Orbit 7, Orbit 8, Orbit 12, Orbit 13) 3. Graphlet and orbit patterns between graphs of 22,283 raw-data gene probes and 612 identified dysregulated gene probes appear identical (suggests that GE graph structure remains the same regardless of GE gene sample size) For the purpose of this study, all cancer co-expression graphs are generated based on probabilistic, unidirected interactions between two genes based on their correlative strength in mRNA expression levels. Each edge is assigned a weight, or correlative threshold, signifying the strength of potential interaction between associated genes. The frequency of certain graphlets and their automorphic orbits within each weighted graph provides an overview of descriptive and structural properties. These properties can then be identified and used to determine the complexity in node interactions. Three such properties, Graphlet Orbit Matrix (GOM), Graphlet Node Distribution (GND), and Graphlet Correlation Matrix (GCM), are used to compare and contrast generated cancer census gene graphs [1, 5–7, 22]. Both GOM and GND contain the frequency of each node participating in differing orbits and differing graphlets, respectively. Arguably the most important of the three properties, GCM identifies the dependencies of each orbit by compounding orbit participation of nodes found in GOM and GND. This provides a topological richness of nodes association and engagement in a graph’ structure and dynamics, and as such can be used for comparison with other graphs [1, 5–7]. To reiterate, graphlet analysis is used for exposing local graph topologies to highlight node interactions and functionality within a graph. Counting frequencies of graphlets and their re- 4 spective orbits maps out node connectivity, often manifesting significant structural information [1, 5, 6, 9]. Undirected graphs are widely used due to their simplicity but are limited by not representing edge direction. Having directionality provides critical information for understanding the power and influence of a single node in the graph. Networks, or graphs with directed edges, would be better used to represent the complexity of transcriptional regulation genetic components within organisms. Genetic Regulatory Networks (GRN) are such examples, highlighting genes promoting or inhibiting gene expression, which demands a particular architecture of governing circuitry shaped through generations of evolution [3]. Reconstructing GRNs for cancer genomes is a more tedious and complex affair compared to reconstruction of stable evolutionary genomes [3]. Therefore, it would be fair to assume that cancer genomics should be first understood through the use of graphs in order to determine the presence of census gene groupings before attempting to understand how each census gene influences the other’s expression. In this thesis, gene co-expression graphs are created from multiple cancers using rich data sets from the Catalogue of Somatic Mutations in Cancer (COSMIC) and applying graphlet analysis to observe potential gene clusterings and graph similarities.The cancer types, Lung Carcinoma (LC) and Breast Carcinoma (BC), are explored together. The purpose of this study is to explore observing gene co-expression patterns across multi-cancers, identifying commonly related structural graph clustering, and identifying census genes partaking in those clusters. An initial design and implementation of a new graphlet analysis alogirthm written in Python is proposed, followed by an overview of COSMIC, the collecting, filtering, and transforming of its archived LC and BC into generate respective graphs. From there, graphlet metrics are collected using the proposed algorithm from which frequently associated cancer census genes are highlighted. 5 Chapter 2 Graphlet Analysis Metrics and Algorithm Das et al. were able to develop a novel triangle counting algorithm for 3-node graphlet analysis of node orbits [1, 5–7]. They were not, however, interested in using higher-level graphlets such as 4- or 5-node graphlets. Das et al.’s pursuit in their endeavour generated a lot of interest in expanding the field of graphlet analysis. In particular, the use of 4- and 5-node graphlets to conduct analysis of highly-connected node-to-node relationships would prove useful in identifying gene clusters of similar GE [22]. As such, there is a gap for this thesis to address through achieving graph complexity measurements of the higher-level graphlets by developing an algorithm that would be able to do so. It should be noted that the development of this algorithm is not the main research goal. That would be the analysis of census gene expression profiles (CGEPs) from two of the most researched cancers in the world in an attempt to identify the most prevalent census genes that appear in their respective profiles. It should be made clear that the algorithm developed for this thesis is correct, confirming the same exact graphlet metrics found by Das et al. by using their sample graph for analysis [1, 5, 6]. It was the hope of this thesis to expand up to 5node graphlet analysis, allowing for increased sensitivity and detailed composition of large and complex graphs, however we were limited due to computation resources to complete 5-node graphlet analysis. 2.1 Graph Terminology In graph theory terminology, a graph G is composed of a set of points termed nodes N, or vertices, from which a set of links, termed edges E, extend to connect other nodes [4]. Each edge of G connects two distinct nodes u and v. A representation of this definition is reflected in (2.1). In graph theory terminology, a simple graph G is composed of points termed nodes, or vertices, from which edges E extend to connect other nodes. Mathematically speaking, G is 6 composed of two sets of elements; one is a set N whose members are nodes, and the other is a set E of two-node subsets of N whose members are edge [17]. In short, G = (N, E ) is a graph such that: E ⊆ N × N, (u, v) ∈ E ⇒ u = v and (v, u) ∈ E (2.1) As discussed previously, the intention of constructing a new algorithm was to tackle 4- and 5-node undirected graphlet metrics from large graphs. In order to do this, we must first be able develop an algorithm to identify orbits regardless of the node size of a graphlet. As such, the main objective for this algorithm is to prioritize correctness over performance, ensuring that it can not only identify orbits from x-node graphlets but to also identify automorphic orbits of subgraphs isomorphic to x-node graphlets. Figure 2.1: Visual representation of graph isomorphism between two graphs representing graphlet G7 7 2.1.1 Isomorphism Isomorphism preserves the connective properties between nodes in compared graphs despite the nodes being labelled differently (See Figure 2.1). More precisely, graph X and graph Y are isomorphic to each other if there is a bijective function f : NX → NY (2.2) between their nodes such that for every pair of nodes in NX : (u, v) ∈ EX if and only if ( f (u), f (v)) ∈ EY . (2.3) Figure 2.2: Visual representation of graph automorphism between two graphs representing graphlet G7. 8 2.1.2 Automorphism An automorphism is an isomorphism between a graph and itself (See Figure 2.2). More specifically, if G = (N, E ), a bijective function f : N → N is an automorphism when, for all (u, v) ∈ N × N, (u, v) ∈ E if and only if ( f (u), f (v)) ∈ E. (2.4) 2.1.3 Orbits For a graph G, let Aut G be the collection of all automorphisms of G. For x a node of G, the collection { f (x) | f ∈ Aut G} is called the orbit of x. Every node of G belongs to exactly one orbit, that is the orbits partition the nodes of G. 2.1.4 Subgraphs A subgraph H of a graph G consists of a subset of the nodes and edges of G that themselves form a graph. That is, if H = (NH , EH ) and G = (NG , EG ) then NH ⊆ NG and EH ⊆ EG . Note that since a subgraph is itself a graph, the endpoints of any edge in a subgraph must also be in the subgraph [17]. Equationally, EH ⊆ NH × NH . As we also know EH ⊆ EG , we have EH ⊆ NH × NH ∩ EG . (2.5) When EH = (NH × NH ) ∩ EG , we say that H is an induced subgraph of G. 2.2 Graphlets Graphlets are small, connected, and induced subgraphs that can be counted to find local structures, heuristic subcomponents or “building blocks”, of larger graphs [4]. They make perfect candidates in studying relationships between graph structure and paired node relationships, aided by a focus on the frequency of different graphlets and of nodes at particular locations in the graphlets [4]. Their use in the statistical analysis of large graphs directly resulted from metrics collected through the capture of unique graphlets identified to then determine the topological positions of their nodes participating in the investigated graph [1,4,5,22]. These discovered topological positions are grouped as automorphic orbits, or orbits, as seen in Figure 2.3, there are a total of 73 orbits and a total of 29 graphlets types from 3-, 4-, and 5-node graphlets. Through the use of identified graphlets and their nodes’ orbits, graphlet metrics are compiled to provide a structural understanding of the graphs they are measured in [1]. The fre- 9 quency of graphlet nodes and their corresponding orbits is what is counted to generate metrics. Below is a list of such metrics which have been modified from Das et al. [1, 5, 6] . Throughout this description N is the set of nodes, Os the set of orbits, and G is the collection of 3-, 4- and 5-node graphlets (see Figure 2.3). • Graphlet Orbit Matrix (GOM): A graphlet metric represented as a V × O matrix where the (i, j ) entry is the number nodes ni identified in orbit o j . This represents the occurences of a node (gene) within a given orbit: o1 GOM = n1 n2 n3 .. . ni o2 o3 ... k0,1 k0,2 k1,1 k1,2 k2,1 k2,2 .. .. . . ki,1 ki,2 k0,3 k1,3 k2,3 .. . . . . k0, j . . . k1, j . . . k2, j . .. . .. . . . ki, j ki,3 oj • Graphlet Node Distribution (GND)1 : A graphlet metric represented as a V × H matrix where the (i, h) entry is the number of nodes ni identified in graphlets gh . This represents the occurences of a node (gene) within a given graphlet, which is derived from the orbits identified from GOM: g1 GND = n1 n2 n3 .. . ni g2 g3 ... l0,1 l0,1 l1,1 l1,1 l2,1 l2,1 .. .. . . li,1 li,1 l0,2 l1,2 l2,2 .. . . . . l0,h . . . l1,h . . . l2,h . .. . .. . . . li,h li,2 gh • Graphlet Correlation Matrix (GCM): A graphlet metric represented by an O × O matrix where the (i, i) entry is the Spearman correlation calculated between the ith and jth orbit column of the GOM: 1 Also referred to as Graphlet Degree Distribution (GDD), and has been renamed as such to further identify its functionality [22]. 10 o1 GCM = o1 o2 o3 .. . oi o2 o3 ... m1,1 m1,2 m2,1 m2,2 m3,1 m3,2 .. .. . . mi,1 mi,2 m1,3 m2,3 m3,3 .. . . . . m0,i . . . m1,i . . . m2,i .. ... . . . . mi,i mi,3 oi In the context of this thesis, Cancer Gene Expression Profiles (CGEP) are the investigated graphs G in which graphlets g are found. Individual nodes in G are represented as a census cancer gene and are associated to other nodes based on the correlation of their mRNA expression levels, which is then represented as an undirected edge. Isomorphic g are identified from G based on the subset of node combinations to study correlational GE relationships of genes within the investigated CGEP. 2.3 Algorithm Proposal In order to study this relationship of genes within G for a given g through software analysis, the proposed algorithm must use the same method for identifying an orbit of a gene no matter the graphlets being used for comparison. As such, there must be measurable variables that are common throughout these graphlet analytics that are vital components to graphs and graphlets alike, to which there are two very obvious ones: node and edge. This algorithm is able to count the number of neighbouring node and connected edge from a given node. Since orbits are simply unique positions of node based on their topological positioning relative to neighbouring and connected node, it stands to reason that each and every node can have their orbits identified via a calculated orbit checksum OC. This is done by tallying the amount of edges eu found between a given node m to its immediate neighbouring nodes nu , the amount of immediate neighbouring nodes found, the amount of edges ev found between immediate neighbouring nodes and their adjacent neighbouring nodes nv , and finally the amount of the adjacent neighbouring nodes found. Equation 2.6 is the resulting formula that captures this process: OCn = |E1 | + |E2 | + |N1 | + |N2 | (2.6) where: E1 = {(n, u) : (n, u) ∈ E} edges from n to N1 N2 = {v : u ∈ N1 , (u, v) ∈ E, v = n} has a length two path from n N1 = {u | (n, u) ∈ E} distance one neighours 11 Figure 2.3: Automorphism orbits 1, 2, . . . , 72 for the twenty-nine 3-, 4-, and 5-node graphlets G1, G2, . . . , G29 [22]. In each graphlet, nodes belonging to the same orbit are of the same shade. E2 = {(u, v) | u ∈ N1 , v ∈ N2 , (u, v) ∈ E} edges from N1 to N2 This “orbit checksum” (OC) is calculated for each and every orbit in each x-node graphlet. This in turn will allow for nodes of Gsubgraph isomorphic to Ggraphlet to be identified with a given orbit that matches the checksum created from the corresponding node of Gsubgraph . The end result creates an automorphism between Gsubgraph and Ggraphlet . Table 2.1 provides 12 the results of the orbit checksums of each orbit found in each x-node graphlet. Note that orbit checksums are not necessarily unique to each graphlet; there will be orbits whose orbit checksums are identical. This is a result of the node topology within those seperate graphlets being identical. However, the argument to be made here is to identify unique orbits within graphlets, not necessarily across all graphlets. The intention is to determine the automorphism between the subgraph and graphlets being isomorphic to each other. Table 2.1: Orbit checksums of each orbit found in 3-, 4-, and 5-node graphlets. G O Σ G1 1 2 3 4 G2 3 8 G3 4 5 3 6 G4 6 7 5 7 G5 8 8 G6 9 5 10 10 11 11 G7 12 12 13 15 G8 14 19 G9 15 16 17 3 6 8 G10 3 5 8 9 18 19 20 21 G O Σ G G12 24 5 25 12 26 13 G20 49 12 50 13 G21 51 10 52 12 53 15 G22 54 16 55 22 G23 56 7 57 21 58 22 G24 59 14 60 19 61 22 G25 G17 39 7 40 14 41 17 42 18 62 12 63 17 64 19 G26 G18 43 12 44 18 65 16 66 23 67 26 G27 68 21 69 26 G13 27 3 28 8 29 10 30 13 G14 31 7 32 12 33 14 G15 34 8 G16 35 5 36 8 37 10 38 11 O Σ G19 45 5 G28 70 25 46 12 G11 22 7 71 30 47 15 23 10 G29 72 34 48 17 G is the graphlet number; 0 is the orbit number; and Σ is the corresponding checksum. 13 2.4 Software Implementation In this section, pseudocode is provided to show how OCs can be created, attributed, and compared. The software application designed for this thesis is aimed to be a simple and readable Python framework using previously existing network analysis packages for generating graphs and subgraphs for the graphlet algorithm introduced in the following section. No matter a user’s data or requirements, this application utilizies parallel computing for its graphlet algorithm to achieve better performance. The algorithm has been developed with parallelization in mind. The parallelization occurs at the graphlet level where a child process is spawned for each graphlet, and given the list of all known induced subgraphs in order to identifying isomorphic subgraphs and their node orbits. Future algorithms will continue to implement parallelization whenever possible for future graph and network analysis. The intent is to provide open-source software that can be run on any platform so users can feed in massive amounts of graph data on remote machinery operated by any private and public institutions. Since the project is open source, anyone with appropriate tools can use it to run directly on their systems. 2.5 Algorithm Implementation Dictionaries are Python’s implementation of a data structure that is more generally known as an associative array. A dictionary consists of a collection of key-value pairs, and each key-value pair maps the key to its associated value. A dictionary is defined by enclosing a commaseparated list of key-value pairs in curly braces “ ” where a colon “ ” separates each key from its associated value, as seen in the code block below: dict ={ Key1 : Value Key2 : Value ... (2.7) Keyi : Value } All graphs, subgraphs, and graphlets are NetworkX Graph Class objects [12]. These Graph objects are dictionary-of-dictionary-of-dictionary data structures where the outer dictionary (the node dictionary) holds adjacency information keyed by node. The next dictionary, the adjacency list dictionary, represents the node adjacency information and holds edge data keyed by neighbor. Finally, the inner dictionary, the edge attribute dictionary, represents the edge data 14 and holds edge attribute values keyed by attribute names, as seen below: Gdict ={ Node1 : { } } Node2 : {Value} (2.8) All variable references made in this section are found in Algorithm 1. All 3-, 4-, and 5-node graphlets are represented by the variable Ggraphlet . An undirected graph that consists of a set of nodes n and edges e from which subgraphs will be identified and compared isomorphically with Ggraphlet is represented by the variable G. Isomorphic subgraphs to Ggraphlet are represented by the variable Gsubgraph . The algorithm begins by initializing the GOM as a Python data structure, on line 3 of Algorithm 1. GOM is structured in such a way that each mapping entry contains a key hashed to node n and the total number of orbits it has been found in: GOMn ={ orbit1 : orbitcount (2.9) ... orbitn : orbitcount } A census cancer gene expression graph G and a set of graphlets are the inputs of this algorithm to compare and identify subgraphs. The GOM is the output created from each G containing all found nodes and orbits based on a graphlet of a x-node size. The algorithm first has a single graphlet Ggraphlet be generated and sent to the function to generate OCn , represented in the code as , specific to each node ni in their respective orbits 2.6. Then, a list of all found induced subgraphs of G are generated. Each subgraph Gsubgraph isomorphic to Ggraphlet will have its OCn , its , calculated using . Each node of Gsubgraph is then compared with every other node of Ggraphlet based on their . Once a match is made, ni is attributed the orbit that ni is situated in. Finally, all ni and their identified orbits are stored in the GOM, from which the GCM and the GND will be calculated. 15 ni : Node of a subgraph n j : Neighbouring node(s) to ni nk : Neighbouring node(s) to n j nl : Node of a graphlet (ni ,n j ) : edge connecting ni and its neighbour n j E : Number of edges between neighbours n j to ni NN : Number of neighbours ni to n j NNE : Number of edges neighbours nk to n j NNN : Number of neighbours nk to n j G : Graph Network Ggraphlet : Graphlet of G Gsubgraph : Subgraph of G GOM : Graphlet Orbit Matrix orbitchecksum: Orbit Checksum nx: NetworkX package object // Using a single graphlet INPUT: G = (V , E ), dict(Ggraphlet ) 2 OUTPUT: GOM for G 3 Shared Array: GOM; 4 for each Ggraphlet ∈ dict(Ggraphlet ) do GenerateOrbitChecksum(Ggraphlet ) 5 IdentifyGraphlets(Ggraphlet , G, nx) end 1      in parallel Algorithm 1: Graphlet and orbit counting algorithm: Part I. Syntax and nomenclature follow that from Das et al. [1]. 16 Procedure IdentifyGraphlets(Ggraphlet , G, nx) subgraphList = [ ] 2 for Gsubgraph ∈ G do 3 if nx.connected (Gsubgraph ) AND nx.isomorphic(Gsubgraph , Ggraphlet ) then 4 ValidateOrbitChecksums(Gsubgraph , Ggraphlet ) 5 end end end 1 Algorithm 2: Graphlet and orbit counting algorithm: Part II Procedure ValidateOrbitChecksums(Gsubgraph , Ggraphlet ) for ni ∈ Gsubgraph do 2 E ← len(subgraph.edges(node)) 3 4 NN ← len(subgraph[node]) NNE ← 0 5 NNN ← 0 6 for n j ∈ Gsubgraph do 7 8 NNE ← NNE + g.edges(n j ) NNN ← NNN + |g.ngbr (nk )| 9 end 10 g[ni ][orbitchecksum] ← E + NN + NNE + NNN for n ∈ Ggraphlet do 11 if ni [orbitchecksum] == n [orbitchecksum] then 12 [ni ][orbit ] ← n [orbit ] 13 end end end end Algorithm 3: Implemented graphlet and orbit counting algorithm. 1 17 Chapter 3 COSMIC Database This chapter outlines the significance of (COSMIC) in the pursuit of combating cancer by providing an expertly refined and updated database of identified cancer genes. In addition, we explore a way to identify (GE) profiles to better understand the global picture of cancer census genes involved in breast and lung cancers. 3.1 COSMIC COSMIC is the most detailed and comprehensive resource to date for scientific exploration into the effects of somatic mutations in human cancer [28]. It allows researchers to utilize its high-resolution resources for exploring target genetic markers and trends in GE profiles of human cancers [10]. COSMIC is built and designed to allow for expert manual curation that encapsulates very high detail across mutation positions, disease descriptions, and other patient and population data [33]. This curation process provides improved quality control over systematic approaches where experienced curators can identify inconsistencies or errors in publications, allowing for the rejection of untrustworthy, incomplete or unspecific data sources [33]. It should be noted that this curation process is done by COSMIC’s team of postdoctoral scientist curators that manually interpret peer reviewed publications, as stated by COSMIC on their Expert Curation of Genes documentation [33], and the validity and integrity of these claims should be analyzed and discussed openly. 3.1.1 Cancer Gene Census The (CGC) data are the primary data that COSMIC derives directly from the scientific literature through a broad-range manual curation process. The CGC is one 18 of COSMIC’s many maintained and continually-updated projects which undergoes a curation process that cultivates specific and detailed data via major cancer portals, such as (TCGA) and the (ICGC), and from census gene supplementary data associated with their curated papers [2,10,14,26,28,33]. It’s intended purpose is to address the challenges posed by genetic mutations and the wide range of consequences brought about by them. Driver mutations, which are mutations that determine the development of a cancerous state, will cause dysfunction (protein structural alterations), dysregulation (GE level changes), or complete abrogation (complete genetic deletion) of a gene [10, 26, 28, 33]. To better help encapsulate these effects on cancer progression, the CGC objective is to catalogue all genes that are causally implicated in cancer, or have extensive literary evidence, for having affected functionalities resulting in the involvement causing a cancer tumour’s development [10, 26, 28, 33]. As a result, the CGC categorizes cancer census genes in two tiers [10, 26, 28, 33]: • Tier 1 — genes that are characterized by the presence of mutational patterns that strongly support their involvement in causing a cancer’s development, progression, and proliferation. A gene’s qualification to be designated in Tier 1 must meet the following criteria: – The investigated gene must have at least two publications from two independent groups that describe its somatic mutations involved in at least one cancer type. – The investigated gene must have at least two independent publications that provides evidence of its functional involvement in the cancer type’s biological functionality. • Tier 2 — genes that don’t have enough research evidence to suggest the presence of mutational patterns to cause a cancer’s development, progression, and proliferation. A gene’s qualifications to be designated in Tier 2 is decided by at least two postdoctoral scientists following review of sufficient literature. To be classified into Tier 1, a gene must possess a documented and reproducible activity relevant to cancer, alongside evidence of somatic mutations that changes its product’s activity towards oncogenics [10,26,28,33]. Although only somatic mutations from cancer samples and cell lines are catalogued in COSMIC, the CGC additionally uses information about germline mutations and their consequences. COSMIC also acknowledges somatic mutation patterns across cancer samples gathered. Tumour suppressor genes have often been shown to become inactive or under-expressed as a result of a broad range of mutation types, whilst oncogenes usually become unchecked in their ex- pression levels as a result of well-defined missense or inframe indel mutations [31]. With both mutations patterns and functional evidence, a gene will be included into the Tier 1 list, indicative of the extensive evidence of their role in the investigated cancer (Figure 3.1). 19 To be classified into Tier 2, a gene with more recently and strongly identified roles in cancer but with less strong mechanistic or functional evidence are the most common candidates [10, 26, 28, 33]. These include genes with somatic mutational patterns atypical for either oncogenes or tumour suppressor genes but which won’t have enough functional evidence in the scientific literature to suggest their cancer role. Similarly, genes with strong published evidence for a function in cancer but unclear mutation patterns or known to be dysregulated, for instance solely by epigenetic means (for example by changes to promoter methylation), are also included [10, 26, 28, 33]. With neither supportive evidence of mutations patterns nor functionalities, a gene will only be listed as Tier 2, highlighting the need for more research to definitively uncover further evidence as to their role in the investigated cancer (Figure 3.1). Figure 3.1: (Adapted from Figure 2 of Sondka et al. [26]) Genes are classified into either Tier 1 or Tier 2 of the Cancer Gene Census based on: (a) the evidence of functional involvement in oncogenesis, and (b) the somatic mutations patterns in cancer samples indicative of a gene’s functionality investigated in the literature being curated [10, 26, 28, 33]. Only strong evidence from both analyses will allow a gene to be classified as Tier 1. If a investigated gene has been shown to have either cancer-driving somatic mutational patterns or descriptive biological pathways highlighting their involvement in cancer progression, then they are attributed as Tier 2. Genes that are targeted for manual curation are generally genes that meet the Tier 1 criteria of the CGC genes [28]. When a new cancer gene is updated into the CGC as a Tier 1 gene, COSMIC performs a literature review in PubMed to identity any publications reporting 20 the gene’s somatic mutations in cancer types. At the point at which a new gene is selected for comprehensive curation, COSMIC inspects for mutation data, GE, clinical details, and cancer phenotype information. The data are curated from both targeted gene studies and from whole genome screens, so that the full range of reported somatic mutations are represented in COSMIC [10, 26, 28, 33]. Each quarterly database release includes data for new CGC cancer genes, and it should be noted that the CGC is not static and is updated when new evidence comes to light (Figure 3.2). As a result of COSMIC’s dedication to its CGC project, all data are standardized and attributed to a given census gene and its mutation(s) [10, 26, 28, 33]. The CGC provides a wealth of information pertaining to a single census gene and its role in the development and progression of the cancer it is involved with. COSMIC also provides, from its database, associative variables such as GE and expression levels, cancer type, and cancer role for each individual census gene identified. With these variables the aim is to create an approximation of a given cancer’s in situ GE profile using graphlet network reconstruction and analysis, as well as identifying common census genes and their neighbouring associations. 3.1.2 Comparative Gene Expression In order to compare and contrast census genes within a network, a network association reconstruction based on individual GE values, cancer role type, and cancer association was proposed. These associations are represented as statistically significant over, under, or normal genetic expression levels, assigned as z-scores, referenced by individual sample mRNA expression microarray data per curated studies for the given census gene. Under-expressed genes have a z-score value lower than −2, over-expressed genes have a z-score value greater than 2, and normally-expressed genes have a z-score and between 2 and −2. These measures are useful to determine whether a gene is up-, down-, or regulated relative to the normal samples or all other tumour samples. A z-score simply represents the number of standard deviations away from the average GE mean of the reference population used in the given study The z-score is provided by TCGA, which COSMIC uses for its curation purposes. mRNA expression data is simply the computed relative expression of an individual gene and tumour to the gene’s expression distribution in a reference population [2, 33]. That reference population is either all tumours that are diploid for the gene in question, or, when available, normal adjacent tissue [2]. For research purposes, we want to include both Tier 1 and Tier 2 when conducting our investigations in understanding a cancer’s GE profile. Determining the placement of Tier 2 genes in association with Tier 1 genes could create preemptive discoveries of significant GE profiles of said genes with low cancer research evidence and investigations. Figure 3.3 provides an overview of the data flow and processing operations for graphlet analysis. 21 Figure 3.2: The Cancer Gene Census curation steps undertaken by COSMIC. Once the information about the gene is made public, COSMIC curators will explore for any presence of patterns of somatic mutations in COSMIC’s resources prior to going forward with the literature review. If COSMIC resources do not contain any information whatsoever for this new gene, then curators will perform a literature review to identify roles the gene has associated with a given cancer. Once there is minimum evidence from at least two publications across different research groups showing experimental evidence of functional involvement in the associated cancer, two COSMIC postdoc-scientist curators will then discuss if the gene in question needs more evidence to be designated a census gene. Once decided, the gene will be defined either as a Tier 1 or a Tier 2 census gene with mutational and functional information stored in the Cancer Gene Census database. With more research for that given gene, COSMIC will continually update it with additional information. 22 Figure 3.3: A generalized overview of the software’s operational steps and functionality to process raw cancer data, generate graphlets and cancer graphs, enumerate identified graphlets, and collect graphlets metrics. 23 Chapter 4 Graph Network Creation Since the introduction of DNA and RNA sequencing technologies, the collection and analysis of Cancer Gene Expression Profile (CGEP) data has been instrumental in identifying cancer bio-markers for clinical applications such as cancer diagnosis, prognosis, and/or prediction for suitable treatments for afflicted individuals. Accordingly, with more CGEP data being collected comes the demand for hosting and sharing that has led to the development of research databases that are accessible, cheap to maintain, fast to query, and straightforward to operate. In turn, the need for effective analysis techniques to model and analyze from CGEP data is unavoidable in order to draw meaningful conclusions. A very popular analysis technique is to model the data as connected graphs in order to effectively represent cancer genes and their interactions with each other in order to unveil hidden links and patterns. Using GE data accumulated in CGC, the need to determine, identify, and understand the underlying molecular biology and genetics of cancers would provide meaningful investigative directions for oncological research. It is with the use of graphlet analysis that we aim to identify the topological significance of cancer gene interactions within and in comparison of different CGEPs. 4.1 Data Analysis 4.1.1 Data Collection All data preparation steps including census GE graphs analysis and statistical analyses were performed in Python using appropriate modules: Pandas and Dask for managing large data files as dataframe datastructures to enable data analysis, NetworkX for creating graphs from the data analysis, MatPlotLib and Seaborn to visualize results of the data analysis as heatmaps and graphs. The use of a correlation matrix allows us to determine co-expression relationships between genes based on their z-score GE values. All data files were downloaded from the following COSMIC website: 24 as files, an archive file compressed by the standard GNU zip (gzip) compression algorithm [11]. The data download version is the v92 release of August 27th, 2020. The data files used are • (8.6 GBs in size), • (2.21 GBs in size), and • (16.6 GBs in size). All files were uncompressed into tab separated files to be used for data analysis. COSMIC data files are available for download with the creation of a free account. 1. Complete Gene Expression (CGE) — Listed as from COSMIC. This contains the census genes associated with their identified z-score per the study associated. COSMIC labels each of these as separate variables within the aforementioned data table as follows: : the gene name for which the data has been curated into COSMIC. In most cases this is the accepted Human Genome Organization Gene Nomenclature Committee (HGNC) identifier (ex: AKT1 is short-form for v-akt murine thymoma viral oncogene homolog 1). : the GE score taken by a study for a particular census gene. : A sample is an instance of a portion of a tumour being examined for mutations. A sample id is used to identify a sample within the COSMIC database. 2. Cancer Gene Census (CGC) — Listed as from COSMIC. This contains the list of cancer census genes curated by COSMIC. From this data table we are able to determine the associated cancer and the cancer role for each gene. Each of these are separate variables are listed by COSMIC as follows: : the gene name for which the data has been curated into COSMIC. In most cases this is the accepted Human Genome Organization Gene Nomenclature Committee (HGNC) identifier. This unique identifier is what allows cross referencing to retrieve additional COSMIC information from other projects and data tables. : the somatic cancer that a census gene has been identified as involved in its progression and development. A census gene can have more than one tumour type associated with it (ex: AKT1 is associated with breast, colorectal, ovarian, and non-small cell lung cancer). 25 : the census gene functionality whose mutations have been determined to be cancerous. It can be defined as an oncogene, a tumour suppressor gene, or a fusion/hybrid gene, or in any combination of the three. : A sample is an instance of a portion of a tumour being examined for mutations. A sample id is used to identify a sample within the COSMIC database. 3. Census Gene Mutation Data (CGMD) — Listed as from COSMIC. This contains all coding mutations in census genes listed in the CGC. It contains the Functional Analysis through Hidden Markov Models (FATHMM) descriptors and the somatic mutation status that allows us to to determine the pathogenic nature of a census gene. These variables used are described and listed by COSMIC as follows, among variables: : The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier. : A calculated p-value score ranging from 0.0 to 1.0 determining the cancerous affect of a given gene. Scores greater than 0.5 are considered deleterious, and scores greater than 0.7 are considered pathogenic. Mutations are classed as neutral if the score less than or equal to 0.5. : The age of the patient identified in the curated studies for the identified census gene and associated sample(s). : Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of Unknown Origin. : The primary tissue/cancer from which the sample originated. : A sample is an instance of a portion of a tumour being examined for mutations. A sample id is used to identify a sample within the COSMIC database. It should be carefully noted that the aforementioned variables from each data source are not the complete list of variables found. Those that are listed are only sourced for the intent of this thesis, which is to determine if graphlet analysis can provide a clearer understanding of frequently associated genes of an identified CGEP. As such, the variables excluded were not deemed necessary for this endeavour due to redundancy of representing qualititative descriptors of quantitative components (i.e. in CGMD, the variable FATHHM scores retains numeric scores from 0.0 to 1.0 where values greater than 0.7 are considered pathogenic, otherwise neutral, and reported in FATHMM prediction), or are completely unneeded in the creation of CGEP graph (i.e. in CGMD, histological sub-classification of sample tissues and of the variable . 26 4.1.2 Proposed Data Analysis It is important to note that the selected GE graphs variables were not selected without comparisons in current literature for differential expression analysis. To reiterate, the need for understanding molecular pathways and genetics is fundamental to improve targeted cancer treatments. In order to provide a reasonable estimate for a clinical investigation, we propose to use the variables listed in Section 4.1.1 from the data files used to recreate CGEPs for graphlet analysis following the data analysis steps in Figure 4.1. The Complete Gene Expression data table consists entirely of quantified mRNA GEs of individual census genes from studies that use standardized and tested RNA sequencing research tools and methods. The platform codes currently used to produce the COSMIC GE values are the following in order of preference: , , , and [33]; Illumina is simply the developer and manufacturer of the RNA sequencing machinery and protocols used for large-scale analysis of genetic variation and function. Additional COSMIC data is needed to identify co-expression relationships between census genes with associative cancer. The CosmicMutantExportCensus and cancer_gene_census data files provide ample information to create meaningful graphs that can lead potential CGEP patterns, and the data analysis process is outlined step by step in Figure 4.1. First and foremost, only mutation samples from CGMD identified as pathogenic by Functional Analysis through Markov Models (FATHMM) predictions are carried over for graph creations. The FATHMM variable is a machine learning approach (better known as FATHMMMKL) that integrates functional annotations from (ENCODE) consortium with nucleotide-based sequence conservation measures that can predict the functional consequences of non-coding variants [25]. If the mutation data is marked as pathogenic then it is considered to be cancerous. From the pathogenic mutation data, mutation samples are then processed for being confirmed as somatic mutations. The experiments that the mutation samples were taken from have had comparative genomic sequencing done between normal samples from the same patient. This ensures that the data collected is confirmed to contain not only pathogenic samples but is reassuringly a mutation in comparison to a normal human genome sample. The resulting mutation data that is then further refined can then only contain samples from the investigated cancer types (breast and lung). Once completed, the data is further distilled to primarily carcinoma histological samples. We acknowledge that there exist many cancer histologies, but for the sake of this study it will be easier to compare graphlet analysis results between the investigated cancers. Before extracting gene expression data based on census genes, the mutation data is compounded based on age ranges of 10 years. This is an attempt to further identify potential CGEP patterns related to age as it is a strong correlative factor in a 27 Figure 4.1: Data analysis steps to create meaningful CGEP graphs. COSMIC data are pruned by keeping sample mutations that are confirmed first by being pathogenic (confirmed through FATHMM predictions) and then being somatic (confirmed by sequencing both the tumour and a matched normal tissue sample from the same patient). Following this, tumour samples are separated based on cancer type and histology (breast and lung carcinomas). From the retained sample mutations, only those identified with GEs are used for the creation of graphs. The resulting graphs are then subjected to graphlet analysis, leading to the creation of comparative data (GOM, GCM, etc.) that will allow for CGEP co-expression comparisons. higher risk of developing cancer [34]. Finally, the last stages of the data analysis process is to collect gene expression data based on identified census genes from the refined mutation samples. The gene expression data stems from the CGE data file wherein the gene names and study IDs are matched with those from the collected mutation samples derived from the CGMD data file. A gene expression matrix is created based on the gene name and study IDs from both data files, which will then undergo Spearman correlation matrix calculations as described in detail in Section 4.2.1. After correlation matrices are completed, they are then used to generate graphs based on the strength in correlative expression levels between census genes. A threshold is applied to only retain meaningful relationships between these genes, and in Chapter 5 levels of thresholds are selected to provide a better comprehension of each cancer’s genetic co-expression profiles. To further enrich the CGEPs, metadata from the CGC data file is merged with the collected mutation sample dataset to identify the type of roles each census gene plays; genes that either became oncogenic 28 or lost their tumour suppressor ability. This merging of additional information is done using the sample IDs and gene names from both the CGC and mutation sample dataset. Generated graphs represent each of these genes accordingly. 4.2 Gene Expression Rationale As mentioned previously in Chapter 3, census genes are given unique identifiers attributed by COSMIC that allows us to not only match them with associated cancer histology but also determine whether or not their somatic mutations are deemed pathogenic, distinctive cancer roles, determined age brackets, and their gene expression levels. The census genes are curated as a result of having a mutation that would cause themselves or others to develop cancerdriving activities through uncontrolled expression. It is with these available data that graphs can be created from GE levels following correlation matrix calculations from groups of census genes for a given cancer. 4.2.1 Gene Expression Matrices The use of correlation matrices is necessary to determine the co-expression graphs of genes for their respective CGEPs. This allows us to determine the strength of GE correlations between pairs of specific cancer genes; a weighted correlation between two genes represents connection strength between them in a CGEP. We use Spearman’s correlation coefficient to calculate the pairwise GE, which results in an adjacency matrix defining the strength in correlative coexpression between each gene. For each dataset, a Weighted Gene Expression Correlation Matrix (WGECM) was created by taking the weight of correlative co-expression (w) between each pair of genes. These values are found in a range between −1.0 (strongly dissimilar correlative significance) and 1.0 (strongly similar correlative signficance). The self-correlated genes of each matrix were not carried over into the graph creation: a gene will always be correlated with itself does not provide meaningful information. For the purposes of our study, census genes with w greater than designated threshold values of 0.3 to 0.9 for ten year age groups starting from 0 to 100 are retained to generate CGEP graphs (i.e., only genes with above average median connectivity were kept for graph analysis). This will provide us comparative analysis of CGEPs increasing w to compare between those age categories. Following this filtering step, adjacency matrices containing census genes with w greater than designated thresholds are used to create co-expression graphs. In a co-expression graphs, an edge between two genes (nodes) represents the co-expression relationship w. All generated graphs are created with the use of NetworkX, a graph generation library for Python. In reference to Chapter 2, the steps for graphlet analysis was conducted on these graphs to create 29 GOMs, GCMs, and GNDs. Heatmaps of WGECMs, GOMs, GCMs, and GNDs were created using Python’s Seaborn, MatPlotLib, Seaborn, and SciPy libraries and used for visualization and analysis. These analyses and results are shown in full in Chapter 6. 30 Chapter 5 Cancer Network Results Graphlet analysis metrics were conducted following the generation of BCG and LCG from census genes provided by COSMIC. The following section provides a summary of results and observations found from each GE graph. The result summaries were measured at different correlation co-expression thresholds and age groups to further help with the investigation of GE graph size and connections. This investigation used a minimum correlation co-expression threshold of 0.3 and each collected GE data from recorded samples are compiled into 10-year age groupings, starting with the years 0–10 and ending with the years 90–100. 5.1 Cancer Graph Networks This section provides an overview as to the size and degree of connectivity in each investigated cancer at varying threshold levels and age groups. The graphs below are constructed from GE data with probabilistic thresholds of census gene co-expression greater than or equal to 0.3. For both LCG and BCG, it is very interesting to observe from the 10 year age groupings some highly occuring genes. 5.1.1 Structural Network Observations In each of the graphs generated, cancer census gene as nodes have been distinguished in individual colours and labels. As mentioned on Section 3.1.1, there are two types of COSMIC census genes used for generating graphs: genes with strong support in cancer development involvement (Tier 1), and genes that might suggest some form of influence on cancer development (Tier 2). As seen in the corresponding graphs in Figures 5.1–5.3, a Tier 1 gene is attributed with a single asterix to its gene name (i.e. “ATR*”) while a Tier 2 gene is attributed with a double asterix (i.e “SEPT6**”). Secondly, genes are also coloured based on their cancer role upon a mutation that would render them cancer-driving. As mentioned in Section 4.1.1 of 31 Chapter 4, each gene has been identified to have an either oncogenic, tumour suppressor, or a mutated fusion of both. In Figures 5.1–5.3, oncogenes are labelled green, tumour suppressor genes are labelled red, and fusions are labelled light-blue. It was briefly noted in Section 5.1, the constructed 10-year age ranged graphs represented in Figures 5.1–5.3 were the only graphs generated from the 0.3 GE probabilistic threshold. No other graphs were found as there were no edges found at correlation levels lower than 0.3. If the age ranges were shortened to 5-, 2-, or down to 1-year age differences we could see stronger and more positive GE correlations. An issue arose when trying to compute more complex, highly connected node-to-node relationships using 5-node graphlets, namely the graphlet algorithm’s consumption exceeding the allocated RAM resources provided for this research work. A hard limit of 46 gigabytes of RAM was allocated for this research work and as a result no 5-node graphlet analysis data was attained, however both 3- and 4-node graphlet analysis data were collected. Figures 5.4–5.6 showcase the GCMs calculated from the GOMs of the BCGs and LCGs. These are generated using Python’s Pandas’ Spearman correlation matrix and Seaborn’s Clustermap tools. Hierarchical groupings, or clusters, of certain orbits found between Orbit 1 – Orbit 14 are observed from the aforementioned figures. An orbit cluster is a collection of orbits such that any two orbits with a correlation coefficient value greater than or equal to 0.8. This collection can be as large as possible and is seen as the lightest coloured squares in Figures 5.4– 5.6. These orbit clusters are tallied and identified further in Table 5.1. Table 5.1: Orbit Clusters Found in Identified Age Groups of Breast and Lung Cancer Gene Co-expression Networks. B REAST C ANCER Age Group Orbit Clusters 20 to 30 30 to 40 90 to 100 5, 7, 2, 11, 12 10, 14, 3, 13 6, 4, 9 1, 8 11, 3, 10 6, 1, 4 5, 2, 7 Age Group L UNG C ANCER Orbit Clusters 30 to 40 10, 14, 3, 13, 5, 7, 2, 11 12, 1, 8 40 to 50 12, 13, 3, 10 6, 9, 8, 14 2, 7, 5, 11 80 to 90 13, 11, 3, 10 2, 7, 5, 1, 4 6, 9 12, 2, 11, 5, 7 14, 10, 3, 13 1, 8 6, 9 There is a clear visible overall distinction of node connectivity in the observed generated graphs as seen in Figures 5.1–5.3. In breast carcinoma graphs (BCG), Table 5.1 shows common 32 clusters of orbits (Orbit 2, Orbit 5, Orbit 7, Orbit 11, Orbit 12), orbits (Orbit 1, Orbit 8), orbits (Orbit 3, Orbit 10, Orbit 13, Orbit 14), and orbits (Orbit 6, Orbit 9) between 20–30 and 90–100 graphs. This is interesting to note as it implies that there are some conserved graph structure for GE between starkly different age groups. However, the graph structure shrinks in the 30-40 graph from the 20-30 graph. A smaller set of orbit clusters (Orbit 2, Orbit 5, Orbit 7) and (Orbit 3, Orbit 10) are shared across 20–30, 30–40, and 90–100 graphs, indicating again similarities in graph structure, but there is a transition back to identical orbit composition from 20–30 to 90–100 graphs. It would appear that GE levels vary differently in the 30–40 age groups in comparison to both the 20–30 and 90–100 age groups. In contrast, the lung carcinoma graphs (LCG) shown in Figures 5.4–5.6 exhibits truncation of orbit clusters with increased age. Table 5.1 shows common clusters of orbits (Orbit 2, Orbit 7, Orbit 5), orbits (Orbit 6, Orbit 9), and orbits (Orbit 3, Orbit 10, Orbit 13) between 40–50 and 80–90 graphs. However, the 30–40 graph does not share separate orbit clusters following 40–50 and 80–90 graphs. A single cluster orbit of (Orbit 10, Orbit 14, Orbit 3, Orbit 13, Orbit 5, Orbit 7, Orbit 2, Orbit 11) is observed instead. This doesn’t detract from the fact that all LCG graphs share similar orbits, it simply indicates that there is a loss of node connectivity from 30–40 onwards. Between LCG and BCG, it is interesting to note that combinations of orbits (Orbit 4, Orbit 6, Orbit 9), (Orbit 5, Orbit 7, Orbit 2, Orbit 11), and (Orbit 10, Orbit 3, Orbit 13) exist across all age groups graphs. This is quite an interesting result as it indicates that all these graphs share some common structural similarities, potentially indicating some sort of preserved genetic expression. To properly compare how similar/disimilar investigated graphs are to each other, a graphlet correlation distance (GCD) matrix was calculated from the GCM of each graph. This method of quantitative analysis measures the dependence of correlation of a set of vectors between two variables; in this case, a list of calculated GCD mean values is compared from two graphs to determine if they are more similar (a GCD value closer zero) or disimilar (a GCD value further from zero) they are by measuring the Euclidean distance. The list of GCD mean values is created by first calculating the mean count of individual orbits found in the GCM of each graph, then a correlation distance calculation is conducted between graphs by using the Python’s Scipy Spatial Distance Correlation function. As seen in Table 5.2, the bolded GCD values represent some interesting within- and between-group graph structure similarities. This can be seen in orbit cluster heatmaps in Figures 5.4–5.6 where there are preserved structural similarities. 33 Table 5.2: Graphlet Correlation Distance between in Identified Age Groups of Breast and Lung Cancer Gene Co-expression Networks. 0.3157 0.0043 0.0086 0.1614 0.2185 5.1.2 0.0043 0.2606 0.3157 0.2606 0.3981 0.0631 0.0252 0.0239 0.1187 0.1687 0.0086 0.3981 0.0239 0.1614 0.0631 0.1187 0.2348 0.2348 0.2968 0.0104 0.2185 0.0252 0.1687 0.2968 0.0104 Potential Genetic Implications Observed in the graphlet analysis are specific orbit clusters that are preserved across the generated breast and lung cancer graphs. Understanding what these clusters represent in terms of preserved oncological properties would be very interesting to explore in future studies. Now, in order to verify that there is some sort of preserved genetic expression across age groups within and between the investigated cancers, the frequency of census genes identified in those orbit clusters must be investigated. Table 5.3 shows the number of census genes found in identified BCG and LCG age groups. Table 5.3: Gene Distributions Identified Age Groups of Breast and Lung Cancer Gene Co-expression Networks. Gene Count versus Age Group B REAST C ANCER L UNG C ANCER Age Count Age Count 20–30 30–40 90–100 151 363 228 30–40 40–50 80–90 119 367 317 Following the common cluster groups identified between breast and lung age groups, the distribution of the common orbit clusters (Orbit 10, Orbit 3) and (Orbit 5, Orbit 7, Orbit 2) are subjected to a new metric that outlines which genes are observed to be predominantly present in those orbits. For each given orbit a distribution of census genes have been found. Each of these genes have associated with them a count or frequency in which they appear in a given orbit. With this information, a search can be conducted to identify highly frequent genes in any given orbit found to be integral to a graph’s structure. In reference to Figures 5.7 and 5.8, there is great confidence in knowing that situated in the 95% percentile orbit-gene frequency distribution are census genes with the highest order of appearances. As seen in each orbit-gene distribution graph (Figures 5.7 and 5.8), each common orbit from each age group, Kernel Density Estimate 34 (KDE) plots were generated to properly visualize and interpret the distribution of associated census gene frequencies and the amount of said distribution that contains census genes that are highly present in those orbits. Each orbit-gene distribution (OGD) graph (Figures 5.7 and 5.8) contains the 95% percentile that acts a threshold to single out the highly-occuring genes. As result of using a KDE, some of the x-axis representing the orbit frequencies show the curve extending from negative x-values. There is no negative presence of any census gene in any orbit. This is a result of the smoothing distortion when converting a data set into a distribution curve, which is more visually understandable than using a conventional histogram. Using a KDE does not destroy the data; it simply generates a continuous curve representing the data set’s distribution. All KDE graphs were created using Python’s Seaborn kdeplot function. 35 Figure 5.1: BCG 20–30 age group (top) and LCG 30–40 age group (bottom) carcinoma graphs using gene co-expression data at probabilistic thresholds equal to and greater than 0.3. Each graph is represented from data gathered in ten year age groupings, all collected from the data provided by COSMIC. 36 Figure 5.2: BCG 30–40 age group (top) and LCG 40–50 age group (bottom) carcinoma graphs using gene co-expression data at probabilistic thresholds equal to and greater than 0.3. Each graph is represented from data gathered in ten year age groupings, all collected from the data provided by COSMIC. 37 Figure 5.3: BCG 90–100 age group (top) and LCG 80–90 age group (bottom) carcinoma graphs using gene co-expression data at probabilistic thresholds equal to and greater than 0.3. Each graph is represented from data gathered in ten year age groupings, all collected from the data provided by COSMIC. 38 Figure 5.4: BCG 20–30 age group (top) and LCG 30–40 age group (bottom) orbit clustering of corresponding graphlet correlation matrices (GCM) of both generated BCG (top) and lung (bottom) graphs, as seen in Figure 5.1. 39 Figure 5.5: BCG 30–40 age (top) and LCG 40–50 age group (bottom) orbit clustering of corresponding graphlet correlation matrices (GCM) of both generated BCG (top) and LCG (bottom) graphs, as seen in Figure 5.2. 40 Figure 5.6: BCG 90–100 age group (top) and LCG 80–90 age group (bottom) orbit clustering of corresponding graphlet correlation matrices (GCM) of both generated BCG (top) and LCG (bottom) graphs, as seen in Figure 5.3. 41 Figure 5.7: Identified gene distribution per given orbits from census genes found between-groups common orbit clusters between generated breast age group graphs. Breast cancer age group graphs from left to right: 20–30, 30–40, and 90–100. Horizontal red marker and arrow provides visual confirmation of high-frequency gene found in 95% percentile of respective kde orbit distributions. 42 Figure 5.8: Identified gene distribution per given orbits from census genes found between-groups common orbit clusters between generated lung age group graphs. Lung cancer age group graphs from left to right: 30–40, 40–50, and 80–90. Horizontal red marker and arrow provides visual confirmation of high-frequency gene found in 95% percentile of respective kde orbit distributions. 43 Table 5.4: Identified Census Genes in Structurally Shared Orbits in Breast Cancer Networks. Orbit 10 Genes Counts KMT2D* 19704 NOTCH2* 18843 PCM1* 18743 CREB1* 18691 PRDM2** 18634 CDH10** 18290 ZFHX3* 18046 CCR4** 18010 Orbit 10 Genes Counts AR* 10 EGFR* 9 IKBKB* 8 PPFIBP1* 8 TLX1* 8 HOXC11* 7 KIT* 7 PICALM* 8 SLC34A2* 7 ARID2* 6 BCOR* 6 CBLC* 6 CNTNAP2** 6 ERC1* 6 KMT2C* 6 Orbit 10 Genes Counts NSD1* 48985 ATRX* 47918 ATM* 47131 MTCP1* 46931 PBRM1* 46042 MED12* 45888 NF1* 45523 PRDM2** 45316 SPEN* 45307 USP8* 45272 KMT2A* 45150 ARID2* 44868 Breast Cancer 20 to 30 Age Group Orbit 3 Orbit 5 Orbit 7 Genes Counts Genes Counts Genes Counts KMT2D* 1223 IL7R* 30035 KMT2D* 6083 NSD1* 1085 KMT2D* 29273 IL7R* 4249 CREB1* 1063 MAP2K4* 28496 CCR4** 4157 ZFHX3* 1013 CCR4** 28288 MAP2K4* 4140 NOTCH2* 1012 ATM* 28241 ATM* 4121 ARID2* 995 CDH10** 27150 PRDM2** 4084 PRDM2** 976 PRDM2** 26446 CDH10** 3522 KLF4* 940 PCM1* 26187 RNF43* 3516 Breast Cancer 30 to 40 Age Group Orbit 3 Orbit 5 Orbit 7 Genes Counts Genes Counts Genes Counts AR* 2 MITF* 150 MITF* 56 HOXC11* 2 PAFAH1B2* 142 PAFAH1B2* 50 IKBKB* 2 NFKBIE* 97 NFKBIE* 30 MYH9* 85 GRIN2A* 20 EIF3E* 66 MAP3K13* 20 MAP3K13* 65 MYH9* 20 PICALM* 65 SGK1** 20 PRDM16* 62 WT1* 20 ERC1* 58 DDX6* 16 DDX3X* 56 POU5F1* 16 GRIN2A* 55 PRDM16* 16 WT1* 55 RSPO2* 16 DDX6* 53 HOXC11* 12 FBXW7* 52 MECOM* 52 PPFIBP1* 52 ERG* 51 SGK1** 50 Breast Cancer 90 to 100 Age Group Orbit 3 Orbit 5 Orbit 7 Genes Counts Genes Counts Genes Counts ATRX* 1612 KMT2A* 89156 ACSL6** 12957 NSD1* 1566 MTCP1* 87707 USP8* 12050 ATM* 1544 ACSL6** 86446 MTCP1* 11616 PBRM1* 1532 USP8* 84033 KMT2A* 11330 SPEN* 1502 STAT3* 80351 STAT3* 10320 USP8* 1454 NSD1* 79379 NSD1* 10168 ARID2* 1453 LARP4B** 78582 ATM* 9820 PRDM2** 1441 MED12* 75942 MED12* 9553 MED12* 1439 MLLT10* 75510 MLLT10* 9340 SF3B1* 1426 ATM* 74065 LARP4B** 9261 MTCP1* 1420 NF1* 72151 LSM14A** 9076 LSM14A** 1403 CACNA1D* 70767 CACNA1D* 8702 Orbit 2 Genes Counts KMT2D* 1192 PRDM2** 915 ATM* 889 IL7R* 863 RNF43* 862 CCR4** 859 CREB1* 828 MAP2K4* 822 Orbit 2 Genes Counts MITF* 28 PAFAH1B2* 27 NFKBIE* 20 GRIN2A* 15 MAP3K13* 15 MYH9* 15 SGK1** 15 WT1* 15 DDX6* 14 POU5F1* 14 PRDM16* 14 RSPO2* 14 HOXC11* 13 Orbit 2 Genes Counts USP8* 1706 ACSL6** 1679 KMT2A* 1666 MTCP1* 1661 NSD1* 1594 STAT3* 1580 ATM* 1537 MED12* 1487 ATRX* 1469 LARP4B** 1455 CACNA1D* 1450 LSM14A** 1447 It should also be noted that each OGD graph in Figures 5.7 and 5.8 can be cross-referenced with their respective cancer graphs in Figures 5.1–5.3. For example, both BCG 30–40 and LCG 80–90 have the smallest amount of node-edge connections, indicating low amount of census gene co-expression similarities despite having some of the highest amount of genes, 363 and 317 genes respectively (Table 5.3). This is reflected in their respective OGD graphs wherein the distributions are in a smaller range (x < 150 and x < 1200, respectively) in comparison to other age groups (x < 40000). This is quite apparent when comparing higher-frequency genes 44 Table 5.5: Identified Census Genes in Structurally Shared Orbits in Lung Cancer Networks. Orbit 10 Genes Counts TSC2* 12991 KCNJ5* 12695 SND1* 12649 CUX1* 12644 FAM131B** 12510 ARHGEF10** 12385 Lung Cancer 30 to 40 Age Group Orbit 3 Orbit 5 Orbit 7 Genes Counts Genes Counts Genes Counts CHD2** 927 CHD2** 2120 MYH11* 16726 ARHGEF10** 911 TRAF7* 2055 PTPRT* 16726 TRAF7 831 AKAP9** 2044 PTPN13* 16637 AKAP9** 825 SETBP1* 2025 KCNJ5* 16489 KCNJ5* 779 ARHGEF10** 1974 SETBP1* 16334 PTPN13* 764 TRAF7* 15886 Orbit 10 Genes Counts MYO5A* 3898 PDGFRA* 3882 COL3A1** 3499 DDR2* 3090 ZEB1** 3088 BTK* 3024 NIN* 3011 FLI1* 2828 TCF12* 2591 IL7R* 2495 SFRP4* 2457 IKZF1* 2403 LCP1** 2306 POU2AF1* 2082 PPFIBP1* 2055 PTPRC* 1998 FBLN2** 1973 PRRX1* 1955 NOTCH2* 1757 Lung Cancer 40 to 50 Age Group Orbit 3 Orbit 5 Orbit 7 Genes Counts Genes Counts Genes Counts MYO5A* 154 MYO5A* 19706 MYO5A* 7883 PDGFRA* 149 ZEB1** 14506 ZEB1** 5095 ZEB1** 132 NIN* 13917 NIN* 4504 COL3A1** 126 PDGFRA* 13434 PDGFRA* 3727 DDR2* 115 FLI1* 11944 IKZF1* 3386 NIN* 115 IKZF1* 10869 FLI1* 3190 BTK* 111 IL7R* 10720 IL7R* 2724 FLI1* 108 DDR2* 10690 DDR2* 2688 TCF12* 100 BTK* 10387 BTK* 2335 IKZF1* 98 COL3A1** 9924 TCF12* 2196 IL7R* 93 TCF12* 9665 COL3A1** 2138 SFRP4* 81 PRRX1* 9085 PRRX1* 2033 LCP1** 78 PPFIBP1* 8046 VAV1** 1798 POU2AF1* 78 PRDM1* 8027 PTPRC* 1776 FBLN2** 74 PTPRC* 7888 MAP3K1* 1738 PPFIBP1* 74 LCP1** 7852 PPFIBP1* 1722 PRRX1* 74 MAP3K1* 7498 PRDM1* 1655 PTPRC* 72 NOTCH2* 7496 CBFA2T3* 1439 VAV1** 69 VAV1** 7272 LCP1** 1429 Orbit 10 Genes Counts FANCA* 135 NFE2L2* 116 FAM131B** 110 CUL3** 109 APC* 103 POLQ* 103 RANBP2* 95 KMT2D* 89 RFWD3** 87 ELK4* 84 RNF213* 83 ZFHX3* 81 PTPRK* 74 HOXA13* 73 KMT2C* 71 Lung Cancer 80 to 90 Age Group Orbit 3 Orbit 5 Orbit 7 Genes Counts Genes Counts Genes Counts CUL3** 10 ZFHX3* 284 ZFHX3* 1062 FANCA* 10 FOXO1* 192 FANCA* 933 RANBP2* 10 FANCA* 190 RANBP2* 775 FAM131B** 9 RANBP2* 188 FOXO1* 710 NFE2L2* 9 MLLT10* 183 MLLT10* 706 APC* 8 AFF4* 182 IDH1* 704 ELK4* 8 COL3A1** 174 TET2* 688 HOXA13* 8 HSP90AB1* 163 RFWD3** 661 POLQ* 8 IDH1* 148 CDC73* 654 EP300* 7 TET2* 147 ETNK1* 621 KMT2D* 7 CDC73* 141 CUL3** 615 RFWD3** 7 CUL3** 135 CBLB* 583 RNF213* 7 CBLB* 130 AFF4* 579 SMAD3* 7 POLE* 130 APC* 577 ZFHX3* 7 ETNK1* 115 POLQ* 569 TPR* 567 Orbit 2 Genes Counts CHD2** 726 ARHGEF10** 685 AKAP9** 660 TRAF7* 654 SETBP1* 636 Orbit 2 Genes Counts MYO5A* 792 ZEB1** 609 NIN* 551 PDGFRA* 517 IKZF1* 463 FLI1* 453 DDR2* 413 IL7R* 403 BTK* 385 COL3A1** 370 TCF12* 365 PRRX1* 332 VAV1** 309 PTPRC* 306 PPFIBP1* 304 MAP3K1* 294 PRDM1* 288 LCP1** 273 CBFA2T3* 265 Orbit 2 Genes Counts ZFHX3* 84 FANCA* 68 RANBP2* 68 FOXO1* 63 AFF4* 62 MLLT10* 62 COL3A1** 61 HSP90AB1* 60 CUL3** 56 IDH1* 53 TET2* 53 CDC73* 52 CBLB* 51 POLE* 51 ETNK1* 49 in Tables 5.4 and 5.5 between low-connectivity and high-connectivity graphs. For example, BCG 30–40’s MITF* has a count of 150 in Orbit 5, which the highest appearing census gene in the graph. In comparison, BCG 20–30’s MAP2K4* has a count of 822 in Orbit 2, which is the lowest appearing census gene in the graph. Likewise, LCG 80–90’s ZFHX3* has a count of 1062 in Orbit 7, which the highest appearing census gene in the graph, but while LCG 40–50’s VAV1** has a count of 69 in Orbit 3, which is the lowest appearing census gene, its highest 45 appearing gene is MYO5A* with 19,706 counts in Orbit 5 in the graph. Following the investigation of the common orbits for high-frequency genes, all highfrequency genes have been identified in Tables 5.4 and 5.5 in order of their orbit appearances. There are some observations noted from that table, namely that there appears to be certain genes that are present in across all orbits and age groups of individual cancers: • BCG 20–30:, KMT2D* and PRDM2** appears in orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2), CCR4** appears in only orbits (Orbit 10, Orbit 5, Orbit 7, Orbit 2). Furthermore, there is more genetic variation across its orbits in comparison to other graphs. As such, these census genes are all strongly presented in their respective orbits. This is indicative of the high amounts of correlative GE between census genes for this age group. • BCG 30–40 age group, only HOXC11* appears in most orbits, orbits (Orbit 10, Orbit 3, Orbit 7, Orbit 2). For the remainder, they show weakly presented in their respective orbits. There is also a lot of variation in census genes across all common orbits. This shows that there are fewer amounts of similar correlative GE. • For BCG 90–100 age group, NSD1*, MTCP1*, MED12*, and USP8* appear across all orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2), while KMT2A* appears in only orbits (Orbit 10, Orbit 5, Orbit 7, Orbit 2). This is an interesting finding as it shows that these could be commonly associated genes for this graph. In terms of genetic variation, there appears to be very little across some orbits. Again, another strong showing for the rest of the census genes being present in their respective orbits with high amounts of correlative GE between them. • For LCG 30–40 age group, ARGHGEF10** and TRAF7* appear in only orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 2) and (Orbit 3, Orbit 5, Orbit 7, Orbit 2), respectively. Regardles of just these two common genes, there isn’t a lot of genetic variation either, namely due to a smaller amount of genes present in the graph. Again, another strong showing for the rest of the census genes being present in their respective orbits with high amounts of correlative gene expression between them. • For LCG 40–50 age group, MYO5A*, PDGFRA*, COL3A1*, DDR2*, ZEB1*, BTK*, NIN*, FLI1*, TCF12*, IL7R*, IKZF1*, LCP1**, PPFIBPI*, PRRX1*, BTK* and PTPRC* all appear across orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2). This is another interesting finding that could very well showcase a common set of genes unique for this graph. In addition, there is not a lot of genetic variation across orbits here. • For LCG 80–90 age group, FANCA*, CUL3**, RANPB2*, and ZFHX3* appear across all orbits (Orbit 10, Orbit 3, Orbit 5, Orbit 7, Orbit 2). Very much like BCG 20–30, the 46 amount of genetic variation here is more than all other LCG graphs, showing that there is a low amount of correlative GE across the graph. Despite this, there still is a strong notion that a unique group of census genes has been associated with this graph. Looking closer at the results from each age group, there are commonly occuring genes that seem highly associated with other genes. Notable examples can be found in both BCG and LCG graphs, namely KMT2D* found in 20–30 and MYO5A* found in 40–50, respectively: • KTM2D: The KMT2D gene provides instructions for making an enzyme called lysine (K)-specific methyltransferase 2D that functions as a histone methyltransferase [29]. Histone methyltransferases are complex enzymes that modify histones: structural proteins that bind to DNA giving form to coiled structures called nucleosomes [19, 29]. Histones are needed to both compact and protect DNA, preventing unnecessary damage and entanglement to the genetic material. Being coiled prevents DNA from gene expression to occur as DNA transcription protein mechanism cannot bind to it. The histones must release a section of DNA exposed in order for gene expression to occurs. KMT2D can do so by methylating the H3 histone [19]; that is to say, adding a methyl group to the H3 histone of the nucleosome complex [29]. It is thought that KMT2D has this ability to regulate gene expression of certain genes as a result of its abilities, more specifically in the role of a tumour suppressor gene assisting in the prevention of uncontrolled cellullar growth and proliferation. With this in mind, it can be hypothesized from the results showcasing KMT2D with the highest degree of gene expression correlative association that the regulation of this gene can affect the regulation of other genes. • MYO5A: The MYO5A gene provides instructions for making an enzyme called myosin Va that functions as an intracellular motor transport protein [15, 30]. Motor transport proteins are by and large responsible for transporting/anchoring intracellular components such as vesicles, organelles, and mRNA to various locations within the cell along actin filaments, a major component of the cellular cytoskeletal structure [15, 18, 30]. Positioning and transportation of such components gives credence to a cell’s ability to not only move crucial materials between organelles, but there is increased evidence to support that it is involved in restructuring of a cell’s shape through the reorganizing of actin filaments [15, 18]. Being identified as a key oncogene in many cancers, including non-small cell lung cancer and laryngeal squamous cell carcinoma, overexpression of MYO5A has allowed for cellular proliferation and metastatis of cancerous cells [15, 18, 30]. With this in mind, it can be suggested from the results showcasing MYO5A with the highest degree of gene expression correlative association that it could have an effect on regulating expression of other genes to ensure metastasis occurs. 47 Overall, each and every age group has its own unique set of genes. This is especially apparent in graphs LCG 40–50, LCG 80–90, and BCG 90–100. In future studies, more work should be done to determine alternative methods to draw more meaningful metrics. Regardless, this is a rather interesting investigation that could potentially be used to further unravel how these genes are actually influencing each other and the cancer. Furthermore, these unique genes have been found in the common orbits shared across all age groups. In terms of graph structure, there must be some sort of similiarities in gene regulatory control. Moving forward, there is promise in assessing the composition of census genes that are present in higher abundance in common orbit clusters of differing graphs. This can be used to further determine and isolate core groups of influential census genes each cancer age group utilizes for growth and survival. 48 Chapter 6 Discussion and Overview It is the ambition of this thesis to highlight the potential use of graphlet analysis to identify and characterize cancer graphs in outlining involved genes and in comparing cancer genomic characteristics. Providing a general overview, as was shown in this study, of a cancer’s composition of unique group of census genes could be beneficial in furthering oncological research focusing more on gene clustering based on their expression levels and control. It should be made clear that this thesis project is very much exploratory. COSMIC has been an invaluable resource for this project, allowing the comparison of not only different cancer types (breast and lung) but with different age groups between them. However, there is a lot more to cancer research than knowing a patient’s age. Following this thesis project, it would be beneficial to evaluate the observed census genes in terms of their hallmarks, if confirmed and identified. COSMIC provides the (CGCHOC) data set wherein curated census genes are updated with known cancer hallmarks. Investigating the cancer hallmarks of census genes in those common orbit clusters will provide an awareness of how the generated cancer graphs are behaving in those particular age groups. COSMIC also provides additional histological context to cancer types, allowing to focus more on specific tissues within larger organs and the differing stages of cancer development. This study did not fully inspect the composition of census genes in orbits specific to graphs as the main objective was to compare common internal structures between cancer graphs. It would be advantageous to fully explore the unique orbit clusters of separate graphs to unearth potentialy genetic implications that could very well be individualistic to said graphs. Lastly, the co-expression graphs generated from census gene expression data does not yield any information regarding how each gene influences or distrupts the expression of another. This is easily represented with the use of directed graphlet analysis where edges are weighted vectors to and from given nodes. Representing these vectors would be guided by oncological literature where findings of a census gene is confirmed to affect the expression levels of other census genes, mutated or not. Such a analysis would further enrich hidden structural dependencies by illustrating how that dependency 49 is initiated towards a cancerous state. Another contribution this study provides is a preliminary graphlet algorithm that could see future use in the NetworkX [12]. Currently, this algorithm is parallelized and could be easily implemented and operated in computationally distributed environments to improve processing power for massive graphs. As such, this algorithm could potentially be integrated into NetworkX’s growing list of graph features, metrics, and analysis. Future improvements of the algorithm will require a revisit in data collection, processing, and deployed in a server database. Currently, all used data sets need to be separately downloaded from COSMIC and store initial graphlet analysis as Comma Delimited Values (CSV) files. Having the data be stored initialy into a relationanl database management system such as PostgreSQL on a server would allow for data streaming and serverless computing. In this study, Dask dataframes were used to simply store all the raw, unfiltered CSV files as a Pandas’ DataFrame could not store over 40 gigabytes of data; this is due to RAM being limited which a Pandas DataFrame uses, while a Dask DataFrame may live on any disk space for larger-than-memory computing on a single machine. Filtered data from Dask DataFrame is stored in a Pandas DataFrame, which improves further analysis. However from there leads to a weak and core component of the software itself; it does not fully utilize data structures with improved search performance such as Pandas’ DataFrame and relies mainly Python’s standard dictionary data structure. To leverage the abilities of Dask and Pandas, Apache Spark would be an appropriate replacement. Apache Spark, or Spark for short, is an open-source parallel processing, multi-language framework that supports in-memory procesing to boost the data analysis performance. Just like Pandas, Spark is able to achieve similar performance levels by using RAM to store immediate data rather than on-disk memory. It is also able to run data processing steps in parallelized batches across a cluster of computers, streaming data to and from each machine. This would handle massive datasets with issue. Spark also has a graph network tool called GraphX which is used to create, study, and modify graphs the same way that NetworkX does. It would be worthwhile to explore the implementation of the proposed algorithm with GraphX and its inherinted ability to handle massive amounts of data, resulting in a step towards graph network analysis at an industry level. There was no attempt to conduct false discovery experiments for this thesis. As mentioned in the beginning of Chapter 2, the correctness of the graphlet Algorithm 1 was verified by checking that it had similar results to Das et al.’s algorithm by computing their sample graph [1, 5]. Das et al. themselves conducted false discovery experiments to confirm that their algorithm did not generate clusters from nonsensical graph data. We were confident from our matching results that Algorithm 1 would be able to provide biological significance from analyzing the BCGs and LCGs generated. However, future studies should investigate how well Algorithm 1 can conduct graphlet analysis with large and small graphs. 50 From identifying census gene, to matching ten’s of thousands of mRNA expression data, and then generating co-expression graphs, dictionaries have been used almost exclusively due to personal familiarity. Algorithmic performance did benefit from the use of multicore/parallel processing, however it is not fully extended for certain aspects. Currently, parallelism is extended to each graphlet, however this implementation did not take into account for massive subgraph lists to be stored in memory. There have been cases when a list of subgraphs for a particular graphlet analysis would be so large that its processing runtime would exceed the runtime of other subgraphs lists. Further refining the multiprocessing capabilities to accomodate maximizing available core resources would further reduce the software’s overall runtime. Implementing the ability to have a single subgraph list to be shared amongst all graphlet processes running in parallel would help not only reduce the time needed for NetworkX to generate a new list for every process, but also reduce the memory footprint by simply having a single list stored in RAM. Lastly, another computational issue arose when exploring up to 5-node graphlets. This is why graphlets G9–G29 have been omitted from this research venture as the analysis of potential subgraphs resulted in extended periods of server runtime that lead to the server seizing up. This could very well be a limitation to the power of graphlet analysis, hence why future research lead by Pržulj et al. [22] and Das et al. will deal with minimizing orbit redundancies and using directed graphlet analysis algorithms for improving the field of graph analysis [1, 5]. 51 ni : Node of a subgraph n j : Neighbouring node(s) to ni nk : Neighbouring node(s) to n j nl : Node of a graphlet (ni ,n j ) : edge connecting ni and its neighbour n j E : Number of edges between neighbours n j to ni NN : Number of neighbours ni to n j NNE : Number of edges neighbours nk to n j NNN : Number of neighbours nk to n j Ggraphlet : Graphlet of G Gsubgraph : Subgraph of G GOM : Graphlet Orbit Matrix orbitchecksum: Orbit Checksum nx: NetworkX package object INPUT: list(Gsubgraph ), dict(Ggraphlet ) 2 OUTPUT: GOM for G 3 Shared Array: GOM; 4 for each Ggraphlet ∈ dict(Ggraphlet ) do IdentifyGraphlets(Ggraphlet , list(Gsubgraph ), nx) in parallel 5 end Algorithm 4: Revised graphlet and orbit counting algorithm: Part I. 1 52 Procedure IdentifyGraphlets(Ggraphlet , list(Gsubgraph ), nx) for Gsubgraph ∈ list(Gsubgraph ) do 2 if nx.connected (Gsubgraph ) AND nx.isomorphic(Gsubgraph , Ggraphlet ) then 3 ValidateOrbitChecksums(Gsubgraph , Ggraphlet ) 4 end end end 1 Algorithm 5: Revised graphlet and orbit counting algorithm, Part II [1]. Procedure ValidateOrbitChecksums(Gsubgraph , Ggraphlet ) for ni ∈ Gsubgraph do E ← len(subgraph.edges(node))) 3 NN ← len(subgraph[node]) 4 5 NNE ← 0 NNN ← 0 6 for n j ∈ Gsubgraph do 7 NNE ← NNE + |g.edges(n j )| 8 end g[ni ][orbitchecksum] ← E + (NNE − 1) 9 for n ∈ Ggraphlet do 10 11 if ni [orbitchecksum] == n [orbitchecksum] then [ni ][orbit ] ← n [orbit ] 12 end end end end 1 2 Algorithm 6: Revised graphlet and orbit counting algorithm: Part II. 53 Bibliography [1] Das A., Efficient enumeration of small graphlets and orbits., Master’s thesis, University of Northern British Columbia, Prince George, British Columbia, Canada, 2020. [2] F. Collins and A. Barker, Mapping the cancer genome. pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies, Scientific American 296 (2007), 50–57. [3] A.I. Compos and J.A. Freye-Gonzalez, Evolutionary constraints on the complexity of genetic regulatory networks allow predictions of the total number of genetic interactions., Sci Rep 9 (2019), 3618. [4] M.R.T. Dale, Applying graph theory in ecological research, Cambridge University Press, 2017. [5] A. Das, A. Aravind, and M. Dale, Algorithm and application for signed graphlets, (2019), 613–620. [6] A. Das, A. Aravind, and G.S. Deo, Signed graphlets based gene expression analysis, (2019), 2055–2062. [7] A. Das, M. Drakos, A. Aravind, and D. Horning, Water governance network analysis using graphlet mining, (2020), 633–640. [8] D. Davis et al., Topology-function conservation in protein-protein interaction networks., Bioinformatics 31 (2015), 1632–1639. [9] S. Doria-Belenguer, M.K. Youssef, N. Böttcher R., Malod-Dognin, and N. Pržulj, Probabilistic graphlets capture biological function in probabilistic molecular networks, Bioinformatics 36 (2020), no. Supplement 2, i804–i812. [10] S.A. Forbes, D. Beare, H. Boutselakis, S. Bamford, N. Bindal, J. Tate, C.G. Cole, S. Ward, L. Dawson E, Ponting, R. Stefancsik, B. Harsha, C.Y. Kok, M. Jia, H. Jubb, Z. Sondka, S. Thompson, T. De, and P.J. Campbell, COSMIC: somatic cancer genetics at highresolution, Nucleic Acids Research 45 (2016), no. D1, D777–D783. [11] J-L. Gailly, J. Meyering, and P. Eggert, Gnu operating system: Gnu gzip., 2004. [12] A. Hagberg, D. Schult, and P. Swart, Networkx: Network analysis in python., 2004. 54 [13] D. Hanahan and R.A. Weinberg, Hallmarks of cancer: the next generation., 2011 144 (2011), 646–674. [14] T.J. Hudson, W. Anderson, z A. Arte, A.D. Barker, C. Bell, R.R. Bernabe, M.K. Bhan, F. Calvo, I. Eerola, et al., International network of cancer genome projects., Nature 464 (2010), 993–998. [15] L. Lan, H. Han, H. Zuo, Z. Chen, Y. Du, W. Zhao, J. Gu, and Z. Zhang, Upregulation of myosin va by snail is involved in cancer cellmigration and metastasis., Int. J. Cancer 126 (2010). [16] P. Langfielder and S. Horvath, Wgcna: an r package for weighted correlation network analysis., BMC Bioinformatics 559 (2008). [17] T. Leighton and M. van Dijk, Massachusetts institute of technology opencouseware 6.042j/18.062j: Mathematics for computer science, https://ocw.mit.edu/, 2010, License: Creative Commons BY-NC-SA. [18] X. Liang, Z. Wu, S. Shen, Y. Niu, Y. Guo, J. Liang, and W. Guo, Linc01980 facilitates esophageal squamous cell carcinoma progression via regulation of mir-190a-5p/myo5a pathway, Archives of Biochemistry and Biophysics 686 (2020). [19] L. Mariño-Ramírez, M.G Kann, B.A. Shoemaker, and D. Landsman, Histone structure and nucleosome stability., Expert Rev Proteomics 5 (2005), 719–729. [20] T. Milenkovic et al., Uncovering biological network function via graphlet degree signatures., Cancer Informatics 8 (2008), 257–273. [21] M. Mina, F. Raynaud, D. Tavernari, E. Battistello, S. Sungalee, S. Saghafinia, T. Laessle, F. Sanchez-Vega, N. Schultz, E. Oricchio, and G. Ciriello, Conditional selection of genomic alterations dictates cancer evolution and oncogenic dependencies, Cancer Cell 32 (2017), no. 2, 155–168.e6. [22] N. Pržulj, Biological network comparison using graphlet degree distribution, Bioinformatics 23 (2007), no. 2, e177–e183. [23] Uddin R. and S.M. Singh, Gene network construction from microarray data identifies a key network module and several candidate hub genes in age-associated spatial learning impairment., Front. Syst. Neurosci. 75 (2017). [24] N. Shervashidze, S.V.N. Vishwanathan, T.H. Petri, K. Mehllorn, and K.M. Borgwardt, Efficient graphlet kernels for large graph comparison., Journal of Machine Learning Research 5 (2009), 488–495. [25] H.A. Shihab, M.F. Rogers, J. Gough, M. Mort, D.N. Cooper, I.N.M. Day, T.R. Gaunt, and C. Campbell, An integrative approach to predicting the functional effects of non-coding and coding sequence variation, Bioinformatics 31 (2015), no. 10, 1536–1543. 55 [26] Z. Sondka, S. Bamford, C.G. Cole, S.A. Ward, I. Dunham, and S.A. Forbes, The cosmic cancer gene census: describing genetic dysfunction across all human cancers., Nat. Rev. Cancer 18 (2018), 696–705. [27] Y. Sun, X. Liang, and K. Loparo, A common gene expression signature analysis method for multiple types of cancer, 2019, pp. 200–211. [28] J.G. Tate, S. Bamford, H.C. Jubb, Z. Sondka, D.M. Beare, l N. Binda, H. Boutselakis, and C.G. Cole, Cosmic: the catalogue of somatic mutations in cancer, Nucleic Acids Research 47 (2018), no. D1, D941–D947. [29] E. Tonska, H.U. Osmanbeyoglu, P. Castel, C. Chan, Hendrickson R.C., M. Elkabets, M.N. Dickler, M. Scaltriti, C.S. Leslie, S.A. Armstrong, and J. Baselga, Pi3k pathway regulates er-dependent transcription in breast cancer through the epigenetic regulator kmt2d, Science 355 (2017), 1324–1330. [30] H. Velvarska and D. Niessing, Structural insights into the globular tails of the human type v myosins myo5a, myo5b, and myo5c., PLoS One. 8 (2013). [31] B. Vogelstein, N. Papadopoulos, V.E. Velculescu, S. Zhou, L.A. Diaz, and K.W. Kinzler, Cancer genome landscapes, Science 339 (2013), no. 6127, 1546–1558. [32] B.A. Weir, M.S. Woo, G. Getz, Perner S., R. Beroukhim, W.M. Lin, M.A. Province, A. Kraja, et al., Characterizing the cancer genome in lung adenocarcinoma., Nature 450 (2007), 941–947. [33] Wellcome Sanger Institute, The catalogue of somatic mutations in cancer. [34] M.C. White, D.M. Holman, J.E. Boehm, L.A. Peipins, M. Grossman, and S.J. Henley, Age and cancer risk: a potentially modifiable relationship., Am. J. Prev. Med. 46 (2014). [35] L. Yin, Z. Cai, B. Zhu, and C. Xu, Identification of key pathways and genes in the dynamic progression of hcc based on wgcna, Genes 9 (2018), no. 2. 56 Appendix A Appendix A.1 Figures and Diagrams Some of the figures found in Chapters 5 are sized in such a way that does not make the data easily legible. They are resized and placed individually in the appendix to better read the data they are presenting. 57 Figure A.1: Full-sized graph of 20–30 age group breast carcinoma (see Figure 5.1). 58 Figure A.2: Full-sized graph of 30–40 age group lung carcinoma(see Figure 5.1). 59 Figure A.3: Full-sized graph of 30–40 age group breast carcinoma (see Figure 5.2). 60 Figure A.4: Full-sized graph of 40–50 age group lung carcinoma (see Figure 5.2). 61 Figure A.5: Full-sized graph of 90–100 age group breast carcinoma (see Figure 5.3). 62 Figure A.6: Full-sized graph of 80–90 age group lung carcinoma (see Figure 5.3). 63 Figure A.7: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 20–30 age group breast cancer data (see Figure 5.4). 64 Figure A.8: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 30–40 age group lung cancer data (see Figure 5.4). 65 Figure A.9: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 30–40 age group breast cancer data (see Figure 5.5). 66 Figure A.10: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 40–50 age group lung cancer data (see Figure 5.5). 67 Figure A.11: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 90–100 age group breast cancer data (see Figure 5.6). 68 Figure A.12: Full-sized cluster map of the graphlet correlation matrices (GCM) for the 80–90 age group lung cancer data (see Figure 5.6). 69