ADVANCING WATER QUALITY PREDICTION THROUGH INTEGRATING MACHINE LEARNING WITH DATA AUGMENTATION: A CASE STUDY FOR FIRST NATIONS COMMUNITIES IN BRITISH COLUMBIA by Anqi Chen B.Sc., Wenzhou University, 2021 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN NATURAL RESOURCES AND ENVIRONMENTAL STUDIES UNIVERSITY OF NORTHERN BRITISH COLUMBIA April 2024 © Anqi Chen, 2024 Abstract Clean drinking water access is essential for public health and regarded as a scarce resource for Indigenous communities in rural and remote areas. In this research, a new iron and manganese prediction method based on Data Augmentation and Machine Learning Algorithms to be applied to drinking water in BC’s First Nation communities is reported. GAN based modelling and NIBS-NI based modelling were developed to investigate the effects of different data augmentation methods and predictors for iron and manganese prediction results. Reliable synthetic data was obtained through both data augmentation methods, allowing 4 machine learning algorithms to predict iron and manganese utilizing 3 and 5 physical properties respectively. Compared with RF, XGB, and DT machine learning models, the GBR model showed the strongest fitting ability and accurate predictions for both NI-BS-NI based modelling and GAN based modelling in predicting iron and manganese, with the Train R2 and Test R2 of two models nearing 1, and all the RMSE scores are below 0.06. The decision-making tool developed using GAN technology is considered to have greater application potential due to its ability to provide accurate predictions while requiring only 3 input physical parameters. ii Table of Contents Abstract ........................................................................................................................................... ii Table of Contents ........................................................................................................................... iii List of Tables .................................................................................................................................. vi List of Figures ............................................................................................................................... vii Glossary ......................................................................................................................................... ix Acknowledgement ......................................................................................................................... xi Chapter 1 Introduction .................................................................................................................... 1 Chapter 2 Literature Review ........................................................................................................... 4 2.1 Overview of Canada's First Nations ..................................................................................... 4 2.1.1 History and current situation .......................................................................................... 4 2.1.2 First Nations communities in BC................................................................................... 6 2.1.3 Drinking water quality and security ............................................................................... 6 2.2 Heavy metals: Iron and Manganese in Drinking Water ........................................................ 7 2.2.1 Sources of iron and manganese ...................................................................................... 8 2.2.2 Impact of iron and manganese ..................................................................................... 10 2.2.3 Current detection approaches ....................................................................................... 13 2.3 Machine learning overview................................................................................................. 15 2.3.1 Models introduction ..................................................................................................... 15 2.3.2 Application of the machine model in water quality ..................................................... 19 2.3.3 Interpretability of machine learning ............................................................................. 24 2.4 Data augmentation (DA) ..................................................................................................... 25 2.4.1 Numerical Interpolation ............................................................................................... 26 2.4.2 Bootstrapping ............................................................................................................... 26 2.4.3 Noise Injection ............................................................................................................. 27 2.4.4 GAN ............................................................................................................................. 27 iii 2.5 Discussion and conclusion .................................................................................................. 28 Chapter 3 Materials and Methods ................................................................................................. 30 3.1 Data processing and methods.............................................................................................. 30 3.1.1 Data sources and indicators ......................................................................................... 30 3.1.2 Data preprocessing ....................................................................................................... 32 3.1.3 NI-BS-NI based Data Augmentation ........................................................................... 33 3.1.4 GAN based Data Augmentation .................................................................................. 34 3.2 Machine Learning Modelling ............................................................................................. 37 3.2.1 Model development ..................................................................................................... 37 3.2.2 Model hyperparameter optimization ............................................................................ 38 3.2.3 Model evaluation ......................................................................................................... 39 3.3 Model interpretability analysis ........................................................................................... 40 3.3.1 Impact magnitude of predictors ................................................................................... 40 3.3.2 Interactive effects of predictors ................................................................................... 41 3.3.3 Impact direction of predictors ...................................................................................... 42 Chapter 4 Results and Discussion ................................................................................................. 43 4.1 NI-BS-NI Based Modelling ................................................................................................ 43 4.1.1 Statistical analysis ........................................................................................................ 43 4.1.2 Model results ................................................................................................................ 49 4.1.3 Interpretable analysis ................................................................................................... 53 4.2 GAN Based Modelling ....................................................................................................... 61 4.2.1 Statistical analysis ........................................................................................................ 61 4.2.2 GAN sample generation ............................................................................................... 63 4.2.3 Model results ................................................................................................................ 65 4.2.4 Interpretable analysis ................................................................................................... 69 4.3 Graphical user interface ...................................................................................................... 75 Chapter 5 Conclusions .................................................................................................................. 77 iv 5.1 Research summary .............................................................................................................. 77 5.2 Limitations and future research .......................................................................................... 78 References ..................................................................................................................................... 80 v List of Tables Chapter 3 Table 3.1 Details of water quality physical parameters ............................................. 31 Chapter 4 Table 4.1 Statistical summary of raw data .............................................................. 44 Table 4.2 Statistical summary of NI-BS-NI augmented data ..................................... 44 Table 4.3 Performance of NI-BS-NI based models and best hyperparameters .............. 49 Table 4.4 Performance of GAN based models and best hyperparameters for Fe prediction ........................................................................................................................ 67 Table 4.5 Performance of GAN based models and best hyperparameters for Mn prediction .......................................................................................................... 68 vi List of Figures Chapter 2 Figure 2.1 Distribution of FNs communities in Canada ..............................................5 Figure 2.2 Structure of Random Forest .................................................................. 17 Chapter 3 Figure 3.1 Structure of GAN ................................................................................ 34 Figure 3.2 The 3-fold cross validation diagram ....................................................... 36 Figure 3.3 Diagram of five-fold cross-validation ..................................................... 39 Chapter 4 Figure 4.1 Comparison boxplots of raw data and NI-BS-NI augmented Data ............... 46 Figure 4.2 Pearson Correlation Heatmap of (a)raw data (b) NI-BS-NI augmented Data . 48 Figure 4.3 Scatter regression plots of (a)Fe (b)Mn of DT .......................................... 51 Figure 4.4 Scatter regression plots of (a)Fe (b)Mn of RF .......................................... 52 Figure 4.5 Scatter regression plots of (a)Fe (b)Mn of GBR ....................................... 52 Figure 4.6 Scatter regression plots of (a)Fe (b)Mn of XGB ....................................... 53 Figure 4.7 Feature importance of GBR model after NI-BS-NI augmentation ............... 54 Figure 4.8 Partial Dependence Plots of XGB model ................................................. 56 Figure 4.9 Two-way Partial Dependence Plots of XGB model................................... 57 Figure 4.10 SHAP summary plot of Fe of GBR model ............................................. 58 Figure 4.11 SHAP scatter diagram of Fe of GBR model ........................................... 59 Figure 4.12 SHAP summary plot of Mn of GBR model ............................................ 60 Figure 4.13 SHAP scatter diagram of Mn of GBR model.......................................... 60 vii Figure 4.14 Pearson Correlation Heatmap of Fe in GAN .......................................... 62 Figure 4.15 Pearson Correlation Heatmap of Mn in GAN ......................................... 62 Figure 4.16 Iron-cross-validation error and best error during loop iteration ................. 63 Figure 4.17 Manganese-cross-validation error and best error during loop iteration ....... 64 Figure 4. 18 Scatter regression plots of (a) RF; (b) XGB; (c) GBR; (d) DT for Fe prediction .......................................................................................................... 66 Figure 4.19 Scatter regression plots of (a)RF; (b)XGB; (c) GBR; (d) DT for Mn prediction .......................................................................................................... 69 Figure 4.20 Feature importance of GBR model after GAN augmentation .................... 70 Figure 4.21 Partial Dependence Plots of GBR model ............................................... 71 Figure 4.22 Two-way Partial Dependence Plots of GBR model ................................. 72 Figure 4.23 SHAP summary plot of GBR model after GAN ..................................... 73 Figure 4.24 SHAP scatter diagram of GBR model after GAN ................................... 74 Figure 4.25 Graphical user interface with 5 parameters ............................................ 76 Figure 4.26 Graphical user interface with 3 parameters ............................................ 76 viii Glossary ANN Artificial neural network AO Aesthetic objective BC British Columbia BP-ANN Backpropagation Artificial Neural Networks BPNN Back Propagation Neural Network DA Data augmentation DT Decision Tree FNs First Nations GAN Generative Adversarial Network GBR Gradient Boosting Regression GRNN Generalized Regression Neural Networks KNN K-Nearest Neighbors MAE Mean absolute error ML Machine learning MLR Multiple Linear Regression NI-BS-NI Numerical InterpolationBootstrapping-Noise Injection NN Neural Network PM Particulate matter ix R2 Coefficient of determination RF Random Forest RMSE Root mean square error SHAP SHapley Additive exPlanations SRR Small, Rural, and Remote SVM Support Vector Machine SWDNs Small water distribution networks TDS Total dissolved solids WQC Water quality class WQI Water quality index XGB Extreme Gradient Boosting x Acknowledgement First and foremost, I am profoundly thankful to my supervisor, Dr. Jianbing Li, for his invaluable guidance and feedback throughout my master's research at the University of Northern British Columbia. He is a scholar of great academic ability, meanwhile, he is the most patient and responsible mentor. Sincerely thanks to Dr. Li again. I extend my sincere appreciation to my co-supervisor, Dr. Min Zhao, and my committee members, Dr. Oliver Iorhemen and Dave Tamblyn, for their valuable feedback and constructive criticism of this thesis. Their expertise and guidance have enriched the quality of my research greatly. I am grateful to the members of my research group, Min Xie, Sorour Nasimi, and Mostafa Dorosti, for their collaboration and support. I am lucky to have the opportunity to work alongside such talented individuals. I would like to extend a special thank you to Cheng Lu, whose mentorship and friendship have been a source of inspiration and support. Lastly, I am deeply grateful to my friends and my best team members, Dixuan Li, Jianliang Mao, Qing Guo, Wanhua Shen, Wei Deng for their love, encouragement, and support. To my mom and dad, thank you for your endless encouragement and for always believing in me. Their simple yet powerful message—that happiness is everything—has kept me grounded and focused on what truly matters. xi Chapter 1 Introduction Ensuring the availability of safe drinking water is an essential concern for both public health and overall development (World Health Organization, 2023). The attainment of universal access to secure water and sanitation infrastructure remains an ongoing challenge, primarily attributable to historical differentials and the marginalization of distinct demographic cohorts (Brown et al., 2023). The current scarcity of drinking water resources is mainly manifested in the contamination of drinking water sources, uneven distribution of water resources, over exploitation of groundwater, inadequate sanitation facilities, and susceptibility to extreme weather events and changes in water circulation patterns (Patrick et al., 2019). Many Indigenous communities live in remote regions, resulting in comparatively greater challenges in accessing clean drinking water than residents in urban areas (Balasooriya et al., 2023). For example, there exists a significant disparity for Indigenous households in Canada. The likelihood of lacking access to clean drinking water is 90 times higher than non-Indigenous households (Wolfe, 2006; Balasooriya et al., 2023). These vulnerabilities do not stem from insufficient capacity or lack of interest within the communities, but rather as an outcome of the structural frameworks inherited from the colonial state (Baijius & Patrick, 2019; Wolfe, 2006). The quality of drinking water is a critical determinant of public health, and the presence of metals can significantly influence its overall safety and potability (Stride et al., 2023). Indigenous communities emerge as particularly susceptible to exposure to toxic metals (Balasooriya et al., 2023; Navarro-Espinoza et al., 2021). It is worth mentioning that trace elements in water, such as iron (Fe), and manganese (Mn), are essential nutrients 1 required to maintain human metabolism in appropriate amounts. They are crucial for physiological metabolic processes of human activities and the human nervous system, and an excess or deficiency can potentially lead to health issue (Le Bot et al., 2016; Zoni & Lucchini, 2013). The methods commonly employed for quantifying metal concentrations in water predominantly rely on sophisticated laboratory instruments, such as Atomic Absorption Spectroscopy (AAS) and Inductively Coupled Plasma Mass Spectrometry (ICP-MS). Consequently, there are limitations imposed by experimental conditions, substantial instrument costs, and time consumption (Hu et al., 2019). Machine learning (ML) is a very powerful method for data analysis. Numerous models have been developed using ML algorithms for the analysis of water quality and water security issues (Azrour et al., 2022). ML is expected to serve as a viable alternative to traditional water sampling, especially in measuring challenging-to-assess water quality parameters (Chowdhury et al., 2009; Shahi et al., 2020). Water samples from Small, rural, and remote (SRR) communities necessitate transportation to analytical laboratories for the measurement of metal concentrations, entailing considerable investments in both time and financial resources (Mian et al., 2020). Hence, the search for an effective and resourceefficient method to monitor the drinking water quality in First Nations SRR communities is a pressing concern and holds significance in addressing environmental injustices resulting from political and historical factors (Wolfe, 2006). In this study, the efficiency of diverse ML tree models was systematically examined for predicting metal concentrations based on water quality detection data collected from 2019 to 2020. The employed ML algorithm was elucidated, followed by a comprehensive 2 explanation of the model prediction outcomes to understand the impact of various predictors on water quality fluctuations. The subsequent sections provided a thorough discussion of the results for comparative analysis. There are two objectives in this study. Firstly, it aims to forecast the levels of iron and manganese utilizing data derived from primary indicators employed in water quality assessment. Through hyperparameter optimization, the ML model's parameters are fine-tuned and optimized to enhance predictive accuracy. Secondly, the study developed a graphical user interface employing optimal features and ML algorithms within the Python programming framework to facilitate the prediction of iron and manganese content at sampling sites. 3 Chapter 2 Literature Review 2.1 Overview of Canada's First Nations 2.1.1 History and current situation 2.1.1.1 History First Nations, Inuit, and Métis collectively fall under the term "Aboriginal" in Canada and are referred to as "Indigenous" globally (Woodcock, 1988). Typically, the term "First Nations" in Canada referred to communities residing south of the tree line, predominantly situated below the Arctic Circle (Assembly of First Nations, 2021). First Nations in Canada, historically referred to as "Indians," in contemporary discourse, many groups prefer the term "First Nations" as a more accurate and respectful alternative to "Indians" (Indigenous Foundations, 2009). Members of First Nations typically identify with their specific nation, such as Mohawk, Cree, Oneida, and others, emphasizing their unique cultural affiliations (Government of Canada, 2017). Unique social and cultural communities are formed by Indigenous Peoples. These areas have experienced historical displacements because of the colonial expansion in Europe (Kingsbury, 1998). They differ from the current dominant society in terms of cultural, economic, and political characteristics due to their distinct cultural and traditional knowledge that shapes their connections with the environment and the society (United Nations, 2023). First Nations communities in Canada have similarities to the Indigenous people of America and Australia with similar historical colonial backgrounds (Daley et al., 2018; Rowles III et al., 2020). The historical legacies of colonialism and exclusion have led to widespread 4 challenges in poor governance and obstacles to resources access (Brown et al., 2023), including access to clean water which is a basic human right (United Nations, 2015). 2.1.1.2 Population and distribution In 2021, the Indigenous population in Canada, numbering 1.8 million as enumerated during the census. This figure significantly surpasses both the count of First Nations people residing in Australia and New Zealand (Statistics Canada, 2022). Presently, Canada acknowledges 617 officially recognized First Nations governments or bands, with approximately 50% distributed in the provinces of Ontario and British Columbia (BC) (Figure 2.1) (Government of Canada, 2017). Figure 2.1 Distribution of FNs communities in Canada 5 2.1.2 First Nations communities in BC This study utilizes water quality data from five different FN communities distributed across various regions in BC, Canada to predict metal content in First Nations regions, especially SRR communities in BC by using ML models. FN communities are typically located in small rural and remote areas, characterized by dispersed rural living patterns (Baijius & Patrick, 2019). 2.1.3 Drinking water quality and security High-income countries like Canada and the United States exhibit differences in universal water access, primarily stemming from the scale and geographical distribution of drinking water, issues related to racial wealth disparities, identity, and institutionalized marginalization structures (Meehan et al., 2020). The dispersed rural living patterns of FNs often results in inadequate infrastructure, aging water treatment equipment, and distance from urban areas collectively contributing to the challenges in accessing clean drinking water. A significant number of individuals in FNs communities, especially SRR communities, lack access to an adequate supply of quality tap water in their households even though tap water is the main source of drinking water in Canada. However, the crisis of water insecurity faced by Indigenous families in Canada remains insufficiently investigated (Duignan et al., 2022). As of October 25, 2021, long-term drinking water advisories persist in 31 communities, affecting 43 small water systems on First Nations reserves (Government of Canada, 2023). In British Columbia, 19 SRR First Nations communities face three water quality advisories, eight boil water advisories (mainly due to E. coli contamination), and ten "do-not-consume" advisories as of September 2021 (McLeod et al., 2020; First Nations Health Authority, 2023). Several first nation remote 6 communities in British Columbia, Canada, have experienced issues such as poor aesthetic properties, the presence of coliforms, and elevated concentrations of metals (Hu et al., 2022). A common water infrastructure widely used in many First Nations communities is a concrete-constructed household water cistern. A truck driving through the community from the water treatment plant provides water to each household's water tank weekly (McLeod et al., 2014). Aging and undisinfected tanks and trucks can cause possible contamination of drinking water. Winter freezing and thawing events lead to concrete household water tank damages, allowing pollutants like organic matter and rodents to infiltrate, causing drinking water insecurity, thereby compromising the safety of drinking water (Baijius & Patrick, 2019). Currently, residents of LTFN rely on bottled water, which is replenished every two weeks (Islam & Yuan, 2018; Pang et al., 2021). In a press conference held on November 19, 2021, the LTFN emphasized the urgent need for federal funding to ensure access to clean drinking water, stating, "We need to have water which is safe. There is no alternative." (Prince George Citizen, 2021). This predicament not only necessitates an escalation in governmental investment in water infrastructure but also imposes a substantial financial burden on public health protection (Li et al., 2021). 2.2 Heavy metals: Iron and Manganese in Drinking Water Ensuring the safety of drinking water is essential as it serves as the foundation for human survival, ecological well-being, and agricultural systems. It is among the most important factors in guaranteeing proper functioning of human society (Schimpf & Cude, 7 2020). Common water pollutants can generally be classified based on the nature of pollutants into categories such as: organic pollutants, inorganic pollutants, and microbial pollutants (Martin & Johnson, 2012). Organic pollutants encompass various organic compounds originating from agriculture, industrial emissions, and urban sewage, including dissolved organic matter, fats, proteins, organic solvents, as well as volatile organic pollutants (VOCs) from processes like chemical manufacturing, petroleum refining, and printing. Microbial pollutants involve bacteria, viruses, and parasites from sewage, livestock farming, and agricultural runoff, potentially leading to the spread of waterborne diseases. One of the most significant inorganic pollutants is heavy metals, and certain heavy metals can cause serious harm, such as mercury, lead, cadmium, copper, and nickel, along with their compounds. 2.2.1 Sources of iron and manganese 2.2.1.1 Natural factors Iron and Manganese are abundant in nature, existing naturally in the water supply due to catchment and erosion. Iron is the fourth most abundant element in the Earth's crust (Sun et al., 2023). Iron in the Earth's crust enters underground water and groundwater through the following three routes: a. The oxide of divalent iron in the rock layer is converted into soluble iron by the groundwater containing carbonation. b. The oxide of tri-iron is reduced to divalent iron and then is dissolved in the underground water body by carbonic acid. 8 c. There is a large amount of organic matter in the underground environment, such as organic acids, which can dissolve iron. The source of manganese and its distribution in the environment are also very extensive (Teng et al., 2001), being present in almost all rocks. Manganese is released from the native minerals after weathering and combining with some oxygen-containing ions or molecules to form secondary minerals. Under the condition that the soil environment is acidic (PH < 5.5), these secondary minerals are dissolved into soluble manganese and some of them will penetrate the water (Zhai et al., 2021). 2.2.1.2 Human action factors However, with the rapid development of global industry, the accumulation and cycling of heavy metal elements in ecosystems have been induced by anthropogenic interventions. Substantial pollution hazards to the environment have resulted from the emissions of heavy metals in industrial processes, including waste gases, wastewater, and waste residues generated in various industrial activities namely ore extraction, alloy smelting, leather manufacturing, electroplating, battery production, plastic manufacturing, ceramic firing, paper printing, fossil fuel combustion, and chemical textile processes (Lim & Aris, 2014; R. Singh et al., 2011; Yeganeh et al., 2023). After the untreated discharge of such waste, heavy metal emissions in the atmosphere can settle into water bodies through precipitation and atmospheric deposition. The waste residue deposited in the soil also pollutes the surface water and groundwater through surface runoff, soil erosion, seepage and infiltration. Consequently, heavy metals continue to accumulate in aquatic ecosystems (Ayangbenro & Babalola, 2017). In certain situations, iron, as a metal used in the 9 manufacturing of pipes and faucet components within water supply systems, may be released into tap water (Veschetti et al., 2010). 2.2.2 Impact of iron and manganese 2.2.2.1 Human health concerns Heavy-metal-induced water pollution poses severe environmental challenges and hazards to the entire ecosystem. As persistent and toxic pollutants, heavy metals,can be transmitted through the food chain into human bodies. Proteins and biocatalysts in human tissues can react with heavy metal ions entering the body, leading to their aggregation and structural changes that result in loss of activity. Meanwhile, heavy metal ions continue to accumulate in the human body until their concentration reaches or exceeds the detoxification threshold of human organs (Briffa et al., 2020). This accumulation leads to pathological changes in human organs, causing acute or chronic poisoning, even carcinogenic. Additionally, they can migrate into animals and human bodies through respiration, contact, and other pathways, exerting irreversible toxic effects and causing functional damage (Gumpu et al., 2015b). Compared to other pollutants, heavy metals in water exhibit more noticeable latency, toxicity, and recalcitrance. They can invade the human body directly or indirectly by drinking and skin infiltration, accumulating in organs such as the kidneys and liver (Jaishankar et al., 2014; Singh et al., 2023; Zhang et al., 2023). Iron and manganese are indispensable elements in human physiological metabolism. However, the excessive concentration of these two elements will lead to human metabolic disorders and induce various diseases (Gumpu et al., 2015a; Valko et al., 2005). The maximum concentration for manganese allowed in drinking water is 0.12mg/L and 0.3 mg/L for iron, based on Guidelines for Canadian Drinking Water Quality (Health 10 Canada, 2019). The standard of drinking water in the world clearly stipulates the content of iron and manganese: the sum of iron content is 0.3 mg/L, and the allowable concentration of manganese is 0.1 mg/L (World Health Organization, 2017). Persistent intake of water with excessive iron content can lead to chronic poisoning. Symptoms include significant iron deposits in the liver and spleen, and may also result in osteoporosis, cirrhosis, coronary heart disease, diabetes, and reduced insulin secretion, thereby causing disruptions in carbohydrate metabolism in the human body. Manganese exhibits higher toxicity in its divalent state compared to trivalent manganese, potentially leading to conditions such as tremor paralysis, memory decline, and pneumonia. Elevated manganese levels can also have adverse effects on the central nervous system, initially manifesting as neurasthenia and dysfunction in the autonomic nervous system and potentially development of Parkinson's syndrome in the later stages, along with certain impacts on reproductive capacity and cognitive functions (Kim et al., 2022). Data from surveys suggests that workers in manganese mines are susceptible to severe mental disorders resembling schizophrenia. Additionally, cases of illness and fatalities have been reported among residents in the outskirts of Tokyo, Japan, who consumed well water contaminated with manganese. 2.2.2.2 Impact on drinking water Metals such as lead, arsenic, copper, and chromium can find their way into drinking water sources through geological processes, industrial discharges, or aging infrastructure. Their presence, even in trace amounts, can have profound effects on human health and ecosystems (Lu et al., 2015). The main concern of iron and manganese in drinking water is their effects on drinking water taste, odor and color (Schwartz et al., 2021). The Canadian 11 drinking water guidelines state that manganese in drinking water requires supervision according to health risks and aesthetic considerations (Health Canada, 2021). Aesthetic objective (AO) or recommended value for Fe and Mn is specified in water quality guidelines (Hu et al., 2022). The amount of manganese will directly affect the chromaticity of the water. If the iron concentration in water surpasses 0.3 mg/L, the water turns cloudy, and when it is more than 1 mg/L, the water develops an iron-like taste. When the content of manganese in the water is above 0. 5mg/L, the water produces a special odor and an unpleasant color. The occurrence of phenomena like "red water" and "black water" is attributed to water with elevated levels of manganese and iron. According to CBC News, in Cape Breton FN reserve, the excess Mn and Fe concentrations caused the aesthetic objectives for drinking water stated in the Canadian guidelines were substandard and received the "do-not-consume" advisory. Members in Cape Breton FN reserve seriously protested the dark, odor tap water. Additionally, the specific conditions under which manganese causes coloration can vary, including factors such as pH, oxygen levels, and the presence of other minerals in the water. From a sensory perspective, washing clothes and utensils with water containing high levels of iron and manganese can easily lead to discoloration, affecting functionality and aesthetics (Meena et al., 2005). When iron and manganese accumulate significantly in water supply pipelines, the transport capacity of the pipeline is significantly reduced due to blocked water pipes (Tremblay et al., 1998). A study indicated 4% of the surveyed FNs households across Canada showed manganese concentrations in stagnant (first draw) or flushed tap water that exceeded the health-based maximum acceptable concentration (MAC) defined by the 2019 Guidelines for Canadian Drinking Water Quality (Health Canada, 2019). The fact is 12.8% of 12 households had manganese concentrations higher than the AO in their flushed tap water, in addition, 3.5% of households had iron levels over the AO (Schwartz et al., 2021). A survey was done in metropolitan France questioning families with children aged 6 months up to 6 years. The results showed that the concentration of Mn and Fe in the tap water was so high that at bare minimum the readings exceeded at least one of the highest-level regulations set by regulatory authorities (Le Bot et al., 2016). Lheidli T’enneh First Nation community (Prince George, British Columbia) expressed apprehension regarding the presence of excessive iron and manganese in their drinking water. The concern stems from the consistent failure of the treatment systems to meet the manganese or hardness treatment objectives, even after the modification of the existing equipment settings in August 2021, which still did not lead to proper treatment. 2.2.3 Current detection approaches 2.2.3.1 Field sampling and analysis This conventional method for directly measuring metal concentrations in water samples involves on-site collection of water samples, followed by analysis using laboratory instruments. At present, the laboratory instrument analysis methods mainly include Atomic Fluorescence Spectrometry (AFS) (Fernández-Martínez et al., 2015), Atomic Absorption Spectrometry (AAS) (Bua et al., 2016), Inductively Coupled Plasma Atomic Emission Spectroscopy (ICP-AES) (Zhao et al., 2015) and Inductively Coupled Plasma Mass Spectrometer (ICP-MS) (Deng et al., 2018). These methods can quantify heavy metals accurately, but the pretreatment processes are inconvenient, costly, and require a wide range of operational expertise. 13 2.2.3.2 Sensor technologies Sensor technology involves sensors and monitor devices for monitoring water quality on site and recording real-time water data. Due to the strength of this method, such as immediacy, portability, and high selectivity for specific metals, sensor technology has been widely used in the field of identifying metal ions and detecting the variation of metal concentration. However, several challenges obstruct promoting the applications, including the expense of instrumentation and the maintenance of the sensor for the reason to ensure the sensitivity and reliability. Microbial electrochemical sensors characterize the concentration of heavy metals by exploiting the property of a decline in electrochemical activity in bacteria under the influence of heavy metals, resulting in a degradation of their output electrical signals (Wang et al., 2020). Electrochemical sensors based on metal-organic frames (MOF) can achieve robust, sensitive, selective and reliable sensing of metal ions (Shafqat et al., 2023). Nanomaterials-based chemical sensors are widely employed as effective analytical tools for the detection of heavy metal ions. They exhibit characteristics such as high sensitivity, portability, overall optimized detection capability and performance (Alias et al., 2020; Rasheed et al., 2022). In addition to simple and reliable electrochemical methods, there are spectroscopic and optical methods applied for sensing of metal ions (Harrington et al., 2011). 2.2.3.3 Modelling and simulation Multiple models, such as hydrogeological or mathematical models, combined with sampled data and Geographic Information Systems (GIS), have been utilized to simulate metal distribution and predict the metals concentration in water bodies (Motovilov & 14 Fashchevskaya, 2021). In academic field, integrating various modelling strategies has been commonly introduced to acquire comprehensive and reliable results. Hydrodynamic models simulate hydrodynamic processes such as water flow, dissolution phenomena, sedimentation, and the transport of suspended particles to infer the transport and distribution patterns of heavy metals in water. Artificial intelligence models leverage techniques such as ML and deep learning to recognize complex patterns within water quality data, facilitating the prediction of heavy metal concentrations in water. Back propagation neural network (BPNN) was applied in heavy metal concentration prediction in the Qinghai-Tibet Plateau basin. The model predicted the content of 4 heavy metals (As、 Sb、Mo、Mn), while pH, dissolved oxygen (DO), conductivity (EC), total phosphorus (TP) and iron (Fe) were used as the input values (Xiao et al., 2023). ML models have become a hot topic in recent years for predicting water quality, although, there is a dearth of research specifically addressing the prediction of heavy metal concentrations. The introduction to ML is developed in detail in the next section. 2.3 Machine learning overview 2.3.1 Models introduction This study presents a new framework based on data augmentation algorithm, which combines recent water quality data from SRR First Nation areas in BC and 4 ML models, and the importance information of relevant features can be effectively captured, thus improving prediction accuracy. The following are the main concepts and components of ML: 15 a. Training Data: The training process of a ML model relies on extensive data. This data contains information relevant to the task the model needs to learn. b. Features: Features are critical attributes that describe the data. In ML, selecting and extracting appropriate features is crucial for the performance of the model. c. Model: A model is a mathematical representation used to capture patterns within the data. The choice of the model can be varied, such as classification, regression, or clustering. d. Training: During the training phase, the model learns patterns and relationships from the training data. This typically involves adjusting the model's parameters to accurately represent the data. e. Testing and Validation: After training, the model needs to be validated using test data. This helps assess the model's generalization ability. f. Prediction and Decision: Once training is complete, the model can be utilized to make decisions on given data. This is the goal of ML. The following are presented to the four ML models used in this study. 2.3.1.1 Random Forest (RF) Random Forest (RF), as depicted by Figure 2.2, employs decision trees as subclassifiers through an ensemble learning approach, combining multiple decision trees to form the Random Forest. In Random Forest, the original dataset is partitioned into multiple subsets, and each decision tree’ sub-classifier employs a distinct method of optimal attribute splitting. This ensures that each tree's training process yields different results, guaranteeing their distinctiveness. 16 It is described as an improvement upon Bagging (Bootstrap Aggregating), with a key distinction lying in the introduction of random feature selection. During selecting split points for each decision tree, RF randomly chooses a subset of features and then performs traditional split point selection on this subset. In addition, compared to Bagging, RF shows rapid training process and better generalization ability, benefiting from its flexibility feature (Lin et al., 2017). Figure 2.2 Structure of Random Forest 2.3.1.2 Extreme Gradient Boosting (XGB) XGB uses multiple decision trees to show a distributed gradient boosting model. This model is well optimized within the gradient boosting framework. This is done to improve and achieve efficiency, flexibility and portability. Regularization terms are introduced to the objective function to reduce the variation between models; namely, it reduces issues with overfitting as models learn in a simpler manner. This model takes inspiration from Random Forest’s approach wherein it supports column sub-sampling with faster computation as a result (Revathi et al., 2020). 17 2.3.1.3 Decision Tree (DT) Decision Tree is the most common ML model as it is built on tree-based models that employ logic to predict an outcome. DT and its variations represent an alternate type of algorithms, each with individually parameterized algorithms (Liu et al., 2024). DT is a tree like structure with 3 main node types, which represents a process to contrast varying values on a data sheet of attributes, in turn determining the trend for the following decision step. State nodes represent values expected of an alternate solution, and by comparing the nodes an optimized result is found and selected. Through such algorithms attributes are divided and a DT’s construction is completed through recursion. Additionally, during construction branching can be paused by pre- or post-running and prevent overfitting phenomena from occurring (Ahmad et al., 2018). 2.3.1.4 Gradient Boosting Regression (GBR) Gradient Boosting Regression (GBR) is an enhancement learning algorithm based on DR, specifically made to solve regression problems. GBR has been shown to be particularly effective in altering prediction accuracy compared to only utilizing DT. The core fundamentals of GBR entail the initial training of a DT model on the dataset, followed by putting residual information within the training set. GBR trains following DT models through successive generations, merging them into existing models, following this it systematically adjusts the prediction results, while reducing errors from residual information. The final regression is the sum of multiple previous regression algorithms (Lu et al., 2018), as shown in the formula (2-1): (2.1) The loss function for each weak classifier is defined as: 18 (2.2) Where m represents the number of training iterations, x stands for the input data, and θm is the distribution weight vector. The model trains M times, with each iteration yielding a weak regression function T. Fm-1(xi) represents the current model. 2.3.2 Application of the machine model in water quality 2.3.2.1 Water quality prediction Numerous models have been developed by using ML for the analysis of water quality and water security issues (Azrour et al., 2022). ML is expected to serve as a viable alternative to traditional water sampling, especially in measuring challenging-to-assess water quality parameters (Chowdhury et al., 2009; Shahi et al., 2020). In the field of water quality prediction, one of the most common research endeavors in ML is the prediction of Water Quality Index (WQI) (Aldhyani et al., 2020; Asadollah et al., 2021; Uddin et al., 2022). A newly integrated ML model, known as Extra Trees Regression (ETR), is employed for predicting the monthly WQI values in the Lin Village River in Hong Kong. Monthly water quality data, comprising chemical indicators such as biochemical oxygen demand and nitrite-nitrogen, along with physical indicators such as pH, turbidity, and temperature, are used as input features to construct the predictive model (Asadollah et al., 2021). Another study was explored to estimate the water quality index and water quality class (WQC) by ML, using four physical input parameters (Ahmed et al., 2019). ML also has a wide range of applications in predicting the chemical and physical eigenvalues of water bodies. The concentration of chlorophyll, DO, turbidity and conductivity were determined using an artificial neural network (ANN) algorithm with nonlinear autoregressive time series from a monitoring station in New York State (Khan & See, 2016). 19 In the Karouun River in southwest Iran, Mohammad Najafzadeh et al. (Najafzadeh & Ghaemi, 2019) used Multivariate Adaptive Regression Spline and Least Squares Support Vector Machine as water quality simulation methods to predict BOD5 and COD. Utilizing various artificial neural network (ANN) models, the weekly concentration of nitrate nitrogen in the Sangamon River, situated close to Decatur, Illinois was forecasted. Also, the comparison showed that artificial neural network (ANN) models are better developed than linear regression in their study (Markus et al., 2003). Additionally, utilizing sample data from 141 cases across small water distribution networks (SWDNs) and employing diverse ML methodologies, models were developed to predict three emerging disinfection byproducts (dichloroacetonitrile, chloropirrin, and trichloacetone) within SWDNs (Hu et al., 2023). Several ML models among other techniques, were constructed to predict 10 parameters related to irrigation water quality (IWQ) to assess the appropriateness of irrigation water (El Bilali & Taleb, 2020). Artificial neural network (ANN) has been utilized in an innovative way to predict water quality recovery, streamlining the resilience assessment process and obviating the need for parametric analyses traditionally employed in evaluating water quality recovery (Imani et al., 2021). 2.3.2.2 Classification In the context of classification, ML can be categorized into supervised learning, semi-supervised learning, and unsupervised learning (The World Bank, 2021). When the training data includes corresponding labels, it is referred to as supervised learning, exemplified by algorithms such as Support Vector Machines and Random Forests. In cases where the training data lacks labels, it falls under the category of unsupervised learning, 20 which is used to explore the intrinsic structure of the data rather than predicting the specific output. If the training data consists of both labeled and unlabeled portions, it is termed as semi-supervised learning, with algorithms like self-training and co-training being relevant instances. In the realm of water quality classification, extensive efforts have been dedicated to the application of ML. Dezfooli et al. employed three models, namely Probabilistic Neural Network (PNN), k-Nearest Neighbors, and Support Vector Machine (SVM), to classify the water quality levels based on the water quality parameters of 172 water samples from the Karun River in Iran (Dezfooli et al., 2018). Another study employed the SVM and Attribute Reduction (AR) algorithms to classify the water quality of the Mekong River. The input data consisted of monitoring data from 2008 to 2019, including turbidity, salinity, total coliform bacteria, biochemical oxygen demand, dissolved oxygen, ammonia nitrogen, and total nitrogen et al. Research in Bangladesh calculated the water quality index using the Weighted Arithmetic Index method based on data obtained from the Ghorashal Lake. Subsequently, a Gradient Boosting Classifier method was employed to categorize water quality into five classes ranging from “Excellent” to “Unsuitable for drinking” (AlRazee et al., 2019). Another research used five ML classification methods, which were K nearest neighbors (K-Means), decision tree (DT), Naive Bayes, artificial neural network (ANN), and support vector machine (SVM), to predict WQC. The results showed that the decision tree and support vector machine classifier are the best prediction models with an error rate of 0% (Babbar & Babbar, 2017). 21 2.3.2.3 Heavy metals prediction Neural Networks (NN) possess proficient data mapping capabilities, while Support Vector Machines (SVM) excel in effectively mapping small-sample datasets. These two ML models are commonly employed in current research for predicting heavy metal content. ML for the prediction of heavy metals in soil has been studied in recent years. Utilizing the Random Forest (RF) model, the spatial distribution of soil-absorbable heavy metals in the arid regions of Iran for the years 1986, 1999, and 2010 was simulated. The results indicate that the RF model effectively predicts the distribution of heavy metals (Taghizadeh-Mehrjardi et al., 2021). In another study, the overall distribution of heavy metals in the soil in Hefei City, China, was predicted using RF, ANN, and SVM models. Soil characteristics, urbanization history, and the area of different land-use types were employed as predictive factors to estimate the concentrations of arsenic, zinc, lead, mercury, nickel, copper, chromium, and cadmium in the soil (Zhang et al., 2020). ML methods can also be used to predict metals in the air. Research has been conducted applying meteorological factors and particulate matter (PM) concentration as predictive factors. Utilizing Multiple Linear Regression (MLR), Backpropagation Artificial Neural Networks (BP-ANN), and SVM, in conjunction with air PM data collected from Nanjing, China, rapid predictions of size-classified metals have been achieved (Wang et al., 2017). Additionally, based on four ML methods—MLR, BP-ANN, SVM, and RF, utilizing meteorological data, atmospheric pollutant data, and PM 2.5 data from the northeastern region of China for the years 2013 to 2018, predictive models for metal concentrations in atmospheric PM 2.5 were established (Lyu et al., 2023). 22 ML can also be used to predict the concentration of heavy metals in living organisms and sediment. One study used multiple linear regression models (MLR) and RF methods to estimate heavy metal concentrations in the muscle and liver tissues of psetta maxima maeotica which is a subspecies of turbot known as "a suitable biological indicator of heavy metal contamination in aquatic environments" (Petrea et al., 2020). A study in China used artificial neural network and support vector machine to predict heavy metal concentrations in sediments in Chaohu Lake, China, and analyzed its ecological risk index (Li et al., 2021). However, the utilization of ML for predicting heavy metal concentrations in water bodies still lacks comprehensive research, primarily due to the limited availability of monitoring data. In Taihu region of China,Lu et al. used the physical and chemical indexes of surface water from drinking water sources, and combined ANN and SVM models to simulate dissolved substances, particulate matter and the concentration of heavy metals (Lu et al., 2019). Furthermore, in the southeastern part of Iran, research employed BPNN, generalized regression neural networks (GRNN), and MLR methods to predict the heavy metal concentrations (Cu, Fe, Mn, Zn) in the acid mine drainage of the Sarcheshmeh porphyry copper deposit, while pH, Mg concentration and sulfate content served as input indicators (Rooki et al., 2011). Currently, there is still a gap in predicting the concentrations of Fe and Mn in drinking water. Therefore, this study used four ML models: RF, XGB, DT and GBR to predict the iron and manganese content in drinking water in remote Indigenous areas in the province of BC, Canada. Meanwhile, this study used a variety of data augmentation 23 methods to address the limited water quality data problem which is the common challenge in terms of heavy metal modelling in drinking water. 2.3.3 Interpretability of machine learning The interpretability of models is one of the most critical issues in ML applications. Interpretable approaches to the model can allow for explanations of how predictions are made. Which simply means, the purpose of interpretability is to turn the behavior of the model into understandable causal relations among various factors. Model-agnostic explanation systems offer a general framework for interpretability, enabling flexible selection based on the model itself, model features, and domain expertise. These modelagnostic tools enhance the credibility of ML applications in practice. Interpretable tools for models can be applied to any ML model after training. Model-agnostic methods, such as Accumulated Local Effects (Apley & Zhu, 2020), Local Interpretable Model-agnostic Explanations (Wang et al., 2021), and SHapley Additive exPlanations (SHAP) (Baptista et al., 2022), typically operate by analyzing feature input and output to provide insights into the model's behavior. Currently, few studies have applied model interpretability to water quality prediction. Research employed RF model and pollution concentration data, including nitrate (NO3-N), total phosphorus (TP), and Escherichia coli (E. coli), gathered from 1047 sampling stations in the Texas Gulf area to predict stream water quality under different levels of urban development scenarios. Model interpretation was conducted using the SHAP method to explore the influence of urban development patterns on stream water quality. The SHAP results highlighted the significance of indicators such as Landscape Division Index, Split Index, Maximum Patch Index and Patch Cohesion Index in shaping 24 stream water quality within the context of urban development. The study demonstrated that spatial variations in this pattern impact river water quality. The interpretability analysis of ML presented in this study suggests that the deterioration in river water quality can be attributed to the effects of urban rises (Wang et al., 2021). 2.4 Data augmentation (DA) It is widely acknowledged that a substantial sample size is required in the practical application of artificial intelligence modeling. When constructing models with insufficient data, overfitting is prone to occur, resulting in decreased predictive accuracy and overfitting performance (Ma et al., 2023; Shen & Qian, 2022). Limitation hinders the ability to effectively interpret the variability of the target variable. Data Augmentation (DA) represents a strategy for increasing the quantity of training samples (Connor et al., 2021; Iglesias et al., 2022), aiming to ameliorate challenges associated with insufficient samples (Shao et al., 2019) and imbalanced datasets (Zhao & Yuan, 2021) during the model training process. The generalization capability is enhanced through reducing overfitting and expanding the decision boundaries of the model (Fekri et al., 2019; Shorten & Khoshgoftaar, 2019). Through the DA process, the newly generated samples help to build a more robust and diverse training set, helping ML models learn a wider range of patterns, and improve generalization to unseen data. DA methods are mainly divided into two categories: supervised and unsupervised. Supervised data augmentation methods include operations such as flipping, rotating, cropping, adding noise, SMOTE (Zhang et al., 2023), sample pairing (Inoue, 2018), mixing (Zhang et al., 2017) etc. Unsupervised data augmentation methods encompass Generative 25 Adversarial Networks (GAN) (Ma et al., 2023; Zhao & Yuan, 2021) and automatic data augmentation (Cubuk et al., 2018). The study involved multiple DA methods to enhance small sample datasets. The first strategy introduced in this study is the integration of three conventional DA methods, namely Numerical Interpolation, Bootstrapping, and Noise Injection. The second method is Generative Adversarial Networks (GAN) which is a new technology based on the neural network (NN). These DA methods are introduced below. 2.4.1 Numerical Interpolation Numerical interpolation is a commonly used technique utilized in data augmentation within ML. This strategy involves interpolating missing data or generating additional information from existing data. The process typically utilizes linear or polynomial interpretation as well as the nearest neighbor-based-interpolation in order to predict values based on the observed data points. 2.4.2 Bootstrapping Bootstrapping is a method used to generate additional training samples by resampling the existing dataset. When used for data augmentation, Bootstrapping involves selecting subsets from the original dataset causing replacement, in turn creating multiple bootstrap samples. Each sample is a variation of the original dataset, this introduces diversity to the training set. This process is very valuable as it allows generation of additional information without gathering new data. 26 2.4.3 Noise Injection Noise injection is useful when there is lack of data or little diversity to the data. It is a regulation technique used to prevent a model from overfitting to data by exposing it to a broader set of input variations. In this process, Gaussian noise or random jitters are added to the input data. The variation if caused can be applied to different types of data such as images, text or numerical data. With the introduction of controlled noise, the model becomes more resilient to small variations in the input. This makes it better at picking up on unseen or noisy data during training and inference. 2.4.4 GAN Generative Adversarial Network (GAN) is an unsupervised learning algorithm first proposed in 2014 by Goodfellow et al. (Goodfellow et al., 2020). GAN is a class of artificial intelligence algorithms that consist of a generator and a discriminator of the confrontation game. They run the process simultaneously through adversarial training (Shao et al., 2019). The process involves the generator creating synthetic samples and the discriminator evaluating whether these samples are real or generated. So far, data augmentation based on generative adversarial networks has predominantly been employed in the domains of image processing (Wang et al., 2019) and fault signal generation (Zhao & Yuan, 2021). It is worth noting that numerous research works have employed GAN in the medical field (Chen et al., 2020; Srivastav et al., 2021; Tyagi & Talbar, 2022). For example, utilizing GAN to augment image data for simulating pulmonary nodule shapes in chest X-ray which plays a significant prognostic role in early screening for lung cancer (Shen et al., 2023). Qin et al. developed a GAN-based screening 27 methods for melanoma and other skin diseases in dermo copy (Qin et al., 2020). However, there is limited research on augmenting small-sample continuous datasets. In comparison to traditional data augmentation techniques, this method based on synthesis, although involving a more complicated process that typically requires training and learning, yields a more diverse set of synthetic samples. 2.5 Discussion and conclusion Ensuring drinking water quality and security is a necessary step in the long-term to access clean drinking water which is a necessary basic human right. However, First Nations communities in Canada are still facing challenges in accessing clean drinking water due to the dispersed rural living patterns and specifically heavy metal pollutants in drinking water pose great risks to a person’s physical health. It is important to build up a new method to detect metal concentrations in drinking water in rural First Nations communities to catch missing concentration data results from high-cost and time-consuming deficiencies from conventional lab analysis. ML is a potential method to simulate pollutants concentration in water, requiring large amounts of data to generate empirical models. Furthermore, this method can improve prediction accuracy efficiently by capturing the importance information of relevant features and is a better alternative than conventional methods in rural Indigenous areas; because they lack resources to utilize conventional methods. The performance of ML models is heavily affected by the quality and quantity of training data. Data augmentation is necessary to generate reliable synthetic data in a situation where there is a lack of water quality data. The applications of ML algorithms in predicting water quality and evaluating water security are broadly utilized, however the prediction of heavy 28 metals in drinking water still needs to be explored to fill up the blank of effective and reliable alternatives in terms of heavy metals detection in drinking water. 29 Chapter 3 Materials and Methods 3.1 Data processing and methods 3.1.1 Data sources and indicators This study involved collecting aggregated groundwater data samples from five First Nations communities scattered across the province of British Columbia. The monitoring period spanned from 2019 to 2020, with water samples collected approximately at equal intervals. Subsequently, common water quality physical indicators and the concentrations of iron and manganese were analyzed for each sample, resulting in a total of 34 datasets. The selected physical indicators, as detailed in Table 3.1—Total Dissolved Solids (TDS), conductivity, pH, turbidity, and hardness—were utilized as predictive factors in NIBS-NI based data augmentation method, while in the GAN based data augmentation method, only pH, turbidity and hardness were used as input factors to accommodate some cases where limited input parameters can be detected caused by insufficient equipment in the practical scenarios. These chosen physical indicators possess the advantage of being detectable on-site using portable meters or simple titration, allowing for the rapid and costeffective acquisition of concentration data. To achieve a method that eliminates the complexity of laboratory testing processes and rapidly obtains metal concentrations, the concentrations of iron and manganese were selected in this study as the predicted indicators based on the practicality of the available dataset. 30 Table 3.1 Details of water quality physical parameters Physical Unit Description Importance Detection Methods Parameter The total amount of dissolved solids in water, including Determines the level of dissolved inorganic salts, organic pollutants in water, determining substances, and other dissolved water suitability and usage. Conductivity or TDS mg/L evaporation methods materials. Conductivity µS/cm The ability of water to conduct Monitors water salinity, pollution electricity, primarily dependent levels, and identify sources of on the dissolved ions present. water contamination. Conductivity meters Affects ecological balance and The measure of acidity or pH chemical processes, crucial for pH electrodes -alkalinity in water. the survival of organisms and water safety. Evaluates water clarity and Indicates the amount and size of Turbidity NTU visibility, detect potential water Turbidity meters suspended particles in water. pollution. Affects water utility, plumbing EDTA titration or Reflects the concentration of Hardness systems, and equipment mg/L complexometric calcium and magnesium in water. maintenance; crucial for titration methods sustainable water resource use. 31 3.1.2 Data preprocessing Descriptive analysis of the raw data is beneficial for gaining an intuitive understanding of the predictor variables and the predicted target before modeling. In many instances, the quality of model prediction is directly linked to the raw data. For instance, the presence of outliers can significantly impact model results. Thus, conducting a descriptive analysis of the predicted target before modeling is a crucial step. In this study, the parameter analysis of the model primarily involves handling missing values and normalization of the data. 3.1.2.1. Missing value processing The issue of missing data is one of the most common challenges during the process of data modelling. Common approaches to handling missing values include direct deletion, nearest neighbor imputation, linear regression fitting and so on. However, the effective handling of missing values within the model significantly influences the performance of the data model. Appropriately selecting suitable methods for dealing with missing values plays a crucial role in determining the overall effectiveness of the data model. This study used the K-Nearest Neighbors (KNN) method to fill up missing values. The basic idea is to estimate the missing values based on the values of their nearest neighbors in the feature space (Zhang et al., 2017). 3.1.2.2 Normalization Since the dataset contains variables with different ranges (i.e., the difference between maximum and minimum values), mean and standard deviation, data normalization is an important preprocessing step. Following the handling of missing values, the input 32 features underwent additional normalization using Equation 3.1. This normalization process aimed to standardize all input features, ensuring a consistent distribution of features. (3.1) In the equation, where xi represents the value of input feature i, ‫ݔ‬௜∗ is the normalized value of the initial xi, μ is the mean of xi, and s is the standard deviation of xi. 3.1.3 NI-BS-NI based Data Augmentation One of the challenges faced in this study is the limited data scale resulting from constrained water quality testing conditions in the SRR region. Therefore, data augmentation methods need to be set up to extract valuable information from the limited training data. In the first strategy, we developed Numerical Interpolation-Bootstrapping-Noise Injection (NI-BS-NI) based data augmentation method to enhance the original data set. This method is a combination involves three traditional DA algorithms, which are Numerical Interpolation, Bootstrapping, and Noise Injection. In the process of Numerical Interpolation, we interpolated for each feature to generate new 40 samples. Due to the limited size of the original dataset, which consists of only 34 samples, it is not conducive to generating too much data in this process. Introducing the Bootstrapping DA method, 34 original samples were randomly selected, resulting in the generation of 66 data sets. In this process, new samples are generated while preserving the distribution of the data. For data augmentation, Noise Injection is a commonly employed method. Noise can take various forms, such as Gaussian noise or uniform noise. Since the dataset primarily consists of continuous variables, Gaussian noise is often a preferable choice. 33 With a designated noise level of 0.05, 140 sets of new data were generated. Noise injection increases data diversity and helps to improve the generalization ability of the model. After each round of data augmentation, the performance of the model was assessed using techniques such as cross-validation to determine whether data augmentation contributed to an improvement in model performance. 3.1.4 GAN based Data Augmentation The second data augmentation method to generate new data in this study is Generative Adversarial Network (GAN). The overall framework is illustrated in Figure 3.1. It comprises four steps: dataset construction, GAN sample generation, removal of irrelevant values, and cross-validation loop iteration. Figure 3.1 Structure of GAN The specific process is outlined as follows: Step 1:The original set of 34 data instances was divided into training and testing sets in an 80/20 ratio. Step2:Constructing a GAN model for generating new samples involves two core network structures: the Generator and the Discriminator. 34 The Generator's objective is to generate synthetic data that closely approximates the real distribution, making it challenging for the Discriminator to distinguish the enhanced generated data. Meanwhile, the Discriminator's goal is to determine whether the data is real or fake, aiming to effectively distinguish between genuine and synthetic data. The Generator and Discriminator engage in an adversarial process, iteratively enhancing their respective discrimination or generation capabilities. When the loss functions for both the generated and discriminative networks converge, the Discriminator becomes reasonably adept at authenticating real samples typically. However, certain generated data may still be misclassified as real, which means the Generative has learned the properties of real samples and can produce plausible synthetic data. The learning rate is 0.001. The batch size is 128 in this process. The optimization during the training of the Generator and Discriminator utilizes the Adam algorithm. Step 3: Delete the similar samples and unreasonable samples generated in step 2. The similar sample removal procedure is based on the Euclidean distance compared with the set threshold. The threshold was set at 0.5. Step 4: A 3-fold cross-validation approach was applied at each iteration of the GAN, as shown in Figure 3.2, to assess the quality of the generated samples. MAEGAN (Mean Absolute Error from GAN-generated data) was computed through three-fold crossvalidation, while the MAEtrain was obtained through three-fold cross-validation on the training set data from step 1. Subsequently, a comparison between the two was conducted. If MAEGAN is smaller than MAEtrain, indicating that the newly generated data exhibits a smaller mean absolute error in the three-fold cross-validation process compared to real data, 35 then the generated samples are of higher quality. The iteration concludes when the iteration count is greater than or equal to 10, resulting in the final set of data. Figure 3.2 The 3-fold cross validation diagram By learning the distribution of real samples to generate synthetic data, during each training iteration, a set of random noise z ~ N ~ (0,1) is input into the generator to produce fake samples. The discriminator assesses the authenticity of the samples and assigns scores. The generator's objective is to deceive the discriminator into classifying fake samples as real, while the discriminator aims to distinguish between fake and real samples. Through adversarial training, the goal is to make the distribution of generated samples approach that of real samples. The objective function of GAN is expressed by the following formula: (3.2) In the expressions: Pdata and Pg represent the distributions of real samples x and random noise z, respectively. G(z) denotes the generated pseudo samples, D(x) signifies the probability of real samples being judged as authentic, and D(G(x)) indicates the probability of pseudo samples being judged as authentic. 36 The GAN model incorporates two distinct loss functions designated for training the generator and discriminator network. The loss functions of the generator and the discriminator are expressed as the mean absolute error (MAE) and the binary cross-entropy (BCE). Their definitions are as follows: (3.3) (3.4) where n represents the sample size, ‫ݕ‬ෝdenotes the predicted values, ‫ݕ‬௜ represents the actual ప values, and ‫ݕ‬ഥప represents the mean of the actual values. 3.2 Machine Learning Modelling 3.2.1 Model development In this study, Python version 3.9 is utilized, and the ML model library is retrieved from Scikit-learn. Python is a high-level scripting language that combines interpretability, compatibility, interactivity, and object-oriented programming. It is also the most popular language for ML, featuring the most comprehensive and up-to-date ML frameworks such as TensorFlow, PyTorch, and others. In this study, Random Forest (RF), Gradient Boosting Regression (GBR), Extreme Gradient Boosting (XGB), and Decision Tree (DT) were selected for modeling by study requirements and model characteristics. These 4 integrated tree models have good results interpretative, and separately belong to different integration methods. After preprocessing and augmenting the dataset, the data points are further divided into training and test sets with the same distribution for multiple training sessions. In this study, the dataset is divided into a training set and a test set with an 8:2 ratio. 37 Simultaneously, a 5-fold cross-validation method is employed during the model training process for validation to avoid overfitting and improve the predictive power of the applied ML methods. Single-parameter optimization and multi-parameter optimization were both utilized in modelling to find the best optimization method in this study. After combing with three traditional data augmentation processes, multi-parameter optimization was used to predict iron and manganese, while single-parameter optimization was studied after GAN data augmentation in ML modelling. 3.2.2 Model hyperparameter optimization Hyperparameter tuning is very important to obtain the best model. The best set of hyperparameters plays an important role in model reliability and adaptability. The most common methods for hyperparameter tuning include manual search, grid search, and Bayesian optimization. In this study, the grid search method is employed for hyperparameter tuning, as its precision when dealing with a limited number of feature inputs particularly. During the parameter tuning process, model parameter optimization is conducted using the average results of the five subsets from each validation iteration in the five-fold cross-validation. 38 Figure 3.3 Diagram of five-fold cross-validation 3.2.3 Model evaluation For a ML predictive model, classification predictions and numerical predictions have different discriminant formulas. The discriminant formulas for predictions of the same type generally exhibit a fundamental consistency, which is also a manifestation of the model's robustness. Common metrics for evaluating the accuracy of each model include the coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE). RMSE represents the errors between predicted samples and actual samples. MAE measures the average absolute difference between the predicted values and the actual values in a dataset. Lower RMSE values and MAE values indicate more accurate predictions of the model, with 0 being the optimal score (perfect predictions). The calculation formulas for RMSE and MAE are shown in Formula 3.5 and 3.6: (3.5) (3.6) 39 The study also calculated the coefficient of determination (R2) between the predicted values and observed values of the test set to assess the quality of the predictions. R2 acknowledged the fitting ability of established models. A higher R2 value indicates a better fit of the model. The formula of the R2 is shown in Equation 3.7: (3.7) In the above equations, n represents the sample size, ‫ݕ‬ෝప denotes the predicted values, ‫ݕ‬௜ represents the actual values, and ‫ݕ‬ഥప represents the mean of the actual values. 3.3 Model interpretability analysis This study attempts to use ML interpretability methods to explain and elucidate the model's prediction results. Following the more reliable predictions for iron and manganese, to further investigate the impact of predictor variables on water quality, it is necessary to explore the magnitude, direction (positive or negative), and interactive effects of predictor variables on the concentrations of iron and manganese in drinking water. This is beneficial for precisely identifying the primary factors contributing to variations in target water quality parameters in specific contexts. ML model interpretability methods can effectively explain tree models. 3.3.1 Impact magnitude of predictors Feature importance is often the first step in interpreting models in data mining. ML's feature importance reflects the magnitude of the impact of features on model predictions (Ibrahim et al., 2019). Calling “the feature importance” in the Scikit-learn 40 library allows direct retrieval of feature importance. It ranks features based on their frequency of use within the classifier and outputs a graph illustrating their ranking in importance. Feature importance is very intuitive thus making it very easy to understand the weightings of each feature and recognize their importance on model predictions. However, feature importance is model-dependent as such, models that generate rankings differently will lead to differing results. Additionally, feature importance does not capture feature interactions following that rankings may be influenced by noise within data, ultimately resulting in information that deviates from the original situation. 3.3.2 Interactive effects of predictors This study used Partial Dependence Plot (PDP) to investigate the interactions between indicators. Other common methods used for revealing the interaction effects of predictive factors in ML models include Individual Conditional Expectation (ICE) plots (Goldstein et al., 2015) and Local Interpretable Model-agnostic Explanations (LIME) plots (Ribeiro et al., 2016). PDP illustrates the marginal effects between one or two features and the prediction outcomes of the ML model, showing whether the relationship is linear, monotonic, or more complex (Nie & Wager, 2021; Yan et al., 2020). Unlike feature importance in header 2.3.1, which indicates the numerical magnitude of a feature's impact on the model, PDP presents the relationship between features and the impact on prediction outcomes. This paper considers the impact of individual factors on prediction outcomes and the combined effects of two factors, specifically examining the synergistic effects of two features on predictions. Partial Dependence Plots are intuitively defined, easy to understand, highly interpretable, and computationally efficient. They can effectively explore the joint impact 41 of two features on the model's predictions. However, their drawback is that they can describe the impact of only two features simultaneously. Moreover, when there is a strong correlation between two features, the results may exhibit bias. 3.3.3 Impact direction of predictors When we understand that a certain predictor has a significant impact on water quality parameters, it is even more important to know the direction of this impact, in other words, as this predictor increases, do the water quality parameters increase or decrease.In this context, we utilize SHapley Additive exPlanations (SHAP) plots to observe the positive and negative impacts as well as the magnitudes. SHAP values are based on cooperative game theory and Shapley values (Winter, 2002), assigning a value to each feature based on its marginal contribution to different possible feature combinations. Additionally, it provides the contribution of each feature data point to the predicted value (Lundberg & Lee, 2017). Thus, it can be applied to individual predictions. Feature importance as mentioned in 3.3.1 provides a high-level overview of the relevance of features across the dataset. SHAP values offer a detailed breakdown of how each feature contributes to a specific prediction, considering interactions among features. Both can be valuable tools for model interpretation and gaining insights into the factors driving model predictions. 42 Chapter 4 Results and Discussion 4.1 NI-BS-NI Based Modelling 4.1.1 Statistical analysis 4.1.1.1 Statistical summary Statistical analysis plays a foundational role in subsequent model establishment and interpretable analysis. Statistical information of the data can intuitively demonstrate the data distribution, dispersion, and central tendency. Based on the data characteristics and information, adjustments can be made to the ML modelling operation. In this study, statistical analysis is performed on both the raw data and the augmented data to examine the reliability of data augmentation methods. This involves assessing whether the data augmentation is based on the characteristic distribution properties of the raw data, rather than generating data unrelated to the original dataset and leading to untrustworthy predictive outcomes. Tables 4.1 and 4.2 respectively present the statistical information of the five predictive factors and the predicted indicators of raw data and the data after augmentation through three traditional methods. The data has increased from 34 sets to 276 sets. The data covers a wide range, allowing for robust model results with strong generalization capabilities. 43 Table 4.1 Statistical summary of raw data TDS pH (mg/L) Conductivity Turbidity Hardness Fe Mn (μS/cm) (NTU) (mg/L) (mg/L) (mg/L) count 20 34 20 34 34 34 34 mean 325.050 8.164 535.300 1.735 174.521 0.187 0.195 std 196.970 0.232 315.613 3.477 120.023 0.411 0.342 min 89.000 7.740 163.000 0.050 0.510 0.003 0.0001 25% 240.750 7.968 369.750 0.173 82.625 0.011 0.003 50% 299.500 8.175 483.500 0.235 155.000 0.036 0.030 75% 345.000 8.385 567.750 0.875 260.500 0.158 0.203 max 870.00 8.480 1400.000 14.000 470.000 2.170 1.160 Table 4.2 Statistical summary of NI-BS-NI augmented data TDS pH (mg/L) Conductivity Turbidity Hardness Fe Mn (μS/cm) (NTU) (mg/L) (mg/L) (mg/L) count 276 276 276 276 276 276 276 mean 287.282 8.158 480.981 1.646 176.618 0.182 0.193 std 161.066 0.229 255.833 3.223 114.049 0.382 0.317 min 88.915 7.587 162.963 0.0001 0.510 0.0001 0.0001 25% 146.089 7.960 265.356 0.172 93.516 0.011 0.003 50% 260.000 8.170 470.000 0.240 157.015 0.062 0.041 75% 348.585 8.379 570.000 0.908 258.979 0.158 0.245 max 870.052 8.499 1400.038 14.088 470.078 2.262 1.232 44 It can be observed that after data augmentation, the trend statistics (mean and percentiles), the measures of dispersion (standard deviation), and the distribution statistics (maximum and minimum values) are relatively consistent compared to raw data, falling within a reasonable range. This proves the effectiveness of using this method for data augmentation. Figure 4.1 shows the comparison boxplot of the raw data and the new data generated after the NI-BS-NI data augmentation. Figures show the distribution of the data, including the median, the upper and lower quartiles, and other information intuitively. The results showed that the data augmentation operation had no obvious effect on the data distribution, and 25% -75% of the data of each factor are still concentrated, and no obvious outliers appeared. 45 Figure 4.1 Comparison boxplots of raw data and NI-BS-NI augmented Data 46 4.1.1.2 Correlation analysis For the original and augmented data, a correlation analysis of the data was conducted to detect the correlation between numerical independent variables. Pearson correlation coefficient heatmaps were generated, as shown in Figure 4.2. As we can see, the correlation coefficient between Hardness and Conductivity is 0.87 (raw data). The strong correlation between them may be attributed to the influence of dissolved substances in water. Dissolved minerals such as carbonates, sulfates, and chlorides in water enhance the hardness in water, meanwhile, the water's conductivity also increases because calcium and magnesium irons commonly exist in hard water. Therefore, hardness and conductivity exhibit a positive correlation. Similarly, hardness shows strong correlation with TDS may be caused by calcium and magnesium irons can also react with other substances in the water to form a precipitate or suspended substances to increase the TDS. In addition, there is a strong correlation between the concentration of iron and turbidity. The correlation coefficient is greater than 0.8. The reason for the strong correlation between iron concentration and turbidity in water may be because iron in the form of suspended particles in water increases the turbidity, which is the measure of suspended particles. The correlation between other variables is not significant strong, which means that most variables contain unique information, and there is no excessive information overlap. 47 (a) (b) Figure 4.2 Pearson Correlation Heatmap of (a)raw data (b) NI-BS-NI augmented Data 48 4.1.2 Model results Table 4.3 Performance of NI-BS-NI based models and best hyperparameters Output Model Train Test Train Test R2 R2 RMSE RMSE 0.992 0.969 0.035 0.040 0.968 0.943 0.055 0.093 1 0.974 0.002 0.036 0.9999 0.933 0.002 0.101 0.999 0.969 0.01 0.04 0.9999 0.979 0.002 0.056 Best Parameters Variable {'max_depth': 40, 'min_samples_leaf': 1, RF Fe 'min_samples_split': 5, 'n_estimators': 10} {'max_depth': 30, 'min_samples_leaf': 1, RF Mn 'min_samples_split': 2, 'n_estimators': 20} {'learning_rate': 0.1, 'max_depth': XGB Fe 20, 'n_estimators': 100, 'subsample': 0.7} {'learning_rate': 0.1, 'max_depth': XGB Mn 10, 'n_estimators': 100, 'subsample': 0.7} {'learning_rate': 0.1, 'max_depth': 20, 'min_samples_leaf': 4, GBR Fe 'min_samples_split': 2, 'n_estimators': 50} {'learning_rate': 0.1, 'max_depth': GBR Mn 20, 'min_samples_leaf': 4, 49 'min_samples_split': 5, 'n_estimators': 100} {'max_depth': 50, DT Fe 'min_samples_leaf': 2, 0.997 0.95 0.022 0.051 0.992 0.976 0.027 0.06 'min_samples_split': 2} {'max_depth': 50, DT Mn 'min_samples_leaf': 2, 'min_samples_split': 2} After augmenting the raw data by using NI-BS-NI method, this study used four ML algorithms, which includes Random Forest (RF), Extreme Gradient Boosting (XGB), Decision Tree (DT) and Gradient Boosting Regression (GBR) to predict iron and manganese concentration in drinking water and find the best parameters during modelling to obtain the best performances of each model. The results of the modelling achieved good results as shown in Table 4.3. Compared with the other three models, GBR has the best performance in all the simulation results. In row GBR the Train R2 of iron and manganese are just neath of 1, and Test R 2 achieved ideal results, which are 0.969 and 0.979 respectively. And RMSE are all lower than 0.01, and as such, the errors between predicted data and actual data are minimal. Although XGB has the highest Train R2 result, the Test R2 is comparatively low compared to the training results, when predicting manganese XGB scores especially low in R2 (0.933), which may cause by overfitting phenomenon. Besides, the Test RMSE of XGB model didn't show a good result in terms of predicting Mn concentration (RMSEMn=0.101). In addition, the performance of RF and DT models showed 50 comparatively poor simulation performance, reflected in lower R2 and higher RMSE scores. A possible reason may stem from the fact that RF and DT models are primitive compared to GBR model; as such, they perform poorly when simulating data with non-linear and complex characteristic relationships. In addition, scatter diagrams and regression lines of four models in predicting iron and manganese were also produced to visualize the performances of simulation, as shown in Figures 4.3-4.6. These diagrams show the consistency of predicted values and actual values of metals concentration. According to the given reference line in the diagrams, all 4 models present great predicative performances, which are expressed in scatter points being close to reference lines in each diagram. The GBR model showed excellent performance in predicting Fe and Mn, as shown in Figure 4.5, most scattered points fell on the reference line. Figure 4.3 Scatter regression plots of (a)Fe (b)Mn of DT 51 Figure 4.4 Scatter regression plots of (a)Fe (b)Mn of RF Figure 4.5 Scatter regression plots of (a)Fe (b)Mn of GBR 52 Figure 4.6 Scatter regression plots of (a)Fe (b)Mn of XGB 4.1.3 Interpretable analysis 4.1.3.1 Feature importance In this section, the best model GBR in predicting iron and manganese was selected to present the effect of the 5 predictors on Fe and Mn concentration. Obtaining feature importance by directly calling the 'feature_importance' in the model to output Feature Importance Ranking Plots, as shown below. Figure 4.7 shows the ranking of 5 input predictors feature importance in predicting Fe and Mn, also can be explained as the ranking of the influence size of each feature. From top to bottom, the importance of each features decreases gradually. For the Fe prediction, turbidity has an outstanding impact, with the importance scores over 0.8. Followed by hardness, TDS, conductivity and pH, with low importance. Likewise, to Mn prediction, 53 conductivity is the most important feature, followed by hardness and turbidity, also show some impact to Mn prediction. TDS and pH still show little importance as they do to iron. Figure 4.7 Feature importance of GBR model after NI-BS-NI augmentation 54 4.1.3.2 Partial dependence In this section, the dependence of each feature on the metal’s prediction will be explained. The explanation of how each feature affects the prediction distinguishes partial dependence from feature importance discussed in section 3.3.1. Take XGB model in this simulation as example, the 5 partial dependence plots were created to explain the effect of the 5 input features on iron and manganese concentration. In Figure 4.8, the x-axis represents the range of variation of each input variable, the y-axis means the predicted values. The partial dependence plot proves the result we obtained from correlation analysis: iron and turbidity show great correlation and have little correlation with other features. As shown in Figure 4.8, the predicted iron concentration increases significantly with the increase of turbidity. For TDS, pH, conductivity and hardness, the iron concentration stays in the horizontal state basically, indicating that these four physical characteristics have very limited impact on the iron concentration. From the dependence plot of manganese, the partial dependence of all 5 features shows greater effect compared to iron partial dependence: the Mn values fluctuates in every plot. Mn concentration decreases from 0.3 to 0.1 mg/L with the TDS increases until 400 mg/L, then stabilizes at around 0.1. The concentration of pH and manganese show a small negative correlation. As for conductivity, it shows a significant positive correlation, the Mn concentration increases from 0.1 to 0.4 mg/L rapidly when the conductivity changes from 300 to 600 µS/cm. The Mn concentration generally increases with the turbidity increases as shown in the figure. When hardness is around 250 mg/L, the concentration of Mn peaks at about 0.35mg/L. 55 Figure 4.8 Partial Dependence Plots of XGB model 56 Figure 4.9 Two-way Partial Dependence Plots of XGB model After analyzing the individual features partial dependence, the synergistic effect of two factors on the prediction of iron and manganese also need to be considered. According to the correlation analysis in section 4.1.1.2, most of the indicators show generally low correlations. Thus, it is significant to explore the combined effect of two variables to predictive values. Figure 4.9 shows two-way partial dependence of GBR model to iron and manganese prediction. By the horizontal and vertical axes represent the variation range of the two features, and the third dimension is represented by the color differences: the yellow 57 color block shows the greater predicted concentration, and purple indicates a small concentration. By the common influence through two features, the greater the turbidity and the less the hardness, causing the greater the concentration of iron. The combined effect of TDS and hardness had little effect on iron concentration. When the turbidity is greater than 6 NTU and the TDS is greater than about 320 mg/L, the iron concentration will increase to 0.8 mg/L or above. For manganese, the concentration is maximum with increasing conductivity and a hardness between 250-280 mg/L. The synergy between the turbidity and the conductivity plot indicates that the greater their concentration, the greater the concentration of manganese. However, the combined effect of hardness and turbidity had little effect on the manganese concentration. 4.1.3.3 SHAP analysis Figure 4.10 SHAP summary plot of Fe of GBR model 58 In this section, the importance of each feature will be expressed by SHAP value. Figure 4.10 indicates the ranking of average impact on model output magnitude determined by the mean absolute value of the SHAP (|SHAP value|). In the five input features, the effect of turbidity is most pronounced for iron prediction, next by hardness and TDS, with an equal magnitude of impact. The predicted effect of pH and conductivity on iron is not obvious. Figure 4.11 SHAP scatter diagram of Fe of GBR model Detailed explanation will be analyzed by SHAP scatter diagram as shown in Figure 4.11. The horizontal axis represents the different SHAP values, while positive values represent the positive effect of the sample on the prediction, and negative values represent the negative effect. Crowded areas indicate a large number of samples gathered together. Color indicates the size of values: red indicates high feature values and blue indicates low feature values. Turbidity shows a positive correlation with SHAP value for the iron prediction. The impact on Fe prediction is greatly affected by large turbidity values with SHAP values obviously increase. In addition, the impact of hardness and TDS are similar to turbidity, but a portion of their samples are concentrated around the 0 value of SHAP, 59 so they have much less impact on the prediction of Fe. The variation of pH and conductivity barely affects the prediction of Fe, as such the majority of samples clustered around the 0 value of SHAP. Figure 4.12 SHAP summary plot of Mn of GBR model Figure 4.13 SHAP scatter diagram of Mn of GBR model 60 Conductivity is the least important factor in the prediction of Fe, conversely, it shows the most important effect for Mn prediction as shown in Figure 4.12. Then followed by hardness, turbidity, TDS and pH. Among these features, pH and TDS tend to have less effects to the prediction of both Fe and Mn. From Figure 4.13, we can see the values in the middle (purple color) among all the conductivity values tend to have a positive correlation regarding the impact of Mn prediction. As for the hardness, the lower the value, the less the effect on Mn prediction. Most of the negative turbidity values cluster on the negative side of the SHAP value, which means the lower the turbidity value, the more likely it is to show a negative correlation. Most of the TDS and pH values concentrate around 0 SHAP values, indicating the effect to Mn prediction is low. 4.2 GAN Based Modelling Single-parameter data augmentation was applied to GAN data augmentation method: the process of generating synthetic data of iron and manganese were analyzed separately. 4.2.1 Statistical analysis 4.2.1.1 Correlation analysis of iron As Figure 4.14 shows, the correlation coefficients of iron and turbidity is 0.83, showing the strongest correlation in this heatmap. pH shows some correlations with iron and turbidity, with the correlation coefficients are 0.4 and 0.47, respectively. 61 Figure 4.14 Pearson Correlation Heatmap of Fe in GAN 4.2.1.2 Correlation analysis of manganese Compared with iron, in the Pearson Correlation Heatmap of manganese, the correlation between every parameter is not very significant. Turbidity and pH show the same correlation as in Figure 4.14, with the correlation coefficient is 0.47. Figure 4.15 Pearson Correlation Heatmap of Mn in GAN 62 4.2.2 GAN sample generation 4.2.2.1 GAN sample generation for iron After the GAN training, 1920 samples were generated initially. The maximum iteration was set as 10 to acquire effective synthetic data and less time and there were 1000 samples generated each iteration. The epochs and batch size were set as 15 and 128. There were 1740 samples of them that were identified as correct samples by the Discriminator in the process of GAN samples generation. However, 1516 generated samples were deleted because of high similarity and 1 sample with negative values also needed to be withdrawn. No out-of-range samples in the generation process were found. As a result, a total of 223 sets of iron and its three predictive parameters samples remained to be discussed in the next ML process. Figure 4.16 Iron-cross-validation error and best error during loop iteration 63 4.2.2.2 GAN sample generation for manganese Like the process of generating iron related data, 1920 samples were generated initially. The maximum iteration was set as 10 and 1000 was set to be the generated samples each iteration. The epoch and batch size were also set as 15 and 128. There were 1079 samples of them were identified as correct samples by discriminator and 800 generated samples were deleted because of high similarity and no out-of-range samples or negative values in the generation process were found. In total, 276 sets of samples remained finally. Figure 4.17 Manganese-cross-validation error and best error during loop iteration Figures 4.16 and 4.17 are the demonstrations of iron-cross-validation error and best error during loop iteration and manganese-cross-validation error and best error during loop iteration. These two figures show similar trends. The best cross-validation error of the synthetic data for both figures slightly reduced during the process of iterations and loops, 64 which indicates the improvement of the augmented data quality. In addition, the crossvalidation error overall remains at a similar level at the end. A possible reason could be the GAN model had leaned the characteristics of the data and reached the apex of performance. 4.2.3 Model results 4.2.3.1 Model results of iron prediction Four ML methods were utilized to simulate iron prediction, with the GAN augmented data. Performance results and best parameters for iron prediction of each model are shown in table 4.4, and the visualized model results are shown as scatter figures in Figure 4.18. From the results of RF, XGB, GBR, and DT models, the GBR model shows the best performance in predicting iron concentration, with the train R2, test R2, train RMSE, and Test RMSE score at 0.999, 0.994, 0.002, and 0.037, respectively. Besides, RF, XGB and GBR models all achieved great performance: the train and test R 2 of these 3 models are greater than 0.99, indicating a great fitting ability; the train and test RMSE are all less than 0.04, reflecting small errors between predicted samples and actual samples. From the scatter figures 4.18, most of the training and testing data fell very close to the regression line. 65 Figure 4. 18 Scatter regression plots of (a) RF; (b) XGB; (c) GBR; (d) DT for Fe prediction 66 Table 4.4 Performance of GAN based models and best hyperparameters for Fe prediction Mode Train Test Train Test R2 R2 RMSE RMSE 0.996 0.996 0.029 0.030 0.999 0.996 0.014 0.031 1 0.994 0.002 0.037 0.980 0.991 0.068 0.043 Best Parameters l {'max_depth': 40, 'min_samples_leaf': 1, RF 'min_samples_split': 2, 'n_estimators': 50} {'learning_rate': 0.05, 'max_depth': XGB 10, 'n_estimators': 100, 'subsample': 0.9} {'learning_rate': 0.1, 'max_depth': 40, 'min_samples_leaf': 2, GBR 'min_samples_split': 5, 'n_estimators': 100} {'max_depth': 10, DT 'min_samples_leaf': 4, 'min_samples_split': 5} 4.2.3.2 Model results of manganese prediction For manganese prediction using GAN augmented data, the GBR model has the best performance, same as the results for iron prediction. The train R2, test R2, train RMSE, and test RMSE all acquired the best scores among these 4 models, which are 0.9995, 0.988, 0.005 and 0.028, reflecting excellent simulation process and accurate prediction for manganese. However, RF didn't show great simulation this time compared to its other performances in this study, with test R2 is only 0.819 and test RMSE is greater than 0.1. 67 Table 4.5 Performance of GAN based models and best hyperparameters for Mn prediction Model Best Parameters Train Test Train Test R2 R2 RMSE RMSE 0.936 0.819 0.055 0.108 0.941 0.929 0.053 0.068 0.9995 0.988 0.005 0.028 0.992 0.975 0.020 0.040 {'max_depth': 30, 'min_samples_leaf': 1, RF 'min_samples_split': 5, 'n_estimators': 10} {'learning_rate': 0.1, 'max_depth': XGB 20, 'n_estimators': 20, 'subsample': 0.8} {'learning_rate': 0.1, 'max_depth': 20, 'min_samples_leaf': 4, GBR 'min_samples_split': 10, 'n_estimators': 100} {'max_depth': 20, DT 'min_samples_leaf': 1, 'min_samples_split': 5} 68 Figure 4.18 Scatter regression plots of (a)RF; (b)XGB; (c) GBR; (d) DT for Mn prediction 4.2.4 Interpretable analysis 4.2.4.1 Feature importance Feature importance ranking will be presented by the best model GBR in predicting both iron and manganese after GAN augmentation. From Figure 4.20, turbidity is the only significant impact on the iron prediction, with the importance scores close to 1. The effect 69 of pH and hardness is negligible. As for Mn prediction, hardness is the most important feature, followed by turbidity and pH, showing some impact to Mn prediction, but scored below 0.2 in both. Figure 4.19 Feature importance of GBR model after GAN augmentation 70 4.2.4.2 Partial importance The dependence of pH, turbidity and hardness to iron and manganese prediction will be discussed below. Take the best model, GBR as an example. As shown in Figure 4.21, the predicted iron concentration increases significantly with the increase of turbidity, correspondent to the feature importance. However, pH and hardness didn't show obvious partial dependence to iron prediction. In the manganese partial dependence plots, there is a small negative correlation between pH and manganese; with the increase of turbidity, the manganese concentration fluctuates suddenly only to drop down when turbidity is around 2 NTU and then increases gradually. Moreover, the predicted manganese content has a dramatic positive correlation with hardness ranging from 0 to 400 mg/L. Figure 4.20 Partial Dependence Plots of GBR model 71 Figure 4.22 indicates the two-way partial dependence correlations based on GBR model. The common influence through pH and turbidity to the iron concentration is regular, this is shown by the increase of iron concentration of around 0.2mg/L for every 2 units of turbidity. The combined effect of hardness and turbidity also shows a similar pattern: with the change of hardness or pH, the concentration of iron has very limited reflections. As for manganese, the maximum concentration happens when turbidity is about 2.4 NTU and hardness is about 260-300 mg/L. The combined effect of pH and turbidity or pH and hardness doesn't show significant effect to manganese prediction. Figure 4.21 Two-way Partial Dependence Plots of GBR model 72 4.2.4.3 SHAP analysis Figure 4.23 represents the average impact size of pH, turbidity and hardness on iron and manganese prediction ranked by SHAP values. For manganese prediction, hardness is the most significant effecting factor, followed by turbidity and pH. Turbidity is the most important factor affecting the iron amount and the predicted effect of pH and hardness on iron shows limited impact represented by SHAP values less than 0.04. Figure 4.22 SHAP summary plot of GBR model after GAN Figure 4.24 represents the SHAP scatter diagrams of iron and manganese SHAP analysis. There is absolute evidence of a positive correlation between turbidity and SHAP 73 values on iron prediction, represented with the increase of the amounts of turbidity, the impact of turbidity on iron output shifts from negative to positive gradually. Hardness also shows a similar positive correlation, however, compared with turbidity, the influence level is much smaller, and a large amount of hardness samples gathered around 0, which means those samples have little impact on iron prediction. Same as pH, most of the samples have negligible effect but only some large readings of pH show some impact on iron output. Figure 4.23 SHAP scatter diagram of GBR model after GAN As shown in manganese SHAP scatter diagram, hardness and turbidity indicate positive correlations with SHAP values while pH shows a very limited negative correlation. Additionally, most of the turbidity samples clustered around 0-0.1 on the x-axis and some 74 low values (blue points) fell on the left side of the horizontal axis, indicating most of them have a minor impact on model output; however, some small pH values have a negative impact in predicting manganese. 4.3 Graphical user interface Prediction application was developed to be utilized in reality for iron and manganese prediction. Figures 4.25 and 4.26 represent two graphical user interfaces of applications utilized two different technologies explored in this study. This process was completed on MATLAB software. The graphical user interface based on GAN technology shows more application prospects. There are two main reasons: (1) Only 3 input parameters need to be measured ahead to predict iron and manganese concentrations; (2) Compared to traditional data augmentation methods, GAN shows it powerful diverse synthetic samples generation ability, given its AI algorithm. 75 Figure 4.24 Graphical user interface with 5 parameters Figure 4.25 Graphical user interface with 3 parameters 76 Chapter 5 Conclusions 5.1 Research summary In this study, the drinking water prediction models of Iron and Manganese concentration in BC's small, rural and remote First Nations communities were investigated using ML. Multiple data augmentation methods were developed to acquire effective synthetic data. Five physical indexes which are TDS, conductivity, pH, turbidity, and hardness were selected to serve as predictive values, namely input parameters. The model results of 4 ML algorithms were compared to selecting the best optimum model to develop the prediction interface for the visualization. Statistical analysis and interpretable analysis were also conducted to collect information on the correlation and importance of the predicted values. The main findings of this study are as follows: (1) According to the trend statistics, the measures of dispersion, the distribution statistics analysis, and cross-validation, Numerical Interpolation-Bootstrapping-Noise Injection (NI-BS-NI) based data augmentation and GAN based data augmentation both synthesized reliable data for modelling. (2) Two models based on different augmented data and predictors were developed in this study. Considering GAN based model the better one to be applied in the prediction tool. (3) The GBR model performed the best for both NI-BS-NI based modelling and GAN based modelling in predicting iron and manganese. The Train R2 of two models are just neath of 1, and Test R2 achieved very ideal results. All the RMSE scores are below 0.06. The ideal evaluation results prove the excellence and effectiveness of the GBR model. 77 (4) Iron and turbidity show great correlation and have little correlation with other features. In addition, manganese concentration shows a significant positive correlation with conductivity. (5) For the iron prediction, turbidity has an outstanding impact, with the importance scores over 0.8. Followed by hardness, TDS, conductivity, and pH. As for manganese prediction, conductivity is the most important feature, followed by hardness and turbidity. 5.2 Limitations and future research In this study, the prediction of Fe and Mn concentration in drinking water of First Nations SRR areas in BC using ML was investigated to access Fe and Mn concentration on site rapidly, saving the time and costs needed to send them to laboratory tests. Although this attempt demonstrated ideal models during experiments, its feasibility of predicting Fe and Mn in practical field applications has not been verified. There are still some steps we can take to make this method really applied for the actual water quality detection. Recommendations for the follow-up optimization and possible future studies are listed as follows: (1) Due to the very limited water quality data, only 2 heavy metal pollutants, iron and manganese were investigated in this study. In the future, more water quality data should be collected, combined with data augmentation methods, more kinds of pollutants should be involved to achieve fast and accurate detection. The prediction is not limited to only heavy metal pollutants, other hard-to-get pollutants, such as E. coli, organic matters, can also be predicted. 78 (2) The prediction method developed in this study needs to be verified in actual presence. The predicted values acquired by using the decision tool should be compared with true values measured in lab to verify its effectiveness and reliability. (3) Turbidity, which is the one predictive parameter in this study, is dependent on the form of iron. The water samples were collected from the groundwater, containing more ferrous iron (Fe2+) due to limited oxygen. While travelling to the lab, ferrous iron was oxidized to ferric iron (Fe3+) due to exposure to air. Then the solubility in the water samples decreased, affecting turbidity values. In this study, the models were established based on the laboratory water samples data. However, in actual use, all input index data will be obtained immediately by field meters, so it needs to be explored whether there is an impact on the accuracy of the prediction results in practical applications. (4) In GAN data augmentation process, the cross-validation error didn’t show an obvious reduction in the trend could also be caused by insufficient data or the inherent complexity of the data. This will be clarified if more raw sampling data can be acquired. (5) In future study, adding and adjusting different input parameters for modelling is worthy of consideration. The reason is unknown correlations between different physical or chemical parameters might help build up better models due to the interactive correlations between input values and predicted values. 79 References Ahmad, M. W., Reynolds, J., & Rezgui, Y. (2018). Predictive modelling for solar thermalenergy systems: A comparison of support vector regression, random forest, extra trees and regression trees. Journal of cleaner production, 203, 810-821. Ahmed, U., Mumtaz, R., Anwar, H., Shah, A. A., Irfan, R., & García-Nieto, J. (2019). Efficient water quality prediction using supervised machine learning. Water, 11(11), 2210. Al-Razee, A., Abser, M. N., Mottalib, M. A., Rahman, M. S., & Cho, N. (2019). Assessment of heavy metals in sediments of Shitalakhya River, Bangladesh. 분석과학, 32(5), 210-216. Aldhyani, T. H. H., Al-Yaari, M., Alkahtani, H., & Maashi, M. (2020). Water Quality Prediction Using Artificial Intelligence Algorithms. Applied Bionics and Biomechanics, 2020, 6659314. Alias, N., Rosli, S. A., Sazalli, N. A. H., Hamid, H. A., Arivalakan, S., Umar, S. N. H., . . . Lockman, Z. (2020). 15 - Metal oxide for heavy metal detection and removal. In Y. AlDouri (Ed.), Metal Oxide Powder Technologies (pp. 299-332): Elsevier. Apley, D. W., & Zhu, J. (2020). Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(4), 1059-1086. Asadollah, S. B. H. S., Sharafati, A., Motta, D., & Yaseen, Z. M. (2021). River water quality index prediction and uncertainty analysis: A comparative study of machine learning models. Journal of environmental chemical engineering, 9(1), 104599. Ayangbenro, A. S., & Babalola, O. O. (2017). A New Strategy for Heavy Metal Polluted Environments: A Review of Microbial Biosorbents. International Journal of Environmental Research and Public Health, 14(1), 94. Azrour, M., Mabrouki, J., Fattah, G., Guezzaz, A., & Aziz, F. (2022). Machine learning algorithms for efficient water quality prediction. Modeling Earth Systems and Environment, 8(2), 2793-2801. Babbar, R., & Babbar, S. (2017). Predicting river water quality index using data mining techniques. Environmental Earth Sciences, 76, 1-15. Baijius, W., & Patrick, R. J. (2019). “We don’t drink the water here”: the reproduction of undrinkable water for First Nations in Canada. Water, 11(5), 1079. Balasooriya, B. K., Rajapakse, J., & Gallage, C. (2023). A review of drinking water quality issues in remote and Indigenous communities in rich nations with special emphasis on Australia. Science of the Total Environment, 166559. 80 Baptista, M. L., Goebel, K., & Henriques, E. M. (2022). Relation between prognostics predictor evaluation metrics and local interpretability SHAP values. Artificial Intelligence, 306, 103667. Briffa, J., Sinagra, E., & Blundell, R. (2020). Heavy metal pollution in the environment and their toxicological effects on humans. Heliyon, 6(9), e04691. British Columbia Assembly of First Nations. (2023). Lheidli T'enneh First Nation. https://www.bcafn.ca/first-nations-bc/cariboo/lheidli-tenneh-first-nation Brown, J., Acey, C. S., Anthonj, C., Barrington, D. J., Beal, C. D., Capone, D., . . . Hicks, B. (2023). The effects of racism, social exclusion, and discrimination on achieving universal safe water and sanitation in high-income countries. The Lancet Global Health, 11(4), e606-e614. Bua, D. G., Annuario, G., Albergamo, A., Cicero, N., & Dugo, G. (2016). Heavy metals in aromatic spices by inductively coupled plasma-mass spectrometry. Food Additives & Contaminants: Part B, 9(3), 210-216. Chen, Y., Zhu, Y., & Chang, Y. (2020). CycleGAN Based Data Augmentation For Melanoma images Classification. Paper presented at the Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, Xiamen, China. Chowdhury, S., Champagne, P., & McLellan, P. J. (2009). Models for predicting disinfection byproduct (DBP) formation in drinking waters: a chronological review. Science of the Total Environment, 407(14), 4189-4206. Connor, S., Khoshgoftaar, T. M., & Borko, F. (2021). Text data augmentation for deep learning. Journal of Big Data, 8(1). Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2018). Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Daley, K., Jamieson, R., Rainham, D., & Truelstrup Hansen, L. (2018). Wastewater treatment and public health in Nunavut: a microbial risk assessment framework for the Canadian Arctic. Environmental Science and Pollution Research, 25(33), 32860-32872. Deng, Z., Yang, Z., Ma, X., Tian, X., Bi, L., Guo, B., . . . Zhang, S. (2018). Urinary metal and metalloid biomarker study of Henoch-Schonlein purpura nephritis using inductively coupled plasma orthogonal acceleration time-of-flight mass spectrometry. Talanta, 178, 728-735. Dezfooli, D., Hosseini-Moghari, S.-M., Ebrahimi, K., & Araghinejad, S. (2018). Classification of water quality status based on minimum quality parameters: application of machine learning techniques. Modeling Earth Systems and Environment, 4, 311-324. 81 Duignan, S., Moffat, T., & Martin-Hill, D. (2022). Be like the running water: Assessing gendered and age-based water insecurity experiences with Six Nations First Nation. Social Science & Medicine, 298, 114864. El Bilali, A., & Taleb, A. (2020). Prediction of irrigation water quality parameters using machine learning models in a semi-arid environment. Journal of the Saudi Society of Agricultural Sciences, 19(7), 439-451. Fekri, M. N., Ghosh, A. M., & Grolinger, K. (2019). Generating energy data for machine learning with recurrent generative adversarial networks. Energies, 13(1), 130. Fernández-Martínez, R., Rucandio, I., Gómez-Pinilla, I., Borlaf, F., García, F., & Larrea, M. T. (2015). Evaluation of different digestion systems for determination of trace mercury in seaweeds by cold vapour atomic fluorescence spectrometry. Journal of Food Composition and Analysis, 38, 7-12. First Nations Health Authority. (2023). Monthly Drinking Water Advisories in First Nations Communities in BC - November 2023. https://www.fnha.ca/Documents/DrinkingWater-Advisory-Monthly-Summary.pdf Government of Canada. (2017). First Nations People in Canada. https://www.rcaanccirnac.gc.ca/eng/1307460755710/1536862806124 Government of Canada. (2021). Indigenous peoples and communities. https://www.rcaanccirnac.gc.ca/eng/1100100013785/1529102490303 Government of Canada. (2023). Ending long-term drinking water advisories. https://www.sac-isc.gc.ca/eng/1506514143353/1533317130660 Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. journal of Computational and Graphical Statistics, 24(1), 44-65. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139-144. Gumpu, M. B., Sethuraman, S., Krishnan, U. M., & Rayappan, J. B. B. (2015a). A review on detection of heavy metal ions in water–an electrochemical approach. Sensors and Actuators B: Chemical, 213, 515-533. 82 Gumpu, M. B., Sethuraman, S., Krishnan, U. M., & Rayappan, J. B. B. (2015b). A review on detection of heavy metal ions in water – An electrochemical approach. Sensors and Actuators B: Chemical, 213, 515-533. Health Canada. 2019. Guidelines for Canadian Drinking Water Quality. https://publications.gc.ca/collections/collection_2019/sc-hc/H144-13-14-2019-eng.pdf Health Canada. 2021. Guidelines for Canadian Drinking Water Quality - Summary Tables. https://www.canada.ca/en/health-canada/services/environmental-workplacehealth/reports-publications/water-quality/guidelines-canadian-drinking-water-qualitysummary-table.html Hu, G., Mian, H. R., Abedin, Z., Li, J., Hewage, K., & Sadiq, R. (2022). Integrated probabilistic-fuzzy synthetic evaluation of drinking water quality in rural and remote communities. Journal of Environmental Management, 301, 113937. Hu, G., Mian, H. R., Mohammadiun, S., Rodriguez, M. J., Hewage, K., & Sadiq, R. (2023). Appraisal of machine learning techniques for predicting emerging disinfection byproducts in small water distribution networks. Journal of hazardous materials, 446, 130633. Ibrahim, M., Louie, M., Modarres, C., & Paisley, J. (2019). Global explanations of neural networks: Mapping the landscape of predictions. Paper presented at the Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. Iglesias, G., Talavera, E., González-Prieto, Á., Mozo, A., & Gómez-Canaval, S. (2022). Data augmentation techniques in time series domain: A survey and taxonomy. arXiv preprint arXiv:2206.13508. Imani, M., Hasan, M. M., Bittencourt, L. F., McClymont, K., & Kapelan, Z. (2021). A novel machine learning application: Water quality resilience prediction Model. Science of the Total Environment, 768, 144459. Indigenous Foundations. (2009). Aboriginal Identity & Terminology. https://Indigenousfoundations.arts.ubc.ca/aboriginal_identity__terminology/ Inoue, H. (2018). Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929. Islam, M., & Yuan, Q. (2018). First Nations wastewater treatment systems in Canada: Challenges and opportunities. Cogent Environmental Science, 4(1), 1458526. 83 Jaishankar, M., Tseten, T., Anbalagan, N., Mathew, B. B., & Beeregowda, K. N. (2014). Toxicity, mechanism and health effects of some heavy metals. Interdiscip Toxicol, 7(2), 60-72. Khan, Y., & See, C. S. (2016). Predicting and analyzing water quality using machine learning: a comprehensive model. Paper presented at the 2016 IEEE Long Island Systems, Applications and Technology Conference (LISAT). Kim, H., Harrison, F. E., Aschner, M., & Bowman, A. B. (2022). Exposing the role of metals in neurological disorders: a focus on manganese. Trends in Molecular Medicine, 28(7), 555-568. Kingsbury, B. (1998). “Indigenous peoples” in international law: a constructivist approach to the Asian controversy. American Journal of International Law, 92(3), 414-457. Le Bot, B., Lucas, J.-P., Lacroix, F., & Glorennec, P. (2016). Exposure of children to metals via tap water ingestion at home: Contamination and exposure data from a nationwide survey in France. Environment International, 94, 500-507. Li, P., Karunanidhi, D., Subramani, T., & Srinivasamoorthy, K. (2021). Sources and consequences of groundwater contamination. Archives of environmental contamination and toxicology, 80, 1-10. Li, X., Yang, Y., Yang, J., Fan, Y., Qian, X., & Li, H. (2021). Rapid diagnosis of heavy metal pollution in lake sediments based on environmental magnetism and machine learning. Journal of hazardous materials, 416, 126163. Lim, A. P., & Aris, A. Z. (2014). A review on economically adsorbents on heavy metals removal in water and wastewater. Reviews in Environmental Science and Bio/Technology, 13, 163-181. Lin, L., Wang, F., Xie, X., & Zhong, S. (2017). Random forests-based extreme learning machine ensemble for multi-regime time series prediction. Expert Systems with Applications, 83, 164-176. Liu, Y., Wang, T., & Chu, F. (2024). Hybrid machine condition monitoring based on interpretable dual tree methods using Wasserstein metrics. Expert Systems with Applications, 235, 121104. Lu, H., Li, H., Liu, T., Fan, Y., Yuan, Y., Xie, M., & Qian, X. (2019). Simulating heavy metal concentrations in an aquatic environment using artificial intelligence models and physicochemical indexes. Science of the Total Environment, 694, 133591. Lu, S.-Y., Zhang, H.-M., Sojinu, S. O., Liu, G.-H., Zhang, J.-Q., & Ni, H.-G. (2015). Trace elements contamination and human health risk assessment in drinking water from Shenzhen, China. Environmental Monitoring and Assessment, 187, 1-8. 84 Lu, S., Zhou, Q., Ouyang, Y., Guo, Y., Li, Q., & Wang ,J. (2018). Accelerated discovery of stable lead-free hybrid organic-inorganic perovskites via machine learning. Nature Communications, 9, 3405 Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30. Lyu, T., Tang, Y., Cao, H., Gao, Y., Zhou, X., Zhang, W., . . . Jiang, Y. (2023). Estimating the geographical patterns and health risks associated with PM2. 5-bound heavy metals to guide PM2. 5 control targets in China based on machine-learning algorithms. Environmental Pollution, 337, 122558. Ma, Z., Wang, J., Feng, Y., Wang, R., Zhao, Z., & Chen, H. (2023). Hydrogen yield prediction for supercritical water gasification based on generative adversarial network data augmentation. Applied Energy, 336, 120814. Markus, M., Tsai, C. W.-S., & Demissie, M. (2003). Uncertainty of weekly nitrate-nitrogen forecasts using artificial neural networks. Journal of environmental engineering, 129(3), 267-274. Martin, Y. E., & Johnson, E. A. (2012). Biogeosciences survey:Studying interactions of the biosphere with the lithosphere, hydrosphere and atmosphere. Progress in Physical Geography: Earth and Environment, 36(6), 833-852. McLeod, L., Bharadwaj, L., & Waldner, C. (2014). Risk factors associated with the choice to drink bottled water and tap water in rural Saskatchewan. International Journal of Environmental Research and Public Health, 11(2), 1626-1646. McLeod, L., Bharadwaj, L. A., Daigle, J., Waldner, C., & Bradford, L. E. A. (2020). A quantitative analysis of drinking water advisories in Saskatchewan Indigenous and rural communities 2012–2016. Canadian Water Resources Journal/Revue canadienne des ressources hydriques, 45(4), 345-357. Meehan, K., Jepson, W., Harris, L. M., Wutich, A., Beresford, M., Fencl, A., . . . Wells, C. (2020). Exposing the myths of household water insecurity in the global north: A critical review. Wiley Interdisciplinary Reviews: Water, 7(6), e1486. Meena, A. K., Mishra, G., Rai, P., Rajagopal, C., & Nagar, P. (2005). Removal of heavy metal ions from aqueous solutions using carbon aerogel as an adsorbent. Journal of hazardous materials, 122(1-2), 161-170. Mian, H. R., Chhipi-Shrestha, G., Hewage, K., Rodriguez, M. J., & Sadiq, R. (2020). Predicting unregulated disinfection by-products in small water distribution networks: an empirical modelling framework. Environmental Monitoring and Assessment, 192, 1-20. Najafzadeh, M., & Ghaemi, A. (2019). Prediction of the five-day biochemical oxygen demand and chemical oxygen demand in natural streams using machine learning methods. Environmental Monitoring and Assessment, 191, 1-21. 85 Navarro-Espinoza, S., Angulo-Molina, A., Meza-Figueroa, D., López-Cervantes, G., Meza-Montenegro, M., Armienta, A., . . . Álvarez-Bajo, O. (2021). Effects of untreated drinking water at three Indigenous Yaqui towns in Mexico: insights from a murine model. International Journal of Environmental Research and Public Health, 18(2), 805. Nie, X., & Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 299-319. Pang, X., Gao, T., Qiu, Y., Caffrey, N., Popadynetz, J., Younger, J., . . . Checkley, S. (2021). The prevalence and levels of enteric viruses in groundwater of private wells in rural Alberta, Canada. Water Research, 202, 117425. Patrick, R. J., Grant, K., & Bharadwaj, L. (2019). Reclaiming Indigenous Planning as a Pathway to Local Water Security. 11(5), 936. Petrea, Ș.-M., Costache, M., Cristea, D., Strungaru, Ș.-A., Simionov, I.-A., Mogodan, A., . . . Cristea, V. (2020). A machine learning approach in analyzing bioaccumulation of heavy metals in turbot tissues. Molecules, 25(20), 4696. Prince George Citizen. (2021). Lheidli T’enneh First Nation calls for federal funding for clean drinking water. https://www.princegeorgecitizen.com/local-news/lheidli-tenneh-first-nation-calls-forfederal-funding-for-clean-drinking-water-4778801 Qin, Z., Liu, Z., Zhu, P., & Xue, Y. (2020). A GAN-based image synthesis method for skin lesion classification. Computer Methods and Programs in Biomedicine, 195, 105568. Rasheed, T., Shafi, S., & Sher, F. (2022). Smart nano-architectures as potential sensing tools for detecting heavy metal ions in aqueous matrices. Trends in Environmental Analytical Chemistry, e00179. Revathi, M., Jeya, I. J. S., & Deepa, S. N. (2020). Deep learning-based soft computing model for image classification application. Soft Computing, 24(24), 18411-18430. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386. Rooki, R., Doulati Ardejani, F., Aryafar, A., & Bani Asadi, A. (2011). Prediction of heavy metals in acid mine drainage using artificial neural network from the Shur River of the Sarcheshmeh porphyry copper mine, Southeast Iran. Environmental Earth Sciences, 64, 1303-1316. 86 Rowles III, L. S., Hossain, A. I., Ramirez, I., Durst, N. J., Ward, P. M., Kirisits, M. J., . . . Saleh, N. B. (2020). Seasonal contamination of well-water in flood-prone colonias and other unincorporated US communities. Science of the Total Environment, 740, 140111. Schimpf, C., & Cude, C. (2020). A systematic literature review on water insecurity from an Oregon public health perspective. International Journal of Environmental Research and Public Health, 17(3), 1122. Schwartz, H., Marushka, L., Chan, H. M., Batal, M., Sadik, T., Ing, A., . . . Tikhonov, C. (2021). Metals in the drinking water of First Nations across Canada. Canadian Journal of Public Health, 112(Suppl 1), 113-132. Shafqat, S. S., Rizwan, M., Batool, M., Shafqat, S. R., Mustafa, G., Rasheed, T., & Zafar, M. N. (2023). Metal organic frameworks as promising sensing tools for electrochemical detection of persistent heavy metal ions from water matrices: A concise review. Chemosphere, 137920. Shahi, N. K., Maeng, M., & Dockko, S. (2020). Models for predicting carbonaceous disinfection by-products formation in drinking water treatment plants: a case study of South Korea. Environmental Science and Pollution Research, 27, 24594-24603. Shao, S., Wang, P., & Yan, R. (2019). Generative adversarial networks for data augmentation in machine fault diagnosis. Computers in Industry, 106, 85-93. Shen, L., & Qian, Q. (2022). A virtual sample generation algorithm supporting machine learning with a small-sample dataset: A case study for rubber materials. Computational Materials Science, 211, 111475. Shen, Z., Ouyang, X., Xiao, B., Cheng, J.-Z., Shen, D., & Wang, Q. (2023). Image synthesis with disentangled attributes for chest X-ray nodule augmentation and detection. Medical Image Analysis, 84, 102708. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1-48. Singh, R., Gautam, N., Mishra, A., & Gupta, R. (2011). Heavy metals and living systems: An overview. Indian J Pharmacol, 43(3), 246-253. Singh, V., Singh, N., Rai, S. N., Kumar, A., Singh, A. K., Singh, M. P., . . . Mishra, V. (2023). Heavy Metal Contamination in the Aquatic Ecosystem: Toxicity and Its Remediation Using Eco-Friendly Approaches. Toxics, 11(2), 147. Statistics Canada. (2022). Indigenous population continues to grow and is much younger than the non-Indigenous population, although the pace of growth has slowed. https://www150.statcan.gc.ca/n1/daily-quotidien/220921/dq220921a-eng.htm 87 Stride, B., Abolfathi, S., Odara, M., Bending, G. D., & Pearson, J. (2023). Modelling microplastic and solute transport in vegetated flows Dispersion of polyethylene in submerged model canopies. Water Resources Research, e2023WR034653. Sun, Y., Chen, F., Zafar, A., Khan, Z. I., Ahmad, K., Ch, S. A., . . . Nadeem, M. (2023). Assessment of potential toxicological risk for public health of heavy metal iron in diverse wheat varieties irrigated with various types of waste water in South Asian country. Agricultural Water Management, 276, 108044. Taghizadeh-Mehrjardi, R., Fathizad, H., Ali Hakimzadeh Ardakani, M., Sodaiezadeh, H., Kerry, R., Heung, B., & Scholten, T. (2021). Spatio-temporal analysis of heavy metals in arid soils at the catchment scale using digital soil assessment and a random forest model. Remote Sensing, 13(9), 1698. Teng, Z., Yuan Huang, J., Fujita, K., & Takizawa, S. (2001). Manganese removal by hollow fiber micro-filter. Membrane separation for drinking water. Desalination, 139(1), 411-418. Tremblay, C. V., Beaubien, A., Charles, P., & Nicell, J. A. (1998). Control of biological iron removal from drinking water using oxidation-reduction potential. Water science and technology, 38(6), 121-128. The Council of Canadians. (2020). Guidelines for drinking-water quality: Fourth edition incorporating the first and second addenda. https://canadians.org/analysis/fighting-covid-19-starts-universal-access-water-andsanitation The World Bank. (2021). Use of AI Technology to Support Data Collection for Project Prepa ration and Implementation: A ‘Learning-by-doing’ Proces. https://gpss.worldbank.org/sites/gpss/files/knowledge_products/2021/Use%20of%20AI% 20technology%20to%20support%20data%20collection.pdf Tyagi, S., & Talbar, S. N. (2022). CSE-GAN: A 3D conditional generative adversarial network with concurrent squeeze-and-excitation blocks for lung nodule segmentation. Computers in Biology and Medicine, 147, 105781. Uddin, M. G., Nash, S., Diganta, M. T. M., Rahman, A., & Olbert, A. I. (2022). Robust machine learning algorithms for predicting coastal water quality index. Journal of Environmental Management, 321, 115923. 88 United Nations. (2015). The human rights to safe drinking water and sanitation : resolution / adopted by the General Assembly. https://digitallibrary.un.org/record/821067 United Nations. (2023). Indigenous Peoples. https://social.desa.un.org/issues/Indigenouspeoples Valko, M., Morris, H., & Cronin, M. (2005). Metals, toxicity and oxidative stress. Current medicinal chemistry, 12(10), 1161-1208. Veschetti, E., Achene, L., Ferretti, E., Lucentini, L., Citti, G., & Ottaviani, M. (2010). Migration of trace metals in Italian drinking waters from distribution networks. Toxicological & Environmental Chemistry, 92(3), 521-535. Wang, H., Wei, L., Yang, C., Liu, J., & Shen, J. (2020). A pyridine-Fe gel with an ultralowloading Pt derivative as ORR catalyst in microbial fuel cells with long-term stability and high output voltage. Bioelectrochemistry, 131, 107370. Wang, J., Ji, H., Wang, Q. g., Li, H., Qian, X., Li, F., & Yang, M. (2017). Prediction of size-fractionated airborne particle-bound metals using MLR, BP-ANN and SVM analyses. Chemosphere, 180, 513-522. Wang, J., Yang, Z., Zhang, J., Zhang, Q., & Chien, W.-T. K. (2019). AdaBalGAN: An improved generative adversarial network with imbalanced learning for wafer defective pattern recognition. IEEE Transactions on Semiconductor Manufacturing, 32(3), 310-319. Wang, R., Kim, J.-H., & Li, M.-H. (2021). Predicting stream water quality under different urban development pattern scenarios with an interpretable machine learning approach. Science of the Total Environment, 761, 144057. Winter, E. (2002). The shapley value. Handbook of game theory with economic applications, 3, 2025-2054. Wolfe, P. (2006). Settler Colonialism and the Elimination of the Native. Journal of genocide research, 8(4), 387-409. Woodcock, G. (1988). A social history of Canada. Toronto: Viking Penguin Group. World Health Organization. (2017). Guidelines for drinking-water quality: fourth edition incorporating first addendum, 4th ed + 1st add. https://iris.who.int/handle/10665/254637 World Health Organization. (2023). Drinking-water. https://www.who.int/news- room/fact-sheets/detail/drinking-water 89 Yan, X., Zang, Z., Luo, N., Jiang, Y., & Li, Z. (2020). New interpretable deep learning model to monitor real-time PM2. 5 concentrations from satellite data. Environment International, 144, 106060. Yeganeh, M., Azari, A., Sobhi, H. R., Farzadkia, M., Esrafili, A., & Gholami, M. (2023). A comprehensive systematic review and meta-analysis on the extraction of pesticide by various solid phase-based separation methods: a case study of malathion. International Journal of Environmental Analytical Chemistry, 103(5), 1068-1085. Zhai, Y., Han, Y., Xia, X., Li, X., Lu, H., Teng, Y., & Wang, J. (2021). Anthropogenic Organic Pollutants in Groundwater Increase Releases of Fe and Mn from Aquifer Sediments: Impacts of Pollution Degree, Mineral Content, and pH. Water, 13(14), 1920. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Zhang, H., Yin, S., Chen, Y., Shao, S., Wu, J., Fan, M., . . . Gao, C. (2020). Machine learning-based source identification and spatial prediction of heavy metals in soil in a rapid urbanization area, eastern China. Journal of cleaner production, 273, 122858. Zhang, P., Yang, M., Lan, J., Huang, Y., Zhang, J., Huang, S., . . . Ru, J. (2023). Water Quality Degradation Due to Heavy Metal Contamination: Health Impacts and Eco-Friendly Approaches for Heavy Metal Remediation. Toxics, 11(10), 828. Zhang, S., Li, X., Zong, M., Zhu, X., & Cheng, D. (2017). Learning k for knn classification. ACM Transactions on Intelligent Systems and Technology (TIST), 8(3), 1-19. Zhang, Y., Wang, Z., Zhang, Z., Liu, J., Feng, Y., Wee, L., . . . Traverso, A. (2023). GANbased one dimensional medical data augmentation. Soft Computing, 1-11. Zhao, B., & Yuan, Q. (2021). Improved generative adversarial network for vibration-based fault diagnosis with imbalanced data. Measurement, 169, 108522. Zhao, J., Yan, X., Zhu, T., Wang, J., Li, H., Zhang, P., . . . Ding, L. (2015). Multithroughput dynamic microwave-assisted leaching coupled with inductively coupled plasma atomic emission spectrometry for heavy metal analysis in soil. Journal of Analytical Atomic Spectrometry, 30(9), 1920-1926. Zoni, S., & Lucchini, R. G. (2013). Manganese exposure: cognitive, motor and behavioral effects on children: a review of recent findings. Current opinion in pediatrics, 25(2), 255. Assembly of First Nations. (2021). Description of the AFN. https://www.afn.ca/description-of-the-afn/ 90