LUMPING OF ATMOSPHERIC ORGANIC CHEMICAL SPECIES BY MACHINE LEARNING by Pruthvi Polam B.E., Mechanical Engineering Bangalore University (India), 1999 THESIS SUBMITTED IN PARTIAL EULFILLMENT OF THE REQUIREMENTS FOR THE DECREE OF MASTER OF SCIENCE in MATHEMATICAL, COMPUTER, AND PHYSICAL SCIENCES (CHEMISTRY AND COMPUTER SCIENCE) THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA April 2006 © Pruthvi Polam, 2006 1^1 Library and Archives Canada Bibliothèque et Archives Canada Published Heritage Branch Direction du Patrimoine de l'édition 395 W ellington Street Ottawa ON K 1A 0N 4 Canada 395, rue W ellington Ottawa ON K 1A 0N 4 Canada Your file Votre référence ISBN: 978-0-494-28366-0 Our file Notre référence ISBN: 978-0-494-28366-0 NOTICE: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. AVIS: L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. Canada A bstract Lumping of atmospheric chemical species into different groups is one of the effective techniques used to reduce the complexity of the reaction mechanisms. Since lumping of chemical species into different categories is a classification problem, the application of machine learning by Artificial Neural Networks (ANNs) is appropriate to address the problem from a computational perspec­ tive. The conventional notation used to represent chemical species is not in a form which can be directly given as an input for machine learning. Issues such as what type of chemical information is appropriate and how best it is given as an input for ANN to obtain good results in classifying the chemical species into different lumped categories are discussed. Both the supervised and unsupervised learning methods are explored. The study in this thesis suggests that supervised ANNs can be gainfully employed for lumping of atmospheric chemical species when compared to the unsupervised ANNs. 11 C ontents 1 A b s tr a c t.................................................................................................................... ii Table of C o n te n ts.................................................................................................... iii List of T a b l e s ........................................................................................................... vil List of F ig u res........................................................................................................... viii A cknow ledgem ents................................................................................................. x In troduction 1 1.1 ................................................................................ 1 Atmospheric Air Quality Simulation M o d elin g .......................... 2 Chemical Mechanism Reduction M eth o d s................................................ 4 1.2.1 Mechanism Reduction without Time Scale A n a ly s is ................ 6 1.2.2 Kinetic Lumping Approach .......................................................... 8 1.2.3 Reduction Based on the Investigation of Time S c a l e s ............. 11 1.2.4 Approximate Chemical Mechanisms 1.1.1 1.2 1.3 2 Lumping in Systems with Time Scale S ep aratio n ........................................................................................... 12 1.2.5 Structural and Molecular Lumping Approach 14 1.2.6 Advantages and Disadvantages of Condensed Mechanism Ap­ .......................... proaches .............................................................................................. 20 M otivation....................................................................................................... 21 A rtificial N eural N etw orks 22 2.1 In tro d u ctio n .................................................................................................... 22 2.2 P attern Recognition 23 .................................................................................... iii 2.3 A rch itectu re.................................................................................................... 24 2.4 A pplications.................................................................................................... 26 2.5 Learning M eth o d s.......................................................................................... 26 2.5.1 Supervised L e a r n in g ....................................................................... 27 2.5.2 Unsupervised L e a rn in g ................................................................... 28 Area of R e s e a r c h .......................................................................................... 29 2.6 3 4 M eth od ology - G eneration and P ru n in g o f C hem ical Species D atabase 31 3.1 In tro d u ctio n .................................................................................................... 31 3.2 Generation of Chemical Species D atab ase................................................ 31 3.3 EPI (Estimation Program Interface) S u i t e ............................................. 35 3.4 Pruning of Chemical Species D a ta b a s e ................................................... 36 3.4.1 Functional Croup A p p ro a c h .......................................................... 37 3.4.2 Vapor P r e s s u r e ................................................................................ 38 M eth od ology - R ep resen tation o f C hem ical Species 41 4.1 In tro d u ctio n .................................................................................................... 41 4.2 SMILES (Simplified Molecular Input Line Entry System) Notation . . 42 4.2.1 A to m s ................................................................................................. 42 4.2.2 B o n d s ................................................................................................. 43 4.2.3 B ranches............................................................................................. 43 4.2.4 Cyclic S tr u c tu r e s ............................................................................. 44 4.2.5 A ro m a tic ity ....................................................................................... 44 M atrix N o tatio n .............................................................................................. 45 4.3.1 ................................................ 45 The A p p ro ach ................................................................................................. 50 4.3 4.4 5 Techniques used in the literature M eth od ology - R eaction s and Lum ping 53 5.1 Tropospheric Chemical R e a c tio n s ............................................................. 53 5.1.1 53 Reactions of A lk a n e s ....................................................................... IV 5.2 6 55 5.1.3 Alkyne R e a c tio n s............................................................................... 57 5.1.4 Reaction of Oxygen-containing Organic S p e c i e s ....................... 58 ................................... 59 Lumping Approach Employed for Classification 62 6.1 Nature of Input to A N N .............................................................................. 62 6.2 Supervised Learning - Multilayer Feedforward NeuralNetwork . . . . 63 6.2.1 68 6.4 Backpropagation A lg o r ith m ........................................................... Unsupervised Learning - Competitive Neural networks ...................... 69 6.3.1 Similarity Measure L a y e r................................................................. 71 6.3.2 Competitive Layer (or M a x n e t) .................................................... 72 6.3.3 The Combination of these Two L a y e r s ........................................ 72 Usecase D ia g r a m .......................................................................................... 74 A pplication o f th e A N N for C lassification o f C hem ical Species: Im ­ plem en tation and R esults 76 7.1 Supervised Neural N e tw o rk s....................................................................... 77 7.1.1 Network Development .................................................................... 77 7.1.2 Training and Testing the Network M o d e l.................................... 80 Unsupervised Neural N etw orks................................................................... 85 7.2.1 Network developm ent........................................................................ 85 7.2.2 Training and Testing the Network M o d e l.................................... 86 D iscu ssio n....................................................................................................... 90 7.2 7.3 8 Reactions of A lk e n e s ........................................................................ M eth od ology - A rtificial N eural N etw orks 6.3 7 5.1.2 C onclusions and Future D irection s 97 8.1 C o n clu sio n s.................................................................................................... 97 8.2 Future D ire c tio n s.......................................................................................... 99 A p p e n d ic e s 101 Appendix A - Back Propagation A lgorithm ...................................................... 101 Appendix B - Chemical Species L is t.................................................................... 104 B ib lio g ra p h y 141 VI List of Tables 3.1 List of the empirical formulas used to generate the chemical species d a ta b a s e ........................................................................................................... 34 4.1 Some examples of SMILES notation to represent m o lecu les................. 42 4.2 A list of chemical species with SMILES n o ta tio n .................................... 43 7.1 Number of chemical species in the d a t a s e t .............................................. 79 7.2 Network parameters adopted for supervised neural network experimen­ tation .............................................................................................................. 79 7.3 Classification accuracy of lumping chemical species into appropriate groups with 27 and 35 hidden nodes(H N )................................................. 82 7.4 Classification accuracy of lumping chemical species into appropriate groups with 65 and 75 hidden nodes ( H N ) ............................................. 83 7.5 Best classification accuracy of chemical species into appropriate lump­ ing g r o u p s ........................................................................................................ 84 7.6 Network parameters for unsupervised neural n e tw o r k ........................... 85 7.7 Examples for chemical species represented in vector notation (VN) and normalized vector notation (NVN) .......................................................... 86 7.8 Analysis of results for alcohols - misclassification of chemical species in supervised learning m eth o d o b tain ed from 5 ite r a tio n s ............................. 91 7.9 Analysis of results for other chemical species - misclassification of chem­ ical species in supervised learning method obtained from 5 iterations vn 92 List o f Figures 1.1 Taxonomy of air quality simulation m odeling................................. 4 2.1 A typical feedforward neural network architecture (after Figure 1in [35]) 25 2.2 Competitive neural network (after Figure in [34]) 25 .................................. 3.1 Methodology of the re s e a rc h .............................................................. 4.1 SMILES notation for molecules with branched structure 32 ..................... 43 4.2 SMILES notation for cyclic structured molecules (after figure in [48]) 44 4.3 SMILES notation for aromatic chemical species (after figure in[48]) . 44 4.4 BE matrix representations th a t specify atomic connectivity and elec­ tronic environment for (left) ethane, (center) ethyl radical, and (right) ethene (after Figure 5. in [1]) .................................................................... 45 4.5 Reaction m atrix representation for (a) H-abstraction, (b) /3-scission, (c) Recombination, (d) Bond fission, and (e) Radical addition, (after Figure 6. in [ 1 ] ) .................................................................................... 4.6 Bond fission reaction in matrix notation .................................................. 4.7 A set of reaction pathways and its m atrix operations 4.8 Various transformations of the chemical species 6.1 Various transformations of the chemical species 48 ........................... 50 ..................................... 52 5.1 Scission of C-C b o n d ........................................................................... 6.2 Transfer functions 46 55 ..................................... 64 ........................................................................................ 67 6.3 Training process for competitive neural n e tw o r k ........................... viii 73 6.4 Usecase diagram for the s y s t e m ........................................................ 74 7.1 Training of a neural n e tw o rk .............................................................. 84 7.2 Prototype weight vector (W^) formed after 1000 epochs [Columnsrep­ resent the clusters and rows represent the connectivity information] 7.3 Results obtained for unsupervised learning method 7.4 Example for the input data . 88 .............................. 89 ........................................................................ 93 IX A ck n o w le d g m e n t s I would like to express my g ratitu d e to all those who gave th e support to complete this thesis. I am greatly indebted to my co-supervisor, Dr. M argot M andy for pro­ viding suggestions and encouragement which helped me in all th e tim e of research and w riting of this thesis. Her comments have been of greatest help a t all times. My gratitude also goes to my co-supervisor Dr. Charles Brown for his suggestions and encouragement th a t led to substantial improvements of this thesis. I th an k Dr. Peter Jackson for serving on my graduate com m ittee and m onitoring th e work w ith his valuable suggestions. I would also like to th an k th e external exam iner for reviewing this thesis. I also extend my sincere thanks to Dr. Alex A ravind and Dr. M aheshwari for their guidance and assistance during my education. I th an k th e chair of Com puter Science departm ent, Dr. W aqar Haque, for the departm ental research assistantship and th e chair of Chem istry departm ent. Dr. Ron Thring, for providing me w ith teaching assistantship. I would also like to th an k my co-supervisor Dr. M argot M andy for the financial support from her N atural Sciences and Engineering Research Council of C anada grant. I th an k A lida Hall and Janis Shandro for their support during Teaching Assis­ tantship. I acknowledge all persons in th e program s of Chem istry and C om puter Science a t th e University of N orthern British Columbia, for their efforts during my education. I th an k Sr ini vas, Baljeet, Jeyaprakash, Joanne and K ouhyar for their nice company and suggestions during my education. Lastly, b u t m ost im portantly I am very grateful for th e love and support of my parents N arasim ha R eddy and V anaja and my fiancée Swapna Reddy to fulfill my dreams. X C hapter 1 Introduction 1.1 C hem ical M echanism s A chemical mechanism is a detailed description of the sequence of elementary pro­ cesses which occur during an overall chemical reaction. It includes a list of all primary, secondary, and intermediate reactions which gives certain, essentially quantitative, in­ formation about the fate of the chemical species. The chemical mechanisms which describe pyrolysis, combustion, atmospheric, and oxidation chemistry of even light hydrocarbons can be extremely complex [1]. Hundreds or thousands (or more) of kinetically significant chemical species, elementary reactions, and a large number of reactive intermediates can be involved. In atmospheric chemistry, even relatively mi­ nor emissions into the atmosphere can play an im portant role in the formation of undesirable byproducts, and their properties and reactions need to be modeled in some detail in order to make accurate predictions. There are many existing models for combustion, pyrolysis, and atmospheric chemistry. For example, in the pyrolysis process, the consideration of only species with two or fewer carbon atoms generated 11 chem ical species a n d 55 chem ical reactions, but w hen th e num ber of c a rb o n atom s was increased to three, 99 chemical species and 611 chemical reactions were generated [1]. In combustion systems, kinetic models have thousands of elementary reactions and a large number of reactive intermediates. 1 For example, there are 3,662 chemical reactions involving 470 chemical species considered in the simulations of n-hexane combustion by Glande and co-workers [2] and 479,206 reactions and 19,052 species in simulation of tetradecane combustion performed by De W itt and co-workers [3]. 1.1.1 A tm o sp h er ic A ir Q u a lity S im u la tio n M o d e lin g Ozone is not directly emitted into the atmosphere, but ozone and other oxides are formed by the complex reactions between nitrogen oxides (NO^) and reactive organic chemical species. It requires reliable and scientifically valid methods to formulate appropriate and cost-effective control strategies to estimate the type of emission re­ ductions needed to reduce the formation of ozone. Air quality simulation models can be used to address these kinds of problems. Air quality simulation models are designed to estimate parameters related to air quality th a t cover large geographic regions. These types of problems can be addressed using available models of chemical and physical processes which influence the formation of ozone. Apart from the meteorological data, an im portant component of such mod­ els is the gas-phase chemical reaction mechanism which is used to describe the fate of emitted chemical species in the atmosphere. Atmospheric modeling can be bro­ ken down into three main components: atmospheric meteorology; emission inventory modeling which includes quantity, location, and rate of pollutant emissions; and the chemical mechanism. 1. M eteorological Data: Local and regional scale meteorological processes pro­ vide information such as wind speed, wind direction, cloud cover, and turbu­ lence th at affect the transport, dispersion, deposition, and chemistry of airborne pollutants. 2. E m ission Inventory: The emissions of biogenic and anthropogenic organic chemical species and their precursors are simulated. This simulation includes the quantity, location, and the rate of chemical species emissions which are required to gain an accurate understanding of the different emission sources. 3. A tm ospheric C hem ical M echanism : The atmospheric chemical mecha­ nisms within air quality models have grown in complexity. These chemical mechanisms can be either constructed manually or the process can be auto­ mated. In manual construction of the detailed chemical mechanisms, chemists examine which chemical species are most likely to be present in the system and which reactions are likely to occur under appropriate conditions. The inter­ conversion between reactants and products also introduces a large number of intermediate species. A huge number of highly coupled reaction steps have to be examined and there is always a possibility of human error. Since atmospheric chemistry involves many different organic species, the number of reactions may become too unmanageable for application in models used to describe a certain airshed. Manual construction of the chemical mechanisms is extremely timeand labor-intensive to develop even for simple systems. This is because the emissions of even a light hydrocarbon into atmosphere can involve hundreds of kinetically significant chemical species, elementary reactions and reactive inter­ mediates. For the past two decades, it has been identified th at the process of construct­ ing chemical kinetic models could be computationally automated [1,4-8]. In order to automate the mechanism generation the following points have to be considered: (a) The structure of the chemical species should be stored in a form which can be accessed and manipulated computationally. (b) All combinations of a species must be considered, but a specific reaction must not be produced twice. (c) A program should be able to parameterize all the reactions based on em­ pirical rules associated with the type of reaction and the size and structure of the reactants. (d) Because a reaction generator produces a very large number of possible reactions, it should be possible to filter out automatically the reactions which are obviously unimportant. The chemical mechanisms are the most computationally intensive aspects of photochemical air quality simulation models. This is because of the presence of thousands of atmospheric chemical species and reactions as well as the amount of computer time required for the numerical integration of the rate equations associated with thousands of chemical reactions. This computational burden is partly due to the fact th a t atmospheric chemical kinetic systems are very “stiff” , and involve changes associated with disparate time scales [9]. This becomes a serious limitation for the application of simulations. The taxonomy of the air quality simulation models is as shown in the Figure 1.1. ( Modeling Atmospheric Chem is^y at Urban and Regional Scale ( Chemical M echanism ) (M anual Constructuion J ( Meteorological Data ( ) Other Inputs ( Automation (D etailed Chemical M echanism ] (C ondensed Chemical M echanism ] (o th e r Grouping Techmques^^^^ ^ ^ (M athem atical A pproach^ ( Structural Approach ) (M olecular A pproach) Figure 1.1: Taxonomy of air quality simulation modeling 1.2 C hem ical M echanism R ed u ction M eth od s A fully explicit mechanism for representing gas-phase atmospheric chemistry would contain 20 000 or more reactions and thousands of chemical species. Due to the large numbers of chemical species and reactions present in atmospheric chemical mecha­ nisms and limited computational resources, explicit chemical mechanisms are gener­ ally not used in atmospheric air quality simulation models. Rather, the mechanisms for air quality models are highly condensed in a various ways to substantially reduce the number of reactions and species in order to be computationally tractable while maintaining accuracy. Even the developers of highly detailed mechanisms adopt some method to limit the size. Lumping of chemical species has been widely employed in the development of condensed mechanisms. Several condensed chemical mechanisms have been designed in tractable form for air quality simulation modeling. Some of the mechanism reduction methods are: 1. Mechanism reduction without time scale analysis [10,11] • Identifying redundant species • Identifying redundant reactions • Sensitivity of temperature to rate coefficient 2. Formal lumping procedures [11-13] 3. Reduction based on the investigation of time scales [11,14-16] • Low-dimensional systems • Jacobian analysis • Computational singular perturbation theory • Slow/inertial manifolds 4. Approximate lumping in systems with time scale separation [17-19] 5. Structural and Molecular lumping approach [20-24] 1.2.1 M ech a n ism R e d u c tio n w ith o u t T im e S cale A n a ly sis The first step for the mechanism reduction is to find the subset of the detailed mech­ anism which consists of fewer chemical reactions and species and which still describes the system adequately. The reduced mechanism may be tailored later according to specific requirements. The primary stage for finding an appropriate mechanism is to find the redundant species. Identifying R edundant Species Species in a chemical mechanism can be classified into 3 categories: Im portant species which include reaction products or initial reactants; necessary species are the chemical species which assist in accurate reproduction of the concentration profiles of important species, tem perature profiles or other im portant reaction features; and the remaining species which are redundant species [11]. Two methods have been proposed to identify redundant species. 1. If a species has no consuming reactions, a change in its concentration has no influence on the concentration of the other species. Therefore, a species which does have consuming reactions could be classified as redundant if the elimination of the reactions th a t consume it has no significant effect on the output of the model when compared with the full model. 2. A species may be considered as redundant if its concentration change has no effect on the rate of production of im portant species. The Jacobian, J = where / is rate of production of species and c is concentration, of the ordinary differential equations which describe the kinetic system is used for this inves­ tigation. An element of the normalized Jacobian, shows the fractional change of the rate of production of species i caused by the fractional change of the concentration of the species j. The influence of the change of the con­ centration of species i on the rate of production of an A-membered group of important species is given by the sum of squared elements of the normalized Jacobian [11]. n—l ^ ^ The higher the Bi value for the species, the greater its direct effect on the concen­ trations of the im portant species. This provides a quantitative measure allowing the identification of possible redundant species. Identifying R edundant R eactions A reaction is also considered to be redundant if its contribution to the production rate of each necessary species is small throughout the modeling regime. To do this, all reaction contributions to each necessary species at several reaction times need to be considered. This can require the analysis of very large matrices. An alternative technique for the reduction of the mechanism by eliminating the redundant reactions is through overall sensitivity measures and principal component analysis of the rate sensitivity matrix [11] where Vÿ is the stoichiometric coefficient of species i in reaction j, Rj is the rate of reaction j, kj is the rate coefficient for reaction j, and f i is the rate of production of species i. The reactions whose contributions, on the basis of eigenvalues of F^F, are below a desired precision threshold may be eliminated in th a t region. S en sitiv ity o f Tem perature to R a te CoeiRcients Temperature is one of the important features in the reaction modeling, especially in combustion modeling. Reactions may be modeled over wide ranges of temperature profiles. Therefore, the sensitivity of the rate of change of tem perature to a change in rate parameters is of importance. The temperature sensitivities become useful when the reduced model is required only to produce accurate temperature profiles [11]. The normalized temperature rate sensitivity is given by dln{dT/dt) dlnkj ~QjRj Cp{dT/dt) (1.3) where T is the temperature, kj is the rate coefficient for reaction j, Qj is the exothermicity of the reaction step j, Rj is the rate of reaction j, and Cp heat capacity per unit volume. 1.2.2 K in etic L u m p in g A p p roach In this approach, new lumped variables are related to the original variables by a mathematical lumping function which can be either linear or nonlinear and depends on the original species’ concentrations as well as on other significant parameters. The main aim in this approach to lumping is to identify mathematical procedures which can be applied to a general reaction system and provide an automatic algorithm for reduction. Developments of such procedures will involve rigorous mathematical principles. Linear methods, where the new species are represented as linear combinations of the original ones, work well for linear kinetic systems and also provide some degree of reduction th at may be appropriate for nonlinear schemes. Nonlinear methods are more general, but can involve complicated algebraic methods which might limit their use. The terms “exact” lumping and “approximate” lumping have been used to dis­ tinguish whether the lumped model has used approximations. The technique used in the kinetic lumping approach is an exact lumping method, which represents the exact features of the full model. The kinetics of a dynamic system with n dependent variables can be described by an n-dimensional ordinary differential equation system dy/ dt — f{y). In mathe­ matical terms, lumping reduces the system to n dimensions if a differential equation system dy/dt = f{y) can be found th at adequately models the kinetics of interest, where n {i) from the vector z(t), leading to the elimination of 0 (t) from the lumped differential equation system. It is assumed th a t these fast species equilibrate rapidly. The resulting lumped scheme will be valid over a wide range of conditions. The faster the variables (f), the fewer are the terms needed in the approximation. Computational singular perturbation methods can be used to characterize the relative time scales in order to identify the faster species. This approach is related to QSSA approach with some extra terms added. The application of singular perturbation methods over QSSA has showed significant improvement in the accuracy of the resulting models [11]. 1.2.5 S tru ctu ra l an d M olecu lar L u m p in g A p p roach There are two main diagnostic lumping approaches employed in the literature apart from the approaches discussed earlier in this chapter. They are the structural ap­ proach and the molecular approach. In the structural approach, the molecular struc­ tures or functional groups within the hydrocarbon molecules provide the lumping category. In the molecular lumping approach, the numerous emitted organic com­ pounds are represented by a limited number of species, each of which represents a certain class of compounds. The principal requirement for this approach is th at the average behavior of the lumped categories must not depart substantially from the behavior of the individual compounds th a t are lumped. Carbon Bond IV Mechanism (GEM IV) [22] is an example of the lumped structure approach which was developed by Grey in 1988-89. The Statewide Air Pollution Research Center (SAPRC) mech­ anism [20] is an example of the lumped molecule approach which was developed by Carter in 1990. Other lumping approaches are used in the RADM (Regional Acid Deposition Model) mechanism [24], and RACM (Regional Atmospheric Chemistry Mechanism) [23] developed by Stockwell. The morphecule approach is a new ap­ proach which is under the development at University of North Carolina [21]. Each of 14 these previously published mechanisms will be discussed in detail below. C B M IV m echanism In the CBM IV mechanism, organic compounds are grouped together according to bond type (e.g. carbon single bonds, carbon double bonds, or carbonyl bonds). The main advantage of this structure-lumping approach is th at fewer surrogate categories are needed to represent bond groups [22], [9]. The CBM IV mechanism was evaluated against 170 experiments conducted in 3 different smog chambers. This mechanism allocates the chemical species in the atmosphere into four different classes of species: 1. Inorganic species are treated explicitly without lumping. 2. Organic species are represented by carbon bond surrogates. These carbon bond surrogates are used to describe the chemistry of three different types of carbon bonds. These three surrogates are described as follows: (a) The single bonded one-carbon-atom surrogate PAR is used to represent the chemistry of alkanes and most of the alkyl groups found in other organics. (b) The double bonded two-carbon-atom surrogate OLE (Olefins) are used to represent the chemistry of alkenes whose carbon-carbon double bonds are found in 1-alkenes. (c) The third surrogate, two-carbon-atom surrogate ALD2 is used to represent acetaldehyde and higher aldehydes th a t contain a -CHO group and adja­ cent carbon atoms. It is also used to represent 2-alkenes, because these species react very rapidly in the atmosphere to produce aldehyde products. 3. Organic species are represented by molecular surrogates. There are two molec­ ular surrogates th at are used to represent the chemistry of aromatic hydrocar­ bons. (a) The surrogate TOL is a seven-carbon species used to categorize monoalkylbenzene structures. 15 (b) The surrogate XYL is an eight-carbon surrogate used to represent dialkylbenzene and trialkylbenzene structures. 4. Organic species like formaldehyde, ethene, and isoprene are also treated explic­ itly because of their unique chemistry or special importance in the atmosphere. S tatew id e Air P ollu tion R esearch C enter (S A P R C ) M echanism In the SAPRC mechanism, the chemical species which have similar reactivity con­ tributing to the formation of ozone and other oxidants are lumped together. The SAPRC mechanism is based on detailed model species for which kinetic and mech­ anistic parameters have been evaluated against over 500 environmental chamber ex­ periments [9], [20]. In this mechanism, the reactions of alkanes, alkenes (excluding ethene), aromatics and biogenics are represented using generalized kinetic and mech­ anistic parameters specified by the user. The SAPRC mechanism contains more organic species than are represented in CBM IV. 1. Inorganic species, formaldehyde, acetaldehyde, acetone, glyoxal, methyl glyoxal and ethene are explicitly represented in the mechanism. 2. Species such as the higher aldehydes and ketones are represented using the surrogate species approach. 3. Species such as alkanes, aromatics, and higher alkenes are represented in the mechanism using generalized reactions and variable kinetic as well as mechanis­ tic parameters assigned for each species by the user. 4. Species such as haloalkanes and haloalkenes, for which reaction mechanisms are highly uncertain, are represented by the generic mechanism species. According to the principles of this approach, organic species can be grouped into three functional groups such as alkanes, alkenes, and aromatics. Alcohols and ethers are estimated to have similar mechanistic reactivity characteristics to alkanes, there­ fore they are lumped in the same group with alkanes . W ithin each group, the organic 16 species can be specified further according to their reaction rates with the OH radicals or other oxidizing agents. Generally, there are three classes which can be specified within each of these three groups: 1. Slowly reacting species, for which only a relatively small fraction reacts during the model simulation. 2. Rapidly reacting species, which are essentially reacted completely during a oneday simulation. 3. Species with intermediate reaction rates, which fall in neither of the other two categories above. Thus within each of these three classes, the organic species can be lumped together. R A D M (R egional A cid D ep osition M odel) M echanism The RADM mechanism was developed by Stockwell in 1990 [24] and has been used in U.S. Environmental Protection Agency’s Regional Acid Deposition Model (RADM). Like the SAPRC mechanism, the RADM is also a generalized species mechanism. The hydrocarbons are represented using lumped species with fixed rather than user specified parameters. 1. Inorganic species, methane, formaldehyde, ethane, ethene, and isoprene are explicitly represented in the mechanism. 2. Alkanes other than methane and ethane are represented using three species with respect to the rate of reaction with OH falling within the following range: (a) Between 3.4 x and 6.8 x 10“ ^^ ppm~^ min“ ^ (b) Less than 3.4 x 10“ ^^ ppm~^ min“ ^ (c) Greater than 6.8 x 10“ ^^ ppm“ ^ min“ ^ 3. Alkenes other than ethene and isoprene are represented using two species: 17 (a) First surrogate species (propene) is used to represent 1-alkenes. (b) Second surrogate species (trans-2-butene) is used to represent internal alkenes, cyclic alkenes and dienes. 4. Aromatic hydrocarbons are simulated using two surrogate species: (a) Toluene to represent aromatics of low reactivity, (b) Xylene to represent aromatics of high reactivity. 5. Five species are used to represent carbonyl compounds: (a) Acetaldehyde is used as a surrogate to represent all other aldehydes other than formaldehyde. (b) Ketones are treated as a mixture of acetone and methyl ethyl ketone. (c) Three species (glyoxal, methylglyoxal and a lumped unsaturated dicar­ bonyl) are used to represent dicarbonyls formed during the oxidation re­ action of aromatics. 6. Finally, generalized species are included to represent each of the following 9 organic species: 1) Alkylnitrates 6) Unsaturated PAN’s 2) Formic acid 7) Methyl hydrogen peroxide 3) Peroxyacetic acid 8) Higher organic peroxides 4) Acetic acid and higher acids 9) PAN and higher saturated acyl- 5) Cresol peroxy nitrates Stockwell in 1997 [23] developed another mechanism called Regional atmospheric chemistry mechanism (RACM) which is an updated and extended version of the RADM mechanism. A completely new condensed reaction mechanism scheme was included for biogenic compounds like isoprene, a-pinene, and d-limonene. This new scheme is based on recent kinetic and mechanistic data obtained for isoprene in various laboratory studies and includes methacrolein as one of the reaction products. 18 M orphecule m echanism Recently, a method called the morphecule approach [21] has been under development by the University of North Carolina. The main objective of this approach is to eliminate some of the weaknesses in the existing condensed chemical mechanisms. This approach centers around the use of surrogate species called morphecules, the composition, concentration, and rate of reaction of which are updated after each time step in the simulation. Some of the weaknesses of other approaches which the morphecule approach is attem pting to address are as follows: 1. Hundreds of atmospheric VOCs are grouped into a few lumped surrogates re­ sulting in the loss of individual chemical species characteristics. 2. All the parameters such as reaction rate coefficients and product yields of the lumped surrogates are kept constant throughout the simulation. The atmo­ spheric chemical mechanism progresses by depleting the more reactive species first. In the morphecule approach, the rate coefficients for a particular lumped surrogate and the type of products are updated at each time step. 3. Highly generic products are formed by lumping chemical species in a condensed mechanism. 4. In all lumping mechanisms, the number of organic radicals included in the mech­ anism is limited, fn the CBM IV mechanism alkyl radicals produced from the NO to NO2 oxidation reactions are classified as XO2 If the NO concentrations is very low, XO2 reacts only with itself or with HO2. XOg includes all species even when NO concentrations is very low. The morphecule approach considers the rate of RO2 and its time evolution in more detail. 19 1.2.6 A d v a n ta g es and D isa d v a n ta g es o f C o n d e n sed M ech a ­ n ism A p p ro a ch es Many of these kinds of chemical mechanisms are proposed in the literature, using dif­ ferent techniques and assumptions in order to represent the condensed mechanisms. Some of the advantages and disadvantages of these condensed mechanisms are as fol­ lows: Advantages: It is not feasible to represent explicitly the chemistry of the hundreds of organic species present in the atmosphere because most of the air quality simulation models require repeated chemical calculations. The main advantage of the lumping mechanism is th a t fewer surrogate categories are needed to represent the chemical species resulting in fewer chemical species and therefore they can be easily implemented in the large air quality simulation models. This greatly reduces the computational resources required. D isadvantages: There are always some uncertainties involved in the condensed mechanisms since there is a lot of flexibility and judgment involved in choosing the kinetics and prod­ ucts th a t represent the whole group of organics (i.e., the timing and magnitude of the chemical species produced during the chemical reactions). Despite the fact that these approaches have certainly reduced the complexity by condensing the chemical mechanism, it is still not an exact representation of the chemical processes and may not make accurate predictions. The major limitation of these kinds of approaches is often associated with inaccu­ racies due to the fact th at these lumped mechanisms have typically been optimized to fit the observed time concentration profile of a specific species. In order to in­ corporate the errors and uncertainties in kinetics and mechanisms of key reactions, these studies are frequently updated. These uncertainties vary from the one lumping 20 approach to the other depending upon the assumptions and techniques used. The limitation th a t there is a loss of accuracy in individual simulations by con­ densed representation of the chemical mechanism may be sufficiently compensated by increased computational tractability. 1.3 M otivation Applying computational techniques to chemistry is becoming increasingly popular in recent times. Computational modeling has become an essential tool to understand and trace the atmospheric chemical species. There are many attem pts in the literature to autom ate the generation of chemical species and reaction mechanisms [1,3-8]. Continued advances in the computational techniques are warranted because these models need to be more accurate and efficient. Automation of chemical mechanisms is very complex, fn this context, lumping techniques can be used to reduce the complexity of the process [12,13,17-24]. We have discussed the various attem pts made in the literature for reducing the complexity of the chemical mechanisms through lumping approaches. Most of the existing lumped models which were developed are done by brute force methods. fn another context, artificial neural networks are used as effective tools for pattern recognition and classification [32]. The ability of the neural network is to generalize. Generalization refers to the neural network producing reasonable outputs for inputs not encountered during learning. Since lumping of chemical species is a classification problem, from the computational perspective we believe applying neural networks is appropriate to solve this problem. The main objective of this approach is to reduce the “drudgery” involved in lumping problems by using a previously trained neural network. Once trained, a neural network is then used to classify a new chemical species which was not involved in the learning process to an appropriate lumped category. 21 C hapter 2 A rtificial N eural N etw orks 2.1 In troduction Artificial neural networks (ANN) are relatively crude electronic models based on the neural structure of the brain. The brain learns from experience as it stores the infor­ mation as patterns. This process of storing information as patterns, utilizing those patterns, and then solving problems, encompasses a new field in computing involv­ ing the creation of massively parallel networks and the training of those networks to solve specific problems. The training of these networks is achieved by extracting the knowledge or patterns from complicated or imprecise data and detecting trends that are too complex to be noticed either by humans or by other computer techniques. A trained neural network can be thought of as an “expert” in the category of informa­ tion it has been given to analyze. This expert can then be used to provide projections in relation to new situations of interest and to answer “what if” questions [32]. When using a neural network, it is always im portant to think about how to repre­ sents the data to the neural network. The usual method of data representation for a neural network is by a vector. Each element in the vector represents various param­ eters of the pattern th a t influence the decision of assigning the pattern to a certain class. For example, in forecasting problems for air quality, various parameters such as concentrations of various pollutants, wind speed, wind direction, and temperature 22 serve as the vector components. The neural network will be able, in this case, to forecast pollutant concentration. Sometimes there is a need for input d ata to be normalized, depending upon the type and the objective of the problem. For example, if we wanted to find the difference between the two vectors, one approach for this would be to find the dot product of the normalized vectors. The dot product is maximum if the vectors have a minimal difference. The need for normalization varies from one problem to another and the nature of the input data. Normalization of the input vector may be applied effectively only when this approach does not lead to the loss of any information needed for the network to be classified appropriately. 2.2 P a ttern R ecogn ition Pattern recognition is widely used, often under the name of ‘classification’. A pattern may be loosely defined as any entity th a t could be given a name. For example, a pattern could be a fingerprint image, a handwritten cursive word, a human face, or a speech signal [33]. Pattern recognition is defined formally as the process where a given pattern is assigned to one of a prescribed number of categories. A neural network may be used for pattern recognition if it first undergoes training. During training, the network is repeatedly presented with a set of only input patterns for unsupervised learning and for supervised learning the input patterns are presented along with a category to which each particular pattern belongs. Later, once training is term inated or completed, a pattern th a t has not been seen before (i.e. not been used in training) but belongs to the same population of patterns used to train the network is presented to a network. The network identifies the category of the pattern on the basis of the information it has extracted during the training process. Pattern recognition is achieved if the information is carried in the relative rather than the absolute values of the vector component and the category identified is correct or acceptable. 23 2.3 A rch itectu re The major building block for any ANN architecture is the processing element or neuron. These neurons are located in one of the three types of layers: the input layer, a hidden layer, or the output layer as shown in Figure 2.1. The input neurons receive d ata from the outside environment, the hidden neurons receive signals from all of the neurons in the preceding layer, and the output neurons receive signals from all of the neurons in the preceding layer and send information back to the external environment. It is possible to have one or more hidden layers of neurons in a neural network depending upon the complexity of the problem. These neurons are linked by a line of communication called a connection. The way in which the neurons are connected has a great effect on the operation and the performance of the network. ANN models can have a variety of topologies or paradigms. Detailed descriptions of all the paradigms are presented in Neural Network Design by Hagan [32]. A “Feedforward” neural network has been used for the supervised method (see Fig 2.1) and a “Competitive” neural network has been used for the unsupervised method (see Fig 2.2) in this thesis work to lump the chemical species into appropriate categories. The network topology depends critically upon the number of training examples and the complexity of the pattern th at the network is trying to learn. The optimum number of hidden nodes varies from one type of the problem to the other. It is very difficult to determine a good network topology just from the number of inputs and outputs. Some authors refer to a “rule of thumb” for choosing the network topology i.e., the number of hidden nodes should be greater than the sum of input nodes and output nodes [36]. In this work we have followed this “rule of thum b” to determine the appropriate number of hidden nodes required for the neural network. 24 Input Layer Hidden Layer IN P U T Net O U TPU T Output layer Activation function Figure 2.1; A typical feedforward neural network architecture (after Figure 1 in [35]) Similarity Measure Layer Competitive Layer M xR ndist R xl Input M xl Output M xl Learning rule Figure 2.2: Competitive neural network (after Figure in [34]) 25 2.4 A pplications Artificial neural networks can be used in fields such as signal processing, robotics, pattern recognition, medicine, chemistry, speech recognition, business and in vision such as face recognition, edge detection, and for visual search engines. Artificial neu­ ral networks have been applied in different fields of chemistry. A detailed description of the application of ANNs in chemistry and the representation of chemical informa­ tion was given by John A. Burns and George M. Whitesides in 1993 [37] including applications in biological sequences, in the interpretation of spectra, in sensor arrays, and Quantitative Structure-Activity Relationships (QSAR). 2.5 Learning M eth od s The purpose of learning is to train the network to perform the desired task. Learn­ ing rules are the methodologies used in support of training a neural network, where the learning rules are used to extract information and knowledge of patterns from the training examples and to adjust the neural networks accordingly. Each input has an associated weight th at represents the strength of th a t particular connection. The learning rule allows the network to adjust the two parameters (i.e., the connec­ tion weights and associated biases) in order to associate given input vectors with corresponding output vectors. The learning rules are methodologies for modifying the weights and biases dynamically in an efficient way such th at accurate pattern recognition is achieved. During training periods, the input vectors are repeatedly presented, and the weights and biases are modified according to the learning rule, until the network produces the desired associations with the desired accuracy. There are as many learning rules as there are neural networks. As the architecture of the neural networks vary, the learning rules also vary, but mostly all the learning rules are categorized into two main types. They are the supervised learning method and the unsupervised learning method. In both cases, the neural network is able to generalize from what it has learned from the training patterns, so th a t when a pre­ 26 viously unseen input pattern is presented, the network responds with an appropriate answer. 2.5.1 S u p e rv ised L earning In the supervised learning method, each individual output node has an external “teacher” . Thus for each given input, the output unit is told what the desired re­ sponse ought to be. Supervised learning tries to match the output of the network to values th at have already been defined. Methods of supervised learning include error-correction learning, reinforcement learning, and stochastic learning. The im portant issue in supervised learning is th a t the total training error converges to a minimum, in th a t the error between the desired output (or the target output) and the computed network output decreases. One of the most commonly used meth­ ods in learning process is least mean square (LMS) convergence which minimizes the Euclidean distance between the desired output and the network output. Some of neu­ ral networks which use supervised methods are the Perceptron neural network, the Adaline neural network, the Feedforward neural network with backpropagation algo­ rithm (BP), and the Learning Vector Quantization (LYQ) [32]. Supervised learning methods usually perform better than unsupervised learning methods, but supervised training is not necessarily faster, or more efficient. W hether it is appropriate depends on the problem. We have discussed different methods used for lumping the chemical species to reduce the complexity in the chemical mechanism. Most of the methods used in the literature consist of rigorous mathematical methods to obtain the lumps such that the behavior of the lumped chemical species should not depart significantly from the actual chemical species. The objective of using machine learning through artificial neural network is to reduce “drudgery” involved in the existing methods by utilizing a previously trained neural network. A previously classified set of chemical species may be used for supervised training of neural network. 27 2 .5 .2 U n su p e r v ise d L earn in g Everyday life is filled with unexpected aspects of situations where exact training sets do not exist. Unsupervised learning may be used for problems in which we lack comprehensive prior knowledge. Unsupervised learning, in contrast to supervised learning, does not provide the network with target output values. For unsupervised learning, the training set consists of input training patterns only. As usual, inputs are applied to the input layer, and the outputs from the output layer nodes are considered. There are no known corresponding correct outputs, in contrast to the supervised learning. A raw datum with no prior knowledge about the desired output for a given input is analyzed and the network is trained without target values. Weights and biases are modified in response to network inputs only. The network learns to adapt based on the experiences collected through the previous training patterns. The only possible way to classify is by enhancing differences as well as similarities from the training patterns of the data and by arranging the d ata in clusters so th at the vectors similar to each other are grouped together. After the network has been tuned to the statistical regularities of the input data, it develops to form the internal representations for encoding the features of the input and to create new classes automatically. Unsupervised learning usually performs a mapping from input to output space, data compression or clustering. Some of the popular unsupervised neural networks are Grossberg classifier, Kohonen self organiz­ ing feature map, competitive neural networks, and fuzzy associative memory [32]. Unsupervised learning neural network may be used with two layers - an input layer and competitive layer as in the case of competitive neural network. Figure 2.2. The input layer receives the available input and the competitive layer consists of neurons th a t compete with each other for the opportunity to respond to features contained in the input data. The network operates in accordance to the “winner takes all” strategy. In this strategy the neuron with the greatest total output wins the competition and all other neurons are switched off. The methodology of competitive neural network is discussed in Section 6.3. 28 A potential advantage of the unsupervised learning method is th a t it does not re­ quire any prior classification process for training the neural network. In our approach of solving the lumping problem by neural network, both supervised and unsupervised methods have been attempted. 2.6 A rea o f R esearch Lumping of atmospheric chemical species can be achieved by two approaches: (i) the lumped structural approach and (ii) the lumped molecular approach. In the lumped structural approach, organic chemical species are grouped together according to the types of bonds and functional groups in each chemical species. In the lumped molec­ ular approach, generalized chemical species are used to represent a certain class of chemical species which have similar chemical behavior. Many techniques have been used in the literature to lump atmospheric chemical species [20,22-24]. In this thesis we adopted the functional group approach to lump the Volatile Organic Compounds (VOCs). A functional group is defined as a group of atoms within the molecule th a t is largely responsible for certain chemical behaviors of the parent molecule and the reactions in which it takes part. These functional groups can be used to gener­ ate and characterize a detailed chemical mechanism which gives certain, essentially quantitative, information about the fate of atmospheric chemical species. Lumping of chemical species into different categories is a classification problem. From the com­ puter science perspective, classification problems are appropriately conducted by the application of machine learning by Artificial Neural Networks (ANN). An im portant application of neural networks is classification by pattern recognition. Once trained, the neural network is able to recognize similarities when presented with a new input pattern, resulting in a predicted output pattern. The research in this thesis deals with the issue of classification of chemical species into different lumped categories using artificial neural networks. The feasibility of applying neural networks to solve the lumping problem is not straightforward because: 29 1. The conventional notation used to represent chemical species is not in a form which can be given directly as an input for the neural networks. 2. A unique or canonical representation of each chemical species is required in order to avoid ambiguity and misinterpretation. 3. An atmospheric chemical species database needs to be available which is ade­ quate for a neural network to be trained. Machine learning techniques by artificial neural networks have the potential to automate the process of classifying atmospheric chemical species into appropriate lumped categories. 30 C hapter 3 M ethodology - G eneration and Pruning of Chem ical Species D atabase 3.1 In trodu ction The issue of applying neural networks to solve the problem of lumping chemical species is not straightforward. One of the major problems mentioned previously is the availability of a database of atmospheric chemical species which is adequate for a neural network to be trained. Even if the database is generated, not all the chemical species will be im portant in the atmosphere. It was realized th a t a systematic pruning of chemical species database would be required. The stages of this work are depicted in Figure 3.1 and are discussed in this chapter. 3.2 G eneration o f C hem ical Species D atab ase Once it has been decided to use the neural network to solve the problem, gathering the d ata for neural network training purposes is the first task. The training data set includes a number of categories (also called patterns) and a number of cases (also 31 G enerating the C hem ical Species D atabase --------------using SA M S Softw are C o nvertin g the g en erated C hem ical Species ^ to SM IL E S N o tation ___ P ru ning o f C hem ical Species w hich are obviously u nim p o rtan t____ L um pin g o f C h em ical Species into D ifferent G r o u p s _ C o nvertin g the C hem ical Species to a M atrix N otatio n w ith Species U niqueness C o n verting to V ecto r N otatio n and R edu cing th e dim en sion ality A p plication o f A rtificial N eural N etw o rk for P attern R ecog nition S upervised L earning M ethods U nsu perv ised L earning M ethods C lassifying to differen t p ro posed lum ped categories Figure 3.1: Methodology of the research called samples) in each category. This requirement of a number of categories and a number of cases in each category for neural network training frequently presents dif­ ficulties. For most practical problems, the number of cases required will be hundreds or thousands. Even more cases may be required if the problem is more complex. If the training dataset is smaller, the information given may not be adequate to train the n eu ral netw ork. The purpose of using a neural network is to generalize (i.e., when inputs which are not in the training set are given to the network, the outputs of the network should closely approach target values). Generalization requires prior knowledge. T his 32 can be achieved by knowing the relevant inputs (usually in large numbers) and the input to output relationship th at contains adequate information for the network to be trained. The effective performance of the neural network lies in the accuracy of classification. For a neural network to have acceptable performance, there is a need for a set of chemical species available in the database th a t is adequate for a neural network to be trained to give correct classification. There are some existing databases available in the literature such as Master Chemical Mechanism (MCM) developed by the University of Leeds [38], a database of atmospheric chemical species developed by Syracuse Research Corporation [39], and the Regional Atmospheric Chemistry Mechanism developed by Stockwell [23]. The limited number of chemical species in these databases are insufficient to train a neural network. For example in MCM database consists of 124 chemical species which are divided into 14 lumped categories. This is the reason we have been motivated to develop our own database of VOCs (Volatile Organic Compounds). Generating the chemical species database has been done with the help of exist­ ing software developed by Spectrum Research group. The software is a ComputerAssisted Structure Elucidation of Q-2 called SAMS (Structure Assembly Made Sim­ ple) [40]. SAMS is a powerful tool used for both Structure Elucidation and New Chemical Entities (NCEs) generation. SAMS was designed for optimized structure generation based on known empirical formula and bond constraints derived from small molecule fragments. This software takes the empirical formula as an input and gen­ erates all possible complete unique structures of the given empirical formula as an output. A database of chemical species, excluding the cyclic chemical species, has been generated for the empirical formulas shown in Table 3.1. Cyclic species are not considered in this thesis because few cyclic species are present in the atmosphere. The SAMS software produced thousands of chemical species isomers for a given empirical formula. This case is especially true with the empirical formulas which have more carbon atoms (usually for a carbon count greater than 5). Identifying the possible cyclic species in the atmosphere and considering 33 Table 3.1: List of the empirical formulas used to generate the chemical species database n = 1 to 8 c . iH2„+2 Cri^2n Orii^2n-2 n = 1 to 7 C„H2„ + 20 Cn^2nO n = 1 to 6 n = 1 to 5 CnH2n-20 CrjH2„_202 ^71^271—40 Cnll2n —4O2 OnH2?a— gO CnH2n_602 OyiH2n— sO CnH2n_s02 CnH2n-loO C„H2n_io02 CnH2n+202 them in the chemical species database can be considered as one area of future work. The heavier chemical species are also not considered in this thesis. As the number of carbon atoms in the carbon chain increases and as the molecular weight increases the vapor pressure is decreased. Organic compounds with low vapor pressures (i.e., less than 10 Pa at 20° C) [41] are not considered to be volatile organic compounds in the atmosphere. Only the Volatile Organic Compounds (VOCs) are considered in this thesis be­ cause VOCs are very important trace atmospheric constituents. In the atmosphere these play a critical role in tropospheric chemistry and can have strong direct ad­ verse effects on the environment depending on their concentrations. VOCs affect the oxidation capacity of the troposphere and contribute to photochemical ozone for­ mation. The formation of many im portant secondary pollutants in the atmosphere, such as ozone, peroxides, aldehydes, and secondary organic particulate m atter de­ pend critically on the availability of VOCs. Tropospheric ozone is mainly formed when pollutants emitted by cars, power plants, chemical plants, and other sources react chemically in the presence of sunlight. Motor vehicle exhaust and industrial emissions, gasoline vapors, and chemical solvents are the major sources of NOa, and VOCs. These two pollutants are the primary reactants in tropospheric ozone forma­ tion. Ozone at ground level is considered as a “bad ozone” because it is a harmful pollutant and has proved to be toxic to living things. Hence, VOCs are of central importance for tropospheric chemistry. VOCs include a wide range of carbon based 34 molecules which participate in atmospheric photochemical reactions. The VOCs in­ clude aldehydes, ketones, alcohol, and hydrocarbons with single, double, and triple bonds. W ithout including the cyclic species, the SAMS software assisted in generating 4200 unique chemical species for the empirical formulas listed in the table 3.1. All the chemical species generated by this process may not be im portant or may not exist in the atmosphere. A systematic pruning of chemical species is done by eliminating the chemical species which are obviously unim portant in the atmosphere due to low volatility, lack of sources, or structure instability. EPI (Estimation Program Interface) Suite assisted in pruning some of the chemical species from the database. 3.3 E P I (E stim ation Program Interface) S uite The EPI Suite is a group of programs th at provides physical properties, chemical properties and environmental fate of the chemical species. It was developed by the Environmental Protection Agency’s (EPA’s) Office of Pollution Prevention and Toxics and Syracuse Research Corporation (SRC). The vapor pressure program of the suite has been used to determine whether or not the generated chemical species is a VOC. The EPI suite [42] provides users with both experimental values and estimated values of physical and chemical properties which assist in predicting the environmental fate of a chemical species. This software requires only the chemical structure of the com­ pound in SMILES (Simplified Molecular Input Line Entry System) notation as an input. A detailed description of the SMILES notation of chemical species structure is given in the next chapter. The interface of EPI suite transfers a single SMILES notation to ten separate structure estimation programs th at are part of the EPI Suite: 35 1) Atmospheric oxidation rates 6) Bioconcentration factor 2) Biodégradation probability 7) Aquatic toxicity 3) Henry’s law constant 8) Water solubility 4) Octanol-water partition coefficient 9) Aqueous hydrolysis rates 5) Soil absorption coefficient 10) Melting point, Boiling point, and Vapor pressure 3.4 P runing o f C hem ical Species D atabase After generation, the chemical species are first sorted according to different functional groups. The functional groups considered are: 1. Alkanes 2. Alkenes (with one, two and three double bonds in the molecule) 3. Alkynes (with one, two and three triple bonds in the molecule) 4. Combination of double and triple bonds (with the combinations of 1 double and 1 triple bonds, 1 double and 2 triple bonds, 1 double and 3 triple bonds,2 double and 1 triple bonds, 2 double and 2 triple bonds, 3 double and 1 triple bonds in the molecule) 5. Alcohols 6. Aldehydes 7. Ketones 8. Ethers 9. Esters 10. Carboxylic acids 36 11. Unstable chemical species (chemical species with patterns -C (0 )0 , -C (0 )(0 )C -, and -COCOC-) 12. Vicinal diols (chemical species with patterns -C (0 )C (0 )C - ) A systematic pruning of the chemical species database was been done by exclud­ ing chemical species which are obviously unim portant in the atmosphere using the following two approaches: 1. Functional Group Approach 2. Vapor Pressure 3.4.1 F u n ctio n a l G rou p A p p roach Some of the chemical species which have been excluded from the chemical species database on the basis of functional group are as follows: 1. E thers and esters: Ethers and esters have been excluded from the database because the chemical species with these functional groups are considered to be unim portant in the atmosphere. The atmospheric fate of the chemical species with these functional groups are not described in comprehensive books such as Chemistry of Upper and Lower Atmosphere by Finlayson-Pitts and Pitts [43] and Atmospheric Chemistry and Global Change by Brasseur et al. [44]. 2. C arboxylic acids: Carboxylic acids react very slowly in the atmosphere (for example the life time for HCOOH at [OH] = 1x10® radicals cm~® is approxi­ mately 26 days). Due to the high solubility and stickiness of these acid molecules, they are likely to be removed by wet and dry deposition rather than by OH rad­ ical reactions [43]. 3. V icinal diols: Vicinal diols have been excluded from the database because a limited number of studies have been done on the oxidation of diols by molecular 37 oxygen as an oxidant. These are unstable in the atmosphere and undergo a re­ arrangement by cleaving the C-C bond and forming either an aldehyde group or a ketone group. As these two groups have already been included, the molecules formed by cleaving of the diols will be present in those groups. 4. U n sta b le m olecules: The functional group denoted by SMILES patterns such as -C (0 )0 , -C (0 )(0 )C -, or -COCOC- is unstable in the atmosphere and breaks down relatively quickly, forming either aldehydes or ketones which are already included in the chemical species database. 3 .4 .2 V ap or P re ssu re Not all the chemical species which were generated by SAMS software may be VOCs. VOCs are organic chemicals th a t can easily vaporize at ambient temperatures. The remaining chemical species in the database were further pruned by eliminating the chemical species which were found to be nonvolatile organic compounds. This can be done by considering the vapor pressures of the organic species. If the chemical species does not have a measurable vapor pressure then it is a nonvolatile chemical species. The recently published EU VOC-directive [41] defines a VOC as an organic compound which has a vapor pressure above 10 Pa at 20° C, or has a corresponding volatility under the particular condition of use. In the absence of measurable data, the vapor pressure of the each chemical species in the database was estimated from the EPI suite using three methods: 1. Antoine method 2. Modified Grain method and 3. Mackay method. All three methods use the normal boiling point to estimate the vapor pressure. The Antoine method [45] was developed for liquids and gases. The general equation 38 is: 1 \n{VP) = L(r&-c) (T -c) (3.1) where A H„6 is the heat of vaporisation at the boiling point (cal/mol) Tft is the temperature of the normal boiling point in Kelvin C is a constant estimated to be = -18 + 0.19T& (in Kelvin) T is the temperature in Kelvin AZft (compressibility factor) is assumed to have the value of 0.97 R is the gas constant = 1.987 cal/(mol K) Vapor pressure is defined with respect to the reference state of standard pressure 1 atm. The modified Grain method [46] is a modification of the Watson method. It is applicable for solids, liquids and gases. Equation 3.2 applies only for compounds which are liquid or gaseous at the tem perature of interest and equation 3.3 can be used for solid and liquid compounds. ln ( y p ) - ^-^-(8.75 + RZTr(T),)) In(VP) (Ï6 - C): 0.97RT R 'f (8.75 -I- RZM(T),)) 0.97R 1- (2^ - C ) (3 - 2T*) (T-C)J (3.2) 2m (3 - 2r*)™-Vm(T*) (3.3) where VP = vapor pressure [atm] Kp = compound class specific constant R = gas constant [cal/mol K] Tb = boiling point [K] T = environmental temperature [K] C = -18 + 0.19 Tfc T* = T/Tfe The value of Kp ranges between 0.97 to 1.23. The constant m depends on T* and 39 the physical state of the compound at the temperature of interest: Liquids: m — 0.19 Solids: T* > 0.6 then m = 0.36 0.5 < T* < 0.6 then m — 0.8 T* < 0.5 then m = 1.19 Mackay [47] fitted the following empirical equation to estimate the vapor pressure: In(yp) = -(4 .4 + Z7%(T1,)) 1.803 ( p - l ] -0 .8 0 3 X /n - 6.8 T (3.4) The equation includes the boiling point (Tb), the melting point (T ^) and the tem perature (T) in Kelvins. The melting point is ignored for liquids. EPI reports the vapor pressure estimate from all the three methods and reports a “suggested” vapor pressure. The modified Grain estimates the suggested vapor pressure for solids, while for liquids and gases, the suggested vapor pressure is the average of the Antoine and the modified Grain estimates. The Mackay method is not used for the suggested vapor pressure because its application is limited to the chemical species from which it was derived. After pruning the list of chemical species by the functional groups, a further pruning has been done by taking the vapor pressure into consideration. According to the statement th a t a VOG is an organic compound which has a vapor pressure above 10 Pa at 20° C in the recently published EU VOC-directive, a further pruning of the chemical species database has been done by eliminating the chemical species from the database which have a vapor pressure less than 10 Pa. W ith the help of the above two pruning approaches, a data set representative of the chemical species present in the atmosphere was obtained. 40 C hapter 4 M ethodology - R epresentation of C hem ical Species 4.1 In troduction After generating the chemical species database, the next task was to determine how to give the information about the chemical species to the neural network. Tradition­ ally a chemical species is represented either as an empirical formula or with structural notation. This type of notation is not suitable as input to the neural network. There needs be some intermediate notation which can be easily accessible and should pro­ vide appropriate information about the chemical species to the computer software. We have employed SMILES (Simplified Molecular Input Line Entry System) nota­ tion to generate a matrix notation to convey the structural information to the neural network. SMILES notation, while not suitable for the purpose of giving the chemical species information to the neural network, did serve as an intermediate representation into which the structural representation of the chemical species could be converted. SMILES notation can be used later to convert to matrix notation which can be acces­ sible by the neural network. This chapter discusses the methodology of the SMILES notation and the techniques used in the literature to represent chemical species in matrix notation. This chapter also presents our approach to m atrix representation of 41 Table 4.1: Some examples of SMILES notation to represent molecules Molecular Name Methane Ammonia Water SMILES Notation C N 0 Molecular Formula CH4 NH3 H2O chemical species and its uniqueness which avoids misinterpretation and ambiguity. 4.2 SM ILES (Sim plified M olecular Input Line En­ try S ystem ) N o ta tio n SMILES is an effective method which is used widely by chemists to encode chemical species data for computer use. SMILES is a line notation for chemical structures which represents the two-dimensional valence-oriented picture th a t chemists often use to represent a molecule. SMILES notation is written as a single sequence of characters in the form of strings without any space characters [48]. A space character denotes the end of the string. Hydrogen atoms are suppressed in this type of notation. Among several approaches to computerized chemical notation, this line notation is popular because it represents molecular structure by a linear string of symbols. Rules for generating SMILES for any chemical structure are illustrated in Tables 4.1, 4.2 and Figures 4.1, 4.2, 4.3 shown in this section. 4 .2 .1 A to m s Atoms are represented by their atomic symbols. In general the first or only letter of the symbol is written in an upper-case letter, the second (if present) must be lower­ case. Atoms in aromatic rings are specified by the lower-case letters. Some examples of SMILES notation to represent an atom in a molecule are shown in the Table 4.1. 42 Table 4.2: A list of chemical species with SMILES notation Molecular Name Ethene Ethyne 3-Heptene(Cis) 3-Heptene(Trans) 4 .2 .2 SMILES Notation Molecular Formula C=C C2H4 c#c cc/c=c\ccc cc\c=c\ccc CyHw C7 H14 B onds Single, double, and triple bonds are represented by the symbols - (hyphen), = (equals sign), and # (hash symbol) respectively. For 'els' chemical species a backward and a forward slash is introduced immediately before and after the two carbon atoms linked by the double bond. For ‘tran s’ chemical species two backward slashes are used. Single bonds between atoms are not explicitly shown. A list of chemical species with SMILES notation are shown in Table 4.2. 4 .2 .3 B ra n ch es Branches are specified by enclosing the atoms within parentheses. Branches can be nested to any depth or stacked as illustrated in Figure 4.1. H H H -C-H H H -Ç — H H Ç— Ç — C— Ç - H — ^ CC(CXC)CC(C)C h h - c - h h h - c- h H H H 2,2,4-Trimethylpentane Figure 4.1: SMILES notation for molecules with branched structure 43 4 .2 .4 C yclic S tru ctu res Cyclic structures are usually represented by breaking one single or aromatic bond in each ring and labeling the atoms which participated in the broken bond with the same integer. The bonds are numbered in any order, designating ring-opening (or ring-closure). For example SMILES notation for l-methyl-3-bromo-cyclohexene is shown in the Figure 4.2. The notation C C l=C C (B r)C C C l is the canonical notation according to the lUPAC convention. Br Br (a) CCl=CC(Br)CCCl Cl C cx -^ HjC (b) CCl=CC(CCCl)Br Figure 4.2: SMILES notation for cyclic structured molecules (after figure in [48]) 4 .2 .5 A r o m a tic ity Aromatic structures may be distinguished from cyclic species by writing the atoms in the aromatic ring in lower case letters. For example the SMILES notation of benzoic acid is shown in Figure 4.3. C— Cl ' ® clccccclC(=0)0 OH Figure 4.3: SMILES notation for aromatic chemical species (after figure in [48]) 44 4.3 M atrix N o ta tio n 4 .3 .1 T ech n iq u es u sed in th e litera tu re The bond and electron (BE) matrix provides description of connectivity of atoms within a moiecuie [1,4,7]. Atom connectivity is given by nonzero entries equal to the bond order in the off-diagonal locations of the matrix. Radicals include an entry on the diagonal of the m atrix element indicating the unpaired electron. The BE matrices of ethane, ethyl radical, and ethene are shown in the Figure 4.4. 01110001 10001110 10000000 10000000 01000000 01000000 01000000 10000000 1111000 1000111 1000000 1000000 0100000 0100000 0100000 021001 200 1 10 100000 010000 010000 100000 Figure 4.4: BE matrix representations th a t specify atomic connectivity and electronic environment for (left) ethane, (center) ethyl radical, and (right) ethene (after Figure 5. in [1]) The sum of the entries in a given row is equal to the number of valence electrons for th a t atom. The diagonal elements in ethane BE matrix are all zero, as all of the entire valence electrons for carbon and hydrogen are involved in bonding. The BE m atrix for the ethyl radical contains a nonzero entry in the diagonal to denote the unpaired electron defining the radical center. Non-unity entries for multiple bonds are illustrated in the BE matrix of ethene. R ep resen tation for C hem ical Species and R eactions The BE m atrix is well suited for description of chemical reactions. Broadbelt and co-workers [1] developed an automated system which describes the chemical reaction mechanism through BE matrix notation. The number of atoms in a molecule ac­ tually affected in a chemical reaction is small. The BE sub-matrix comprises only 45 those atoms which are actually affected in chemical reaction. These sub-matrices are relatively small and dense. This makes the m atrix operation R -t- B = E simple and effective. The reaction matrix, R, is added to the reactant sub-matrix, B, to yield the product sub-matrix, E, which gives the altered connectivity of the atoms involved in the reaction. The product sub-matrix E can then be incorporated back into the overall BE m atrix and adjacency structure to represent the entire product molecules [1,7]. Figure 4.5 shows the reactant sub matrices for different types of reactions. H 0 -1 1 R* -1 1 0 C -1 1 0 Ca 1 0-1 R 1 0 -1 0-1 1 S 4 -1 1 Ri 1-1 (b) (a) Cl 1-1 C2 -1 1 (c) Cl 0-1 1 C2 -1 1 0 R. 1 0-1 (e) (d) Figure 4.5: Reaction matrix representation for (a) H-abstraction, (b) /3-scission, (c) Recombination, (d) Bond fission, and (e) Radical addition, (after Figure 6. in [1]) The general form of the chemical reactions shown in the Figure 4.5 is as follows: (a) H-abstraction: (b) /3-scission: R - C - H + OH C — Co C + \ / I c Ca + -0/3 / \ I (c) Recombination: Ri- 4- Rg- + M —> R1-R2 + M 46 R - H2O (d) Bond fission: _ Cj - \ (e) Radical addition: — - Cr + / Ci / C2 - = C2 + R- R - C2- I I — Ci — C2' I I \ The mechanism generator contains three species lists: unreacted components (molecules and radicals), reacted molecules, and reacted intermediates (radicals). These lists are visited through an iterative algorithm. Reaction generation begins by placing the reactants in the unreacted components array, a list of species which are yet untested for reaction. The first species is then extracted from the unreacted component array. Sequential tests are done on the molecule to indicate what type of reactions are plausible, and later the reaction operations are applied. These tests are done by user-defined rules which define the atoms in a molecule th a t are involved in the chemical reaction. For example [1] in: • H -abstraction: Abstraction of any hydrogen atom in a molecule is allowed. • ,0-scission: /3-scission requires a C atom th a t is in (3 position to the radical center and it should be singly bonded to the a atom. • R adical recom bination: A radical recombination can occur if there exist any two radical centers in a reaction. • B ond fission: If there exists a carbon-carbon single bond, a bond fission reac­ tion can occur. • R adical addition: Radical addition reaction occurs if there exists a radical center with the atoms of a double or triple bond. Reaction operations are applied on the small area of the m atrix containing the atoms th a t are actually involved in the reaction. M atrix representation of the product 47 matrix is obtained with altered connectivity of atoms in the molecule. A systematic connectivity check has to be done on each product matrix to determine the number of chemical species formed and their correct matrix representation. The methodology of reaction generation is best explained with the pyrolysis of pure ethane [7]. The process first checks whether the species is a molecule or radical because the reaction properties are different. As ethane is a molecule it is subjected to tests like bond fission, H-abstraction, and radical addition. Ethane should fail the test for H-abstraction and radical addition as there are no co-reactant radicals in the species lists for the pyrolysis mechanism. Ultimately the first application is the bond fission reaction. Bond fission requires a carbon-carbon single bond. The model is first tested for bond fission by determining if there is a single carbon to carbon bond. After the single carbon to carbon bond is located the connectivity information is placed into a BE matrix to compute the reaction. The fission reaction is carried out by addition of the reactant (ethane) sub-matrix and the reaction (fission) matrix as shown in the Figure 4.6. A new BE sub-matrix describing the connectivity and electron configuration of the reactive sites of the product molecule is obtained. c O il 11000 c 10000111 H 10000000 H 10000000 H 10000000 H 01000000 H 01000000 H 01000000 + 1-1 - 11 _____ ^ c 10111000 c 01000111 H 10000000 H 10000000 H 10000000 H 01000000 H 01000000 H 01000000 Figure 4.6: Bond fission reaction in matrix notation The connectivity of these reactive sites in the product side is unaltered. The prod­ uct molecule is produced by reassembling the adjacency information of the reacted and unreacted atoms. This entire matrix is then subjected to a connectivity check to determine the number of products formed. The current molecule or radical subjected to the reaction tests (ethane, in this 48 example) is now completed as a reactive component. It is removed from the unre­ acted component list and placed in the appropriate molecule or radical list so th at subsequently generated species can participate with it in bimolecular reactions. All combinations of species for a given reaction type will ultimately be tested. Subse­ quent passes through the generation algorithm follow the same logic and treat all species in a systematic manner to ensure all possible combinations are generated. The algorithm continues as long as all the new species in the unreacted components list are completed. Thus, the methyl radical was tested for the pyrolysis radical reac­ tions of recombination, radical addition, disproportionation, /3-scission, and hydrogen abstraction. The functionality of the model developed by Broadbelt and co-workers [1] is also explained with a bimolecular reactions example. A set of chemical reaction pathways for methyl radical (ie., 2 CH3 C^Hg, C2H6 + OH —> C2H5 + H2O) is considered in the Figure 4.7. For bimolecular reactions, two reactants are combined together to form a single reactant matrix. Step I describes the merging of the two methyl radicals into a single reactant matrix. In Step II the forward reaction operation describes the recombination of two methyl radicals to one ethane molecule. The reverse reaction describes the bond fis­ sion of one ethane molecule into two methyl radicals. Steps III, IV, and V describe the H-abstraction from ethane. Step HI shows the merging of two BE reactant matrices (i.e., ethane and OH radical) into a single reactant matrix. This section illustrates the way a chemical species can be represented in matrix notation and how well this matrix notation is suited for the description of chemical reactions for computational purposes. Although this notation is not the exact matrix notation employed in this thesis, the work produced by Broadbelt and coworkers has introduced the idea of representing the chemical species in a matrix notation which may be useful for application to neural network processing. 49 c H H H C H H H n i l 1000 1000 1000 + n i l 1000 1000 1000 1 1 1000 0001 1 1 10000000 10000000 10000000 Ri. Ri Cl 01000000 Cl 01 1 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 00 1000000000 1000000000 1000000000 0100000000 0100000000 0 100000000 000000001 1 0000000010 c 01 1 1 1 0 0 0 c 10 0 0 0 1 1 1 H 10000000 H 10000000 H 10000000 H 0 1000000 H 0 1000000 H 0 1000000 1-1 R ecom bination 01000000 01000000 -11 1- B ond F ission + S te p I o . 11 H 10 Step II H 00 10 0 0 0 0 0 0 c 010 0111000 0 100 1 0 0 0 0 0 0 H 00 1 0 0 0 0 0 0 0 c OIOOOOOI 1 1 H 0100000000 H 0100000000 H 0000 100000 H 0000100000 H 0000100000 111100000 100001110 100000000 100000000 010000000 010000000 010000000 001 001 Step V Step III H 0 -1 1 C -1 1 c rJi 0 H 010 0000000 c 100 0111000 0 00 1 1000000 0010000000 H - A bstraction c 0 1 0 0 0 0 0 1 1 1 H 0100000000 H 0100000000 H 0000 I 00000 H 0000100000 H 0000100000 Step IV 110 Figure 4.7: A set of reaction pathways and its m atrix operations 4.4 T he A pproach After obtaining the pruned chemical species list, chemical species in SMILES nota­ tion are converted to the internal matrix notation. The hydrogens are suppressed in the matrix notation approach used in this thesis. The various transformations of the chemical species are depicted in Figure 4.8. Figure 4.8(a) shows the molecular structure of the chemical species where the numbers in the parentheses refer to the corresponding atoms in matrix notation shown in Figure 4.8(c). The structural nota­ tions are converted to SMILES notation as shown in Figure 4.8(b) for computer use in order to convert the chemical species from external notation to an internal con­ nectivity matrix notation as shown in Figure 4.8(c). The connectivity matrix is the hydrogen suppressed matrix. Atom connectivity in the connectivity matrix is given by nonzero entries equal to the bond order in the off-diagonal locations of the matrix. 50 In order to avoid ambiguity, misinterpretation, and to preserve the atom connectiv­ ity information of each chemical species, a unique representation of chemical species is done. The uniqueness of the chemical species is obtained by labeling the atom ’s connectivity as a real number weight in each of the non-zero matrix elements. The decimal part of the entry represents which type of atom is connected and what type of bond conformation exists. In Figure 4.8(c) the entities are represented as follows: 1. If a C atom is singly bonded to another C atom, then it is represented as 1.0 2. If a C atom is doubly bonded to another C atom, then it is represented as 2.0 3. If a C atom is singly bonded to an O atom, then it is represented as 1.5. The 5 in the decimal place represents the oxygen atom. 4. Similarly, If a C atom is doubly bonded to an 0 atom, then it is represented as 2.5 5. If there exists a double bond between the two C atoms which isassociated with the ‘cis’ conformation, then it is represented as 2.25 6. If there exists a double bond between the two C atoms which isassociated with the ‘tran s’ conformation, then it is represented as 2.75. This initial choice for representing ‘trans’ was later modified as discussed in Section 7.1.2. After identfying the matrix notation for the chemical species, the next task of the work is to find the appropriate lumped categories and sort the chemical species database into those lumped categories. These are discussed in the Chapter 5. 51 C C (=0)C C (C )C (4) CH 3 (7) (SM ILES Notation) (b) 4-M ethyl-2-Pentanone (Structural Notation) (a) c (4) c c (5) 0 0 0 0 0 0 2.5 1.0 0 0 0 0 2.5 0 0 0 0 0 0 1.0 0 0 1.0 0 0 0 0 0 1.0 0 1.0 1.0 0 0 0 0 1.0 0 0 0 0 0 0 1.0 0 0 C (1) (2) c 0 1.0 c (2) 1.0 (1) (=0) (3) C (4) C (5) (C) (6) C (7) c (7) (c) Figure 4.8: Various transformations of the chemical species 52 C hapter 5 M ethodology - R eactions and Lum ping 5.1 T ropospheric C hem ical R eaction s 5.1.1 R e a c tio n s o f A lk an es Alkanes are hydrocarbons which have only single carbon-carbon bonds and have a general formula C„H2„+2- The hydroxyl radical has a strong tendency to abstract a hydrogen atom from an alkane forming an alkyl radical and a water molecule. This is the only im portant initial chemical reaction of an alkane in the atmosphere [43]. RH + OH -> R + H2O The other two possible reactions are: RH + NO3 ^ R -I- HNO3 (Relatively slow) RH + Cl ^ R + HCl. The only significant reaction of the alkyl radical in the atmosphere is with O2. It readily combines with the O2 molecule to form alkyl peroxy radicals. R + O2 53 RO2 The reactions of RO2 with NO are fast and these are the predominant reactions oxidizing NO to NO2. RO2 + NO —> RO T NO2 The second class of reactions of RO2 with NO is the addition of RO2 to NO forming alkyl nitrate. This type of reaction can be significant for larger molecules (with 5 or more carbons). RO2 + NO ^ RONO2 The other two possible reactions of RO2 are with HO2 or with another RO2 . RO2 + HO2 ROOH + O2 RO2 + HO2 —> Carbonyl compound + H2O RO2 + HO2 —^ ROH + O3 RO2 + RO2 2R 0 + O2 (Contributes significantly) RO2 + RO2 —> ROH + RCHO + O2 (Contributes significantly) RO2 + RO2 —» ROOR + O2 (Small contribution) The alkoxy radical produced can react by several pathways depending upon the struc­ ture of the molecule. The reaction could be with O2, decomposition, isomerization, with NO and with NO2. R eaction w ith O 2: Molecular oxygen, O2, reacts with the alkoxy radical by abstract­ ing the hydrogen, producing a carbonyl group and HO2. The hydrogen abstracted is the hydrogen bonded to the same carbon as the alkoxy. RO -j- O2 —>O—C CH3C(0 0 -)H - CH2OH CH3C(0 0 -)H - CH2OH + NO ^ CH3C(0 -)H - CH2OH + NO2 55 CH3C(0 0 -)H - CH2OH + NO -4 CH3C(0 N0 2 )H - CH2OH CH3C(0 -)H-CH2 0 H formed from the above reaction can then undergo the following process: 1. Reaction with O2 CR1R20H + O2 R 1R 2C =0 + HO2 2. Decomposition (Carbon count < 4) CH3C(0 -)H - CH2OH CH3CHO + CH2OH 3. Isomerization (Carbon count > 5) CH2OH CHgCHG- ^ Isomer In itiation reaction w ith O3 : Typical peak ozone concentrations found throughout the world currently range from 30 to 40 ppb for the most remote places, or as high as 500ppb or more in highly polluted urban areas. Peak levels in the rural-suburban areasare typically in the 80 - 150 ppb range [43]. Under these conditions the reaction O3 is the significant reaction for alkenes. Sym m etrical Alkenes: The initial step in this type of reaction is the addition of O3 across the double bond forming a primary ozonide ^2 > 0 1 = C2<^4 + 0 3 —^ Primary Ozonide These primary ozonides are not stable in the atmosphere. They readily break down to an aldehyde or ketone and to diradical Criegee intermediates Primary Ozonide —> > 0 = 0 + ^ >C 00- Primary Ozonide > 0 = 0 + ^ >C 00 A sym m etrical Alkenes: The following are the reactions of asymmetrical alkenes and O3 with their approximate branching ratios: RICH = CH2 + O3 RICHOO- + ECHO (0.5) RICH = CH2 + 0 3 ^ RICHO + HCHOO- (0.5) 56 ^ >C = CHg + Û 3 ^ ^ >COQ. + HCHO (0.65) m >C = CHa + O3 -» R1C(0)R2 + HCHOO- (0.35) R2 ^ . _ _ CHR3 + 0 3 - ^ ^ > C 0 0 - + R3CH0 (0.65) Rl Rl R2 >C = CHR3 + O3 ^ R 1C(0)R2 + R 3C H 00- (0.35) Term inal A lkenes and Internal Alkenes: Alkenes can be classified as terminal or internal alkenes. Alkenes with a double bond involving a terminal C (such as propene) are called terminal alkenes. Internal alkenes, such as trans-2-butene, have double bonds within the molecule which do not involve C atoms in the terminal positions. The reaction reactivities and the end products of internal and terminal alkenes with respect to the OH radical and O3 are significantly different. Terminal alkenes react in the atmosphere to form aldehydes while internal alkenes react in the atmosphere to form ketones. 5 .1 .3 A lk y n e R e a ctio n s Alkynes are hydrocarbons which have one or more carbon-carbon triple bonds. The only significant initial reaction of alkyne is with the OH radical. The OH radical is initially added to the triple bond. These reactions give major products with corre­ sponding dicarbonyls: acetylene gives glyoxal [(CHOjg], propyne gives methylglyoxal [CH3COCHO] and 2-butyne gives biacetyl [(CH3CO)2]. The following is the reaction mechanism for the ethyne [43]: HC = CH + OH ^ HC(OH) = CH HC(OH) = CH + O2 ^ HC(OH) = CHOOHC(OH) = CHOO + N 0 HC(OH) = C(0-)H -t- NO2 HC(OH) = C (0-)H ^ HC(OH) - C (H )= 0 HC(OH) - C (H )=0 -b O2 ^ (CH0)2 + HO2 Term inal A lkynes and Internal Alkynes: Similar to the alkenes, alkynes are also classified as terminal or internal alkynes. Alkynes with a triple bond involving a 57 carbon atom at the end of the molecule, such as 1-butyne, are called terminal alkynes. Internal alkynes, such as 2-butyne, have a triple bond within the molecule. Terminal alkynes react with OH radical resulting in the formation of formaldehyde and other aldehyde radicals as the end product whereas in the case of internal alkynes the final products are ketones and some other aldehydes. 5 .1 .4 R e a c tio n o f O x y g en -co n ta in in g O rganic S p ecies The reactions of oxygen-containing organics with the OH radical are fast. In most cases, NO3 reactions are slower than reactions with the OH radical, the latter being the primary oxidant for the oxygen-containing compounds [43]. Aldehydes: Aldehydes are chemical species which have a carbonyl functional group ( 0 = 0 ) in the terminal position. In an aldehyde, at least one hydrogen atom is bonded to the carbon of the carbonyl group, so the aldehyde functional group is -OHOReactions with aldehydes occur with abstraction of the relatively weakly bonded aldehydic hydrogen R C H O + OH (N O 3 , Cl) R C O + H 2 O (H N O 3 , HCl) The RCO radical which was obtained is then added to Og. An illustration of the reaction mechanism for aldehydes is as follows: RCO + O2 RC(=0)00- RC(=0)00- + NO RC(=0)0 + NOg R C (=0)00 + NOg ^ RC(=0)-00N0g RC(=0)0 ^ R + COg K etones: Chemical species with a carbonyl group functional group(C =0) in the nonterminal position are referred to as ketones. The reactions of ketones are similar to those of alkanes with abstraction of hydrogen by OH, N O 3, and Cl. A lcohols: Alcohols are the chemical species which have a hydroxyl (-0H) functional group. The possible hydrogen abstraction sites for the reaction with alcohols are: 58 1. The alcohol 0-H 2. Primary hydrogen 3. Secondary hydrogen 4. Tertiary hydrogen The reaction with OH tends to abstract the hydrogen th at is the most weakly bonded. The hierarchy of the C-H and 0-H bond strengths are: Tertiary < Secondary < Primary < 0-H . 5.2 Lum ping A pproach E m ployed for C lassifica­ tion In our approach, we have employed features of both the structural and the molecular lumping approaches. The most important oxidation reaction for alkanes is reaction with an OH radical. The OH radical abstracts one hydrogen atom to form an alkyl radical and a water molecule. Under atmospheric conditions oxygen reacts rapidly with the alkyl radical to form an alkyl peroxy radical. The peroxy radical reacts with NO to make NOg, organic nitrates, and unstable alkoxy radicals. The latter may decompose or isomerize. Hydroxyl radical reaction rate coefficients and ozone reaction rate coefficients for VOCs can be estimated from structure-reactivity relationships. Kwok [49] developed an algorithm to estimate the hydroxyl radical reaction rate coefficients for gas-phase organic compounds. The EPI suite provides the experimental and estimated reaction rate coefficients for all the VOCs with respect to OH and O3 using structure-reactivity relationships. For example, hydrogen atom abstraction rate coefficients from C-H and 0-H are based on the estimation of group rate coefficients for hydrogen atom abstraction from 59 -CHa, -CH2-, >CH-, and -OH groups. The rate coefficient for hydrogen atom ab­ straction from these groups depends on the identity of the substituents attached to them [49]. The hydroxyl rate coefficient for n-butane is as shown below; CH3 -CH2 CH2 CH3 ^ tprimF(-CH2 -) -b (-CH3 ) f (-CH2 - ) -fi (-CH3)F(-CH2-) -b &pHmF(-CH2 -) where ^(-CHa-X) = k p r i m F { X ) , A(X-CH2-Y) = ( X ) f (Y), &(X-CH<^) - ( X ) f ( Y ) f (Z) and kprim, kgec, and ktert are derived using the expression k = CT^e^^/^, where C (cm^ molecule"^ s“ ^) and D (K) are the temperature-dependent parameters. The values of these parameters for the group rate coefficients for H-atom abstraction from -CHa, -CH2-, >CH-, and -OH groups are given in Table 1 of Kwok’s paper [49]. Relative rate coefficients (i.e., O a:O H ) may be determined with the O H reaction rate coefficient based on ambient concentration of 1.5 6 x 10® O H cm~® and the O3 reaction rate coefficient with O3 ambient concentration of 7x 10^^ molecules cm"® [42]. Relative rate coefficients are calculated to decide whether a chemical species will initiate the oxidation reaction preferentially with an O H radical or with O3. All the chemical species which react preferentially with O3 are lumped into a separate group because the product species follow a different chemical pathway of decomposition. For example, O3 reacts across the double bond of alkenes to make a very short-lived intermediate th at decomposes to carbonyl and Criegee intermediate radicals. The im portant chemical loss process for aldehydes includes photolysis and the reaction with OH, N O 3 radical and oxygen atom. All these reactions are first ex­ pected to form acylperoxy radicals. The subsequent reactions lead to the formation of peroxyacyl nitrates, alkyl-peroxy radicals, and eventually, formaldehyde. 60 Reaction of the alcohols in the atmosphere are of particular interest because of their use as alternative fuels. The reaction of alcohols with the OH radical depends on the possible hydrogen abstraction sites. For the reaction of methanol with the OH radical, the hydrogen abstraction from the alkyl group is predominant. In the case of the reaction of ethanol with OH radical, the secondary hydrogen abstraction is the predominant reaction. Terminal chemical species refer to those species which have a double or triple bond in the terminal position of the carbon chain. Thesechemical species react in the atmosphere to form aldehydes. Internal chemical species refer to the chemical species which have a double bond in the internal position of the carbon chain. These chemical species react in the atmosphere to form ketones. The lumping of chemical species is done based on the knowledge gained from the existing lumped mechanisms [9,20-24] and from the reviews on tropospheric chemical mechanisms [43,44]. The lumped groups utilized in this thesis are as follows: 1. All the chemical species whose dominant reaction is with ozone are lumped into a group. 2. All alkanes except methane are grouped together. 3. All aldehydes are grouped together. 4. All ketones are grouped together. 5. All alcohols are grouped together. 6. Alkynes and alkenes which react predominantly with OH radical and have a double or triple bond at a terminal position in a carbon chain are grouped together. 7. Alkynes and alkenes which react predominantly with OH radical and with a double or triple bond at internal position in a carbon chain are grouped together. 61 C hapter 6 M ethodology - Artificial N eural N etw orks 6.1 N ature o f Input to A N N The reliability of prediction is not only dependent on the architecture of the artificial neural network, but also on the input data. To obtain reliable results, the input data should be concise and minimize redundant or irrelevant information. To achieve this, the matrix notation is collapsed into a final vector notation of the chemical species which is suitable as an input to the neural network. This is depicted in the Figure 6.1. To represent the chemical species as an input to the neural network in an effective way, the following steps were applied: 1. Since the matrix has diagonal symmetry, it contains redundant information. The information which is present below the diagonal of the matrix is sufficient to represent a structure of the chemical species. The upper half of the matrix is redundant and will not be considered further. 2. All input vectors should have the same size, but the length of the chemical species in SMILES notation varies from one chemical species to another re­ sulting in different sizes of the matrix. We restricted the maximum size of 62 the SMILES notation and matrix restricting molecules to under by 8 carbon atoms. The remaining parts of the matrix are filled with zeros which describes no connectivity as shown in Figure 6.1(e). 3. There are large numbers of zeros in the matrix notation even after reducing the size of the matrix. These zeros give little information to ANN for classification and if the input size of the vector is large, the performance of the neural network may decrease. Since we are not considering any radical species in our work, the entire diagonal of zero elements (diagonally shaded region in Figure 6.1(f)) is removed. 4. Since in this work, we did not consider the cyclic species involving the first and last atoms, the shaded entities of the matrix always have zeros as shown in Figure 6.1(f). 5. Converting from this matrix notation to the vector notation is relatively straight­ forward. All the rows excluding the shaded entities are arranged in a sequential manner (as shown in Figure 6.1(g)) which preserves the connectivity informa­ tion. 6. The ratio of rate coefficients (OaiOH radical) is added as a last element of the chemical species input vector. 6.2 Supervised Learning - M ultilayer Feedforward N eural N etw ork Feedforward neural networks are the most popular and most widely used models of ANN in many practical applications. They are applied to a wide variety of chemistryrelated problems. This class of networks consists of multiple layers of computational units interconnected with each other in the feedforward way. Each neuron in one layer has a directed connection to each of the neurons in the subsequent layer. The 63 (4) (5)^ (1) CC (=0)C C (C )C CHj (SM ILES Notation) (b) (7) 4-MethyI-2-Pentanone (Structural Notation) (a) c (1) c (2) (=0) (3) w (§ (C) (6) C (7) C (1) c (2) c ((g ) (4) c (5) c (7) 0 1.0 0 0 0 0 0 c 0 1.0 0 2.5 1.0 0 0 0 c 1.0 0 0 2.5 0 0 0 0 0 (=0) 0 2.5 0 0 1.0 0 0 1.0 0 0 c 0 1.0 0 0 0 0 0 1.0 0 1.0 1.0 c 0 0 0 1.0 0 0 0 0 0 1.0 0 0 (C) 0 0 0 0 1.0 0 0 0 0 0 1.0 0 0 c 0 0 0 0 1.0 0 0 (d) (c) c 0 c 1.0 0 (=0) 0 2.5 0 c 0 1.0 0 0 c 0 0 0 1.0 0 (C) 0 0 0 0 1.0 0 c 0 0 0 0 1.0 0 0 0 0 0 0 0 0 0 (=(), (C) 1 .0 0 (e) 1.0 2.5 1 .0 0 0 0 1 . 0 0 0 0 1 . 0 ODD 1 .0 0 0 0 (8) Relative Rate Coefficient (Ozone/OH Radical) Figure 6.1; Various transformations of the chemical species 64 first layer is called the input layer where each unit represents the external inputs and does not perform any calculations. The last layer is called the output layer where the output units of the network as a whole represents the “answer” . The layers between these two layers are the hidden layers which do not correspond to either external inputs or outputs of the network. Each unit in the layer generates an output which is a simple function of its inputs, and may include external data or the outputs of previous layers. Each unit takes the output generated by the previous layer as an input, performs calculations, and provides its output as an input to the subsequent unit in a sequential manner. The coefficients, which multiply the inputs to a unit, are known as “weights” . The weight between the unit i and the unit j of a network is indicated with Wj, , assuming th a t all elements of a layer are connected to all elements of the successive layer. In this way, the connections between two layers can be represented by a weight matrix W. In this matrix the entry ji corresponds to the connection between node i and node j in the succeeding layer. These weight coefficients are used to make the network “learn” training data. The processing element multiplies each input by its connection weight and usually sums these products. The summed output n, often referred to as the net input, is used as input to the transfer function which produces the neuron output a. The transfer function can be linear or nonlinear. The transfer function is chosen depending upon the specification of the problem th a t the neuron is attem pting to solve. There are different varieties of transfer function which are used depending upon the type of problem and threshold value. Some of the standard transfer functions are [32] : • The hard limit transfer function sets the output of the neuron to 0 if the function argument is less than 0; or 1 if its argument is greater than or equal to 0. • The symmetrical hard limit transfer function sets the output of the neuron to -1 if the function argument is less than 0; or 1 if its argument is greater than or equal to 0. • The linear transfer function where the output of the transfer function is pro­ 65 portional to its input. • The saturating linear transfer function sets the output of the neuron to 0 if the function argument is less than 0; the output is equal to function argument if its arguments is greater than or equal to 0 and less than or equal to 1; or 1 if its argument is greater than 1. The input/output relation is: a = 0 if n < 0 a= n a = 1 if n > 1 if 0 < n < l • The symmetric saturating linear transfer function sets the output of the neuron ing the range -1 to 1 as follows: a = -1 if n < -1 a= n if - l < n < l a= I if n > 1 • The log-sigmoid transfer function takes the input of any value from -foo and -oo and maps the output into the range 0 to 1. The in p ut/output relation is: 1 + er In our approach, log-sigmoid and saturating linear transfer functions have been used in the hidden layer and the output layer respectively for the feedforward neural network. The feedforward neural network uses the backpropagation algorithm. The backpropagation algorithm uses differentiation of the transfer function. Therefore the most popular choice of the transfer function for the feedforward neural network in the hidden layer is sigmoidal because it has a continuous derivative. The multilayer feedforward neural network learns using the backpropagation algo­ rithm. Backpropagation can be applied to a large number of problems. It is successful 66 a = logsig (n) a = satlin (n) L o g -S ig m o id T ran sfe r F u n ctio n S aturating L in ear T ran sfe r Function Figure 6.2: Transfer functions in practical applications. The backpropagation algorithm is simple to use, and reaches an acceptable error level reasonably quickly. The network is provided with the inputoutput pairs and then tries to find the weights which minimize the squared error of the approximation produced by the network. In this case the best solution is obtained by the least mean squared method. In a multilayer networks with nonlinear transfer functions, the relation between the network weights and the error is more complex. An iterative gradient descent has to be used in order to minimize the squared error for the training set. This is achieved by means of the backpropagation algorithm. As the name implies in “backpropagation” , the error of the network is propagated “backwards” from output nodes to hidden layer(s). In this algorithm, the input data is repeatedly presented to the network. W ith each iteration of presented d ata an output is generated. The output of the neural network is compared to the target output and an error is computed. The error computed here is the mean square error (F(x) = E(e^) = E[(t-a)^]). The vector of network weights and biases is ‘x’. The target output is denoted as ‘t ’ and ‘a ’ is the network output. This error is fed back to the neural network and used to adjust the weights such th a t the error is decreased in each iteration. The weights are randomly assigned to the network in the first iteration. As a result in each iteration of the training process, the network model 67 gets closer to producing the desired output. The training of a neural network can be done by finding those weights th a t minimize the network’s error on the given samples. 6.2.1 B a ck p ro p a g a tio n A lg o r ith m The backpropagation algorithm consists of two phases: a forward propagation and a backward propagation. In forward propagation, the input units are applied to the input neurons and all the outputs are calculated using the sigmoid threshold of the inner product of the corresponding weight and input vector. The outputs of the pre­ vious layer k are propagated to the next layer k+l. Finally, a set of outputs are produced as the actual response of the neural network. During the forward propaga­ tion the weights of the network are not changed. The network output is compared with the target output, and calculates the overall system error by squaring the differ­ ence between this pair of vectors. The accumulated error for all of the input-output pairs is defined as the Euclidean distance in the weight space. During the backward propagation phase this error signal is then propagated backward through the network. The weights are adjusted in the direction of decreasing error in accordance with an error-correction rule in order to attem pt to minimize this distance using the gradient descent approach. The objective of the gradient descent approach is to make the function decrease for every iteration. Gradient descent is a function approximation which uses the derivative of a function to determine the steepest descent of the slope. The function moves in the negative direction of the slope so th a t the value of the function is reduced in each iteration. A detailed description of gradient descent in backpropagation algorithm is illustrated in Appendix A. Backpropagation provides a way to compute the necessary gradients, so th a t the network finds a local minimum of the training error function with respect to network weights. For the multiple layer neural network the error is not an explicit function of the weights in the hidden layer, therefore the differentiation is not computed so easily. Because the error is an indirect function of the weights in the hidden layer, a chain rule of differentiation is applied to compute the gradient of the error function 68 with respect to the weights [32]. If yj{n) is the value of the j t h unit at n th iteration, for each Wji connecting the output of the neuron i to the input of the neuron j at iteration n, one can write the partial derivative of the error function as: dS{n) _ dS{n) dej{n) dyj{n) dvj{n) dwji{n) dej{n) dyAn) dvj{n) dwji{n) where 6 (n) refers to instantaneous value of the error at iteration n. 6j (n) refers to error signal at the output of neuron j at an iteration n. yj{n) is the functional signal appearing at the output at iteration n. Vj{n) refers to the induced local field (i.e., weighted sum of all inputs plus bias) produced at the input of activation function of neuron j at iteration n. 4>j is the transfer function. The update for the weights will be Awji — where, 77 is a parameter known as the learning rate. The detailed mathematical description of the backprop­ agation algorithm is explained in the Appendix-A. 6.3 U n su p ervised Learning - C om p etitive N eural netw orks Unsupervised learning performs a mapping from input to output space. Unsupervised learning is used in a wide variety of fields, the most common of which is cluster analysis. As discussed in Chapter 2, unsupervised learning does not provide the network with the target outputs. For unsupervised learning the training set consists of input patterns only. Raw data with no prior knowledge of classification is to be 69 analyzed. The network learns to adapt based on the experiences collected through the previous training patterns. The furthur reading can be sought in [32,34]. Cluster analysis is one of the classification methods th at are used to arrange a set of cases into a cluster such th at the cases within the cluster are more similar to each other when compared to the cases in the other clusters. Such a process can be readily performed using simple competitive networks. The architecture of the competitive neural network is shown in the Figure 2.2. In the network p^ and aj indicate ith input node and j t h output node respectively. Wj is the prototype weight vector stored in the j t h row of weight matrix W and connected to the j t h output node from all the input nodes. The main tasks of the competitive neural network are: • The initial values of the prototype weight vectors are assigned as random values in order to avoid the possibility of any user influence. • For each input vector p (R x l) determine the winning neuron, j, for which its weight vector, Wj, is closest to the input vector. For this neuron, the output aj — 1. • Adjust the weight vector of the winning neuron, Wj, in the direction of the input vector, p with the training algorithm. Weight vectors of the losing neurons are not modified. To perform the above-mentioned objectives, the competitive neural network is divided into two different layers of neurons which perform two different actions: 1. The similarity measure layer which measures how much the input vector resem­ bles the weight vector of each perceptron. 2. The maxnet layer (or competitive layer) which declares the weight vector closest to the current input vector as the winner. 70 6,3.1 S im ilarity M ea su re Layer The main objective of this layer is to perform the correlation between the input vector and the weight vector. The correlation is done by generating the distance signal that indicates the similarity between those two vectors and leads to the formation of a prototype weight vector. The ||ndist|l box in the Figure 2.2. accepts the input vector p and the input weight matrix W, and produces the vector having M elements which describes the distance between the input vectors and the weights vectors. In order to represent the output node with a cluster for a given input, each per­ ceptron at the similarity measure layer calculates distance between the input vector and the weight vector by two methods: 1. Measure of the similarity for the normalized vector 2. Measure of the similarity for unnormalized vector M easure o f th e sim ilarity for th e norm alized vector: The distance between the input vector and the prototype weight vector can be calculated by the weighted sum. This weighted sum can be interpreted as the dot product of the input vector p and the weight vector W. Through a weighted sum each perceptron calculates a measure of how closely its weight vector resembles the input vector. The highest dot product indicates the highest similarity between these two vectors. max (a) = ^ WiPi = W * p M easure o f th e sim ilarity for th e unnorm alized vector: The most obvious similarity measure for the unnormalized vectors is the Euclidean norm, i.e., the magnitude of the difference vector S. d = \ \ P - W\\ = + 71 + ....... + J 2 = This measure is relatively complex to calculate, hence the square of the magnitude of the difference vector is used: i=l where N is the size of the input vector. 6 .3 .2 C o m p e titiv e Layer (or M a x n e t) The objective of the competitive layer is to declare as winner the node which has a weight vector th a t is closest to the vector of the input. This layer is a fully connected network with each node connecting to every other node, including itself. The basic idea is th a t the nodes compete against each other by sending out inhibiting signals to each other. In this layer the net input n is computed by adding the distance signal and the biases b. The competitive transfer function in the competitive layer accepts the net input vector and returns the neuron outputs of zero for all neurons except the winner which has an output of 1. 6 .3 .3 T h e C o m b in a tio n o f th e se T w o Layers The combination of these networks forms the simple competitive neural network. In a simple competitive network, a maxnet connects the top nodes of the similarity measure layer. Whenever an input is presented, the similarity measure layer finds the distance of the weight vector of each node from the input vector. The calculated signal is then fed to the competitive transfer function in the maxnet. Using this competitive transfer function only one node wins and all other nodes converge to 0 except for the node with the maximum initial value, which is deemed as the winner. In this way the maxnet identifies the node with the maximum value which has the closest similarity to the input vector. Once it has found the weighting vector of the winning node, this weight is updated by the training algorithm and all other weights remain unchanged. Thus the winning neuron with its weight vector moves more 72 closely towards the similarity input vector. In this study, the competitive neural network is trained with the Kohonen learning rule. The Kohonen learning rule is defined as: where r] is the learning rate. I OO OOO; ### ### o o ® e# mOO o # # o o O w4 GO ' ^* *® ® ® J D uring Training A fter Training Figure 6.3: Training process for competitive neural network This process is repeated iteratively over entire training set of the data until the weight change becomes minimum and the weight vectors are considered as the repre­ sentative of the clusters. If the input data sets are in the form of clusters, then every time the winning node is excited, the winning weight vector will move towards the particular dataset which is in the form of a cluster. Eventually, once the competitive neural network is trained, each of the weight vectors would converge to the centroid of one cluster ideally representing the prototypes of the clusters found in the dataset. This training process is depicted in the Figure 6.3 where i is referred to input data. When a new dataset is presented it calculates the similarity with the weight vectors which are the centroids of each cluster and generates the output th at is the closest 73 resemblance to the given input vector. 6.4 U secase D iagram The first step in any system design is to identify the usecases and actors. UML (Unified Modeling Language) tools are among the best tools for designing software systems. ‘ Usecase' is a system level function th a t helps one to visualize by describing the interaction between the user and the system. It emphasizes the behavior as it appears to the outside user environment. The elements in the usecase diagram are “usecases” , “actors” , and the association between them. Identifying th e usecases: SAMS Software Chemical species database ^ in SMILES notation , < uses > VOCs L ist in SM ILES Notation Functional Group Sorter < uses > Pruned Chemical Species SMILES to Matrix List in SM ILES Notation, ^ C onverter^ Actor < uses > V ector Notation Artificial Neural Network O utput Figure 6.4: Usecase diagram for the system Usecases are the services provided by the system from the user’s perspective. In this 74 system, we identified the following usecases as shown in Figure 6.4: 1. VOCs list extraction and the extracted list is given to the functional group sorter as an input 2. Pruned chemical species generation and the pruned list is given as an input to the SMILES to matrix notation converter 3. Optimized vector generation and the vector is given as an input to artificial neural network for training and testing process 4. O utput Identifying th e actors: An actor is a person, organization, or external system th at plays a role in interacting with the system. In this case, the researcher or user, the EPI Suite (External System), and the SAMS software are the only actors. Usecases help to represent problem scenarios and to identify the cases th at need to be taken care of during design or implementation. The implementation of lump­ ing of atmospheric chemical species through artificial neural network is discussed in Chapter 7. 75 C hapter 7 A pplication of th e A N N for C lassification of Chem ical Species: Im plem entation and R esults Neural networks have the ability to learn and therefore generalize. Generalization refers to the neural network producing reasonable outputs for the inputs not en­ countered during training. Neural network performance depends on two attributes: knowledge representation and architecture. Knowledge representation includes what type of information is actually made ex­ plicit to the network and how the information is physically encoded for the subsequent use. An accurate solution for the problem depends upon the appropriate representa­ tion of the input data to the neural network. For a given problem, an appropriate neural network architecture has to be con­ sidered. This consists of selection of suitable neural network model and network parameters. Unfortunately, there are currently no well defined rules to do this, so th a t ad-hoc procedures are used to yield good results. A neural network may perform badly if the network model is poorly fitted. There are many other reasons for a network model to underperform. Some of the reasons may be due to improper input node assignment, insufficient hidden nodes, too few 76 training epochs, inappropriate values of design parameters, and the nature of the dataset. A number of experiments in lumping chemical species with artificial neural networks have been carried out. In this work, network modeling and programming is performed using MATLAB™ version-7. M ATLAB^^ (Matrix laboratory) is a high performance interactive soft­ ware package for performing numerical computations and graphics with matrices and vectors. MATLAB™ features a family of add-on application-specific solutions called toolboxes. MATLAB^-^ toolboxes are available for signal processing, control sys­ tems, neural networks, fuzzy logic, wavelets, simulation, and many others. The MATLAB™ Neural Network Toolbox was used in this work. This chapter discusses the network development, parameters used, testing meth­ ods, and results obtained for both supervised and unsupervised learning. Finally this chapter also discusses why one learning method performed better than the other. 7.1 Supervised N eural N etw orks 7.1.1 N etw o rk D e v elo p m e n t The feedforward neural network has 19 elements in the input vector. The first 18 elements describe the connectivity of atoms within a molecule and the final element describes the relative rate coefficient for reaction of the chemical species with ozone and OH. There are seven output layer units, with each unit describing a lumped category. The lumped categories considered in this work are aldehydes, terminal alkenes or alkynes, internal alkenes or alkynes, alcohols, chemical species which react predominantly with ozone, alkanes, and ketones. Various combinations of the training epochs, hidden layers, and hidden layer units have been considered to optimize the performance of the network. The training and testing process is carried out for 25 iterations. An epoch is defined as a single cycle in which a sequence of all the input vectors is presented to the neural network. An iteration is defined as a cycle of a designated number of epochs. At each epoch, when an input vector is presented to 77 the neural network, an error is computed. The weights of the network are adjusted based on the error computed for all the input vectors. The modification of the weights is carried out for the complete cycle of epochs such th at the mean squared error is decreased for every epoch. This process is carried out until the network reaches the maximum number of epochs or the network reaches the minimum mean squared error. This complete process is one iteration. If the results are presented for only one iteration, then the results obtained can not be justified as a good result. Therefore to evaluate the performance of the network and the results obtained, the network training and testing process is carried out for 25 iterations. Various combinations of the hidden layer and its units are considered because these greatly influence the performance of the network. If the number of nodes in the hidden layer is too low, it may not yield an appropriate classification of the data. On the other hand, if “more-than-necessary” hidden neurons or hidden layers are used, the network may tend to overfit the data. According to a rule of thumb often mentioned in the literature, a network performs better if the sum of hidden nodes is greater than the sum of input nodes and output nodes [36]. We have carried out the experiments with 27, 35, 45, 55, 65 or 75 hidden nodes in a single hidden layer and [25, 10] nodes respectively in a two hidden layers network. The notation [25, 10] indicates th a t it has two hidden layers with 25 hidden nodes in the first hidden layer which is connected to the input layer and the 10 nodes in the second hidden layer which is connected to the first hidden layer and the output layer. Prior to the network training operation, a dataset consisting 1,016 chemical species is sorted into seven categories as shown in Table 7.1. The dataset is divided into two distinct parts - a training set and a testing set. A testing set of 102 chemical species is generated using a uniform random distribution within each lumped category so th a t 10% of the chemical species from every lumped category is included for testing the network. This testing set is excluded from the training set and used later, once the model is developed to test the performance of the network model. The network parameters for the supervised neural network are shown in the Table 7.2. 78 Table 7.1: Number of chemical species in the dataset Lum ped C ategory Aldehydes Terminal alkenes or alkynes Internal alkenes or alkynes Alcohols Molecules which react predominantly with O3 Alkanes Ketones N um ber o f C hem ical Species 98 500 119 36 152 38 73 Table 7.2: Network parameters adopted for supervised neural network experimenta­ tion Architecture Learning Algorithm Input units Hidden Layers Hidden Layer nodes Number of training Epochs O utput units Transfer Function Adaptive Learning Rate Initial weights and biases Performance Goal Training set Testing set Feedforward neural network Gradient descent with adaptive learning rate backpro­ pagation algorithm 19 One and two hidden layers Various combinations 27, 35, 45, 55, 65, 75, and [20 15] 250, 500, 1000, and 1500 7 proposed lumped patterns ‘logsig’ in hidden layer and ‘satlin’ in output layer 0.01 (default) Randomly generated 0 90% of the dataset 10% of the dataset (Uniform random distribution from each class) 79 7 .1.2 T rain in g an d T estin g th e N etw o rk M o d el In each epoch of training, when an input vector is presented, the seven elements of the output vector are generated. The seven elements of the output vector are compared to the value of the seven elements of the target vector and an error is computed. This error is the mean-squared error ' ^ { X i - X )^/n which averages the error between the network’s output and the target output over all n inputs. The error is fed back to the neural network and the backpropagation mechanism will adjust the weights and biases of the neurons of the network according to the learning algorithm such that the error is decreased for every epoch. This training process is repeated until the sum of the mean squared error is minimized or maximum number of epochs of training are reached. As a result in each epoch of an iteration, the network model gets closer to produce a desired output. The final step in each iteration is testing the performance of the neural network. In the testing process, when an input vector is presented to the neural network, an output is generated. Seven elements of the output indicate the “goodness” of the match with the seven lumps of the chemical species. The output element with the maximum value indicates the lumped category of chemical species represented by the input vector. The percentage accuracy of classifying a chemical species into a distinct lumped group for various network designs are shown in Tables 7.3 and 7.4. The results presented are the average and the standard deviation of 25 iterations. Each iteration is first carried out for up to a different maximum number of epochs of training. The results presented in the Tables 7.3 and 7.4 are trained for 250 epochs. This process is repeated • Representation as for two different representations of ‘tran s’ double bonds: matrix element value2.75 (and input vector) for a double bond between the two C atoms which is associated with the ‘tran s’ conformation as mentioned earlier in the Section 2.6.2 • Representation as matrix element value1.75 (and input vector) for a double bond between the two C atoms which is associated with the ‘tran s’ conforma­ 80 tion. The notation is transformed from 2.75 to 1.75 because T.75’ is closer to ‘2’ (dou­ ble bond) and this representation of ‘cis’ (2.25) and ‘trans’ (1.75) averages to the representation of ‘double bond’ (i.e., 2). We can observe from the results th at the performance of the network varied from one iteration to other for the same network model. The accuracy of classifying the chemical species into appropriate lumped categories varied approximately from 46% to 92%. This variation of the results is not considered to be unusual in the field of neural networks. For example in Table 7.4, for the 5*^ iteration for 45 hidden nodes, the network responded with only 48% of accuracy. Such results may be obtained if the network reaches either the minimum gradient or the maximum number of epochs before the performance goal has been met. This case is an example for reaching the minimum gradient before the performance goal has been met. For randomly assigned initial weights, there is a probability th a t the mean squared error may not have converged within the given number of epochs. The transformation of ‘trans’ representation from 2.75 to 1.75 led to improved results as shown in Tables 7.3 and 7.4. For most of the models, the mean of the results is increased and the standard deviation is decreased when the ‘trans’ double bond is represented by 1.75 instead of 2.75. This is a clear indication th a t 1.75 is a better representation than 2.75 for a ‘tran s’ double bond and th a t the results overall may be sensitive to the numerical values assigned to various chemical bonds. The network is trained for various training epochs (250, 500, 1000, 1500). The training process by gradient descent with an adaptive learning rate backpropagation algorithm is stopped before reaching the maximum number of epochs if the algorithm has converged. The network is said to have converged if the mean squared error (MSB) is nearly constant over several epochs. The best results obtained for the supervised training method are illustrated in the Table 7.5. A single hidden layer feedforward neural network with 35 and 65 hidden nodes produced the best result, classifying 92.16% of the chemical species into appropriate 81 Table 7.3: Classification accuracy of lumping chemical species into appropriate groups with 27 and 35 hidden nodes(HN) Iter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 M ean ST D H N -27 Trans 2.75 Trans 1.75 66.67 85.29 89 .2 2 85.30 86.28 86.28 72.55 65.69 85.29 85.29 73.53 61.76 77.45 55.88 79.41 87.26 83.33 86.28 78.43 87.26 85.29 80.39 87.26 83.33 82.35 86.28 91.18 91.18 87.26 88.24 86.28 86.28 90.20 80.39 90.20 87.26 89.22 85.29 86.28 86.28 85.29 83.33 84.31 88.24 88.24 86.28 83.33 86.28 89.22 88.24 H N -35 Trans 2.75 89.22 76.47 78.43 87.26 85.29 74.51 87.26 73.53 90.20 72.55 82.35 88.24 9 2 .1 6 86.28 80.39 79.41 79.41 81.37 80.39 88.24 74.51 81.37 87.26 75.49 85.29 79.88 8.97 86.98 2.57 82.28 5.77 Trans 1.75 H N -45 Trans 2.75 88.23 64.71 85.29 87.26 80.39 81.37 72.55 90.20 87.26 90.20 89.22 87.26 87.26 84.31 89.22 88.24 87.26 83.33 87.26 91 .1 8 82.35 84.31 87.26 89.22 84.31 87.26 84.31 89.22 83.33 83.33 83.33 84.31 H N -55 Trans 2.75 90.20 74.51 80.39 84.31 76.47 85.29 65.69 72.55 74.51 77.45 62.75 91.18 77.45 47.06 75.49 66.67 87.26 76.47 86.28 83.33 79.41 85.29 77.45 84.31 77.45 79.41 12.0871 86.59 2.64 77.57 9.64 82.35 79.41 82.35 85.29 85.29 84.31 89.22 85.29 88.24 85.29 84.31 86.28 84.31 88.24 88.24 87.26 81.37 87.26 83.33 83.33 81.37 85.29 9 2 .1 6 84.31 86.28 89 .2 2 81.37 63.73 46.08 86.28 77.45 83.33 87.26 84.31 85.22 2.83 82 83.33 71.57 83.33 88.24 48.04 87.26 86.28 Trans 1.75 Trans 1.75 84.31 85.29 87.26 g g .g g 86.28 88.24 66.67 71.57 75.49 82.35 66.67 88.24 88.24 85.29 85.29 68.63 83.33 84.31 87.26 81.37 88.24 89 .2 2 85.29 85.29 88.24 82.86 7.13 Table 7.4: Classification accuracy of lumping chemical species into appropriate groups with 65 and 75 hidden nodes (HN) Iter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 H N -65 Trans 2.75 84.31 88.24 81.37 86.28 73.53 55.88 73.53 86.28 90.20 74.51 75.49 86.28 92 .1 6 85.29 83.33 86.28 78.43 90.20 70.59 88.24 83.33 84.31 87.84 89.22 87.26 Trans 1.75 79.41 82.35 90.20 89.22 71.57 87.26 87.26 78.43 84.31 67.65 79.41 91.18 92 .1 6 90.20 79.41 82.35 85.29 83.33 73.53 88.24 81.37 84.31 84.31 88.24 89.22 H N -75 Trans 2.75 87.26 77.45 83.33 86.28 80.39 86.28 84.31 85.29 76.47 78.43 82.35 87.26 72.55 78.43 88.24 83.33 84.31 89.22 85.29 78.43 85.29 86.26 82.35 90 .2 0 87.26 M ean ST D 79.29 16.93 83.61 6.29 83.45 4.43 83 Trans 1.75 85.29 88.24 85.29 83.33 81.37 78.43 88.24 83.33 79.41 71.57 71.57 90 .2 0 87.26 88.24 90.20 85.29 84.31 86.28 86.28 77.45 88.24 88.24 87.26 87.26 86.28 84.35 5.11 HN-[25,10] Trans 2.75 77.45 78.43 74.51 76.47 79.41 81.37 83.33 78.43 76.47 72.55 83.33 80.39 80.39 87.26 80.39 78.43 79.41 71.57 81.37 74.51 77.45 83.33 86.28 86.28 Trans 1.75 77.45 80.39 80.39 82.35 83.33 81.37 8 7 .2 6 82.35 82.35 79.41 84.31 83.33 79.41 80.39 62.75 70.59 76.47 81.37 76.47 84.31 74.51 77.45 72.55 82.35 86.28 79.88 4.42 79.57 5.34 88.24 T raining N eu rai N etw ork 0.2 18 16 0 .1 4 0.12 ui w 0.1 •M S E 0 .0 8 0 .0 6 0 .0 4 0.02 0 1 5 9 13 17 21 25 29 33 37 41 4 5 4 9 53 57 61 65 Num ber o f E pochs Figure 7.1: Training of a neural network Table 7.5: Best classification accuracy of chemical species into appropriate lumping groups Network Design 27 hidden nodes 35 hidden nodes 45 hidden nodes 55 hidden nodes 65 hidden nodes 75 hidden nodes [20 15] hidden nodes Accuracy of Prediction (%) 91.18 92.16 91.18 89.22 92.16 90.20 87.26 Performance Goal Achieved (MSE) 0.0014603900 0.0023992100 0.0006678037 0.0009909770 0.0006780370 0.0008345070 0.0023994000 84 Number of Epochs 83 83 61 60 64 64 94 Table 7.6: Network parameters for unsupervised neural network Architecture Learning Algorithm Input units Number of Layers Number of training Epochs O utput units Transfer Function Kohonen learning rate Conscience learning rate Initial weights and biases Training set Testing set Competitive neural network Kohonen learning rule for updating the weights and bias learning rule for updating the biases 19 Similarity measure layer and competitive layer. 1000 1 (i.e., the winner) Competitive transfer function (compet) in comp­ etitive layer 0.01 (default) 0.001 (default) The initial values of weights are the midpoint of the input range and biases are initialized by con­ science bias initialization function 90% of the dataset 10% of the dataset (Uniform random distribution from each class) categories. Less than 8% of the chemical species are misclassihed. This is discussed further in Section 7.3. 7.2 U nsu p ervised N eural N etw orks 7.2.1 N etw o rk d e v elo p m en t In contrast to the supervised learning, unsupervised learning does not use any target outputs for training the neural networks. Unsupervised learning methods can be used for applications such as mapping from input to output space, d ata compression or clustering. The network parameters used for the unsupervised learning method are as shown in Table 7.6. The initial values of weights assigned for a network are the midpoints of the input range and the biases are initialized by a conscience bias initialization function. Midpoint is a weight initialization function th a t sets weight (row) vectors to the 85 Table 7.7: Examples for chemical species represented in vector notation (VN) and normalized vector notation (NVN) F o rm ald eh y d e VN NVN H ex ald e h y d e VN NVN [1.0 2.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0.3714 0.9285 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1.0 1.0 0 1.0 0 0 1.0 0 0 0 1.0 0 0 0 0 2.5 0 0 0] [0.2981 0.2981 0 0.2981 0 0 0.2981 0 0 0 0.2981 0 0 0 0 0.7454 0 0 0] center of the input ranges. This function takes two arguments (S, PR), where S is the number of neurons and PR is an R x Q matrix of input value ranges [Pmm Pmax] and returns an S x R matrix with rows set to (Pmin + Pmax)^/2. The conscience bias initialization function assigns random values depending upon the number of neurons. No m atter how long the training is continued in competitive networks, there is a possibility th at randomly assigned neuron weight vectors can start out far from any input vectors and will not lead to learning. As a result the neuron can never win the competition. This limitation can be mitigated by using the bias learning rule. The functionality and this algorithm are documented in the neural network toolbox [34]. 7.2.2 T rain in g an d T estin g th e N etw o rk M o d el Usually, the inputs are applied to the network through the input layer. As mentioned earlier in section 2.3, the first process for a competitive neural network is to calculate the distance between the input vector and the weight vector. The distance between the input vector and the weight vector can be calculated by the weighted sum. The weighted sum can be interpreted as the dot product of the input vector and the weight vector. The input vectors need to be normalized for this approach. If the input vector has been normalized, the weights of each element in the input vector which contain the connectivity information will be changed. Consider the vector notation for two chemical species from the aldehyde group shown in Table 7.7. The 86 connectivity information of carbon to carbon single bond in the vector notation is represented by 1.0 in all the chemical species. But when the vector is normalized, the same carbon to carbon single bond is represented by different values thus changing the weights of the representation. This means the similarity can not be measured. This process of normalizing the vector alters the connectivity information of all the atoms within the molecule. Because the information is degraded the normalization is not appropriate to the application. Therefore the similarity measure is carried out for the un-normalized input vectors. When an input vector is presented it finds the similarity measure by calculating how far the weight vector is from the input vector as discussed earlier in section 6.3. The output obtained by the similarity measure layer is fed to the competitive layer where each node is connected to every other node including itself. Every node competes with the other nodes and finally the network presents the winner. The winning node indicates th a t the input vector which is presented to the neural network belongs to the winner class. The prototype vectors formed after 1000 epochs of training are as shown in Figure 7.2. The 8*^ and 12*^ elements of the prototype vector are zeros because all the chemical species in the vector notation has their and 12‘^ elements as zero. Ten percent of the dataset (102 chemical species) from each group (Aldehydes - 10, Terminal alkenes or alkynes - 50, Internal alkenes or alkynes - 12, Alcohol - 4, Chemical species which react predominantly with ozone -15, Alkanes - 4 and Ketones - 7) is used as the testing set. As in the case of supervised learning a testing set is generated using a uniform random distribution within each lumped category so that 10% of the chemical species from every lumped category is included for testing the network. Once the network has been trained, every weight vector in the weight matrix forms a prototype weight vector of the input cluster (each weight vector contributes to the centroid of the each cluster). When a new input vector from the testing set is presented to the network, it finds the similarity between the input vector and the prototype weight vector and classifies to a cluster which is the closest representation to the input vector. The 87 1,2344 2.3544 1.7313 0.9929 0.2835 0.0000 0.9794 1.9291 0.0079 0.0000 0.2334 0.4881 0.8768 0.5008 0 0 0.0762 0.0822 0.2508 0.0542 1.0186 0.7385 0 0 0.0000 0.0101 0.0367 0.0060 0.2874 0.3103 0.7050 0.3949 0.2142 0.1542 0.3781 0.1508 1.9605 0.1009 1.9677 1.2074 0.3459 0.6554 0.0000 0.0000 2.4186 0 0.2276 0.2671 0.4878 0 0.0000 0.0000 0.0091 1.3530 0.2695 0.3224 0.2900 2.2025 1.0652 0.0457 1.7036 0.0002 0.5249 0.4787 0 0.0000 0.0000 2.5174 0 0.0000 0.0706 0.1535 0.5052 0.0542 0.3211 0.2900 1.2756 1.6809 0.4864 0.4736 0.0977 0.0224 0.9211 0 2.0091 1.1343 0.1667 1.0835 0.0080 0.0298 0.2884 1.0026 0.2478 0.9496 0.0182 0.3637 0.8597 0 0.1966 0.3377 0.4760 0 0.0012 0.1075 0.1195 0.3035 0.1682 0.4082 0 0.0000 0.0000 0.0000 0.9499 0 0.0705 0.0937 0.3756 0.2870 2.5526 0.4898 0.0750 0.1480 0.0852 0.0431 0.4751 0.2496 0.0000 0.3582 0.8510 0 2.1598 2.0534 0.2548 Figure 7.2: Prototype weight vector (W^) formed after 1000 epochs [Columns repre­ sent the clusters and rows represent the connectivity information] competitive layer assigns each input vector to one of the categories by producing an output of 1 for a neuron with the weight vector is closest to input vector. The results obtained for the unsupervised learning method after training the network for 1000 epochs are illustrated in the Figure 7.3. The numbers in the results (Figure 7.3) indicate the winning neuron for each chemical species given as test data. Terminal alkenes or alkynes Aldehyde 1 2 3 4 5 6 7 8 9 Id 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30' >— .Order of Chemical species in testing set 6 6 6 6 6 4 2 4 7 4 3 3 6 7 6 7 2 2 3 6 2 3 7 7 7 5 2 6 2 7- ■Output Neuron won Terminal alkenes or alkynes 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 5 4 5 3 7 6 2 2 2 4 2 2 2 3 3 6 6 7 2 2 2 4 7 2 3 7 7 7 7 2 f \ Internal alkenes or alkynes Alcohol Chemical species which predominently react with ozone 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 1 1 6 5 2 5 2 3 2 2 5 4 2 5 2 5 Alkanes 1 1 1 1 1 1 1 1 3 6 1 1 3 11 Ketones 92 93 94 95 96 97 98 99 100 101 102 5 5 5 5 2 5 3 2 3 6 5 Figure 7.3: Results obtained for unsupervised learning method In supervised learning method, a previously classified set of the chemical species is required for training the neural network. Once trained the network will be able to classify the chemical species into appropriate categories. The potential of the unsu­ pervised learning method is th a t it does not require any prior classification process for training the neural network. The unsupervised learning method is able classify the chemical species into categories based on the input patterns only. The results ob­ tained by the unsupervised learning method do not appear promising when compared to the supervised learning method. 89 7.3 D iscu ssion The main purpose of the experimentation with ANNs was to design and test ANNs for automating the classification of chemical species. We employed both the supervised learning and the unsupervised learning neural networks. The supervised learning method was able to classify accurately 92.16% of the chemical species. Misclassification is occured for less than 8% of the chemical species. An analysis of the results is done for the first five iterations of training and testing. In supervised learning, it was the alcohols th at were most frequently misclassihed. For example, the chemical species with a SMILES pattern of C = C C (0 )C —C was never successfully classified as a member of the alcohol lump. When this chemical species is given for testing it was misclassified once as an aldehyde lump and four times as terminal lump. From the neural network perspective, it might be appropriate th a t it has been classified as terminal lump because it has two double bonds in the terminal positions of the carbon chain. But from the lumping point of view we have told the supervised network to classify it as a member of the alcohol lump. There are chemical species in all the lumped categories with more than one functional group. We have not employed any ranking scheme for multiple functional groups within a chemical species. The chemical species with multiple functional groups were able to be classified properly for other lumps but not for the alcohol lump. It was also observed from the results th a t there are few cases in which a chemical species from the alcohol lump were not able to be classified into any of the proposed lumped categories. Two examples are presented in Table 7.8. Another reason for misclassification of alcohols may be the low number of chemical species available in this lump. There are only 36 chemical species in this lump in the training set, less than half the number of chemical species present in the other lumps. As shown in Table 7.8, two other chemical species in the alcohol lump which were misclassified frequently are C C (0)C C = C and CCCO. Another lump for which misclassification was frequent is the Terminal lump. Fifty 90 Table 7.8: Analysis of results for alcohols - misclassification of chemical species in supervised learning method obtained from 5 iterations Lump Aldehyde Terminal Internal Alcohol Ozone Alkanes Ketones None Misclassification C = C C (0)C = C C C (0)C C =C CCCO 1 4 5 2 1 1 1 5 2 2 1 3 chemical species were given for testing from the terminal lump. As shown in Table 7.9, three chemical species (C=CCCC(=C)C, C #C C (= C )C , C # C C # C ) were mis­ classified frequently. Supervised learning performed better than the unsupervised learning. There are several factors th a t may account for this. 1. Supervised learning performed fairly well in classifying the chemical species into appropriate lumped categories because the goal for supervised learning having the computer to learn a classification previously created. In contrast, unsupervised learning has the goal is to have the computer learn how to do something th a t the network has not been previously told how to do. 2. The nature of the dataset contributed to the superior performance of supervised learning over unsupervised learning. The elements in the vector representation of chemical species consists of just O’s, I's, 2’s and 3’s. The vector notation of chemical species in one lump has a close resemblance to the chemical species in other lumps. For example, the chemical species with patterns C—O and C—C have 2.5 and 2.0 respectively as elements in the vector notation which describes the connectivity information between the atoms. Similarly the patterns CO and CC are represented as 1.5 and 1.0. These are the key elements in the vector no91 Table 7.9: Analysis of results for other chemical species - misclassification of chemical species in supervised learning method obtained from 5 iterations Lump Aldehyde Terminal Internal Alcohol Ozone Alkanes Ketones None Misclassification C-CCCC(-C)C C#CC(=C)C c#cc#c - 1 - 4 2 - 2 1 1 4 3 - 3 1 4 tation th at differentiate between the lumped groups. These weights to represent the connectivity information between the carbon and oxygen atoms have been chosen arbitrarily. For example, trans double bond is initially represented as 2.75 but later changed to 1.75 which increased the accuracy of results. However this value is very close to the values represented for carbon-carbon double bond (2.0) and carbon-oxygen double bond (2.5). Classes of similar properties may be separated by supervised learning because supervised learning has an exter­ nal teacher (target output) for each input vector during training. In the case of unsupervised learning they could be lumped into one class. If the choice of the numerical values representing different functional groups were more separated, improved results may be anticipated. This is definitely one of the approaches we would like to consider in our future work. 3. The supervised learning method can learn about the direction of the error in which the network is moving. In the case of unsupervised learning, the direction in which the error is changing does not play a role. Unsupervised learning methods have been gainfully applied to problems where each element in the vector of all input data indicates the fixed position or fixed characteristics. For example in the Figure 7.4 the character ‘A’ can be represented in 92 a vector notation as: [ 0 1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 ] and the character ‘B’ can be represented in a vector notation as: [ 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 ]. In these two vector notations the first and second elements in the vector indicates the first and second cells in the first column respectively. This is not the case with the dataset in this work. Each element position in the vector does not represent the same functional group. This type of method may not be readily applicable to our dataset. Even though it has learned to form a prototype weight vector by calculating the similarity measure, these prototype vectors are not the actual representative of each cluster. To solve this problem, one approach could be representing the chemical species by using some form of canonical notation. Figure 7.4: Example for the input data In Figure 7.3, the chemical species numbered from 77 to 91 in the testing set belong to the lump th at reacts preferentially with ozone. When these chemical species are given as an input to the network, one of seven categories will produce an output of 1 for the neuron whose weight vector is closest to the input vector. Twelve of fifteen chemical species which are given as an input to the network were able to produce an output of 1 for the 1®* neuron. This explains th at the E* prototype weight vector in the weight m atrix closely resembles the ozone lump (i.e., the 1st vector in the weight matrix was able to form the centroid of the ozone lump cluster). Two of the chemical species from the ozone lump are misclassified to the 3'’'^ prototype vector and one is misclassified to the 6*^ prototype vector. We can also confirm th at the 1®* prototype vector in the weight matrix indicates the ozone lump with the help of a relative rate coefficient. Only the chemical species which predominantly react with ozone have a relative rate coefficient greater than unity. The nineteenth element in the 93 vector notation for all the chemical species in the database contains the relative rate coefficient. It can be observed th a t only the weight vector ([1.2344 1.7313 0.2835 0.9794 0.0079 0.2334 0.8768 0 0.0762 0.2508 1.0186 0 0.0000 0.0367 0.2874 0.7050 0.2142 0.3781 1.9605]) in the prototype weight m atrix has a nineteenth element greater than unity. This is the reason th a t the ozone lump was able to classify reasonably well when compared to the other lumps. Similarly the alkane lump (chemical species numbers from 92 to 95) was able to map the 5*^ prototype weight vector. All the four chemical species from the alkane lump which are given for testing were classified properly. The elements in the vector notation for all the chemical species in the alkane lump should not be greater than or equals to 2 because the carbon atoms in the carbon chain of alkanes are single bonded, which is denoted as 1. It can be observed from the 5*^ prototype weight vector th a t none of the elements is greater than or equals to 2. Therefore the prototype weight vector was able to train and map to form the centroid of the alkane lump cluster. The aldehyde lump (chemical species numbers from 1 to 10) was able to map with the 6*^ prototype weight vector. Only the first five of the ten chemical species which are given for testing were able to be classified properly. This is because the 16*^ element in the prototype weight vector and the first five chemical species which are given for testing has 2.5 (i.e., oxygen atom is connected to the carbon atom with a double bond). Therefore it was able to map with 6*^ prototype weight vector. A similar kind of mapping is done for the rest of the chemical species in the aldehydes testing dataset and it misclassified accordingly as shown in the following tables. From this we can conclude th a t unsupervised learning method itself is not performing badly, but it is because of the nature of the d ata th a t contributes to the unsupervised learning method performing poorly. Similarly, other lumps have been misclassified depending upon the input given for testing and the resulting prototype weight vector formed. In conclusion, it can be said th at the dataset in its current form may not be ap­ propriate for applying unsupervised learning methods. A potential future direction 94 of research would be to improve the representation of the chemical species for use in unsupervised learning methods. Table 7.10(a): Classification and misclassification for th e aldehyde lump. P rototyp e 1 2 3 4 5 P rototyp e W eight W eight V ector - 6 V ector - 4 6 8 1 0 2.0091 2 .0 0 0 0 2 .0 0 0 0 2 .0 0 0 0 2 .0 0 0 0 1 .0 0 0 0 2.2025 1 .0 0 0 0 1 .0 0 0 0 1 .0 0 0 0 1.1343 1 .0 0 0 0 1 .0 0 0 0 1 .0 0 0 0 1 .0 0 0 0 1 .0 0 0 0 1.0652 1 .0 0 0 0 1 .0 0 0 0 1 .0 0 0 0 0.1667 0 0 0 0 1 .0 0 0 0 0.0457 0 0 1 .0 0 0 0 1.0835 1 .0 0 0 0 2 .0 0 0 0 1 .0 0 0 0 2 .0 0 0 0 0 1.7036 2.2500 1 .0 0 0 0 0 0.0080 0 0 0 0 0 0.0002 0 0 1 .0 0 0 0 0.3582 0 0 2 .0 0 0 0 1 .0 0 0 0 0 0.5249 0 0 0 0.8510 1 .0 0 0 0 1 .0 0 0 0 0 0 2 .0 0 0 0 0.4787 1 .0 0 0 0 1 .0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2884 0 0 0 0 0 0.0000 0 0 0 0.3035 1 .0 0 0 0 1 .0 0 0 0 0 0 1 .0 0 0 0 0.0000 0 0 0 0.4082 0 0 1 .0 0 0 0 1 .0 0 0 0 0 2.5174 2.5000 2.5000 2.5000 0 0 0 0 0 0 0 0 0 0 0.0000 0 0 0 0 0 0.0000 0 0 0 0.0000 0 0 0 0 0 0.0706 0 0 0 0.0000 0 0 0 0 0 0.1535 0 0 0 2.5526 2.5000 2.5000 2.5000 2.5000 2.5000 0.5052 0 0 0 0.0431 0 0 0 0 0 0.0542 0 0 0 0.4751 0 0 0 0 0 0.3211 0 0 0 0.2496 0 .1 0 2 0 0.0580 0.0580 0.0500 0.0230 0.2900 0 .0 1 1 0 0 0 95 Table 7.10(b): Misclassification for th e aldehyde lump. P rototyp e 7 P rototyp e W eight W eight V ector - 2 V ector - 7 9 2.3544 3.0000 2.1598 1.0000 0.9929 1.0000 1.0026 1.0000 0 .0 0 0 0 0 0.2478 1.0000 1.9291 2.5000 0.9496 0 0 .0 0 0 0 0 0.0182 0 0.4881 0 0.3627 0 0.5008 0 0.8597 1.0000 0 0 0 0 0.0822 0 0.1966 0 0.0542 0 0.1682 1.0000 0.7385 0 0.9499 0 0 0 0 0 0 .0 1 0 1 0 0.0705 0 0.0060 0 0.0937 0 0.3103 0 0.3756 0 0.3949 0 0.4898 1.0000 0.1542 0 0 .0 0 0 0 0 0.1508 0 2.0534 2.5000 0.1009 0.0010 0.2548 0 96 C hapter 8 C onclusions and Future D irections 8.1 C onclusions During the past decade, it has been identified th a t lumping of atmospheric chemical species into groups is an effective technique to reduce the complexity of the reaction mechanism by the condensed representation of atmospheric chemistry. Lumping of chemical species into different categories is a classification problem. We identified from the computational perspective th a t application of machine learning techniques by artificial neural networks has the potential to automate the process. In this study, we 1. Generated a chemical species database for training the neural network. This includes: (a) The generation of all the possible chemical species isomers for a given empirical formula. (b) Conversion of structural notation of the chemical species to a SMILES notation for computer use. (c) Pruning of the chemical species database by retaining the chemical species which are important in the atmosphere. 97 2. Proposing seven lumped categories using a functional group approach. 3. Implementing a method of transforming the chemical species structural infor­ mation into a notation which can be used by the neural network. 4. Conducting training and testing processes for both supervised and unsupervised learning methods. The results presented in this study for supervised learning are more promising than for unsupervised learning and suggest th a t supervised neural networks can be gainfully employed for lumping atmospheric chemical species. The best result obtained is 92.16% accuracy of classification for a supervised neural network learning method. However, some improvements are needed to quantitatively describe practical systems. The results of unsupervised learning indicate th a t it is sensitive to the numerical values assigned to various chemical bonds. Improvements towards the representation of the chemical species for unsupervised learning must be considered in the future work. The percentage accuracy of classification depends on the complexity of the prob­ lem and availability of the data set. We could not compare the results obtained in this work directly with the results in the literature because this is a novel approach. When the results obtained in this work are compared with some other chemistry and environmental science applications in the field of neural networks, these results are considered to be acceptable. In “The integrated strategy of pattern classification and its application in chemistry” [50], Huafeng Wang and co-workers tried to classify the Nature Spearmint Essence (NSE) specimens into a 3 graded ranks of qualitygood, middling and bad and also to classify the toxicity of amine into highly-toxic, medium-toxic and low-toxic. The neural networks alone showed only 46% accuracy. An improvement of the model has been obtained by integrating other approaches such as Bayes method and correlative component analysis leading to a classificafion accuracy of 100%. In another paper, “Assessment and prediction of tropospheric ozone concentration levels using artificial neural networks” , by Abdul-Wahab and co­ 98 workers [35] the ozone concentration was predicted in advance with an accuracy rate of98%k The lumped mechanism (RACM) proposed by Stockwell [23] was used to conduct a simulation study of environmental chamber experiments, testing a complete system of atmospheric chemical mechanism. The chamber contains the mixture of NO^, and the organic species exposed to either sunlight or artificial light and concentration measurements were made as a function of time. Taking all the uncertainties into consideration, he was able to predict ozone concentration no better than ± 30%. Taking this study into consideration and through indirect comparisons, we would conclude th a t an error rate of 8% obtained from the supervised learning method is very likely to be acceptable in kinetic models. 8.2 Future D irection s There are many directions for improvements in which the work presented in this thesis can be extended. We outline some possible directions of the future work: 1. The neural network methods used in this work are general. An alternative ap­ proach such as “network pruning” [51] is a possible future direction. A minimum size neural network is less likely to be influenced by the noise in the training data and thus will be able to generalize better. Network pruning is one of the techniques th a t can be used to achieve this objective. Network pruning assists in minimizing the system complexity. In network pruning, a large network with an adequate performance is first considered and later once the network has been trained the network is pruned by eliminating certain synaptic weights in a se­ lective or orderly fashion. The potential exploration of network pruning could involve the addressing of issues such as which weights to be eliminated, and how the remaining weights to be adjusted. 2. The work could be expanded by working with larger data sets, considering all possible chemical species in the atmosphere. In the present work we did not 99 consider cyclic chemical species, aromatics and radicals. By considering these, the system developed might evolve to a system with more practical applications. If the d ata set is large, a more adequate testing set (for example 20% - 30%) can be used to test the performance of the model. It is also im portant to have a good number of chemical species in each lump to produce a good classification accuracy for all the categories. The problem of unsupervised learning of the alcohols lump might also be examined in future work. 3. The application of neural networks to various fields has shown wide ranges in the quality of results. Many attem pts have been made in the literature to improve the performance of the models through hybrid networks if the neural network alone is performing unsuccessfully. Future work could consider other approaches such as hybrid models. The aim of the hybrid model is to combine the advantage of different architectures into a single system. This involves the use of more than one problem solving technique. Through this approach a system can perform better by increasing the strength of the combined techniques and decreasing the weakness of using either technique alone. 4. Modification of the input vector notation by choosing more separated numer­ ical values representing different functional groups and adopting a canonical representation have to be considered such th a t the improved results can be anticipated for unsupervised learning method. The overall results confirm the hypothesis th a t it is possible to automate (with reasonable error) the classification of atmospheric chemical species by a neural net­ work approach. The supervised learning method performed reasonably well when compared to the unsupervised learning method. This is because the dataset in its current form of representation may not be appropriate for applying unsupervised learning methods. 100 A p p en d ices A pp en d ix A - B ackpropagation A lgorithm The error signal at the output of neuron j at an iteration n is defined as ^W = 4)W - (Ai) where yj{n) and dj(n) is the functional signal appearing at the output and desired response of neuron j at iteration n respectively. Instantaneous value £(n) of the error energy for neuron j is defined as: == (^12) Instantaneous value for total error energy is obtained by summing |e |( n ) for all neu­ rons. This can be written as: tec where C is the set of all neurons. Let N be the total number of training patterns. Therefore the average squared energy is obtained by 1 ^ (A4) £av — n=l For a given set, Sav represents the cost function as a measure of a learning per­ formance. The objective of the learning process is to adjust the parameters of the network to minimize £av The induced local field (i.e., weighted sum of all synaptic inputs plus bias) produced at the input of activation function of neuron j at iteration n is given as: 101 W ^ W (AS) 2=0 where m is the total number of inputs and Wj^(n) is the synaptic weight connecting the output of neuron i to the input of neuron j. The bias applied to neuron j is denoted by hj ; its effect is represented by a synapse of weight Wjo = hj connected to a fixed input equal to 4-1. The functional signal y^(n) appearing at the output of neuron j at iteration n is %j(n) ==