TRANSFORMER MODELS FOR PROTEIN-GUIDED DRUG COMPOUND GENERATION: A COMPARISON OF AMINO ACID SEQUENCES, PRE-TRAINED PROTEIN EMBEDDINGS, SMILES, AND SELFIES by Dylan Fossl BSc. Hons., University of Northern British Columbia, 2020 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF NORTHERN BRITISH COLUMBIA November 2024 © Dylan Fossl, 2024 Abstract Drug discovery is a time-consuming and costly process that notoriously suffers from low success rates. Increased availability of chemically relevant data and advances in machine learning techniques offer potential solutions for aiding in the drug development pipeline. This thesis explores the use of Transformers in the conditional generation of potential drug compounds from protein context. Building on previous research, this work implements four transformer models to take in protein information as input and generate potential binding compounds. Each model uses either SMILES or SELFIES string representations of compounds and amino acid sequences or pretrained ESM-2 protein embeddings as contextual input. These models are trained and compared in their ability to generate chemically feasible compounds that approximate the physiochemical properties of the training set and show binding potential specific to the contextual proteins. The utilization of SEFLIES increased compound validity and diversity but overall had a negative performance impact compared to their SMILES counterparts. Pretrained protein embeddings were shown to decrease validity but improved model performance despite no change to model structure or size. These results highlight the potential of transformer models paired with pretrained protein embeddings to enhance the drug discovery process with the generation of lead compounds from novel proteins without any fine-tuning or retraining. ii Table of Contents Abstract ........................................................................................................................................... ii Table of Contents ........................................................................................................................... iii List of Tables ................................................................................................................................. vi List of Figures ............................................................................................................................... vii Dedication ....................................................................................................................................... x Acknowledgement ......................................................................................................................... xi Chapter 1 - Introduction .................................................................................................................. 1 1.1 Background ........................................................................................................................... 1 1.1.1 Difficulties in Drug Discovery ....................................................................................... 1 1.1.2 Brief Introduction to Machine Learning ......................................................................... 2 1.1.3 Machine Learning in Drug Discovery ............................................................................ 4 1.1.4 Generative Deep Learning .............................................................................................. 5 1.2 Problem Statement ................................................................................................................ 7 1.3 Thesis Structure ..................................................................................................................... 8 Chapter 2 - Literature Review....................................................................................................... 11 2.1 Deep Learning ..................................................................................................................... 11 2.1.1 Neural Networks ........................................................................................................... 11 2.1.2 Recurrent Neural Networks .......................................................................................... 12 2.1.3 Transformers................................................................................................................. 16 2.1.4 Sequence Generation Methods ..................................................................................... 20 2.2 Representation of Biological Data ...................................................................................... 21 2.2.1 Compound Representations .......................................................................................... 21 2.2.2 Protein Representations ................................................................................................ 26 2.3 Deep Generative Models in Drug Discovery ...................................................................... 28 2.3.1 Recurrent Neural Networks .......................................................................................... 29 2.3.2 Transformers................................................................................................................. 30 2.3.3 Other Model Architectures ........................................................................................... 31 2.3.4 Protein Embedding Use in Drug Discovery ................................................................. 31 Chapter 3 - Thesis Statement ........................................................................................................ 33 Chapter 4 - Methodologies............................................................................................................ 35 iii 4.1 Transformers for Compound Generation ............................................................................ 35 4.1.1 Model Data ................................................................................................................... 36 4.1.2 Model Structure ............................................................................................................ 38 4.1.3 Model Training ............................................................................................................. 39 4.2 Binding Classifier for Evaluation ........................................................................................ 41 4.2.1 Model Data ................................................................................................................... 42 4.2.2 Model Structure ............................................................................................................ 43 4.2.3 Model Training ............................................................................................................. 45 4.3 Experiment Setup ................................................................................................................ 46 4.4 Evaluation............................................................................................................................ 48 4.4.1 Chemical Feasibility ..................................................................................................... 48 4.4.2 Physiochemical Properties............................................................................................ 50 4.4.3 Similarity to Known Binders ........................................................................................ 52 Chapter 5 - Results and Discussion .............................................................................................. 57 5.1 Transformer Training Metrics ............................................................................................. 57 5.2 Binding Classifier Training Metrics .................................................................................... 59 5.3 Chemical Feasibility ............................................................................................................ 60 5.4 Physiochemical Properties .................................................................................................. 63 5.5 Similarity to Known Binders............................................................................................... 70 5.5.1 Tanimoto Similarities ................................................................................................... 70 5.5.2 Classifier Scores ........................................................................................................... 74 5.5.3 Case Study .................................................................................................................... 79 Chapter 6 - Conclusions and Future Research .............................................................................. 85 6.1 Conclusions ......................................................................................................................... 85 6.1.1 Effect of SELFIES ........................................................................................................ 85 6.1.2 Effect of Embeddings ................................................................................................... 86 6.1.3 Effect of Generation Method ........................................................................................ 87 6.1.4 Scaling Evaluation with a Classifier ............................................................................. 88 6.1.5 Transformers Moving Forward .................................................................................... 88 6.2 Future Research ................................................................................................................... 89 6.2.1 Short Comings and Limitations .................................................................................... 89 6.2.2 Improvements ............................................................................................................... 90 iv 6.3 Final Statement .................................................................................................................... 93 Bibliography ................................................................................................................................. 95 Supplementary Figures ............................................................................................................... 105 v List of Tables Table 1. Chemical Feasibility results for each model and generation set. Bolded values are the max values for that metric and generation set between the models. Ties are both bolded and underlined. .................................................................................................................................... 61 Table 2. Chemical Feasibility of reference sets. Ideal Set is made up of random samples from the Test set. Random sets made up of random samples from the Training set. .................................. 61 Table 3. Binned absolute count of outlier compounds (>0.2) for Mean Tanimoto Similarity. Max values for each generation set are bolded. Ties are bolded and underlined. ................................. 73 Table 4. Binned absolute count of outlier compounds (>0.3) for Max Tanimoto Similarity. Max values for each generation set are bolded. Ties are bolded and underlined. ................................. 73 Table 5. Binned absolute count of compounds classified as binders (>0.5) for Classifier Scores. Max values for each generation set are bolded. Ties are bolded and underlined. ........................ 79 vi List of Figures Figure 1. Visual Example of Perceptron ....................................................................................... 12 Figure 2. Segler et al 2018 [35] “Symbol generation and sampling process. We start with a random seed symbol s1, here c, which gets converted into a one-hot vector x1 and input into the model. The model then updates its internal state h0 to h1 and outputs y1, which is the probability distribution over the next symbols. Here, sampling yields s2 = 1. Converting s2 to x2 and feeding it to the model leads to updated hidden state h2 and output y2, from which we can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state hi allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.” ........................................................................................................................ 13 Figure 3. Vaswani et al. 2017 [16] “The Transformer - model architecture” ............................... 19 Figure 4. Example of Lewis Structure Representation. ................................................................ 21 Figure 5. David et al 2020 [76] “Example graph representation for acetic acid. a Graph representation of acetic acid with nodes numbered from one to four. b Example adjacency matrix, A, for an acetic acid graph with the corresponding node ordering on the left. c Example node features matrix, X, which one-hot encodes a few selected properties. d Example edge features matrix, E, where each edge feature vector is a one-hot encoding of single, double, or triple bonds. “Implicit Hs” stands for the number of implicit hydrogens on a given node.” ....... 22 Figure 6. David et al 2020 [76] “Canonical (a) and randomized (b) SMILES representations of aspirin. Randomized SMILES correspond to the various representations of a molecule obtained by randomly selecting the starting node in the graph traversal algorithm, thus changing the order of the nodes traversed in the molecular graph (still using depth-first search). Numbers represent the order of graph traversal, where 1 is the initial node (user defined). Considering a as being the canonical representation of aspirin, b shows a different ordering of the atoms of the molecule. The final SMILES is one possible SMILES among all the randomized SMILES which can be generated. Green arrows indicate how the molecular graph is traversed. Both SMILES strings shown represent the same molecule but, as the atom numberings are different, the generated SMILES strings are, too.” ............................................................................................................. 24 Figure 7. Elnaggar et al 2021 [86] “Protein LMs learned constraints. t-SNE projections visualized information extracted by the unsupervised protein LMs (here best-performing ProtT5XL-U50; upper row: before training (Random), and lower row: after pre-training on BFD & UniRef50. (A) The left-most column highlights single amino acids by biophysical features. (B) The middle column annotates protein structural class (taken from SCOPe). (C) The right-most column distinguishes proteins according to the kingdom of life in which it is native. Although the random projections on top may suggest some adequacy of the clusters, the trained models shown on the lower row clearly stood out. Incidentally, the comparison of the two also highlighted the potential pitfalls of using t-SNE projections from many dimensions onto 2D: although random, the human might see some correct patterns even in the top row.” ............................................... 28 vii Figure 8. Distribution of Needle-Whitman Similarities Between Test and Train Amino Acids. . 37 Figure 9. Song et al. 2023 [104] “The framework of PMF-CPI. For a compound-protein pair, the protein sequence is transformed into context embedding via pretrained TAPE and then fed into LSTM. And a compound SMILES is turned into a molecular graph by RDKit and encoded with graph neural networks. After that, two encoding vectors are merged through the Kronecker product, and MLP provides a final output. PMF-CPI can process two task types, regression for CPI binding affinities or classification for interactions” .............................................................. 45 Figure 10. Train and Test loss curves for all four models with Ideal loss visualized. .................. 58 Figure 11. Evaluation Binding Classifier Train and Test AUC scores over training. .................. 59 Figure 12. Distributions of Molecular Weights for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines..................................................... 64 Figure 13. Distributions of the Number of Hydrogen Acceptors and Hydrogen Donors for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. .............................................................................................................................................. 64 Figure 14. Distributions of Rotatable Bonds for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines..................................................... 65 Figure 15. Distributions of LogP for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. .......................................................... 66 Figure 16. Distributions of Total Polar Surface Area (TPSA) for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. ...................... 67 Figure 17. Distributions of Quantitative Estimates of Drug Likeness (QED) for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. . 68 Figure 18. Distributions of Natural Product Scores for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. ..................................... 68 Figure 19. Distributions of Synthetic Accessibility Scores for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. ..................................... 69 Figure 20. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the ideal set for Mean Tanimoto Similarity. For full distributions see Supplementary Figure 1. ............................................................................................................... 71 viii Figure 21. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the ideal set for Max Tanimoto Similarity. For full distributions see Supplementary Figure 2. ............................................................................................................... 71 Figure 22. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the ideal set for Classifier Scores. For full distributions see Supplementary Figure 3. ........................................................................................................................................ 75 Figure 23. ROC Curve for Generated vs. Shuffled Sets across four model groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. .................... 76 Figure 24. ROC Curve for Generated vs. Test Set across four model groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. .................... 77 Figure 25. ROC Curve for IGF-1 Generated Sets vs. Ransom Sample Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. ........................................................................................................................................... 82 Figure 26. ROC Curve for IGF-1 Generated Sets vs. Real Binder Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. ........................................................................................................................................... 82 Figure 27. ROC Curve for VEGFR2 Generated Sets vs. Real Binder Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. ........................................................................................................................................... 83 Figure 28. ROC Curve for VEGFR2 Generated Sets vs. Ransom Sample Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. ........................................................................................................................................... 83 Supplementary Figure 1. Mean Tanimoto Similarity Distributions of Generated and Shuffled sets............................................................................................................................................... 105 Supplementary Figure 2. Max Tanimoto Similarity Distributions of Generated and Shuffled sets. ..................................................................................................................................................... 106 Supplementary Figure 3. Classifier Score Distributions of Generated and Shuffled sets. ......... 107 Supplementary Figure 4. ROC Curve for Generated Sets vs. Shuffled Test Set across four model groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. ......................................................................................................................................... 108 ix Dedication Dedicated to My Wife, My Mother, and My Father In Memory of Maria Borges, Gordon Fossl, and Gus x Acknowledgement I wish to express my gratitude first and foremost for my supervisor Dr. Fan Jiang. Thank you for believing in my work. Your support and guidance throughout my research have enabled me to produce work more valuable than I could have on my own. I would like to thank my Committee Members, Dr. Alia Hamieh and Dr. Sean Maurice, for their guidance, support, and invaluable feedback throughout the course of this thesis. Their expertise and encouragement have been instrumental in shaping this work. My appreciation to Dr. Jianhui Zhou for participating and chairing my defence. And a special thanks to Dr. Stephen Radar. The opportunity to do bioinformatic analysis in his lab was my first opportunity and exposure to programming. I am not sure where I would be without his encouragement. I want to acknowledge the support of my friends Galen Seilis, Daniel O’Reilly, and Edward Lee. Their academic support, advice, and comedic relief are the reasons I have been able to persevere through this work. The support of my family, my sister Cassandra Fossl, my mother Debbie Fossl, and my father Norman Fossl. This work is in memory of loved ones who have passed, my vavó Maria Borges, and my uncle Gordon Fossl. An extra special thanks to my wife, Braelyn Fossl. Her love and support have been at the foundation of all my effort throughout this research. Marrying Braelyn is an accomplishment that supersedes all else, including this work. I wish to acknowledge the support of my colleagues at Northern Health, particularly my manager, Jim Condon. The opportunities and support you have provided me have made this work possible. Lastly, I would like to acknowledge the financial and institutional support of the University of Northern British Columbia and the Natural Sciences and Engineering Research Council of Canada - CGSM, which made this research possible. xi Chapter 1 - Introduction 1.1 Background 1.1.1 Difficulties in Drug Discovery The search for cures to aliments and disease is likely one of the most consistent and universal human struggles. The importance of speed and innovation in the pursuit of these cures has become more salient in recent years with the emergence of COVID-19. It is undisputed that the past decade has seen a massive increase in the amount of data that is available for analysis and because of this we have seen the rise of machine learning techniques utilizing this data. The drug development space is not an exception to this trend. Advancements in RNA-Seq technologies allow for low-cost analysis of gene expression data for quick determination of compound effects and are available on public databases such as Sequence Read Archive [1] and the national cancer institute genomic data [2]. Protein data itself can be found in abundance on the Protein Data Bank (PDB) with more than 150,000 protein structures [3]. Drug compound information can be found thoroughly cataloged at ChEMBL, PubChem, and Zinc with 2.3 million, 112 million, and 750 million molecules, respectfully [4], [5], [6]. This data provides the opportunity for the application of machine learning techniques to aid in the drug discovery process. In addition to abundant data, the cost and time demand of drug discovery also motivates this research. 1 The global pharmaceutical market is estimated to be worth 1.2 trillion dollars but drug discovery itself is still plagued with low success rates [7]. A 2019 study evaluated 21,143 drug candidates and found that only 6.2% reach market [8]. This assessment only observed compounds that had made it to clinical trials and did not account for compounds that were invested in and researched but failed before clinical trials, suggesting an even lower figure. With this low success rate the average drug development cycle from target screening to market takes 13.5 years and costs an estimated 1.8 billion dollars [9]. Finding potential drug-like compounds is also a technical challenge. It is estimated that the chemical space of drug-like compounds is as large as 1023 - 1060 molecules with approximately only 108 of those being estimated to be therapeutically relevant [10]. With standard methods of high-throughput screening it is difficult, if not impossible, to navigate such a space efficiently [10]. It is these time, cost, and space problems paired with the rich data availability that motivate the research and application of machine learning in drug discovery. 1.1.2 Brief Introduction to Machine Learning Machine Learning (ML) Machine learning is a subset of artificial intelligence focused on developing algorithms that enable computers to learn from and make predictions or decisions based on known data. Machine learning can be unsupervised where the input data is unlabeled or supervised where the input data is being mapped to output data. Supervised learning is performed by using a set of algorithms that are built to approximate the relationship between input data and some target output data [11]. 2 Deep Leaning Deep learning is a subset of machine learning that utilizes neural networks with many layers, which are trained through a process called gradient descent [12]. It is best known for its ability to handle large unstructured datasets. The advantage of using deep learning is the ability to scale the neural network size or change the structure of the network to create many different architectures. Some popular architectures include: Generative Adversarial Networks (GANs): - Consists of competing models, the generator and the discriminator. The generator’s goal is to learn to generate realistic synthetic data through adversarial training where the generator generates data and the discriminator identifies if it is real or fake [13]. Autoencoders: - Neural network architecture primarily for data compression, noise reduction, and the creation of embeddings. It consists of two models, the Encoder and Decoder. The Encoder takes input data and outputs a lower dimensional representation, while the Decoder takes in that representation and tries to reconstruct the original data input [12]. Recurrent Neural Networks (RNNs): - Designed for sequential data, like time series or natural language. Built to take in sequence data one token at a time [12]. 3 Long Short-Term Memory Networks (LSTMs): - LSTMs are a type of RNN that contain additional mechanisms for learning long-term dependencies [14]. Gated Recurrent Units (GRUs): - GRUs are a variant of RNN that, like LSTMs, has an additional mechanism for learning long-term dependencies. However, GRUs use one memory vector, where LSTMs use two. This makes GRUs computationally more efficient while still effectively capturing temporal dependencies in sequential data [15]. Transformers: - Like RNNs, Transformers are designed for sequential data. The difference is that they use a self-attention mechanism to process data without recurrence. This enables parallelization for more efficient training and has shown improved performance on tasks like natural language processing [16]. Deep learning architectures relevant to this thesis will be reviewed in greater detail in Chapter 2. 1.1.3 Machine Learning in Drug Discovery To date, machine learning has been researched as an intervention at almost all stages of the drug discovery pipeline. Many of these approaches utilize the discriminative approach of machine learning, where the models are classifying a trait or are predicting some key indicator [10]. This could be predicting drug toxicity [17], solubility [18], [19], binding potential for a target [20], [21], [22], or even processing high throughput screening images to predict candidate 4 molecule effectiveness in assays [23]. The largest area of discriminative machine learning in drug discovery is quantitative structure-activity relationships (QSARs) and quantitative structure-property relationships (QSPRs) [24], [25], [26]. These methodologies can be used to search an already established chemical library, which is advantageous considering these compounds are known and synthesizable. However, this approach only evaluates a subset of the chemical space. To compliment discriminative approaches to machine learning, an exciting area of recent research has emerged to leverage the generative approach of machine learning. Where previous methods were concerned with identifying if a compound may have therapeutic properties, generative models seek to flip this flow and generate new novel compounds given a particular property set or desired feature [10], [27], [28], [29]. Such a method carries more overhead considering generated compounds still need to be synthesized, however, it allows for exploration through generation of novel compounds. 1.1.4 Generative Deep Learning Generative machine learning has risen in popularity as an innovative sub-field of deep learning where the goal of the model is to generate new data that resemble training data. Generative Deep Learning has gained popularity recently with its advancements in text and image generation [30], [31]. In drug discovery, these models are particularly useful for generating novel molecular structures and optimizing these created structures for drug-like properties. 5 In the compound generation model space, a large number of models are recurrent neural networks, specifically long-short term memory models (LSTMs) or gated recurrent unit models (GRUs) [28], [32], [33], [34], [35], [36], [37], [38], [39]. Other notable models are Variational AutoEncoders/HeteroEncoders (VAEs/HEs) [27], [40], [41], [42], [43], [44], [45], [46], [47], Generative Adversarial Networks (GANs)/Adversarial AutoEncoders (AAEs) [48], [49], [50], [51], [52], [53], [54], [55], [56], and Transformers [57], [58], [59]. These architectures are often paired with a reinforcement learning element in studies to optimize the generated molecules for a variety of properties [48], [60], [61], [62], [63], [64], [65], [66]. For the compound generation methods above, additional steps are needed to generate compounds for a specific target. One option is to train or fine-tune the model on a subset of molecules that are known inhibitors of a target [35], [36]. The model can be conditioned on a property [40]. Or, the model can be reinforced for generating compounds that bind to a specific target [56]. A broader approach to this specific conditioning is general conditioning. Rather than being conditioned to some specific target or property the model can take some condition as input and generate compounds relevant to that condition without further training or optimization. That can be some chemical properties [43], [45], [47], [53], [57], [67], protein target information [59], [65], [68], or gene expression data [51], [66]. The motivation of a generally conditioned model is to potentially generate compounds for targets that may lack or have minimal training data available. Chapter 2, the Literature Review, will go into more detailed exploration of some of the cited work above. 6 1.2 Problem Statement In the drug development pipeline one of the earliest goals is to identify a drug target. Much of the time this comes down to a specific protein or a set of protein targets. The next step is the search for a candidate drug compound that can bind and inhibit the chosen target [10], [69]. Currently, virtual screening techniques are reliant on having data sets on known compounds that already bind the protein and/or having detailed structural information about the target [69]. For new emerging targets, such as pandemic viruses, these techniques can be rendered ineffective. The development of generative models that can generate compounds with only nonstructural protein data can allow for the identification of lead compounds without any prior knowledge of what compounds interact with the target, or the structure of the target. Of particular interest to us, is the using of amino acid sequences for targets for the generation of potential binding molecules. An advantage of amino acid sequences is the ability to avoid the computational complexity that comes from working with protein 3-dimensional structures. There are three models, to my knowledge, that attempt to tackle this issue. CogMol from Chenthamarakshan et al 2020 [68] is a model that generates potential binding molecules for a given target by utilizing two models in tandem. A model to generate compounds and a model to predict affinity. The affinity model is then used to fine tune molecule generation and is trained using pretrained protein embeddings from Unirep [70]. A downside of this is the need to fine tune the model for each target. Ghanbarpour et al 2020 [65] attempted this through the use of RNN’s and reinforcement learning techniques for fine-tuning. The model also makes use of a 7 pretrained protein embedding network ELMo [71]. Grechishnikova 2021 [59] attempted with the application of transformers which take amino acid sequences as context to generate potential binders. Ghanbarpour et al 2020 [65] and Grechishnikova 2021 [59] are unique in being the two models that attempt to create a single all-in-one model that takes in amino acid sequences as input and outputs potential binding compounds. This is to say a model that does not need to do any optimizations when given an amino acid sequence and can simply produce molecules directly, in a deterministic manner. I will be tackling this problem by utilizing transformers similar to Grechishnikova 2021 [59] with the additional improvement of new protein embeddings, similar to those utilized in Ghanbarpour et al 2020 [65]. 1.3 Thesis Structure The remainder of the thesis will have 5 chapters, Literature Review, Thesis Statement, Methodologies, Results and Discussions, and Conclusions and Future Research: Chapter 2 - Literature Review: - This chapter will provide a high-level background into the mathematics of deep learning, neural networks, recurrent neural networks, and transformers needed for understanding the proposed work and cited literature. The section will also review how biological data is represented for work in machine learning and review in detail the application of deep generative models in drug discovery. 8 Chapter 3 - Thesis Statement: - This chapter will present the primary models to be explored in the thesis, their rationale, and any expectations or hypotheses. Chapter 4 - Methodologies: - This chapter will review in detail the structure of each of the models, the hyperparameters chosen, the training process for the models, the compound generation process, and the evaluation criteria. In addition to exploring the models’ designs, the source data for the project will be introduced. Chapter 5 – Results and Discussion: - This chapter will review the results primarily through figures and tables with some notes on apparent trends and provide discussion on the findings. The results will review the models’ training results, the chemical feasibility of the generated compounds, the physiochemical properties of the generated compounds, and how well the compounds approximate true binders. Chapter 6 – Conclusions and Future Research: - This chapter will review the results and their potential meanings and implications in detail. It will also review potential shortcomings and limitations of the work in addition to suggestions for future work, improvements, and will conclude with a final statement. 9 10 Chapter 2 - Literature Review 2.1 Deep Learning This work will make many references to deep learning and some specific deep learning architectures. In this section we hope to introduce neural networks and the associated architectures important for understanding this work. 2.1.1 Neural Networks Neural Networks are a machine learning technique that utilize multiple layers of functions that are applied on some input vector with these functions having parameters that can be learned [12]. The first function applied to the input vector is the first layer, the second function is the second layer, etc. It is common convention to name the first layer the input layer, the last layer the output layer, and all layers in between hidden layers. Neural networks are often defined through a fundamental unit called a perceptron. The perceptron is a node that represents a linear function wrapped with an activation function. Figure 1 shows an example of a perceptron where 𝑊𝑖 and 𝑏 are learnable parameters, and the sigmoid function is the activation function. The activation function is a non-linear function that is applied to the output of the perceptron. The sigmoid function is often used as an activation function as it maps the linear functions output between 0-1. 𝑥1 𝑊 If we define 𝑥 as the vector [𝑥 ] and 𝑊 as [ 1 ] then a perceptron can be defined through 𝑊2 2 a singular function: 11 𝑓 (𝑥 ) = 𝜎 (𝑥 𝑇 𝑊 + 𝑏 ). (1) Note, the input vector can be any size; two is used as an example. A layer is then defined as a function that takes in an input vector of shape (𝑛, 1) and applies 𝑘 perceptrons concurrently to produce an output of shape (𝑘, 1). This makes up the fundamental unit of neural networks and all abstractions past this point vary primarily in the structure of these perceptrons and techniques for how the models learn. Figure 1. Visual Example of Perceptron 2.1.2 Recurrent Neural Networks A recurrent neural network is a neural network that is fed sequence data recursively [12]. This means that each input 𝑥𝑡 depends on information from 𝑥𝑡−1. The RNNs structure at a fundamental level is a normal neural network where the input is an input vector 𝑥𝑡 and a hidden vector ℎ𝑡−1 . These are put through the neural network and two things are output 𝑦𝑡 and ℎ𝑡 . 𝑦𝑡 is the prediction for the next character in the sequence, ℎ𝑡 is the hidden vector. The next step of the process uses the same network and now puts in ℎ𝑡 and 𝑦𝑡 as inputs. 𝑦𝑡 is essentially the predicted 𝑥𝑡+1 . In training it is popular to feed in the real 𝑥𝑡+1 value, this is called teacher forcing [72], however when generating sequence after training 𝑦𝑡 is fed into the next step as the input. 12 Figure 2. Segler et al 2018 [35] “Symbol generation and sampling process. We start with a random seed symbol s1, here c, which gets converted into a one-hot vector x1 and input into the model. The model then updates its internal state h0 to h1 and outputs y1, which is the probability distribution over the next symbols. Here, sampling yields s2 = 1. Converting s2 to x2 and feeding it to the model leads to updated hidden state h2 and output y2, from which we can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state hi allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.” Figure 2 visualizes this process for generating a sequence of SMILES characters. RNNs are described by the following equations: ℎ𝑡 = 𝐺 (𝑊ℎ ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏ℎ ), (2) 𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦 . (3) Where 𝑥𝑡 is the input, 𝑦𝑡 is the output, and ℎ𝑡 is the hidden state. 𝑊’s are weight matrices and 𝑏’s biases. 𝐺 is the activation function. Note that the bracket and comma notation “𝑊ℎ ∙ [𝑥𝑡 , ℎ𝑡−1 ]” means two separate weight matrices multiply and added which translates to “𝑊ℎ𝑥 𝑥𝑡 + 𝑊ℎℎ ℎ𝑡−1 ”. 13 Long Short-Term Memory (LSTM) The long-short term memory cells were originally suggested by Hochreiter et al. 1997 [14]. Regular RNNs must pass previous sequence information through a single hidden vector. This makes it difficult to hold onto long-term contextual sequence information. The LSTM cell allows for long-term memory storage using a context vector. As originally described, an LSTM cell is given by the following equations: 𝑓𝑡 = 𝜎(𝑊𝑓 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑓 ), (4) 𝑖𝑡 = 𝜎 (𝑊𝑖 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑖 ), (5) 𝐶̃𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝐶 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝐶 ), (6) 𝐶𝑡 = 𝑓𝑡 ⨀ 𝐶𝑡−1 + 𝑖𝑡 ⨀ 𝐶̃𝑡 , (7) 𝑜𝑡 = 𝜎 (𝑊𝑜 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑜 ), (8) ℎ𝑡 = 𝑜𝑡 ⨀ 𝑡𝑎𝑛ℎ(𝐶𝑡 ). (9) Where 𝜎 is the sigmoid function, 𝑓𝑡 , 𝑖𝑡 , 𝐶̃𝑡 , 𝐶𝑡 , 𝑜𝑡 are the forget gate, the input gate, temporary context vector, the context vector, and the output gate, respectfully. Equations (4) and (5) are similar to equation (2) for a regular RNN. As can be seen in (7) the 𝑓𝑡 and 𝑖𝑡 are used to determine how much of the previous context vector and temporary context vector will be added to make the new context vector. In (9) we see 𝑜𝑡 dictating what of the new context vector should be passed as the hidden state. Overall, the RNN maintains its structure but with the LSTM cell, two vectors are output, the hidden state vector, and the context vector. 14 Gated Recurrent Unit (GRU) GRUs were originally described in Chung et al. 2014 [15]. The GRU cells, like the LSTM cells, have the goal of passing both long-term and short-term memory, however, GRU cells store long and short-term memory in one hidden state. GRU cells can be described by the following equations: 𝑟𝑡 = 𝜎 (𝑊𝑟 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑟 ), (10) 𝑧𝑡 = 𝜎(𝑊𝑧 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑧 ), (11) ℎ̃𝑡 = 𝑡𝑎𝑛ℎ(𝑊ℎ ∙ [𝑥𝑡 , (𝑟𝑡 ⊙ ℎ𝑡−1 )] + 𝑏ℎ ), (12) ℎ𝑡 = 𝑧𝑡 ⊙ ℎ𝑡−1 + (1 − 𝑧𝑡 ) ⊙ ℎ̃𝑡 . (13) Where 𝑟𝑡 , 𝑧𝑡 , and ℎ̃𝑡 are the reset gate, update gate, and temporary hidden state, respectively. It can be seen in (12) that 𝑟𝑡 is being used to determine what portions of the previous hidden state should be included in determining the temporary hidden state. Then in (13), 𝑧𝑡 is used to determine what proportions of the previous hidden state and what proportions of the temporary hidden state should be combined to form the new hidden state. Bidirectionality and Layers Bidirectional RNNs describe a similar structure to regular RNNs. The difference is that two RNNs are generated each of which read the sequence in the opposite direction [73]. All calculations are the same except for in the final output the hidden states are concatenated together. This does not change any equations. Bidirectional RNNs are often used to encode a sequence and not used to generate. In a similarly trivial fashion, RNN layers describes 15 constructing multiple RNNs that feed into each other. In this case, the first RNN will calculate an output 𝑦𝑡 that will be fed into another RNN as an input. 2.1.3 Transformers The transformer architecture was originally introduced by Vaswani et al 2017 [16]. This architecture started a revolution in the processing of unstructured sequence data, namely textual data. Unlike recurrent neural networks (RNNs) and their variants long-short term memory (LSTM) networks and Gated Recurrent Units (GRUs), transformers do not rely on sequential data processing. Instead, they leverage a mechanism known as self-attention to process all elements of the input sequence simultaneously, which allows for more efficient computation and better handling of long-range dependencies. Transformers have an encoder-decoder structure where both the encoder and decoder have an almost identical structure composed of two main components, multi-head self-attention mechanisms and a position-wise feed forward network [16]. Attention The purpose of the attention mechanism is to allow the model to focus on different parts of sequences simultaneously. Attention is described by the following equation: 𝑄 ′𝐾′𝑇 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄′, 𝐾′, 𝑉′) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( ) 𝑉′. √𝑑𝑚𝑜𝑑𝑒𝑙 (14) Where 𝑄′, 𝐾′, and 𝑉′ are the query, key, and value matrices, respectively. 𝑄′ has a shape of (𝑛, 𝑑𝑚𝑜𝑑𝑒𝑙 ) where 𝑛 is the length of the sequence and 𝑑𝑚𝑜𝑑𝑒𝑙 is the size of the embedding. 𝐾′ 16 and 𝑉′ both have a shape of (𝑚, 𝑑𝑚𝑜𝑑𝑒𝑙 ) where 𝑚 is the length of the sequences. 𝑄′, 𝐾′, and 𝑉′ are calculated from a linear transformation performed on three candidate sequences, 𝑄, 𝐾, and 𝑉. 𝑄 ′ = 𝑄𝑊𝑄 , (15) 𝐾 ′ = 𝐾𝑊𝐾 , (16) 𝑉 ′ = 𝑉𝑊𝑉 . (17) Where 𝑄 has shape (𝑛, 𝑑𝑚𝑜𝑑𝑒𝑙 ), 𝐾 and 𝑉 both have a shape of (𝑚, 𝑑𝑚𝑜𝑑𝑒𝑙 ), and 𝑊𝑄 , 𝑊𝐾 , and 𝑊𝑉 have a shape of (𝑑𝑚𝑜𝑑𝑒𝑙 , 𝑑𝑚𝑜𝑑𝑒𝑙 ). In Self-Attention 𝑄, 𝐾, and 𝑉 are all the same sequence. In Cross-Attention 𝐾 and 𝑉 are the same sequence, and 𝑄 is a different sequence. To control where attention is applied, Masked-Attention can be implemented. Masked-Attention can be defined with the following: 𝑀𝑎𝑠𝑘𝑒𝑑𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄 ′, 𝐾 ′ , 𝑉 ′ , 𝑀) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( 𝑄 ′𝐾′𝑇 √𝑑𝑚𝑜𝑑𝑒𝑙 + 𝑀) 𝑉′, (18) Where 𝑀 is a matrix that contains −∞ where the attention is chosen to be masked and 0′𝑠 elsewhere. For example, when preventing paying attention to tokens ahead in the sequence in Masked Self-Attention, 𝑀 [𝑖, 𝑗] = −∞ for 𝑗 > 𝑖 is applied. Multi-Headed Attention extends attention to multiple subspaces using the following equations: 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1 , ℎ𝑒𝑎𝑑2 , … , ℎ𝑒𝑎𝑑ℎ )𝑊𝑂 , (19) ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑄𝑖 , 𝐾𝑊𝐾𝑖 , 𝑉𝑊𝑉𝑖 ). (20) Where the concatenated heads have a shape of (𝑛, ℎ ∙ 𝑑𝑚𝑜𝑑𝑒𝑙 ) and 𝑊𝑂 has a shape of (ℎ ∙ 𝑑𝑚𝑜𝑑𝑒𝑙 , 𝑑𝑚𝑜𝑑𝑒𝑙 ). 𝑊𝑂 helps project the concatenated heads back into the 𝑑𝑚𝑜𝑑𝑒𝑙 dimensional 17 space. The additional weighted matrices, 𝑊𝑄𝑖 , 𝑊𝐾𝑖 , and 𝑊𝑉𝑖 , are unique to each head and learn to perform a unique transformation on 𝑄, 𝐾, and 𝑉 for that attention head. This allows the transformer to capture different types of relationships in different subspaces [16]. Positional Encoding The transformer does not process the sequence sequentially so the concept of order and distance within the sequence needs to be added. To accomplish this the positional encodings are added to the inputs of the model. These encodings are calculated using sine and cosine: 𝑝𝑜𝑠 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛 ( ), 100002𝑖⁄𝑑𝑚𝑜𝑑𝑒𝑙 𝑝𝑜𝑠 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠 ( ). 100002𝑖⁄𝑑𝑚𝑜𝑑𝑒𝑙 (21) (22) Where 𝑑𝑚𝑜𝑑𝑒𝑙 is the dimension of the embedding, 𝑝𝑜𝑠 is the position of the embedding vector within the sequence and 𝑖 is the dimension within the embedding. Therefore, for each embedding vector a positional encoding is calculated and added at each dimension. Transformer Encoder and Decoder Structure The Encoder and Decoder of a transformer have very similar structures. The Encoder takes in an input sequence, calculates an embedding, adds the positional encoding, calculates the Multi-Head Self Attention and passes results through a feed forward network. The embedding layer’s dimension is 𝑑𝑚𝑜𝑑𝑒𝑙 and the feed forward layer’s dimension is 𝑑𝑓𝑓 . For the Decoder the same initial steps are taken except the Multi-Head Self Attention is masked. The Decoder also has a second attention process that is Cross-Attention where 𝐾 and 𝑉 are the outputs of the 18 Encoder and 𝑄 is from the Decoder. If there are multiple layers in the Encoder and/or Decoder the output of the feed forward network in the previous layer is passed to the next layer with the same structure. The Encoder-Decoder structure is outlined in Figure 3. Figure 3. Vaswani et al. 2017 [16] “The Transformer - model architecture” 19 2.1.4 Sequence Generation Methods For deep learning models there are different methodologies for generating token sequences from the model. Three common generation methods are argmax sampling, random sampling with temperature, and beam search. These models output a discrete probability distribution of the possible next token in a sequence. In argmax sampling, the most probable token is sampled. In random sampling with temperature this distribution is sampled from [74]. The temperature is a scaling is a mechanism to scale the probability distribution. Given a discreate probability distribution 𝑝, and a temperature 𝑇, the new discrete probability distribution (𝑝′) is defined as: ⁄ 𝑝𝑖′ = 𝑝𝑖1 𝑇 ∑𝑛𝑗 𝑝𝑗1⁄𝑇 . (23) As can be seen in equation 23, as 𝑇 approaches infinity, the larger probabilities are scaled down much more than smaller probabilities flattening the distribution and making all tokens equally likely. As 𝑇 approaches 0 the higher probabilities are scaled up much larger than the lower probabilities making the sampling approach argmax sampling. In beam search a beam size is determined. For a beam size of 𝑛 the top 𝑛 most probably tokens are sampled at each step. For the first step 𝑛 tokens are sampled, then on the second step each of those 𝑛 tokens generate 𝑘 candidate tokens, where 𝑘 is the size of the vocabulary. This creates 𝑛 ∙ 𝑘 number of token sequences to choose from. The probabilities of the first token and the second token are multiplied to get the probabilities for the two token sequences and only the top 𝑛 are kept. This is then repeated for the third token until a max length is hit or all 𝑛 token 20 sequences hit an end token. This methodology of beam search is known as max-probability based beam search [75]. A downside of this methodology is that it can penalize longer sequences. To correct this, length normalization can be applied [75]. When using an implementation with normalization the selection of sequences then depends on some scoring function. In Wu et al. 2016’s [75] implementation the top 𝑛 scoring sequences are explored in each step. If any sequences reach an end token they are stored and the remaining top 𝑛 sequences are explored until either all sequences reach an end token or, a max sequence is reached [75]. This implementation of beam search is discussed further in Chapter 4. 2.2 Representation of Biological Data 2.2.1 Compound Representations The most common representation of molecules that people are familiar with are molecular drawings where atoms are represented by alphabetic characters and bonds between atoms are denoted with lines connecting them. These representations are called Lewis Structures (Figure 4) [76]. Although, simply utilizing molecular images is possible [77], for most computational problems an appropriate digital representation of the molecule is preferred. There Figure 4. Example of Lewis Structure Representation. 21 are multiple goals when developing and deciding between chemical representations. Two of the most important qualities of representations are uniqueness and ambiguity. A chemical representation is considered unique if for every molecule it only produces one valid representation. A chemical representation is non-unique if for one molecule it can generate many representations. A representation is considered ambiguous if one representation can represent many molecules, and it is considered unambiguous if each representation can only represent one molecule. In simpler terms, uniqueness describes a one-to-one relationship from molecule to representation, and non-uniqueness represents a one-to-many relationship between molecule to representations. In reverse, ambiguity represents a one-to-many relationship from representation to molecules, while non-ambiguity represents a one-to-one relationship from representation to molecule [76]. Lewis structures under these categories would then represent a unique, non-ambiguous representation, since a drawn structure can only represent one molecule, and one molecule only has one correct drawing, after accounting for any rotational invariance [69]. Figure 5. David et al 2020 [76] “Example graph representation for acetic acid. a Graph representation of acetic acid with nodes numbered from one to four. b Example adjacency matrix, A, for an acetic acid graph with the corresponding node ordering on the left. c Example node features matrix, X, which one-hot encodes a few selected properties. d Example edge features matrix, E, where each edge feature vector is a one-hot encoding of single, double, or triple bonds. “Implicit Hs” stands for the number of implicit hydrogens on a given node.” 22 To create suitable representations of molecules from structural drawings, researchers early on decided to model compounds as graphs. In this modelling atoms are viewed as nodes in a graph and bonds are considered edges. Once a graph is constructed opportunities for mapping the molecule to different representations become possible. The simplest representation to use in the case of a molecular graph is an adjacency matrix. Where each index maps to a specific atom in the molecule (Figure 5) [76]. This representation has the advantage of being inherently numerical and has the extendibility to add additional information about the molecule (Figure 5). This information can include properties of the atoms as well as properties describing the bonds between atoms. Another advantage is that all sub-graphs within these representations are interpretable [76]. A representation like this is also unique and unambiguous. A significant disadvantage of this representation is the cost of space and its sparseness. This can lead to potential memory limitations when applied to large problems and computational bottlenecks [76]. A linear chemical representation can circumvent these limitations. The most prominent linear representation of molecules is the Simplified Molecular Input Line Entry System (SMILES). SMILES strings are constructed by traversing over the graph representation of molecules in a depth-first manner. Atoms can be represented by their alphabetic symbols and additional characters can be used to handle geometric information. A negative of this representation is that it loses the extendibility that graph representations have, and a SMILES substring cannot be guaranteed to be interpretable [76]. However, this representation is advantageous for its compact form. Any depth-first traversal of a molecular graph can generate a valid SMILES string. For this reason, SMILES strings on their own are non-unique. But all SMILES strings can only be decoded into one molecule making them non-ambiguous. To correct 23 for the non-uniqueness of SMILES several canonicalization algorithms have been developed with traversal rules that construct unique SMILES strings [76] (Figure 6). There are many existing canonicalization algorithms but the most prominent of these is the opensource implementation in the python chemical toolkit RDkit [78]. The non-uniqueness of SMILES in some circumstances can be an advantage as shuffling SMILES into multiple valid configurations has been utilized as a data augmentation technique [79]. For generative deep learning tasks, a prominent issue with the use of SMILES strings is the frequent low success in generating valid molecules [80], [81]. For a SMILES string if a single character is wrong the whole string can be invalid resulting in no interpretable molecule being formed. Figure 6. David et al 2020 [76] “Canonical (a) and randomized (b) SMILES representations of aspirin. Randomized SMILES correspond to the various representations of a molecule obtained by randomly selecting the starting node in the graph traversal algorithm, thus changing the order of the nodes traversed in the molecular graph (still using depth-first search). Numbers represent the order of graph traversal, where 1 is the initial node (user defined). Considering a as being the canonical representation of aspirin, b shows a different ordering of the atoms of the molecule. The final SMILES is one possible SMILES among all the randomized SMILES which can be generated. Green arrows indicate how the molecular graph is traversed. Both SMILES strings shown represent the same molecule but, as the atom numberings are different, the generated SMILES strings are, too.” 24 Krenn et al 2020 [82] addressed this issue by developing a linear compound representation called Self-Referencing Embedded Strings (SELFIES). In their paper they describe an algorithm that can systematically construct a completely syntactically and semantically robust grammar from a set of provided example words, where “word” means a sequence of characters from a defined alphabet [82]. Using this method, the researchers were able to learn a grammar using 72 million molecules from PubChem. This learned grammar allows for the translation from SMILES strings to SELFIES strings and guarantees any order of characters in the SELFIES alphabet can be decoded into a valid SMILES string and their for a valid molecule [82]. The previous representations can be considered “structural representations” since they are ways of representing the structure of the compound through a graphical representation transformed into a linear representation [83]. Another common compound representation are feature based representation, namely structural keys and molecular fingerprints [83]. These representations are important for many common metrics evaluating compound properties [80], [81]. Structural keys are a representation where a specific set of chemical substructures are scanned for in a molecule and the number of occurrences for each substructure is counted and held in an array where each index represents a specific substructure. This representation is unique but ambiguous [76]. The theory behind such an implementation is that these substructures hold important chemical activity, so recording their presence and occurrences is enough to capture the chemical essence of a compound in a compact and simple form factor [76]. 25 Molecular fingerprints are hashed representations of molecules, with the most popular being Morgan fingerprints [76]. These representations are constructed by first defining a max path length. Then for 0 to the max path length every possible path in the molecular graph is generated as a string. The set of these strings are then hashed into an array of some predetermined size. Additions into the array are handled by an “exclusive or” meaning any even collision number results in a 0 bit and any odd collision number results in a 1 bit. This makes the representation ambiguous and potentially non-unique [76]. However, the advantage of such a method is that it is quick to calculate, it is uniform in size, and it takes a whole representation of the molecule to determine a specific hash rather than substructures like in the structural key representation. Its structure also makes it possible to perform simple distance calculations for determining how similar two molecules are [76]. Both structural keys and molecular fingerprints are popular for comparing molecules with a variety of metrics [80], [81]. 2.2.2 Protein Representations Proteins, by definition, are still compounds and in theory, can be represented with the aforementioned techniques. However, because of their size, such representations are impractical. For example, drug-like molecules often have under 50 atoms present in their compounds while proteins can have on the order of thousands of atoms [76]. However, although large the advantage of proteins is that they often have a specific three-dimensional structure. Because of this a prominent and practical representation for proteins is a simple coordinate system that defines the position of each atom with an x, y, z coordinate in atomic measurement units 26 (Angstroms) [76]. The protein data bank (PDB) is a database that contains more than 150,000 protein structures [3]. A downside of this representation of proteins is that 3D data is more complicated to work with from a machine learning perspective. This motivates the development of linear representations of proteins. Proteins can be described by a single linear sequence of 20 possible amino acids. This amino acid sequence can act as a unique identifier but doesn’t hold any structural information. However, it is important to note that all amino acid sequences map to a specific protein structure [84]. The prediction of protein structures from amino acid sequence information alone is an area of rich research, with AlphaFold2 recently emerging successful in achieving high accuracy in the prediction of protein structure from amino acid data alone [85]. Recently, a large area of study has emerged for applying large language model techniques to the generation of information rich linear protein embeddings from amino acid sequences alone [70], [71], [86], [87], [88]. These models are called Protein Language Models (PLMs). All these PLMs follow a similar methodology. An architecture for a large language model, likely a RNN variant or Transformer, is adapted to take in amino acid sequences and perform next-in-sequence or gap-in-sequence predictions. After performing this task the hidden layers of the model are used to construct a vector representation for the protein [70], [71], [86], [87], [88]. These embeddings have shown success in categorizing amino acid sequences with biological information [70], [71], [86], [87], [88]. Figure 7 shows a plot from Elnaggar et al 2021 [86] for the embeddings developing an unsupervised but biologically relevant structure in the latent space. These protein embeddings can therefore be repurposed into other machine learning tasks. 27 Figure 7. Elnaggar et al 2021 [86] “Protein LMs learned constraints. t-SNE projections visualized information extracted by the unsupervised protein LMs (here best-performing ProtT5-XL-U50; upper row: before training (Random), and lower row: after pre-training on BFD & UniRef50. (A) The left-most column highlights single amino acids by biophysical features. (B) The middle column annotates protein structural class (taken from SCOPe). (C) The right-most column distinguishes proteins according to the kingdom of life in which it is native. Although the random projections on top may suggest some adequacy of the clusters, the trained models shown on the lower row clearly stood out. Incidentally, the comparison of the two also highlighted the potential pitfalls of using t-SNE projections from many dimensions onto 2D: although random, the human might see some correct patterns even in the top row.” 2.3 Deep Generative Models in Drug Discovery Generative deep learning has been applied to the compound generation space for years. In these years multiple model architectures have been explored. In this section we will review some major papers in this research space highlighting the architectures most relevant to this thesis’ goals, RNNs and Transformers. 28 2.3.1 Recurrent Neural Networks Out of all the architectures utilized in generative drug discovery, RNN based models (LSTMs and GRUs) appear to be the most prevalent. Bjerrum et al. 2017 [32] was the first paper to generate SMILES compounds via an LSTM. Although not attempting to generate compounds for drug discovery, this paper showed the feasibility of generating valid, novel, chemically diverse SMILES strings with LSTMs. In the following year multiple papers arose exploring this technique for drug like compounds [35], [36], [37], [65], [67]. Segler et al. 2018 [35] took the method established in Bjerrum et al. 2017 [32] and expanded it specifically for drug discovery. After training an LSTM model to generate valid compound from ChEMBL, they fine-tuned the model on 732 molecules that are active for a specific receptor HT2A. The fine-tuning method was an iterative approach first fine tuning on real binders and then fine tuning on the best molecules generated. Gupta et al. 2018 [36] took this work and expanded it by training on a broader more diverse compound set. Merk et al. 2018 [37] was the first to validate this methodology of fine tuning in vitro. In their research they trained an LSTM to generate SMILES compounds and fine-tuned it on retinoid X and peroxisome proliferator-activated receptor agonists. The lab synthesized the five top ranking compounds of which four showed activity in cell-based assays. Kotsias et al. 2020 [67] was one of the first papers in the RNN space to explore context based generation. The goal of this research was to create a model that can generate compounds with a desired set of properties. In training they calculated a set of chemical properties and bioactivities for each molecule and used those values as an initial contextual input in the model 29 during training. This resulted in a model that provided a set of desirable values as inputs could generate a set of compounds that estimate those values. Ghanbarpour et al. 2020 [65] expanded on this concept of conditioning an RNN for compound generation by utilizing a dataset of protein compound binding pairs and training the RNN with protein data for contextual input. The protein context used in this work were embeddings calculated from amino acid sequences using a large protein language model ELMo [71]. 2.3.2 Transformers As Transformers have been used in place of RNNs in natural language problems in recent years, the same is true for RNNs in compound generation. Mandhana et al. 2020 [58] took the RNN approach originally proposed in Bjerrum et al. 2017 [32] and simply expanded it to a Transformer Decoder specific architecture. Bagal et al. 2022 [57] took an approach similar to Kotsias et al. 2020 [67], where they took at Transformer Decoder architecture and utilized compound property information as a conditional input for the model for the generation of compounds sets with desired properties. However, in addition to properties, the model also takes in compound scaffolds as an input SMILES string. In practice that means this model can take in reference molecules and desired properties and make additions to the scaffold that align with the provided properties. Haroon et al. 2023 [89] took this work and expanded on in by applying a different attention mechanism called relative attention. This change made improvements in novelty and validity metrics. Grechishnikova 2021 [59] followed a similar process as Ghanbarpour et al. 2020 [65] where protein information was used as contextual information in 30 compound generation. It diverts from this work, however, by not using pre-trained protein embeddings but rather the raw amino acid sequences as contextual input into the model. 2.3.3 Other Model Architectures Although less relevant to the models used in our research, it is important to recognize that other models outside of RNNs and Transformers have been extensively explored in compound generation to some success. Notably, Variational AutoEncoders/HeteroEncoders (VAEs/HEs) [27], [40], [41], [42], [43], [44], [45], [46], [47], and Generative Adversarial Networks (GANs)/Adversarial AutoEncoders (AAEs) [48], [49], [50], [51], [52], [53], [54], [55], [56]. VAEs are of particular interest because they are the Deep Learning architecture that has been used with the SELFIES compound format [90], [91]. The improved validity of SELFIES aids in the exploration of latent spaces created by VAEs, which when constructed by SMILES creates a space with many invalid molecules [82]. This guarantee correctness also makes it appealing for other usually noisy methods of generation, such as genetic algorithms which introduce random mutations to the output SELFIES [92], [93], [94]. 2.3.4 Protein Embedding Use in Drug Discovery Some of the models above have used raw amino acid sequences [59] while others have utilized pretrained protein embeddings [65], [68]. The benefit of utilizing pretrained protein embeddings is having access to their immense size. These models can have total trainable parameters ranging from the 18.2 million up to 15 billion [90], [93]. This would make training these models from scratch a computational feat. Luckily, there is a large selection of open source 31 large protein language models available for use [70], [71], [86], [87], [88]. Previous models focused on LSTM architectures [70], [71]. More recent best performing models are transformer based [86], [87], [88]. ESM-2 (Evolutionary Scale Model-2) is one of the newest models, being released in the summer of 2022 [88]. Although protein embeddings have been used in compound generation in the past ESM-2 has not seen this application [65], [68]. However, ESM-2 has been utilized in models for affinity prediction and compound screening [95], [96], [97], [98], [99]. 32 Chapter 3 - Thesis Statement This thesis will explore the use of transformers for the generation of compounds using amino acid sequences as conditioning. The work will expand on the work of Grechishnikova 2021 [59] exploring two major adaptations, the use of transfer learning by using the pretrained protein embeddings from the ESM-2 model [88], and the use of SELFIES strings instead of SMILES strings [82]. The expectation is that the larger training corpus of ESM-2 will allow the model to better generalize between a broader set of proteins, potentially improving the model’s ability to make assumptions on new amino acid sequences and overall improving performance. SELFIES strings are guaranteed to be valid and therefore will increase the validity metrics of the model, but the expectation is they may also help with diversity, a cited benefit of their use in models [82]. The goal is to investigate how these changes effect the performance of a model with the same design and size as Grechishnikova 2021 [59]. In addition to exploring the effects of embeddings and SELFIES we will explore changes in the evaluation process. Grechishnikova 2021 [59] only explores beam search for evaluation, which is a deterministic generation mechanism. We will expand this by also exploring random sampling with temperature as a generation method. We will also attempt to scale the evaluation process. In Grechishnikova 2021 [59] a case study is performed by simulating the binding of a subset of generated compounds to a small selection of proteins. We will attempt to build a binding classifier for the purpose of benchmarking model performance across the entire dataset. For consistency all this work will use the same dataset as Grechishnikova 2021 [59]. 33 34 Chapter 4 - Methodologies This section reviews the methodology of training for models used in compound generation as well as models used in evaluation. And will overview the methods used to evaluate model performance. 4.1 Transformers for Compound Generation For this project four transformer models will be trained and compared in amino acid conditioned compound generation. The four models are: Base model (Base): - Model designed to mimic Grechishnikova 2021 [59]. Trained on SMILES and raw amino acid sequences. Base + SELFIES model (Base+S): - Same as base model but trained with SELFIES instead of SMILES. Base + Embedding model (Base+E): - Same as base model but replaces the Encoder of the transformer with pretrained embeddings from ESM-2. Base + SELFIES + Embedding model (Base+S+E): - The same as Base+E with SELFIES instead of SMILES. 35 All these models were trained in Googles CoLab platform with V100 GPUs and developed using Tensorflow2 version 2.15.0 [100]. Transformers were implemented with the help of TensorFlow’s documentation and tutorials [101]. 4.1.1 Model Data The dataset chosen for this analysis was from Grechishnikova 2021 [59]. Grechishnikova 2021 [59] had five train test splits. Only one split was chosen for this thesis work due to constraints on computational resources and time. No amino acid sequences in the training split are present in the test split. The Needleman-Wunsch similarity distribution between the train and test split can be seen in Figure 8. Additional cleaning steps were performed on the data following some of the processes outlined in Brown et al 2019 [81]. The additional cleaning steps are as listed: - The SMILES string was cleared of salts. - The SMILES string was canonicalized. - Molecules were filtered to remove any that contained elements outside of H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I. - The molecules were neutralized. This ultimately removed 26 compounds from the dataset, all of which contained an element not found in the allowable group. This compound filtering was performed using the RDKit python package version 2023.9.5 [78]. These molecules were then also converted into SELFIES strings using the selfies python package version 2.1.1 [102]. After filtering the dataset was left with 183,928 unique amino acid compound pairs (1512 unique amino acid sequences) and 9198 36 unique amino acid compound pairs (99 unique amino acid sequences) for the train and test set, respectively. The final processing step for data was to tokenize. For SELFIES strings the unique tokens are already predefined in the grammar. For SMILES strings the unique tokens were unique elements charged or neutral. SMILES were tokenized using the selfie packages tokenize_smiles function [102]. For the embedding models, the ESM-2 650M parameter model was used. ESM-2 has model ranging from 8 million to 15 billion parameters [88]. The 650 million parameter model was chosen instead of higher parameter models because of memory limitations of the V100 GPUs used in training. Figure 8. Distribution of Needle-Whitman Similarities Between Test and Train Amino Acids. 37 4.1.2 Model Structure Encoder and Decoder Structure The encoders and decoders for the models were modeled identically to Grechishnikova 2021 specifications [59]. The Encoder and Decoders had the same structure which is made up of 4 layers with a feed-forward dimension (𝑑𝑓𝑓 ) of 512, 𝑑𝑚𝑜𝑑𝑒𝑙 of 128, and 4 attention heads. A dropout rate of 0.1 was applied after the embedding layer, each attention layer, and each feed forward layer before normalization. The embedding models, Base+E and Base+S+E, do not have this encoder but rather replace it with ESM-2 pretrained embeddings. Embedding Model ESM-2 is a transformer model was originally trained using PyTorch [88]. Since my work is developed in Tensorflow2, combining the models together and using the pretrained weights is infeasible. To accommodate this all the embeddings were pre-calculated for all the amino acids in the dataset. Each embedding is calculated over the full length of the amino acid sequence, replacing each amino acid in the sequence with a 1280 embedding vector. Since ESM-2 is a transformer model, this output is the same shape as what is expected from the Encoder. For this reason, the embeddings can be used as a drop-in replacement for the Encoder and can be passed directly into the Decoder. However, because the precalculated embeddings expand the amino acid sequences into a large second dimension loading the full dataset into memory is no longer feasible. To accommodate this, we made a custom Tensorflow2 model layer that functions as a hash map. The layer EmbedHashLookupLayer on construction makes a one-to-one mapping 38 from amino acid sequences to embedding representations. At model runtime this layer will take batches of amino acid sequences and return batches of embedding representations. This approach is beneficial because the embedding representations can be stored in system RAM instead of GPU VRAM which saves GPU memory. The non-embedding model utilized ~14Gb of VRAM while the embedding model only utilized ~5Gb of VRAM during training. An additional benefit is the mappings can be passed each time the model is constructed which means you only ever need to provide the model with the minimum sized hash map for the current context, also potentially saving memory. The major downside of this approach is that the ESM-2 model must be used to calculate any new amino acid embeddings before using the model. In addition to this, although more resilient to memory issues, there can still be memory errors if the hash map becomes too large. This implementation ultimately allows for the embedding models to replace the Encoder with a single EmbedHashLookupLayer layer. 4.1.3 Model Training The total number of epochs for training was not predetermined and was chosen by observing the progression of the loss. This will be discussed further in Chapter 5 Results and Discussions. The models were trained using a batch size of 72. This batch size was chosen to approximate the number of tokens per batch similar to Grechishnikova 2021 [59]. In their work they utilized a batch size of 4096 tokens per batch. The batch size we utilized was estimated using the average compound size to achieve approximately 4096 tokens per batch on average. The V100 GPUs utilized in this work did not have enough memory to process this batch size in total. To accommodate this, we implemented gradient accumulation with 6 mini-batches of size 39 12. Gradient accumulation is a training process where the batch is split into minibatches. For each minibatch a forward pass is made, and the gradient is calculated and accumulated but not applied. On the final minibatch the accumulated gradient is applied in backward propagation. The accumulation is the summation or the average of the gradients depending on if the loss is normalized [103]. In our case the gradient is averaged across the minibatches. This saves memory because each forward pass only 12 examples need to be held in memory but doesn’t impact training because the back propagation only occurs every 72 examples, as originally intended. Adam was the optimizer utilized in training with 𝛽1 = 0.9, 𝛽2 = 0.98, 𝜀 = 10−9 and a learning rate defined as: −0.5 𝑙𝑟𝑎𝑡𝑒 = 𝑑𝑚𝑜𝑑𝑒𝑙 ∙ 𝑚𝑖𝑛(𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 −.05 , 𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 ∙ 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠 −1.5). (24) Where 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠 is 4000. This definition of learning rate and the parameters of the Adam optimizer are identical to Grechishnikova 2021 and the original transformer publication [16], [59]. The loss function for training was sparse categorical cross entropy which was masked to not penalize failing to predict padding values. Teacher forcing was used during training [72]. It is important to note that normally the theoretically ideal loss is 0, but in this case the same amino acid sequence can map to multiple different binding compounds. Because of this one-to-many mapping, even a perfectly fitted model cannot achieve a loss of 0. To accommodate this in evaluation the theoretically best loss was calculated for the SMILES and SELFIES training and test sets. This was done by calculating the ideal probabilities for each molecule and amino acid sequence pair. The process for this was for every amino acid sequence to take all of 40 the potential binding molecules and for every possible prefix calculate the appropriate next probability. This is best explained by an example. If an amino acid sequence bound to three compounds “AAC”, “AAB”, and “ABB” then the ideal probability for the first character would be 1/1 “A” since all three compounds start with “A”. Then for the next character after “A” would be 2/3 “A” and 1/3 “B”. Then for the prefix “AA” the next ideal probability would be 1/2 “C” and 1/2 “B”, but for the prefix “AB” it would be 1/1 “B”. Since the model is using teacher forcing in training, we know that for the example “AAC” we would expect and ideal model to guess 1/1 for “A” being the first character, 2/3 “A” and 1/3 “B” for the second character, and 1/2 “C” and 1/2 “B” for the final character. For both SMILES and SELFIES, the ideal loss was calculated to be ~0.11, for both train and test sets. 4.2 Binding Classifier for Evaluation To evaluate the similarity of generated compounds to real binding molecules a classifier was constructed with the goal of distinguishing real binders from random molecules for a given amino acid sequence. This is a similar approach to what was performed in Grechishnikova 2021 [59] where instead they simulated binding affinities for generated compounds and real binders. Constructing a classifier is advantageous because it scales and allows us to evaluate generated molecules without the computational cost of simulating binding for all real, random, and generated compounds. This allows for the evaluation of all generated compounds for all amino acid sequences and not a subset like in Grechishnikova 2021 [59]. Since the transformer models are conditioned to generate SMILES and SELFIES strings that approximate real binding SMILES and SELFIES strings, this method of evaluation is in line with model performance. It is 41 important to note that better performance of generated compounds in the classifier evaluation is not making a claim that these compounds are better binders, just that they better approximate the SMILES/SELFIES strings of true binders. In training the classifier, a delicate feature will be avoiding overfitting, or underfitting. If the model is overfit, it will be too discriminatory of potential binders and only recognize those almost identical to real binders. But if underfit it will not discriminate enough. To account for this the model size was limited, high dropout values were explored, and training was early stopped to prevent overfitting. As a waypoint for appropriate performance, we benchmarked against the simulation results of Grechishnikova 2021 [59]. In their work the best area under the receiver operating curve (AUC) for their simulations was 0.86. Therefore 0.86 was used as an approximate target for the classifier model’s AUC score to stop training. 4.2.1 Model Data The dataset used for the evaluation classifier was constructed by taking all unique amino acid sequences from the Grechishnikova 2021 [59] dataset and splitting them into a 50:50 train test split. The 190 amino acids with only one occurrence were randomly split evenly between the train and test split. For the remaining 1421 amino acid sequences that had multiple occurrences were evenly split with their binding pairs into the train and test sets. If the number of binding pairs for an amino acid sequence was odd the left over pair was randomly placed in the either set. This model also made use of ESM-2 for protein embeddings. For this data the 3 billion parameter model was used. The reason the larger model could be used in this instance was because the resulting protein embeddings were reduced via taking the mean across the length of 42 the sequence. This created a single 2560 vector for each amino acid sequence, which did not create memory issues. To train the classifier it is necessary to have negative binding examples. For the data, negative binders were molecules randomly sampled from the training and test set, respectfully. The method of using randomly sampled molecules for negative binding pairs was chosen because distinguishing between random molecules and true binders is the benchmark evaluation in Grechishnikova 2021 [59]. Random sampling does allow for the accidental sampling of true binders as negative examples, but with the size of the dataset this was not a major concern. A simulation of 200 full random samplings showed that in the training and test set the average percentage of collisions per amino acid was 0.12%. This methodology creates a training and test set with almost equal representations of each amino acid sequence in both sets. By not using the same train-test split as the transformer we ensure that the classifier learns to classify for amino acid sequences both in the transformers training set and the test set. Having the same amino acid sequences in the training set and test set was not a concern, considering the goal of the classifier is to perform evaluation on this dataset, and not generalize outside of it. 4.2.2 Model Structure The model is constructed to take in a potential binder SMILES string and an amino acid sequence embedding and return the probability that that molecule is a binder for that amino acid sequence. The structure of the binding classifier was motivated by Song et al. 2023 [104]. In their work they developed an architecture using pretrained protein embeddings from ESM-2 and 43 compounds to predict affinity and interactions [104]. Their model is constructed of three main parts, a model for processing compound graphs constructed from input SMILES strings into a single embedding, a model for processing protein embeddings into a single compressed embedding, and a final feed forward network that takes the Kronecker product of the embeddings (Figure 9). Our implementation simplifies this by using a GRU recurrent neural network for SMILES strings instead of a Graph Neural Network, and by replacing the protein embedding processing model with the mean reduced ESM-2 embeddings and simple single layer dimension reduction feed forward network. The SMILES string processing model has an initial embedding layer with a dimension of 32 and a single layer bidirectional GRU with a dimension of 32. The hidden state vectors from each step of the GRU are averaged together to get the final compound embedding. This is similar to the original protein embedding LSTM in Song et al. 2023 [104] as seen in Figure 9. The protein embedding dimensional reduction model was a simple feed forward network with a dimension of 128. The results of these two models were concatenated with each other, and then concatenated with their Kronecker product as in Song et al 2023 [104]. For two vectors 𝐴 and 𝐵 the Kronecker product is simple the matrix 𝐵 multiplied by every value in 𝐴 and concatenated into a single vector. The resulting shape of a (1, 𝑛) and (1, 𝑚) Kronecker product will be a vector of shape (1, 𝑛 ∙ 𝑚). The concatenated result for the original embeddings and the Kronecker product is passed through a two-layer feed forward network with dimensions 128 and 64, respectfully. Before input into the first feed forward layer and after each layer a dropout of 0.3 is applied. The last layer of size 1 with a sigmoid activation is used to output the binding probability. 44 Figure 9. Song et al. 2023 [104] “The framework of PMF-CPI. For a compound-protein pair, the protein sequence is transformed into context embedding via pretrained TAPE and then fed into LSTM. And a compound SMILES is turned into a molecular graph by RDKit and encoded with graph neural networks. After that, two encoding vectors are merged through the Kronecker product, and MLP provides a final output. PMF-CPI can process two task types, regression for CPI binding affinities or classification for interactions” 4.2.3 Model Training The model was trained on a V100 GPU using Google’s CoLab platform. The loss for training was binary cross entropy. The Adam optimizer was used with 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜀 = 10−7 and a learning rate of 0.001. These values were informed from Song et al 2023 [104]. To prevent overfitting the model was to be stopped when the AUC score of the test set approached an AUC score of 0.86. This is because this is the maximum AUC score found in Grechishnikova 2021 [59] for binding simulations. 45 4.3 Experiment Setup For each model three sets of compounds were generated. All SMILES generated were canonicalized using the RDkit package [78]. SELFIES strings were converted to SMILES strings and canonicalized. Two sets were generated using beam search with a beam size of 10, as done in Grechishnikova 2021 [59]. One set from beam search was the 1-1 set which only kept the most probably compound for each unique amino acid sequence. The other set from beam search was the 1-10 set which kept all 10 of the compounds for each amino acid sequence. The third set was generated via random sampling with a temperature of 1. For random sampling, 100 molecules were generated. For shorthand this set will be referred to as the 1-100 set. It was chosen to generate more molecules for random sampling because of the lower computational cost compared to beam search. In Grechishnikova 2021 [59] only beam search is explored, but since beam search is a deterministic generation mechanism that generates more probable molecules the goal was to use random sampling as an exploration of a more diverse generation method that also allows us to explore more of the chemical space. The beam search methodology used in the project was the same as used in the Tensor2Tensor project [105]. This is an adaptation of the beam search methodology used in Wu et al. 2016’s previous work without the coverage penalty: 46 𝑠(𝑌, 𝑋) = log(𝑃(𝑌|𝑋)) , 𝑙𝑝(𝑌) (25) 𝑙𝑝(𝑌) = (5 + |𝑌|)𝛼 . (5 + 1)𝛼 (26) Where 𝑌 is the generated sequence and 𝑋 is the output of the encoder. 𝛼 is a hyperparameter for normalization that was set to 0.6, as in Grechishnikova 2021 [59]. The value of 5 is from the original work of Wu et al 2016 and was a hyperparameter that was tuned and found to be optimal set to 5 [75]. For appropriate baselines additional compound sets were added as references. An “Ideal set” was created as a subsample of the test set where 50% of the compounds per amino acid we sampled and treated as generated. The dataset was created 5 times with different random splits and evaluations were calculated on each split to get a baseline for what real binders looked like when compared to other real binders. The generated sets were also all shuffled to make “shuffle sets”. Shuffled sets were used in evaluation calculations that were approximating how well the compounds associated with their amino acid sequences. If shuffled and non-shuffled sets performed similarly then the model was not generating molecules specific to the amino acid sequences. Lastly, random sampling of compounds from the train and test set or the whole train and test set were used as references where appropriate. Each of the generated sets was then put through evaluation. The sets were evaluated on their own for chemical feasibility and distribution of physicochemical properties. They were also compared to the real binder compounds using tanimoto similarity and classification scores. For comparison, all amino acid sequences in the test set were explored and aggregated in the results. In addition to this, two specific proteins, IGF-1 and VEGFR2, were explored individually. This is a case study to evaluate the performance on specific proteins, and are the same two proteins utilized in Grechishnikova 2021 [59]. 47 4.4 Evaluation The evaluation performed will have two main focuses, evaluating the overall chemical feasibility and the physiochemical properties of the overall compound sets, and evaluating how well the generated compounds approximate true binding compounds for their corresponding amino acid sequences. The code utilized in evaluation was adapted from MOSES, a bench marking framework for molecular generation models [80]. 4.4.1 Chemical Feasibility Chemical feasibility metrics are metrics that describe how viable the generated compound sets are. These metrics will be calculated for each model’s sets, the ideal set, and random samples from the training set. All these metrics are single quantitative values calculated for a whole set of molecules. The reference set for this section is the ideal set, and random samples from training. Validity Validity is defined as the percentage of generated sequences that can be decoded into valid molecules [29], [80], [81]. This metric is important for comparing between models, but also for comparing between random sampling and beam search. Considering beam search is generating the compounds of highest probability there is an expectation that its validity may be higher. 48 Uniqueness Uniqueness is simply the fraction of unique molecules in the generated set divided by the number of valid generated molecules [29], [80], [81]. Uniqueness is important for measuring the diversity of the generated distribution. It is a goal of generative models to produce a diverse chemical space. This is an important metric to assess if the model can generate a diverse set of compounds. This metrics compares performance between models but can also explore the difference in generation techniques. The expectation is that random sampling may generate better uniqueness. Novelty Novelty is the percentage of generated compounds that are not seen in the training data. This is an important metric for ensuring the models are not simply copying molecules from the training set [29], [80], [81]. This can investigate if any models are over fitting to the training set. If models can approximate the same novelty as the ideal set that would be appropriate, but more novelty can be considered desirable. Internal Diversity Internal diversity is a more sophisticated metric for determining how unique a set of molecules is [80]. In can be described by the following equation: 𝑝 𝐼𝑛𝑡𝐷𝑖𝑣𝑝 (𝐺 ) = 1 − √ 1 ∑ 𝑇(𝑚1 , 𝑚2 )𝑝 . | 𝐺 |2 𝑚1 ,𝑚2 𝜖 𝐺 (27) 49 Where 𝑇 is the Tanimoto similarity, a well-defined metric of measuring compound similarity [80]. 𝐺 is the generated set of compounds, and 𝑝 = 1. The Internal Diversity metric amounts to being the normalized sum of the Tanimoto similarities of all combinations of molecules in the generated set. Tanimoto similarities will be expanded on in a up coming section. This method expands on uniqueness because it doesn’t stop at only evaluating molecules as identical or not but incorporates a similarity score. The expectation is that generated sets should be as, if not more, internally diverse than the random training set sample and the test set. 4.4.2 Physiochemical Properties The evaluation of physiochemical properties allows for the evaluation of the potential drug-likeness and drug feasibility of generated compounds. Properties like molecular weight, number of H donors, H acceptors, water-octanol partition coefficient (LogP), rotatable bonds, and total polar surface area (TPSA) are all metrics used to assess the drug likeness of a compound [106], [107], [108]. Lipinski’s rule is a common rule of thumb for assessing drug likeness and states that for oral compounds the molecular weight should not exceed 500 Da, should have no more than 5 hydrogen bond donors, no more than 10 hydrogen bond acceptors, and a water-octanol partition coefficient (LogP) of less than 5 [106]. It is also understood that ideally rotatable bonds shouldn’t exceed 10, and that TPSA values of under 140 A2 and under 90 A2 are necessary for having good absorption and passing the blood brain barrier, respectively [107], [108]. In addition to these more fundamental properties of compounds there are other more abstract metrics that also hold valuable information. 50 Three more abstract metrics are Quantitative Estimate of Drug likeness (QED), Natural Product likeness score (NP Score), and Synthetic Accessibility Score (SA Score). QED was originally defined by Bickerton et al 2012 [109]. QED is a value between 0 and 1, where the closer to 1 the more drug like a compound is. The score is calculated by using 8 molecular properties, molecular weight, LogP, number of hydrogen bond donors , number of hydrogen bond acceptors, TPSA, number of rotatable bonds, number of aromatic rings, and number of structural alerts [109]. The distributions of these metrics in known drug-like compounds are used to fit asymmetric double sigmoid functions which allows for a score between 0 and 1 to be generated for each metric that describes the distance from the mean a compound is for a given metric. The weighted geometric mean of these values is then calculated to create the QED score. This allows for a single value to be used to approximate drug-likeness [109]. The NP score was originally developed by Ertl et al 2008 as an estimate of how similar a compound is to a natural product [110]. The theory is that the more like a natural product the more viable a drug candidate may be [110]. The metric usually spans from -5 to 5, with a higher value representing a more natural product. The score is essentially calculated by using the precalculated frequency values for a set of compound fragments. For each fragment found in a compound the score is added to a total and then normalized by the size of the molecule. Fragments found more frequently in synthetic products have a negative score, while fragments found more frequently in natural products have a positive score. Ertl et al 2009 [111] went on to propose another heuristic, the Synthetic Accessibility Score. The SA score of a compound ranges from 1 to 10 with 10 being the hardest to synthesize. The score is calculated in a similar way, utilizing precalculated fragment scores. These fragment 51 scores are influenced by the frequency of these fragments, molecular complexity, size, symmetry, and rare structural motifs. More common fragments with a lower complexity have a lower SA score, while less common fragments with a higher complexity have a higher SA score [111]. A suitable SA score has conflicting citations. Lucio et al 2020 [51] uses 4.5 as a cut-off for considering a molecule synthesizable, while Grechishnikova 2021 [59] uses 6 as a cut-off. Other papers simply report distributions of SA scores for a higher level view [80]. These metrics will be evaluated by comparing their distributions to reference sets. The reference sets will be the full training and test set. These metrics are important to evaluate druglikeness and feasibility so their values will be evaluated on where they align with known cutoffs. How well the generated sets approximate the training and test distribution will also be evaluated. Since no additional property information is being purposely selected for, any deviations from the training and test set can be interpreted as a bias for that model and/or generation method. 4.4.3 Similarity to Known Binders These metrics will specifically be used to evaluate how well the compound’s generated are approximating real binders. In this section, for each amino acid sequence in the test set the metrics are calculated for the generated compounds and the shuffled sets. Tanimoto Similarity Tanimoto similarity is a well defined metric of measuring compound similarity [80]. Tanimoto similarity is calculated from the morgan fingerprints of two compounds and is calculated by dividing the sum of the bitwise-and of the two fingerprints by the sum of the 52 bitwise exclusive or of the two fingerprints. The tanimoto similarity was calculated between each generated compound and each real binder for each amino acid sequence. For each generated compound the mean tanimoto similarity and max tanimoto similarity was recorded. This created two metrics, the mean tanimoto similarity, and the max tanimoto similarity. Each of the generated sets were compared to the ideal set as a reference. The ideal set calculated the tanimoto similarity between two 50/50 splits of the real binders for each test amino acid sequence. To make an empirical comparison the Wasserstein Distance was calculated between the generated sets and the ideal set to determine their similarity. This comparison has been previously utilized compound generation benchmarks [80]. The Wasserstein Distance is a measure the amount of work it takes to transform a one distribution to another. The work is proportional to both the distance the probability needs to be moved and the amount of probability. The shuffled generated sets were also compared to the ideal set. This was to test if the compound generation is specific to the amino acid sequences. We would expect the shuffled sets to perform worse than the unshuffled sets if this were the case. Binding Classifier The binding classifier is trained to generate a binding probability for a given amino acid sequence and compound. By generating a classification score for each compound and its corresponding amino acid sequence we can compare the generated sets and shuffled generated sets to the full test set. Similar to tanimoto similarity evaluation we can compare the distributions of the classification scores for the generated set and shuffled generated sets with the test set by calculating the Wasserstein Distance. However, in addition to this comparison we can also 53 evaluate the performance of the generated sets by visualizing the receiver operating characteristic (ROC) curve and calculating the corresponding AUC score. This is the same technique utilized in Grechishnikova 2021 [59]. For the ROC-AUC evaluation three kinds of comparisons will be investigated. - Comparing Generated compounds against real binders. This will evaluate the ability of the generated compounds to approximate real binders. A successful result here would be an AUC score closer to 0.50, equivalent to the classifier randomly guessing. - Comparing Generated compounds against themselves shuffled. If compound generation is amino acid specific, we would expect the classifier to be able to tell the difference between the generated set that is appropriately paired with the right amino acid sequences and the shuffled set. A successful result would be an AUC score closer to 1.0. - Comparing Generated compounds against random shuffled sample from the test set. This will evaluate if the generated compounds are better than randomly sampled compounds from the test set. A successful result would be an AUC score closer to 1.0. As a reference point for all these calculations the AUC score for the test set real binders vs the shuffled test set will be calculated. In addition to performing this evaluation over the full test set, we also will explore two specific proteins as case studies. These proteins are insulin-like growth factor-1 (IGF-1) and vascular endothelial growth factor receptor 2 (VEGFR2). These proteins were chosen because they were also the proteins explored in Grechishnikova 2021’s simulation analysis [59]. The PDB ids explored for IGF-1 were 20J9, 3O23, and 4D2R. For VEGFR2 2P2H and 4ASE were 54 explored. These PDB ids had amino acid sequences that were not in the training set. For the case study the 1-1 generation set will not be used since it will only generate 3 compounds for IGF-1 and 2 for VEGFR2, which isn’t enough to make a meaningful evaluation. The known binders for each of these specific PDB ids were extracted from BindingDB in June of 2023 [112]. 55 56 Chapter 5 - Results and Discussion 5.1 Transformer Training Metrics The models were all trained to 50 epochs. This stopping point was chosen for computational resource reasons and was the point where the trained loss appeared to level off (Figure 10). This plateau occurs at different values for each model. Base+S model settled with a loss around 0.38, Base and Base+S+E both settled around 0.30 and the lowest loss was seen for Base+E with a loss around 0.22. Overall, the embedding models performed better than their nonembedding counterparts in both cases, by ~0.10. This could be evidence of the better generalization of the embedding models. However, to be certain this difference persists, more epochs would need to be explored. Overall SELFIES models perform worse than their SMILES counterparts when it comes to loss. This could potentially be explained by the difference in vocabulary sizes. For SMILES there were 79 unique tokens while for SELFIES there were 111 unique tokens. More tokens allow for more opportunities to be wrong or uncertain. The SELFIES grammar is also more complex than the SMILES grammar. For example, a single token in the SELFIES grammar can hold multiple meetings depending on the context of the sequence [82]. This added difficulty may also explain the loss difference. A striking feature of the training curves is the test sets’ losses not improving and mostly degrading continuously over training. The trends in validation loss appear to be opposite to the trend with training loss where embedding models had higher validation loss than their non- 57 Figure 10. Train and Test loss curves for all four models with Ideal loss visualized. embedding counter parts and SELFIE models having higher validation losses than their SMILES counter parts. The overall validation loss being high may be explained by the fact that for both SMILES and SELFIES strings multiple permutations of tokens can create the same molecule. For example, the SMILES BrCCBr and C(Br)CBr are identical molecules that share 0% sequence accuracy. Because of this there is a potential that the compounds that are being generated for the training set are valid, but simply don’t align with the precise expected permutation. A potential hypothesis for how this could cause a higher validation loss over time is that as a model improves, it becomes more certain with the molecule it is deciding to create. This certainty leads to a narrower probability distribution being predicted. Where in the beginning the model is less certain and predicts a much broader probability distribution. If the more certain prediction is narrowing in on a permutation that isn’t the expected permutation, then it will generate a worse loss than a broader prediction which has a better chance of overlapping with the 58 expected permutation. It is uncommon for other compound generation papers to display their loss curves, so it is unclear if this is a common occurrence or an anomaly. The later evaluation steps allow for a clearer picture of model performance. 5.2 Binding Classifier Training Metrics The classifier model was trained for 25 epochs, but the 7 th epoch was chosen as the early stoppage point (Figure 11). This is because it’s AUC score was approaching 0.86 which approximates the best simulation performance in Grechishnikova 2021 [59]. The training showed consistent performance between the train and test set, which gives confidence that the model was not overfitting to binding molecules in the training set. Figure 11. Evaluation Binding Classifier Train and Test AUC scores over training. 59 5.3 Chemical Feasibility In chemical feasibility results the SELFIES models stand out in validity, novelty, internal diversity and uniqueness; only falling to the Base+E model in the 1 to 10 generation for uniqueness (Table 1). High performance in internal diversity, novelty, and uniqueness is an expectation of SELFIES strings since their propensity for greater compound diversity is a known effect [82]. With this being said, internal diversity is quite consistent between all models and aligns well with the ideal set and random samples (Table 2). The embedding models appear to do slightly better in uniqueness than their non-embedding counter parts by an average of 3% across generation methods. An interesting result is the degradation in validity seen in the baseline embedding model when compared to the baseline model. The 1 to 1 generation validity drops by 20% and the 1 to 10 generation drops by 25% (Table 1). There are multiple potential reasons for this. One potential reason is that this model is simply failing to learn proper SMILES grammar and failing to generate valid SMILES strings. However, this is unlikely because the Decoder structure is identical to the base model, which has no trouble generating valid SMILES strings. A second potential reason is that the Base+E model is uncertain on the kind of molecule to generate and is generating a broader probability distribution for token prediction. This could be a sign of the improved protein embedding space providing more context for the model. For the Base model, when given a new amino acid sequence it may only be able to draw an association to a smaller set of the amino acid sequences it was trained on, giving it a narrower scope of potential compounds to generate. If the protein embeddings allow the Base+E model to make more 60 Table 1. Chemical Feasibility results for each model and generation set. Bolded values are the max values for that metric and generation set between the models. Ties are both bolded and underlined. 1-1 Base 1-10 1-100 1-1 Base+S 1-10 1-100 1-1 Base+E 1-10 1-100 1-1 Base+S+E 1-10 1-100 Validity 92% 86% 79% 100% 100% ~100% 72% 61% 77% 100% ~99% ~100% Uniqueness 83% 92% 92% 94% 90% 97% 91% 98% 94% 96% 89% 98% Novelty 87% 92% 97% 91% 96% 99% 78% 87% 95% 92% 96% 98% Internal Diversity 0.83 0.84 0.87 0.86 0.87 0.89 0.85 0.86 0.87 0.87 0.89 0.89 Table 2. Chemical Feasibility of reference sets. Ideal Set is made up of random samples from the Test set. Random sets made up of random samples from the Training set. Ideal Set 1-1 Random 1-10 1-100 Validity 100% 100% 100% 100% Uniqueness 99% 99% 100% 98% Novelty 79% 0% 0% 0% Internal Diversity 0.86 0.86 0.87 0.87 relations to a broader set of proteins, then this could cause a larger scope of potential compounds to generate. The broader scope may then cause the model to become less certain of the patterns it is trying to generate and result in an invalid SMILES string being generated. Essentially, the Base model may have a narrower scope on the kind of molecule it wants to generate, making it more certain, while the Base+E model has a broader scope making it more uncertain and prone to error. It is possible this would affect beam search more than the random sampling method because beam search functions by selecting sequences that maximize the scoring probability. 61 This cumulative probability score may cause the model to fixate on compound fragments that are chemically valid but put together are not valid SMILES strings. Random Sampling, however, isn’t constrained by this strict selection process and is only looking at what is the most probable next token. Since the model has likely learned the general structure of valid SMILES string, even if it is uncertain about the exact molecule it should generate, it may be better at avoiding over fixating on some compound fragment patterns. The uniqueness being lower for the Base model compared to the Base+E model in beam search also supports the idea that the Base model may be narrowly focusing on a smaller chemical space and therefore regenerating the same molecules more frequently. As will be seen in results that will be discussed later, the Base+E model does appear to produce better molecules for the amino acid sequences than the Base model. This provides confidence that the Base+E is generalizing on the protein information better and that the low validity is not impacting performance. The Base+E model also has slightly lower novelty than the Base model, especially for the 1 to 1 generation set. However, for the ideal set the novelty is only 79%. Therefore, even though the Base+E generation set is producing lower novelty values it isn’t producing novelty much worse than a sample of the test set would. In some ways this can be seen as a positive as approximating the ideal set better could be a sign of better approximation of the real binders. A general trend that can be observed for all models is an increase in novelty and internal diversity when going from 1-1 generation, 1-10 generation and 1-100 generation, respectively. The increase in novelty is a more prominent effect and can be expected because 1 to 1 generation is generating the most probable compound, which is more likely to be one from the training set the model was trained on because of inherent bias for the training set. Expanding to 10 molecules 62 allows for slightly more randomness since less probable molecules are being generated, and the random sampling of the 1-100 set allows for even less probable molecules to be generated. This is because in beam search the probability of the whole sequence is being considered where in random sampling only the current token is being considered, and even then, it is being randomly sampled from the token probability distribution. Overall, in comparison to the ideal set the generated compounds are not deviating in a concerning manner with the only point of concern being the Base+E model’s lower validity. 5.4 Physiochemical Properties The molecular weight distributions for generated sets overlap appropriately with the test and train distributions (Figure 12). The generated sets trend on the lighter side. For each generated set the SELFIES models have the largest high molecular weight outliers. All the generated sets have a majority of their compounds are under 500 Da which meets our Lipinski ideal cut off [106]. For beam search, embedding models tend towards lighter compounds on average compared to non-embedding models. For hydrogen acceptors and hydrogen donors a majority of the generated compounds are under 10 and 5 respectfully, which aligns with Lipinski’s cut offs (Figure 13) [106]. Both of these metrics overlap appropriately with the test and train sets, but for hydrogen acceptors tends to be a bit lower. In beam search the embedding models tend to be a bit lower in H acceptors. 63 Figure 12. Distributions of Molecular Weights for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. Figure 13. Distributions of the Number of Hydrogen Acceptors and Hydrogen Donors for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. 64 A majority of molecules generated in each set do not have rotatable bonds exceeding 10 which is ideal (Figure 14) [107]. SELFIES seem to have more large outliers for rotatable bonds and consistently have a higher mean than the SMILES models. Overall, the distributions overlap appropriately with the train and test set. Figure 14. Distributions of Rotatable Bonds for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. A majority of generated compounds for all sets are under the LogP cutoff of 5 specified in Lipinski’s rules (Figure 15) [106]. The SELFIES models have the largest outliers for high LogP values compared to the other generated sets and the Base model has a slightly lower mean. Overall, the sets agree with the train and test sets. All the generated sets have a majority of molecules under the 140 A 2 cut off and the average sits close to the 90 A2 mark (Figure 16). The Base+S, Base+S+E, and the Base+E model 65 Figure 15. Distributions of LogP for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. appear to trend to lower TPSA values, especially the embedding models which appear to have averages in the 80-60 A2 range. This trend also matches the patterns seen in molecular weight. The overall overlap with training and test sets are appropriate. QED values in the generated sets seem to be more spread out with means tending higher than the train and test set (Figure 17). Embedding models appear to generate more higher value QED scores suggesting better drug likeness. Base+E seems to consistently have the highest average QED scores regardless of generation method. Base+S consistently is one of the lowest, with the exception of Base+S+E in the 1-100 generation set. The natural product scores of the train and test set actually group quite lower than the generated sets (Figure 18). For the Base and Base+E model the means appear to align closer with 66 Figure 16. Distributions of Total Polar Surface Area (TPSA) for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. the train and test set. The SELFIE models appear to deviate from the training and test set with a distribution tending to higher natural products, primarily the Base+S+E model. This deviation may be evidence of the SELFIE model creating a more diverse chemical set. SA scores for Base and Base+E model trended slightly lower than the train and test set (Figure 19). SELFIE models created much more difficult to synthesize compounds. So much so that the averages for the SELFIE models are close to the high outliers for the SMILES models. Overall, the generated sets for all models across all metrics overlap very well. There are many slight trends with differing means, most are small. Since molecular properties are not a feature being selected on the only reason for any deviation from the training sets is due to some bias of the model type and/or the compound string type. 67 Figure 17. Distributions of Quantitative Estimates of Drug Likeness (QED) for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. Figure 18. Distributions of Natural Product Scores for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. 68 Figure 19. Distributions of Synthetic Accessibility Scores for four model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. Comparing generation methods, in general 1-100 sets had more outliers and a wider spread than the beam search generation sets. This is expected since the 1-100 set is generating more compounds and therefore have a greater potential for diversity. The most striking deviations were in natural product scores, where the Base+S+E model appeared to deviate strikingly from the train and test set, and the SA score distributions where SELFIES models deviated largely from the train and test sets. Since SA scores take into account compound complexity this may be a sign of the structural diversity being generated in the SELFIES models. For other metrics, SELFIES models appeared to also have larger outliers. This can make logical sense for SELFIES models since their forced validity means that large 69 uncertain SELFIES strings are still guaranteed to be valid, unlike large uncertain SMILES strings which have a higher likelihood of being invalid and filtered out. This shows the potential cost of this diversity, with hard to synthesize compounds that may be impractical. The tendency for the embedding models to have lower molecular weights, H acceptors, and TPSAs is an interesting pattern. A potential explanation chemically could be that these models are favoring less functional groups, or a different distribution of functional groups. Functional groups in a molecule can contribute to its mass, H acceptor profile, and its polar surface values. For example, carbonyl groups (C=O) are a common hydrogen acceptor functional group with a molecular mass of 28. So lowering carbonyl frequency could drop all of these metrics lower. Although this could explain the deviation, more exploration would be necessary to confirm it, and even if functional group preference is the practical cause, why the models are making this preference would remain unknown. 5.5 Similarity to Known Binders 5.5.1 Tanimoto Similarities To evaluate tanimoto similarities the distributions of tanimoto mean metrics and tanimoto max metrics are determined for each generated set, the shuffle sets, and the ideal set. The Wasserstein distance is calculated between the generated sets and the ideal set and the shuffled sets and the ideal set. The lower the value the more the set approximates the ideal set distribution. These calculations show that for all models the generated sets perform better than the shuffled sets (Figure 20, Figure 21). This supports that the models are learning some 70 Figure 20. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the ideal set for Mean Tanimoto Similarity. For full distributions see Supplementary Figure 1. Figure 21. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the ideal set for Max Tanimoto Similarity. For full distributions see Supplementary Figure 2. generalization and the compounds being generated are specific to the amino acid sequences provided as context. However, each model is not equally efficient at this generalization. The worst model is Base+S, which on average across all generated sets had a Wasserstein distance of 0.138 for the mean tanimoto comparison, and 0.571 for the max tanimoto comparison. The Base+S+E model is second worst with average mean and max tanimoto Wasserstein scores of 0.131 and 0.548, respectively. The Base model is second best with an average mean and max tanimoto Wasserstein score of 0.122 and 0.529, respectively. The best model is Base+E with an average mean and max tanimoto Wasserstein score of 0.097 and 0.471, respectively. This increase in the performance of Base+E is a large jump, where the previous 71 models had average improvements of 0.008 and 0.021 for mean and max tanimoto metrics, the Base+E model improved by 0.024 and 0.058. That is more than double the improvement. Also, it is important to note that across all models beam search performs better than random sampling and further, 1-1 generation is consistently the best performer out of the generation methods with few exceptions. On this metric the models can be compared to each other to get a sense of the effect of adding embeddings and the effect of using SELFIES strings. Comparing the Base and Base+S models to each other across generation sets we get an average increase in the Wasserstein distance of 0.017 and 0.042 for mean and max metrics, respectively. Comparing the Base+E and Base+S+E models we get an average increase in the Wasserstein distance of 0.034 and 0.077 for mean and max metrics, respectively. This increase suggests an overall degradation of generalization from the use of SELFIES strings. The impact seems to harm embedding models more with a mean tanimoto degradation 2x worse in the embedding model, and a max tanimoto degradation 1.8x worse. Comparing the Base and Base+E models to each other we get an average decrease in the Wasserstein distance of 0.024 and 0.058 for mean and max metrics, respectively. Comparing the Base+S and Base+S+E models we get an average decrease in Wasserstein distance of 0.007 and 0.023 for mean and max metrics, respectively. This suggests an overall improvement of generalization from the use of the embeddings. The impact seems to be impacted by the presence of SELFIES. Models using SELFIES strings saw a degradation of the mean tanimoto improvement by 3.4x and the max tanimoto improvement by 2.5x. 72 The tanimoto results can also be compared on raw counts. Comparison of two random samples of compounds determined that a mean tanimoto score of over 0.2 can be considered an outlier, and a max tanimoto score of 0.3 can be considered an outlier. Using these cutoffs, we can count the number of molecules generated over these cutoffs to see compound generation performance in absolute terms. Table 3 and Table 4 show that consistently, despite lower validity, the Base+E embedding model is producing more high value compounds in total. Table 3. Binned absolute count of outlier compounds (>0.2) for Mean Tanimoto Similarity. Max values for each generation set are bolded. Ties are bolded and underlined. Base_1-1 Base+S_1-1 Base+E_1-1 Base+S+E_1-1 Base_1-10 Base+S_1-10 Base+E_1-10 Base+S+E_1-10 Base_1-100 Base+S_1-100 Base+E_1-100 Base+S+E_1-100 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Total 7 7 14 9 64 78 100 78 466 419 804 442 1 2 3 3 6 12 19 18 144 88 248 133 3 1 3 1 17 4 21 7 119 69 177 94 1 1 1 2 9 8 4 9 68 42 103 55 0 0 0 0 0 0 0 0 24 6 20 4 0 0 0 0 0 0 0 0 43 5 39 9 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 3 0 12 11 21 15 96 102 144 112 864 629 1396 738 Table 4. Binned absolute count of outlier compounds (>0.3) for Max Tanimoto Similarity. Max values for each generation set are bolded. Ties are bolded and underlined. Base_1-1 Base+S_1-1 Base+E_1-1 Base+S+E_1-1 Base_1-10 Base+S_1-10 Base+E_1-10 Base+S+E_1-10 Base_1-100 Base+S_1-100 Base+E_1-100 Base+S+E_1-100 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Total 3 4 9 4 39 53 68 43 444 283 656 329 1 1 3 0 14 10 35 18 160 85 279 139 4 1 1 1 18 5 12 18 111 54 166 70 0 0 0 1 2 3 11 9 32 21 116 54 0 0 1 0 7 2 15 9 72 32 84 46 0 1 1 0 9 4 11 6 68 30 76 33 6 3 8 8 24 18 30 28 254 134 288 154 14 10 23 14 113 95 182 131 1141 639 1665 825 73 The better performance of the Base+E model in Wasserstein distance and in absolute count shows that this model’s performance cannot only be attributed to producing less bad compounds but is also attributed to the model producing more high-quality compounds. The overall poorer performance of the SELFIES models could be a result of their guaranteed validity. Where the SMILES models may be uncertain and inevitably generate invalid SMILES, the SELFIES models force the generation of the uncertain compound, potentially leading to the inclusion of lower quality compounds. Table 3 and 4 do show that SELFIES models are producing less high-quality compounds but not an unreasonably low amount. This also supports the idea that SELFIES performance may not only be poor because of the lack of good compounds but the over production of poor compounds. On top of the general poor performance of SELFIES models it appears that their inclusion potentially stifles improvement. An interesting result was the negative interaction between embeddings and SELFIES strings. The inclusion of embeddings appeared to improve tanimoto performance but less so if SELFIES were being used and SELFIES tended to hurt tanimoto performance but hurt the embedding models’ performance more. 5.5.2 Classifier Scores As was done for tanimoto similarity, we can calculate the classifier scores for the generated set and the shuffled sets. Since the classifier does not require us to compare to a reference set of compounds we can calculate the classifier score distribution for the whole test set. Calculating the Wasserstein distance as we did before we can see again that overall, the shuffled sets perform worse than the generated sets, showing the models’ ability to generalize 74 (Figure 22). In the performance comparison between models the order changes. The worst model is almost a tie between Base and Base+S with average classifier Wasserstein scores of 0.306 and 0.305, respectively. The second-best model is Base+S+E with an average classifier Wasserstein score of 0.25. The Base+E model remains the top model with an average classifier Wasserstein score of 0.156. As in tanimoto similarity, the jump the Base+E model makes is larger gain than the other models. Where Base+S+E makes about an ~0.05 improvement, the Base+E model makes a ~0.10 improvement, almost doubling the improvement difference. As seen in the tanimoto results, beam search is consistently better than the 1-100 set and the 1-1 set is the best performer. Figure 22. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the ideal set for Classifier Scores. For full distributions see Supplementary Figure 3. Comparing the models Base and Base+S shows an overall average of no change in classifier scores. For the random sampling generation set the Base model performance is better by 0.051, while for the beam search generation the Base+S model is better by 0.033 and 0.021 for 1-10 and 1-1 generation sets, respectively. For comparison of Base+E to Base+S+E we see an average increase of 0.094. This continues a theme of SELFIES having a negative impact on performance and impacting the embedding model worse. For comparison of Base to Base+E we 75 Figure 23. ROC Curve for Generated vs. Shuffled Sets across four model groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. see an average improvement of 0.15 and for Base+S to Base+S+E we see an average improvement of 0.055. This is more evidence in support of the better generalization of the embedding models and the hindrance of SELFIES on this improvement. The classifier allows us to also visualize the ROC curve comparing sets. When we ask the classifier to discriminate between the generated sets and the shuffled sets, we see that all the models do better than a random guess (Figure 23). The test set result has an AUC score of 0.86. The best model is the Base+E model with 1-100, 1-10, 1-1 AUC scores of 0.73, 0.78, and 0.76 respectively. The second best is the Base+S+E model with AUC scores of 0.64, 0.67, and 0.73 respectively. The Base and Base+S model seem to come in almost tied in the low 0.60s. This paired with previous Wasserstein distance results shows that all the models are generating 76 Figure 24. ROC Curve for Generated vs. Test Set across four model groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. compounds conditioned on their amino acid sequences, with the embedding models performing the best. When comparing these results to the models’ performance against negative binding examples the performance is almost identical, with the same models performing better against random samples as they did against their own shuffled values (Supplementary Figure 4). When asking the classifier to discriminate between real binders, and the generated sets we can see that the Base+E model performs the best with 1-100, 1-10, 1-1 AUC scores of 0.72, 0.67, and 0.64 (Figure 24). The second best is the Base+S+E model with 0.78, 0.75, and 0.70 respectively. The Base and Base+S model also seem to come in at almost tied in the low-high 0.80s-0.70s. The lower AUC scores mean the classifier is having greater difficulty discerning between the real binders and the generated compounds. This very clearly shows that the 77 embedding models are producing compound sets that better approximate the real binding compounds. Like with tanimoto similarity we can look and evaluate absolute counts (Table 5). Table 5 shows total counts for a set of bins over a 50% probability of binding score. We can see that the tanimoto outlier cutoff is stricter with overall less molecules making the cutoff when comparing the tables (Table3,4,5). When we look at the absolute count of compounds, we see that Base+S+E produces the highest number of high-quality binders. The Base+E model only outperforms other models in bins 0.7-0.8 and 0.8-0.9 for the 1-100 gen set. However, Base+E does in total outperform Base in all generation sets. The Base+E model’s outsized performance in AUC scores and classifier Wasserstein distance but lower absolute compound count highlights the importance of compound filtering. Considering the validity scores in Table 1, it is clear that the Base+E model is the most efficient model in compound production. For example, the Base+E model 1-100 set generates 4830 compounds over 50% binding probability and the Base+S+E model 1-100 set generates 4931. But, when considering the validity difference of 77% and 100% respectively, the Base+S+E model is generating an extra ~2000 compounds to get an additional 101 high probability compounds. For generation methods, the classifier results agree with tanimoto similarity with 1-1 generation performing the best with 1-10 and 1-100 generation coming in second and third, with few exceptions. The overall trend of embeddings improving performance, and SELFIES hurting performance remains true. 78 Table 5. Binned absolute count of compounds classified as binders (>0.5) for Classifier Scores. Max values for each generation set are bolded. Ties are bolded and underlined. Base_1-1 Base+S_1-1 Base+E_1-1 Base+S+E_1-1 Base_1-10 Base+S_1-10 Base+E_1-10 Base+S+E_1-10 Base_1-100 Base+S_1-100 Base+E_1-100 Base+S+E_1-100 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Total 10 8 6 5 71 89 52 60 614 743 677 826 9 10 4 13 97 123 68 101 923 990 984 1079 22 16 15 16 144 114 117 127 1209 1009 1402 1212 4 10 13 15 76 103 117 127 716 867 1126 1110 3 5 11 11 24 70 65 91 409 500 641 704 48 49 49 60 412 499 419 506 3871 4109 4830 4931 5.5.3 Case Study The exploration of specific proteins allows us to see the effectiveness of the model in a case study. The classifier when applied to real binders and a random sample for IGF-1 was able to distinguish between the two well with an AUC of 0.90 (Figure 25,26). For VEGFR2 the classifier was still successful in differentiating but much less so with an AUC score of only 0.76 (Figure 27, 28). This provides some confidence that the classifier can classify appropriately for these novel amino acid sequences and binders. For the case studies only, generation sets 1-10 and 1-100 are being used because the 1-1 set will only generate 3 and 2 molecules for IGF-1 and VEGFR2, respectively. For IGF-1 when comparing real binders to the generated sets, only three models gave reasonable performance. Those being Base+E 1-10 and 1-100 and Base+S+E 1-10 with AUC scores of 0.69, 0.55, and 0.76, respectively (Figure 26). When looking at the performance of generated molecules against a random sample only Base+E sets 1-10 and 1-100 maintain 79 reasonable performance with AUC scores of 0.89, and 0.87 respectively (Figure 25). In both comparisons a surprising result is that the other generated sets actually perform worse than random molecules (Figure 25, 26). When comparing to real binders, the model is able to separate generated compounds better than random compounds, and when comparing to random compounds the generated compounds produce classifier probabilities worse than random compounds. What doing worse than random compounds may indicate is that the models are focusing on a specific subset of the chemical space that is wrong. The negative examples come from a random sample from the dataset, which means that the diversity of compounds allows for some by chance to score higher on the classifier. So, performing worse than this set must mean the models are focusing their generation in an area of the chemical space that is particularly wrong and not diversifying. This could be evidence of a lack of generalization in the other models for the IGF-1 protein family. Other models are latching onto the wrong types of protein neighbors in the protein embedding space and therefore generating the wrong type of binders, where the Base+E model is correctly finding associations with a set of analogous proteins and therefore generating more appropriate binders. For VEGFR2 the performance across all the generated sets is more reasonable. In comparison of generated compounds and real binders the Base and Base+E models stand out as the best performers (Figure 28). The Base model has AUC scores of 0.58 and 0.61 for 1-100 and -10 generation sets respectively. The Base+E model has AUC scores of 0.54 and 0.48 for 1-100 and -10 generation sets respectively. For comparison of generated sets to negative binding examples this result holds with the Base model having AUC scores of 0.67 for both generated 80 sets and the Base+E model having AUC scores of 0.73 and 0.79 for generation sets 1-100 and 110 respectively (Figure 27). Through these results we see the Base+E modeling consistently performing the best. The effect of the embeddings seems greater in the IGF-1 case as opposed to the VEGFR2 case. This could be because there may be some closer examples of VEGFR2 in the training set than there are for IGF-1 so the advantage of the larger embedding isn’t as necessary or prominent. However, it is also clear that the classifier has a more difficult time with even real binders for VEGFR2, so it is possible that there is a larger benefit caused by the inclusion of embeddings for VEGFR2 that the classifier doesn’t have the resolution to show. Consistent with previous classifier and tanimoto results, we see that the inclusion of SELFIES hinders models and stifles the benefits of adding the embeddings. IGF-1 results appear to agree with previous results with Base+E being the best model and Base+S+E being second best. However, the performance gap between Base+E and Base+S+E is much larger than in the other results. VEGFR2 deviates from previous results with the Base model being the second-best model but agree with Base+E still being the best model. In the case study results the improvement of beam search is less prominent, with performance between 1-10 and 1-100 being very similar. 81 Figure 25. ROC Curve for IGF-1 Generated Sets vs. Ransom Sample Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. Figure 26. ROC Curve for IGF-1 Generated Sets vs. Real Binder Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. 82 Figure 27. ROC Curve for VEGFR2 Generated Sets vs. Real Binder Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. Figure 28. ROC Curve for VEGFR2 Generated Sets vs. Ransom Sample Set across four model groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the legend. 83 84 Chapter 6 - Conclusions and Future Research 6.1 Conclusions 6.1.1 Effect of SELFIES The results show that the introduction of SELFIES strings has a potential negative effect on performance. In chemical feasibility SELFIES strings improved validity, novelty and diversity. For many of the physiochemical properties SELFIES models consistently showed the largest deviations from the training and test set. Some of these deviations may potentially be positive, like the deviation into more natural products, but others are negative, like the increase in SA score. The overall concern with this deviation is that if the models are generating outside the domain of the training set, this is a potential indication that the models are generating noise. In tanimoto results and classifier results, including the case studies, we consistently see the Base+S model perform the worst. When comparing Base to Base+S and Base+E to Base+S+E we consistently see a performance drop. An interesting observation in these comparisons is that it appears the SELFIES strings have a more of a negative performance hit on the Base+E model than the Base model. This suggests that the SELFIES strings are a hinderance in the performance gains facilitated by the embeddings. The deviations from the training set in physiochemical properties, and this poor performance in producing binders suggests that the SELFIES models forcing validity may be resulting in more noise and low-quality molecules being produced, and that SMILES invalidity may actually be useful mechanism of filtering. This is most clear when taking into consideration the absolute counts. For the tanimoto results and 85 classifier results we can see that the SELFIES model produces a reasonable number of highquality molecules, but unfortunately performance is still poor, because of the total amount of poor quality molecules generated. A recent paper by Skinnider 2024 [113] supports this interpretation. This work provided causal evidence that invalid outputs are actually beneficial for filtering out low-likelihood samples and that the enforcing of valid outputs via SELFIES strings produces structural biases in the generated compounds [113]. Skinnider 2024 [113] ultimately suggests that the creation of invalid SMILES strings can be seen as a feature and not a bug of generation models. This aligns well with our findings that SELFIES models could generate high quality molecules but were overall hindered by their over generation of poor-quality molecules. 6.1.2 Effect of Embeddings The results revealed that pretrained embeddings can improve conditional compound generation performance. This improvement in compound quality is seen through tanimoto results, classifier results, and the case studies explored. These improvements appear to come at no cost to physiochemical properties and only stand to potentially decrease validity. However, the increase in compound quality outweighs the decrease in validity, with the raw number of high-quality molecules being generated is still the highest for tanimoto metrics and second highest for classifier metrics. Notably the classifier results, including the case studies, show a striking improvement in the Base+E model compared to the other models. The improvement of embeddings is also seen in the SELFIES models, although less prominent. This increased ability to generate SMILES strings that approximate true binding SMILES can potentially be attributed 86 to the pre-trained embeddings allowing the model to generalize and find better associations between new proteins and proteins in the training set. This improvement in performance from a pretrained embeddings for conditional compound generation is a novel finding, but aligns with improvements seen in binding prediction and compound screening [95], [96], [97], [98], [99], [114]. The notable finding is that this improvement in conditional compound generation comes with no additional change to model structure or size of the Transformer Decoder. An interesting note, when looking at the absolute count results, is that the Base+S+E could potentially be a viable model if the generation task has an efficient way to filter generated compounds. If that is the case, then Base+S+E stands to potentially produce more high-quality compounds. Unfortunately, in use cases where the model is being used to generate new lead molecules for a novel amino acid sequence, filtering may not be feasible. This highlights Skinnider 2024’s [113] assertion that potential low validity could be a feature of SMILES string generation and not a bug. 6.1.3 Effect of Generation Method Performance between the methods of compound generation did vary. In chemical feasibility results from random sampling appeared to improve uniqueness, novelty and internal diversity. For the Base model it had a negative effect on validity but Base+E was unaffected. In physiochemical properties the random sampling generation did tend to have more outliers outside of the domain of the training set, but distribution shape overall agreed with training data. In tanimoto, and classifier results there was a consistent trend of beam search methods performing better than random sampling, and 1-1 being the best performing generation method. 87 This isn’t a surprising result, given beam search is generating the most probable compound, which intern means it is the compound that the model thinks best matches the provided amino acid context. However, the difference in performance isn’t drastic, and considering random sampling improved diversity and novelty it presents itself as a reasonable method for generating and exploring a larger set of lead molecules with more diversity. 6.1.4 Scaling Evaluation with a Classifier The results of the case study show how performance may vary from protein to protein and highlights the importance of scaling compound generation evaluation to acquire more robust results. The classifier for evaluation was successful and was able to distinguish between training and test sets, and when provided with novel amino acid sequences for the case study showed effectiveness in distinguishing between real binders and random compounds. These results serve as a good proof-of-concept and motivate the creation of a more universal binding classifier compound generation benchmarking at scale. 6.1.5 Transformers Moving Forward Recurrent Neural Networks have been the model of choice from compound generation historically [28]. However, Transformers do provide a number of key benefits for the field moving forward. Their ability to parallelize and scale allows for the training of larger models faster, as seen with the rise of large language models currently [16]. The use of attention allows the model to extract key information on how protein sequences and compound sequences attend to each other, potentially important for determination of binding or other associations. Most 88 notably, transformers are proliferating in bioinformatic models more broadly as seen with AlphaFold2 and ESM-2 [85], [88]. The utilization of transformer models for compound generation allows for the more seamless repurposing of these pre-trained models, as displayed in this work. It is important to reiterate that although compound generation methods have the advantage of potentially providing new novel compounds; they do provide the concern of more overhead in synthesizing and validating proposed compounds. Because of this, the use of discriminative machine learning approaches on established compound libraries can still be a more favorable approach. 6.2 Future Research 6.2.1 Short Comings and Limitations This research was restricted computationally which led to some experimentation being less robust than we would hope. Performing multiple train test splits would allow us to validate the results of this work with more confidence. In addition to multiple splits, the datasets could be improved by not only splitting on identical amino acid sequences but splitting on amino acid sequence similarity. This would create train test splits that are more diverse, and better testing models’ ability to generalize. Training length of the models is also a point that can be explored further. In this work we trained the models for 50 epochs but Grechishnikova 2021 [59] was trained to approximately 200 epochs. The effects of longer training still need to be investigated. There is a possibility that with 89 more training the non-embedding and embedding models may converge on performance, although the trend seems unlikely. The evaluation of this research makes use of a classifier which is only a proxy for evaluating binding. With more compute power the models could be evaluated with more robust binding simulations. There is also an opportunity to explore the use of some open-source binding and affinity prediction models to benchmark results against independently trained models. Ideally, utilizing multiple independent models could allow for more confidence in the results as well. 6.2.2 Improvements Analysis of Performance on Protein Family This research evaluates the models at an aggregate level and only goes into the specifics for two example proteins in a case study. The case study demonstrated that performance for any given protein can be variable. A deeper dive into trends in performance related to protein families or amino acid sequence similarity would be valuable. This could uncover if the model has any protein family bias. This could also aid in a deeper investigation of the Base+E validity drop. If the validity drop is caused by focusing on a boarder set of protein associations, we may expect amino acid sequences that don’t have close associations to proteins in the training set but rather have multiple distant associations to then have worse validity. The true binders for these amino acid associations could be identified in the training set and compared to the invalid molecules generated for that amino acid sequence, to look for potential substring similarity. This 90 could further validate the hypothesis of validity dropping due to uncertainty caused by broader amino acid sequence associations. Exploring Difference Generation Parameters This research only explores beam search with an alpha value of 0.6 and random sampling with a temperature of 1. Doing a more exhaustive exploration of the effect of these parameters on performance would be a valuable exercise. For example, increasing temperature will increase the randomness of sampling generation and measuring the potential increase in diversity and novelty against a potential decrease in performance would be interesting. The alpha value of 0.6 was determined to be optimal by Wu et al. 2016 [75] for translation tasks, so it would be interesting to reevaluate this parameter empirically for compound generation to see if it is potentially different [75]. Exploring Few-Shot Performance The evaluation of the models in this work focuses on zero-shot generation of compounds from amino acid sequences. This is to say that the models are generating compounds for an unknown amino acid sequence with no additional training. But other papers in the past use a method of fine tuning a pretrained model with known binders for the protein [35], [36], [37]. It would be interesting to compare the established RNN methods to the models explored in this paper with amino acid sequence context. In theory, the models created in this work may be able to generalize to new examples faster because it is not only learning the compounds but also learning the amino acid sequence and making associations to other amino acid sequences it already knows. It would also be interesting to compare them in context to each other, to see if the 91 embedding models are able to perform better with fewer fine-tuning cycles. This could present a potentially more practical use case for the models in this work, as foundational models that can be quickly finetuned for specific protein generation. Generating Compounds from Scaffolds All compounds generated in this research were generated from scratch, but it is possible to give the decoder a starting sequence to then add to. It would be interesting to explore this functionality in a variety of ways. For example, giving the same scaffold to multiple amino acid sequences and seeing the different generations. It would also be possible to take a lead compound that has already been generated, make a scaffold from it and have the models branch off of it while changing the random sampling temperature to see the effectiveness of iterating on generated molecules repeatedly. Graphical Network for Evaluation Classifier The classifier used in evaluation takes in compounds as SMILES strings. This means at its core the model is measuring if the string is associated with a protein. The hope is that the classifier is in some way deriving the physical and chemical properties from the SMILES string. An interesting exploration would be retraining the classifier model using a graphical network to take in compounds via their graph structure instead of string structure and see if the results remain the same. This classifier could be more interesting since it will be directly evaluating the compounds’ 3D structure in similarity to the 3D structure of other binders. 92 Exploring Model Size This work explores models with Encoder and Decoder structures and sizes identical to Grechishnikova 2021 [59]. This leaves room to explore increasing (or decreasing) model size, whether in size of 𝑑𝑓𝑓 , 𝑑𝑚𝑜𝑑𝑒𝑙 , or number of layers/attention heads. It would be interesting to explore the overall performance impact of a larger model. Namely, if the non-embedding models eventually narrow the performance gap with embedding models and if an increased decoder size could help the performance of SELFIES string generation. When Skinnider 2024 [113] compares SMILES string and SELFIES string generation model sizes and structures are also identical. This leaves room for the potential exploration of the impact of model size on SELFIES string generation. There is a chance that a larger model may be needed to better grasp the more complex grammar of SELFIES strings. Outside of model architecture size, the other embedding sizes of ESM-2 could also be explored to see the impacts of the smaller models, as well as the larger ones if computational resources allow. 6.3 Final Statement This work explores how pretrained protein embeddings from a large protein language model, ESM-2 [88], and the use of an alternative string compound representation, SELFIES [82], effects the performance of compound generation in Transformer models. Overall, results showed that the inclusion of pretrained embeddings has the potential to improve model performance in generating high quality compounds for specific amino acid sequences. The utilization of SELFIES strings was seen to improve validity and diversity but promoted difficult to synthesize compounds and overall hurt the performance of the models. Random sampling with temperature 93 was shown to be a less effective generation method than beam search but considering its nondeterministic nature and increased speed it is a feasible methodology for exploring compound generation. These results ultimately show that without an increase in model size or change in structure the application of pretrained protein embeddings can improve protein specific compound generation in Transformer models. These models can potentially be used to generate lead compounds for novel proteins without any retraining or fine-tuning. 94 Bibliography [1] R. Leinonen, H. Sugawara, M. Shumway, and International Nucleotide Sequence Database Collaboration, “The Sequence Read Archive,” Nucleic Acids Research, vol. 39, no. Database issue, pp. D19-21, Jan. 2011, doi: 10.1093/nar/gkq1019. [2] M. A. Jensen, V. Ferretti, R. L. Grossman, and L. M. Staudt, “The NCI Genomic Data Commons as an Engine for Precision Medicine,” Blood, vol. 130, no. 4, pp. 453–459, Jul. 2017, doi: 10.1182/blood-2017-03-735654. [3] H. M. Berman et al., “The Protein Data Bank,” Nucleic Acids Res, vol. 28, no. 1, pp. 235– 242, Jan. 2000, doi: 10.1093/nar/28.1.235. [4] A. Gaulton et al., “The ChEMBL Database in 2017,” Nucleic Acids Res, vol. 45, no. D1, pp. D945–D954, Jan. 2017, doi: 10.1093/nar/gkw1074. [5] S. Kim et al., “PubChem Substance and Compound databases,” Nucleic Acids Res, vol. 44, no. D1, pp. D1202-1213, Jan. 2016, doi: 10.1093/nar/gkv951. [6] J. J. Irwin and B. K. Shoichet, “Zinc--a Free Database of Commercially Available Compounds for Virtual Screening,” J Chem Inf Model, vol. 45, no. 1, pp. 177–182, Feb. 2005, doi: 10.1021/ci049714+. [7] R. and Markets, “Global Pharmaceuticals Market Report 2021-2030 Featuring Major Players - Pfizer; F. Hoffmann-La Roche Ltd; Sanofi; Johnson & Johnson and Merck & Co.” Accessed: Oct. 05, 2022. [Online]. Available: https://www.prnewswire.com/newsreleases/global-pharmaceuticals-market-report-2021-2030-featuring-major-players--pfizer-f-hoffmann-la-roche-ltd-sanofi-johnson--johnson-and-merck--co-301260487.html [8] C. H. Wong, K. W. Siah, and A. W. Lo, “Estimation of Clinical Trial Success Rates and Related Parameters,” Biostatistics, vol. 20, no. 2, pp. 273–286, Apr. 2019, doi: 10.1093/biostatistics/kxx069. [9] S. M. Paul et al., “How to Improve R&d Productivity: The Pharmaceutical Industry’s Grand Challenge,” Nat Rev Drug Discov, vol. 9, no. 3, Art. no. 3, Mar. 2010, doi: 10.1038/nrd3078. [10] Y. Cheng, Y. Gong, Y. Liu, B. Song, and Q. Zou, “Molecular Design in Drug Discovery: A Comprehensive Review of Deep Generative Models,” Brief Bioinform, vol. 22, no. 6, p. bbab344, Nov. 2021, doi: 10.1093/bib/bbab344. [11] I. H. Sarker, “Machine Learning: Algorithms, Real-World Applications and Research Directions,” SN COMPUT. SCI., vol. 2, no. 3, p. 160, Mar. 2021, doi: 10.1007/s42979021-00592-x. [12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org 95 [13] I. Goodfellow et al., “Generative Adversarial Networks,” Commun. ACM, vol. 63, no. 11, pp. 139–144, Oct. 2020, doi: 10.1145/3422622. [14] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735. [15] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” Dec. 11, 2014, arXiv: arXiv:1412.3555. doi: 10.48550/arXiv.1412.3555. [16] A. Vaswani et al., “Attention Is All You Need,” Dec. 06, 2017, arXiv: arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762. [17] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “DeepTox: Toxicity Prediction using Deep Learning,” Frontiers in Environmental Science, vol. 3, 2016, Accessed: Oct. 06, 2022. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fenvs.2015.00080 [18] W. F. D. Bennett et al., “Predicting Small Molecule Transfer Free Energies by Combining Molecular Dynamics Simulations and Deep Learning,” J. Chem. Inf. Model., vol. 60, no. 11, pp. 5375–5381, Nov. 2020, doi: 10.1021/acs.jcim.0c00318. [19] A. Lusci, G. Pollastri, and P. Baldi, “Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules,” J Chem Inf Model, vol. 53, no. 7, pp. 1563–1575, Jul. 2013, doi: 10.1021/ci400187y. [20] J. Jiménez, M. Škalič, G. Martínez-Rosell, and G. De Fabritiis, “KDEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks,” J. Chem. Inf. Model., vol. 58, no. 2, pp. 287–296, Feb. 2018, doi: 10.1021/acs.jcim.7b00650. [21] H. Öztürk, A. Özgür, and E. Ozkirimli, “Deepdta: Deep Drug–Target Binding Affinity Prediction,” Bioinformatics, vol. 34, no. 17, pp. i821–i829, Sep. 2018, doi: 10.1093/bioinformatics/bty593. [22] T. Nguyen, H. Le, T. P. Quinn, T. Nguyen, T. D. Le, and S. Venkatesh, “GraphDTA: Predicting Drugd–Target Binding Affinity with Graph Neural Networks,” Oct. 02, 2020, bioRxiv. doi: 10.1101/684662. [23] P. T. Jackson et al., “Phenotypic Profiling of High Throughput Imaging Screens with Generic Deep Convolutional Features,” Mar. 15, 2019, arXiv: arXiv:1903.06516. Accessed: Oct. 05, 2022. [Online]. Available: http://arxiv.org/abs/1903.06516 [24] P. Karpov, G. Godin, and I. V. Tetko, “Transformer-CNN: Swiss Knife for QSAR Modeling and Interpretation,” Journal of Cheminformatics, vol. 12, no. 1, p. 17, Mar. 2020, doi: 10.1186/s13321-020-00423-w. [25] C.-T. Chi, M.-H. Lee, C.-F. Weng, and M. K. Leong, “In Silico Prediction of PAMPA Effective Permeability Using a Two-QSAR Approach,” Int J Mol Sci, vol. 20, no. 13, p. 3170, Jun. 2019, doi: 10.3390/ijms20133170. 96 [26] S. Kim and K.-H. Cho, “PyQSAR: A Fast QSAR Modeling Platform Using Machine Learning and Jupyter Notebook,” Bulletin of the Korean Chemical Society, vol. 40, no. 1, pp. 39–44, 2019, doi: 10.1002/bkcs.11638. [27] J. Vamathevan et al., “Applications of Machine Learning in Drug Discovery and Development,” Nat Rev Drug Discov, vol. 18, no. 6, Art. no. 6, Jun. 2019, doi: 10.1038/s41573-019-0024-5. [28] D. D. Martinelli, “Generative Machine Learning for De Novo Drug Discovery: A Systematic Review,” Computers in Biology and Medicine, vol. 145, p. 105403, Jun. 2022, doi: 10.1016/j.compbiomed.2022.105403. [29] X. Tong et al., “Generative Models for De Novo Drug Design,” J. Med. Chem., vol. 64, no. 19, pp. 14011–14027, Oct. 2021, doi: 10.1021/acs.jmedchem.1c00927. [30] T. B. Brown et al., “Language Models are Few-Shot Learners,” Jul. 22, 2020, arXiv: arXiv:2005.14165. doi: 10.48550/arXiv.2005.14165. [31] P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Jun. 01, 2021, arXiv: arXiv:2105.05233. doi: 10.48550/arXiv.2105.05233. [32] E. J. Bjerrum and R. Threlfall, “Molecular Generation with Recurrent Neural Networks (RNNs),” May 17, 2017, arXiv: arXiv:1705.04612. doi: 10.48550/arXiv.1705.04612. [33] P. Ertl, R. Lewis, E. Martin, and V. Polyakov, “In Silico Generation of Novel, Drug-Like Chemical Matter Using the LSTM Neural Network,” Jan. 08, 2018, arXiv: arXiv:1712.07449. doi: 10.48550/arXiv.1712.07449. [34] S. Harel and K. Radinsky, “Prototype-Based Compound Discovery Using Deep Generative Models,” Mol. Pharmaceutics, vol. 15, no. 10, pp. 4406–4416, Oct. 2018, doi: 10.1021/acs.molpharmaceut.8b00474. [35] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, “Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks,” ACS Cent. Sci., vol. 4, no. 1, pp. 120–131, Jan. 2018, doi: 10.1021/acscentsci.7b00512. [36] A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, and G. Schneider, “Generative Recurrent Networks for De Novo Drug Design,” Molecular Informatics, vol. 37, no. 1–2, p. 1700111, 2018, doi: 10.1002/minf.201700111. [37] D. Merk, L. Friedrich, F. Grisoni, and G. Schneider, “De Novo Design of Bioactive Small Molecules by Artificial Intelligence,” Mol Inform, vol. 37, no. 1–2, p. 1700153, Jan. 2018, doi: 10.1002/minf.201700153. [38] J. Arús-Pous et al., “Randomized SMILES Strings Improve the Quality of Molecular Generative Models,” Journal of Cheminformatics, vol. 11, no. 1, p. 71, Nov. 2019, doi: 10.1186/s13321-019-0393-0. 97 [39] T. Blaschke and J. Bajorath, “Fine-Tuning of a Generative Neural Network for Designing Multi-Target Compounds,” J Comput Aided Mol Des, vol. 36, no. 5, pp. 363–371, May 2022, doi: 10.1007/s10822-021-00392-8. [40] R. Gómez-Bombarelli et al., “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules,” ACS Cent. Sci., vol. 4, no. 2, pp. 268–276, Feb. 2018, doi: 10.1021/acscentsci.7b00572. [41] E. J. Bjerrum and B. Sattarov, “Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders,” Biomolecules, vol. 8, no. 4, Art. no. 4, Dec. 2018, doi: 10.3390/biom8040131. [42] A. Kadurin et al., “The Cornucopia of Meaningful Leads: Applying Deep Adversarial Autoencoders for New Molecule Development in Oncology,” Oncotarget, vol. 8, no. 7, pp. 10883–10890, Dec. 2016, doi: 10.18632/oncotarget.14073. [43] S. Kang and K. Cho, “Conditional Molecular Design with Deep Generative Models,” J Chem Inf Model, vol. 59, no. 1, pp. 43–52, Jan. 2019, doi: 10.1021/acs.jcim.8b00263. [44] T. Blaschke, M. Olivecrona, O. Engkvist, J. Bajorath, and H. Chen, “Application of Generative Autoencoder in De Novo Molecular Design,” Molecular Informatics, vol. 37, no. 1–2, p. 1700123, 2018, doi: 10.1002/minf.201700123. [45] J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim, “Molecular Generative Model Based on Conditional Variational Autoencoder for De Novo Molecular Design,” Journal of Cheminformatics, vol. 10, no. 1, p. 31, Jul. 2018, doi: 10.1186/s13321-018-0286-7. [46] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, “Grammar Variational Autoencoder,” Mar. 06, 2017, arXiv: arXiv:1703.01925. Accessed: Oct. 06, 2022. [Online]. Available: http://arxiv.org/abs/1703.01925 [47] S. H. Hong, S. Ryu, J. Lim, and W. Y. Kim, “Molecular Generative Model Based on an Adversarially Regularized Autoencoder,” J. Chem. Inf. Model., vol. 60, no. 1, pp. 29–36, Jan. 2020, doi: 10.1021/acs.jcim.9b00694. [48] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. AspuruGuzik, “Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models,” Feb. 06, 2018, arXiv: arXiv:1705.10843. Accessed: Oct. 06, 2022. [Online]. Available: http://arxiv.org/abs/1705.10843 [49] E. Putin et al., “Adversarial Threshold Neural Computer for Molecular de Novo Design,” Mol. Pharmaceutics, vol. 15, no. 10, pp. 4386–4397, Oct. 2018, doi: 10.1021/acs.molpharmaceut.7b01137. [50] O. Prykhodko et al., “A De Novo Molecular Generation Method Using Latent Vector Based Generative Adversarial Network,” Journal of Cheminformatics, vol. 11, no. 1, p. 74, Dec. 2019, doi: 10.1186/s13321-019-0397-9. 98 [51] O. Méndez-Lucio, B. Baillif, D.-A. Clevert, D. Rouquié, and J. Wichard, “De Novo Generation of Hit-Like Molecules from Gene Expression Signatures Using Artificial Intelligence,” Nat Commun, vol. 11, no. 1, Art. no. 1, Jan. 2020, doi: 10.1038/s41467-01913807-w. [52] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov, “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico,” Mol. Pharmaceutics, vol. 14, no. 9, pp. 3098–3104, Sep. 2017, doi: 10.1021/acs.molpharmaceut.7b00346. [53] D. Polykovskiy et al., “Entangled Conditional Adversarial Autoencoder for de Novo Drug Discovery,” Mol. Pharmaceutics, vol. 15, no. 10, pp. 4398–4405, Oct. 2018, doi: 10.1021/acs.molpharmaceut.8b00839. [54] Y. Liu and J. Bailey, “De Novo Molecular Generation with Stacked Adversarial Model,” Oct. 24, 2021, arXiv: arXiv:2110.12454. Accessed: Oct. 06, 2022. [Online]. Available: http://arxiv.org/abs/2110.12454 [55] X. Bai and Y. Yin, “Exploration and Augmentation of Pharmacological Space Via Adversarial Auto-Encoder Model for Facilitating Kinase-Centric Drug Development,” Journal of Cheminformatics, vol. 13, no. 1, p. 95, Dec. 2021, doi: 10.1186/s13321-02100574-4. [56] Q. Bai, S. Tan, T. Xu, H. Liu, J. Huang, and X. Yao, “MolAICal: A Soft Tool for 3d Drug Design of Protein Targets by Artificial Intelligence and Classical Algorithm,” Briefings in Bioinformatics, vol. 22, no. 3, p. bbaa161, May 2021, doi: 10.1093/bib/bbaa161. [57] V. Bagal, R. Aggarwal, P. K. Vinod, and U. D. Priyakumar, “MolGPT: Molecular Generation Using a Transformer-Decoder Model,” J. Chem. Inf. Model., vol. 62, no. 9, pp. 2064–2076, May 2022, doi: 10.1021/acs.jcim.1c00600. [58] V. Mandhana and R. Taware, “De Novo Drug Design Using Self Attention Mechanism,” in Proceedings of the 35th Annual ACM Symposium on Applied Computing, in SAC ’20. New York, NY, USA: Association for Computing Machinery, Mar. 2020, pp. 8–12. doi: 10.1145/3341105.3374024. [59] D. Grechishnikova, “Transformer Neural Network for Protein-Specific De Novo Drug Generation as a Machine Translation Problem,” Sci Rep, vol. 11, no. 1, Art. no. 1, Jan. 2021, doi: 10.1038/s41598-020-79682-4. [60] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular De-Novo Design Through Deep Reinforcement Learning,” Journal of Cheminformatics, vol. 9, no. 1, p. 48, Sep. 2017, doi: 10.1186/s13321-017-0235-x. [61] M. Popova, O. Isayev, and A. Tropsha, “Deep Reinforcement Learning for De Novo Drug Design,” Science Advances, vol. 4, no. 7, p. eaap7885, Jul. 2018, doi: 10.1126/sciadv.aap7885. 99 [62] T. Blaschke, O. Engkvist, J. Bajorath, and H. Chen, “Memory-Assisted Reinforcement Learning for Diverse Molecular De Novo Design,” Journal of Cheminformatics, vol. 12, no. 1, p. 68, Nov. 2020, doi: 10.1186/s13321-020-00473-0. [63] B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, and A. Aspuru-Guzik, “Optimizing Distributions Over Molecular Space. an Objective-Reinforced Generative Adversarial Network for Inverse-Design Chemistry (ORGANIC),” Aug. 2017, doi: 10.26434/chemrxiv.5309668.v3. [64] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley, “Optimization of Molecules via Deep Reinforcement Learning,” Sci Rep, vol. 9, no. 1, Art. no. 1, Jul. 2019, doi: 10.1038/s41598-019-47148-x. [65] A. Ghanbarpour and M. A. Lill, “Seq2Mol: Automatic Design of De Novo Molecules Conditioned by the Target Protein Sequences Through Deep Neural Networks,” Oct. 29, 2020, arXiv: arXiv:2010.15900. doi: 10.48550/arXiv.2010.15900. [66] J. Born, M. Manica, A. Oskooei, J. Cadow, G. Markert, and M. Rodríguez Martínez, “PaccMannRL: De Novo Generation of Hit-Like Anticancer Molecules from Transcriptomic Data Via Reinforcement Learning,” iScience, vol. 24, no. 4, p. 102269, Mar. 2021, doi: 10.1016/j.isci.2021.102269. [67] P.-C. Kotsias, J. Arús-Pous, H. Chen, O. Engkvist, C. Tyrchan, and E. J. Bjerrum, “Direct Steering of De Novo Molecular Generation with Descriptor Conditional Recurrent Neural Networks,” Nat Mach Intell, vol. 2, no. 5, Art. no. 5, May 2020, doi: 10.1038/s42256-0200174-5. [68] V. Chenthamarakshan et al., “CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models,” Jun. 23, 2020, arXiv: arXiv:2004.01215. doi: 10.48550/arXiv.2004.01215. [69] R. Gupta, D. Srivastava, M. Sahu, S. Tiwari, R. K. Ambasta, and P. Kumar, “Artificial Intelligence to Deep Learning: Machine Intelligence Approach for Drug Discovery,” Mol Divers, vol. 25, no. 3, pp. 1315–1360, Aug. 2021, doi: 10.1007/s11030-021-10217-3. [70] E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified Rational Protein Engineering with Sequence-Based Deep Representation Learning,” Nat Methods, vol. 16, no. 12, Art. no. 12, Dec. 2019, doi: 10.1038/s41592-019-0598-1. [71] M. Heinzinger et al., “Modeling Aspects of the Language of Life Through TransferLearning Protein Sequences,” BMC Bioinformatics, vol. 20, no. 1, p. 723, Dec. 2019, doi: 10.1186/s12859-019-3220-8. [72] R. J. Williams and D. Zipser, “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989, doi: 10.1162/neco.1989.1.2.270. 100 [73] M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, Nov. 1997, doi: 10.1109/78.650093. [74] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” Mar. 09, 2015, arXiv: arXiv:1503.02531. doi: 10.48550/arXiv.1503.02531. [75] Y. Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” Oct. 08, 2016, arXiv: arXiv:1609.08144. doi: 10.48550/arXiv.1609.08144. [76] L. David, A. Thakkar, R. Mercado, and O. Engkvist, “Molecular Representations in AiDriven Drug Discovery: A Review and Practical Guide,” Journal of Cheminformatics, vol. 12, no. 1, p. 56, Sep. 2020, doi: 10.1186/s13321-020-00460-5. [77] G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, and N. Baker, “Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expertdeveloped QSAR/QSPR Models,” Jun. 20, 2017, arXiv: arXiv:1706.06689. doi: 10.48550/arXiv.1706.06689. [78] “RDKit.” Accessed: Oct. 17, 2022. [Online]. Available: http://www.rdkit.org/ [79] E. J. Bjerrum, “SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules,” May 17, 2017, arXiv: arXiv:1703.07076. doi: 10.48550/arXiv.1703.07076. [80] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models,” Frontiers in Pharmacology, vol. 11, 2020, Accessed: Oct. 08, 2022. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fphar.2020.565644 [81] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096– 1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839. [82] M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-Guzik, “Self-Referencing Embedded Strings (SELFIES): A 100% Robust Molecular String Representation,” Mach. Learn.: Sci. Technol., vol. 1, no. 4, p. 045024, Oct. 2020, doi: 10.1088/2632-2153/aba947. [83] D. S. Wigh, J. M. Goodman, and A. A. Lapkin, “A Review of Molecular Representation in the Age of Machine Learning,” WIREs Computational Molecular Science, vol. 12, no. 5, p. e1603, 2022, doi: 10.1002/wcms.1603. [84] D. Nelson and M. Cox, “Lehninger Principles of Biochemistry,” in Lehninger Principles of Biochemistry, 6th ed., New York, NY: W.H. Freeman, 2017. [85] J. Jumper et al., “Highly Accurate Protein Structure Prediction with Alphafold,” Nature, vol. 596, no. 7873, Art. no. 7873, Aug. 2021, doi: 10.1038/s41586-021-03819-2. 101 [86] A. Elnaggar et al., “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing,” May 04, 2021, arXiv: arXiv:2007.06225. doi: 10.48550/arXiv.2007.06225. [87] A. Rives et al., “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences,” Proceedings of the National Academy of Sciences, vol. 118, no. 15, p. e2016239118, Apr. 2021, doi: 10.1073/pnas.2016239118. [88] Z. Lin et al., “Language Models of Protein Sequences at the Scale of Evolution Enable Accurate Structure Prediction,” Jul. 21, 2022, bioRxiv. doi: 10.1101/2022.07.20.500902. [89] S. Haroon, H. C.a., and J. A.s., “Generative Pre-Trained Transformer (gpt) Based Model with Relative Attention for De Novo Drug Design,” Computational Biology and Chemistry, vol. 106, p. 107911, Oct. 2023, doi: 10.1016/j.compbiolchem.2023.107911. [90] P. Eckmann, K. Sun, B. Zhao, M. Feng, M. K. Gilson, and R. Yu, “LIMO: Latent Inceptionism for Targeted Molecule Generation,” Jun. 17, 2022, arXiv: arXiv:2206.09010. doi: 10.48550/arXiv.2206.09010. [91] K. Kaitoh and Y. Yamanishi, “Scaffold-Retained Structure Generator to Exhaustively Create Molecules in an Arbitrary Chemical Space,” J. Chem. Inf. Model., vol. 62, no. 9, pp. 2212–2225, May 2022, doi: 10.1021/acs.jcim.1c01130. [92] A. Nigam, R. Pollice, and A. Aspuru-Guzik, “Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design,” Digital Discovery, vol. 1, no. 4, pp. 390–404, Aug. 2022, doi: 10.1039/D2DD00003B. [93] A. Nigam, R. Pollice, M. Krenn, G. dos P. Gomes, and A. Aspuru-Guzik, “Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (stoned) Algorithm for Molecules Using Selfies,” Chem. Sci., vol. 12, no. 20, pp. 7079–7090, May 2021, doi: 10.1039/D1SC00231G. [94] S. Park and H. Lee, “A Molecular Generative Model with Genetic Algorithm and Tree Search for Cancer Samples,” Dec. 16, 2021, arXiv: arXiv:2112.08959. doi: 10.48550/arXiv.2112.08959. [95] S. Bhat et al., “De Novo Generation and Prioritization of Target-Binding Peptide Motifs from Sequence Alone,” Jun. 28, 2023, bioRxiv. doi: 10.1101/2023.06.26.546591. [96] H. Wu et al., “AttentionMGT-DTA: A Multi-Modal Drug-Target Affinity Prediction Using Graph Transformer and Attention Mechanism,” Neural Networks, vol. 169, pp. 623–636, Jan. 2024, doi: 10.1016/j.neunet.2023.11.018. [97] F. Wu, L. Wu, D. Radev, J. Xu, and S. Z. Li, “Integration of Pre-Trained Protein Language Models into Geometric Deep Learning Networks,” Commun Biol, vol. 6, no. 1, pp. 1–8, Aug. 2023, doi: 10.1038/s42003-023-05133-1. 102 [98] Y. Kalakoti, S. Gawande, and D. Sundar, “Estimating Protein-Ligand Interactions with Geometric Deep Learning and Mixture Density Models,” Dec. 25, 2023, bioRxiv. doi: 10.1101/2023.10.03.560738. [99] Z. Du, X. Ding, W. Hsu, A. Munir, Y. Xu, and Y. Li, “pLM4ACE: A Protein Language Model Based Predictor for Antihypertensive Peptide Screening,” Food Chem, vol. 431, p. 137162, Jan. 2024, doi: 10.1016/j.foodchem.2023.137162. [100] T. Developers, TensorFlow. (Jul. 11, 2024). Zenodo. doi: 10.5281/zenodo.12726004. [101] “Neural Machine Translation with a Transformer and Keras | Text,” TensorFlow. Accessed: Sep. 14, 2024. [Online]. Available: https://www.tensorflow.org/text/tutorials/transformer [102] A. Lo, R. Pollice, A. Nigam, A. D. White, M. Krenn, and A. Aspuru-Guzik, “Recent advances in the self-referencing embedded strings (SELFIES) library,” Digital Discovery, vol. 2, no. 4, pp. 897–908, Aug. 2023, doi: 10.1039/D3DD00044C. [103] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” J. Mach. Learn. Res., vol. 12, no. null, pp. 2121–2159, Jul. 2011. [104] N. Song, R. Dong, Y. Pu, E. Wang, J. Xu, and F. Guo, “PMF-CPI: Assessing Drug Selectivity with a Pretrained Multi-Functional Model for Compound–Protein Interactions,” Journal of Cheminformatics, vol. 15, no. 1, p. 97, Oct. 2023, doi: 10.1186/s13321-023-00767-z. [105] A. Vaswani et al., “Tensor2Tensor for Neural Machine Translation,” Mar. 16, 2018, arXiv: arXiv:1803.07416. doi: 10.48550/arXiv.1803.07416. [106] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney, “Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings,” Advanced Drug Delivery Reviews, vol. 46, no. 1, pp. 3–26, Mar. 2001, doi: 10.1016/S0169-409X(00)00129-0. [107] D. F. Veber, S. R. Johnson, H.-Y. Cheng, B. R. Smith, K. W. Ward, and K. D. Kopple, “Molecular Properties That Influence the Oral Bioavailability of Drug Candidates,” J. Med. Chem., vol. 45, no. 12, pp. 2615–2623, Jun. 2002, doi: 10.1021/jm020017n. [108] S. A. Hitchcock and L. D. Pennington, “Structure−Brain Exposure Relationships,” J. Med. Chem., vol. 49, no. 26, pp. 7559–7583, Dec. 2006, doi: 10.1021/jm060642i. [109] G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins, “Quantifying the Chemical Beauty of Drugs,” Nature Chem, vol. 4, no. 2, Art. no. 2, Feb. 2012, doi: 10.1038/nchem.1243. [110] P. Ertl, S. Roggo, and A. Schuffenhauer, “Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries,” J. Chem. Inf. Model., vol. 48, no. 1, pp. 68–74, Jan. 2008, doi: 10.1021/ci700286x. 103 [111] P. Ertl and A. Schuffenhauer, “Estimation of Synthetic Accessibility Score of Drug-Like Molecules Based on Molecular Complexity and Fragment Contributions,” Journal of Cheminformatics, vol. 1, no. 1, p. 8, Jun. 2009, doi: 10.1186/1758-2946-1-8. [112] T. Liu, Y. Lin, X. Wen, R. N. Jorissen, and M. K. Gilson, “Bindingdb: A Web-Accessible Database of Experimentally Determined Protein–Ligand Binding Affinities,” Nucleic Acids Res, vol. 35, no. Database issue, pp. D198–D201, Jan. 2007, doi: 10.1093/nar/gkl999. [113] M. A. Skinnider, “Invalid SMILES Are Beneficial Rather Than Detrimental to Chemical Language Models,” Nat Mach Intell, vol. 6, no. 4, pp. 437–448, Apr. 2024, doi: 10.1038/s42256-024-00821-x. [114] Y. Huang et al., “A Robust Drug–Target Interaction Prediction Framework with Capsule Network and Transfer Learning,” Int J Mol Sci, vol. 24, no. 18, p. 14061, Sep. 2023, doi: 10.3390/ijms241814061. 104 Supplementary Figures Supplementary Figure 1. Mean Tanimoto Similarity Distributions of Generated and Shuffled sets. 105 Supplementary Figure 2. Max Tanimoto Similarity Distributions of Generated and Shuffled sets. 106 Supplementary Figure 3. Classifier Score Distributions of Generated and Shuffled sets. 107 Supplementary Figure 4. ROC Curve for Generated Sets vs. Shuffled Test Set across four model groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. 108