TRANSFORMER MODELS FOR PROTEIN-GUIDED DRUG COMPOUND
GENERATION: A COMPARISON OF AMINO ACID SEQUENCES, PRE-TRAINED
PROTEIN EMBEDDINGS, SMILES, AND SELFIES
by
Dylan Fossl
BSc. Hons., University of Northern British Columbia, 2020

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
COMPUTER SCIENCE

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
November 2024
© Dylan Fossl, 2024

Abstract
Drug discovery is a time-consuming and costly process that notoriously suffers from low
success rates. Increased availability of chemically relevant data and advances in machine
learning techniques offer potential solutions for aiding in the drug development pipeline. This
thesis explores the use of Transformers in the conditional generation of potential drug
compounds from protein context. Building on previous research, this work implements four
transformer models to take in protein information as input and generate potential binding
compounds. Each model uses either SMILES or SELFIES string representations of compounds
and amino acid sequences or pretrained ESM-2 protein embeddings as contextual input. These
models are trained and compared in their ability to generate chemically feasible compounds that
approximate the physiochemical properties of the training set and show binding potential specific
to the contextual proteins. The utilization of SEFLIES increased compound validity and diversity
but overall had a negative performance impact compared to their SMILES counterparts.
Pretrained protein embeddings were shown to decrease validity but improved model
performance despite no change to model structure or size. These results highlight the potential of
transformer models paired with pretrained protein embeddings to enhance the drug discovery
process with the generation of lead compounds from novel proteins without any fine-tuning or
retraining.

ii

Table of Contents
Abstract ........................................................................................................................................... ii
Table of Contents ........................................................................................................................... iii
List of Tables ................................................................................................................................. vi
List of Figures ............................................................................................................................... vii
Dedication ....................................................................................................................................... x
Acknowledgement ......................................................................................................................... xi
Chapter 1 - Introduction .................................................................................................................. 1
1.1 Background ........................................................................................................................... 1
1.1.1 Difficulties in Drug Discovery ....................................................................................... 1
1.1.2 Brief Introduction to Machine Learning ......................................................................... 2
1.1.3 Machine Learning in Drug Discovery ............................................................................ 4
1.1.4 Generative Deep Learning .............................................................................................. 5
1.2 Problem Statement ................................................................................................................ 7
1.3 Thesis Structure ..................................................................................................................... 8
Chapter 2 - Literature Review....................................................................................................... 11
2.1 Deep Learning ..................................................................................................................... 11
2.1.1 Neural Networks ........................................................................................................... 11
2.1.2 Recurrent Neural Networks .......................................................................................... 12
2.1.3 Transformers................................................................................................................. 16
2.1.4 Sequence Generation Methods ..................................................................................... 20
2.2 Representation of Biological Data ...................................................................................... 21
2.2.1 Compound Representations .......................................................................................... 21
2.2.2 Protein Representations ................................................................................................ 26
2.3 Deep Generative Models in Drug Discovery ...................................................................... 28
2.3.1 Recurrent Neural Networks .......................................................................................... 29
2.3.2 Transformers................................................................................................................. 30
2.3.3 Other Model Architectures ........................................................................................... 31
2.3.4 Protein Embedding Use in Drug Discovery ................................................................. 31
Chapter 3 - Thesis Statement ........................................................................................................ 33
Chapter 4 - Methodologies............................................................................................................ 35
iii

4.1 Transformers for Compound Generation ............................................................................ 35
4.1.1 Model Data ................................................................................................................... 36
4.1.2 Model Structure ............................................................................................................ 38
4.1.3 Model Training ............................................................................................................. 39
4.2 Binding Classifier for Evaluation ........................................................................................ 41
4.2.1 Model Data ................................................................................................................... 42
4.2.2 Model Structure ............................................................................................................ 43
4.2.3 Model Training ............................................................................................................. 45
4.3 Experiment Setup ................................................................................................................ 46
4.4 Evaluation............................................................................................................................ 48
4.4.1 Chemical Feasibility ..................................................................................................... 48
4.4.2 Physiochemical Properties............................................................................................ 50
4.4.3 Similarity to Known Binders ........................................................................................ 52
Chapter 5 - Results and Discussion .............................................................................................. 57
5.1 Transformer Training Metrics ............................................................................................. 57
5.2 Binding Classifier Training Metrics .................................................................................... 59
5.3 Chemical Feasibility ............................................................................................................ 60
5.4 Physiochemical Properties .................................................................................................. 63
5.5 Similarity to Known Binders............................................................................................... 70
5.5.1 Tanimoto Similarities ................................................................................................... 70
5.5.2 Classifier Scores ........................................................................................................... 74
5.5.3 Case Study .................................................................................................................... 79
Chapter 6 - Conclusions and Future Research .............................................................................. 85
6.1 Conclusions ......................................................................................................................... 85
6.1.1 Effect of SELFIES ........................................................................................................ 85
6.1.2 Effect of Embeddings ................................................................................................... 86
6.1.3 Effect of Generation Method ........................................................................................ 87
6.1.4 Scaling Evaluation with a Classifier ............................................................................. 88
6.1.5 Transformers Moving Forward .................................................................................... 88
6.2 Future Research ................................................................................................................... 89
6.2.1 Short Comings and Limitations .................................................................................... 89
6.2.2 Improvements ............................................................................................................... 90
iv

6.3 Final Statement .................................................................................................................... 93
Bibliography ................................................................................................................................. 95
Supplementary Figures ............................................................................................................... 105

v

List of Tables
Table 1. Chemical Feasibility results for each model and generation set. Bolded values are the
max values for that metric and generation set between the models. Ties are both bolded and
underlined. .................................................................................................................................... 61
Table 2. Chemical Feasibility of reference sets. Ideal Set is made up of random samples from the
Test set. Random sets made up of random samples from the Training set. .................................. 61
Table 3. Binned absolute count of outlier compounds (>0.2) for Mean Tanimoto Similarity. Max
values for each generation set are bolded. Ties are bolded and underlined. ................................. 73
Table 4. Binned absolute count of outlier compounds (>0.3) for Max Tanimoto Similarity. Max
values for each generation set are bolded. Ties are bolded and underlined. ................................. 73
Table 5. Binned absolute count of compounds classified as binders (>0.5) for Classifier Scores.
Max values for each generation set are bolded. Ties are bolded and underlined. ........................ 79

vi

List of Figures
Figure 1. Visual Example of Perceptron ....................................................................................... 12
Figure 2. Segler et al 2018 [35] “Symbol generation and sampling process. We start with a
random seed symbol s1, here c, which gets converted into a one-hot vector x1 and input into the
model. The model then updates its internal state h0 to h1 and outputs y1, which is the probability
distribution over the next symbols. Here, sampling yields s2 = 1. Converting s2 to x2 and feeding
it to the model leads to updated hidden state h2 and output y2, from which we can sample again.
This iterative symbol-by-symbol procedure can be continued as long as desired. In this example,
we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden
state hi allows the model to keep track of opened brackets and rings, to ensure that they will be
closed again later.” ........................................................................................................................ 13
Figure 3. Vaswani et al. 2017 [16] “The Transformer - model architecture” ............................... 19
Figure 4. Example of Lewis Structure Representation. ................................................................ 21
Figure 5. David et al 2020 [76] “Example graph representation for acetic acid. a Graph
representation of acetic acid with nodes numbered from one to four. b Example adjacency
matrix, A, for an acetic acid graph with the corresponding node ordering on the left. c Example
node features matrix, X, which one-hot encodes a few selected properties. d Example edge
features matrix, E, where each edge feature vector is a one-hot encoding of single, double, or
triple bonds. “Implicit Hs” stands for the number of implicit hydrogens on a given node.” ....... 22
Figure 6. David et al 2020 [76] “Canonical (a) and randomized (b) SMILES representations of
aspirin. Randomized SMILES correspond to the various representations of a molecule obtained
by randomly selecting the starting node in the graph traversal algorithm, thus changing the order
of the nodes traversed in the molecular graph (still using depth-first search). Numbers represent
the order of graph traversal, where 1 is the initial node (user defined). Considering a as being the
canonical representation of aspirin, b shows a different ordering of the atoms of the molecule.
The final SMILES is one possible SMILES among all the randomized SMILES which can be
generated. Green arrows indicate how the molecular graph is traversed. Both SMILES strings
shown represent the same molecule but, as the atom numberings are different, the generated
SMILES strings are, too.” ............................................................................................................. 24
Figure 7. Elnaggar et al 2021 [86] “Protein LMs learned constraints. t-SNE projections
visualized information extracted by the unsupervised protein LMs (here best-performing ProtT5XL-U50; upper row: before training (Random), and lower row: after pre-training on BFD &
UniRef50. (A) The left-most column highlights single amino acids by biophysical features. (B)
The middle column annotates protein structural class (taken from SCOPe). (C) The right-most
column distinguishes proteins according to the kingdom of life in which it is native. Although the
random projections on top may suggest some adequacy of the clusters, the trained models shown
on the lower row clearly stood out. Incidentally, the comparison of the two also highlighted the
potential pitfalls of using t-SNE projections from many dimensions onto 2D: although random,
the human might see some correct patterns even in the top row.” ............................................... 28
vii

Figure 8. Distribution of Needle-Whitman Similarities Between Test and Train Amino Acids. . 37
Figure 9. Song et al. 2023 [104] “The framework of PMF-CPI. For a compound-protein pair, the
protein sequence is transformed into context embedding via pretrained TAPE and then fed into
LSTM. And a compound SMILES is turned into a molecular graph by RDKit and encoded with
graph neural networks. After that, two encoding vectors are merged through the Kronecker
product, and MLP provides a final output. PMF-CPI can process two task types, regression for
CPI binding affinities or classification for interactions” .............................................................. 45
Figure 10. Train and Test loss curves for all four models with Ideal loss visualized. .................. 58
Figure 11. Evaluation Binding Classifier Train and Test AUC scores over training. .................. 59
Figure 12. Distributions of Molecular Weights for four model groups (Base, Base+S, Base+E,
Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test
Set comparisons. Means visualized with vertical dashed lines..................................................... 64
Figure 13. Distributions of the Number of Hydrogen Acceptors and Hydrogen Donors for four
model groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1,
1-10, 1-100), with Training and Test Set comparisons. Means visualized with vertical dashed
lines. .............................................................................................................................................. 64
Figure 14. Distributions of Rotatable Bonds for four model groups (Base, Base+S, Base+E,
Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test
Set comparisons. Means visualized with vertical dashed lines..................................................... 65
Figure 15. Distributions of LogP for four model groups (Base, Base+S, Base+E, Base+S+E)
across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set
comparisons. Means visualized with vertical dashed lines. .......................................................... 66
Figure 16. Distributions of Total Polar Surface Area (TPSA) for four model groups (Base,
Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with
Training and Test Set comparisons. Means visualized with vertical dashed lines. ...................... 67
Figure 17. Distributions of Quantitative Estimates of Drug Likeness (QED) for four model
groups (Base, Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10,
1-100), with Training and Test Set comparisons. Means visualized with vertical dashed lines. . 68
Figure 18. Distributions of Natural Product Scores for four model groups (Base, Base+S,
Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training
and Test Set comparisons. Means visualized with vertical dashed lines. ..................................... 68
Figure 19. Distributions of Synthetic Accessibility Scores for four model groups (Base, Base+S,
Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training
and Test Set comparisons. Means visualized with vertical dashed lines. ..................................... 69
Figure 20. Heatmap of Wasserstein distances between generated sets and the ideal set, and
shuffled sets and the ideal set for Mean Tanimoto Similarity. For full distributions see
Supplementary Figure 1. ............................................................................................................... 71

viii

Figure 21. Heatmap of Wasserstein distances between generated sets and the ideal set, and
shuffled sets and the ideal set for Max Tanimoto Similarity. For full distributions see
Supplementary Figure 2. ............................................................................................................... 71
Figure 22. Heatmap of Wasserstein distances between generated sets and the ideal set, and
shuffled sets and the ideal set for Classifier Scores. For full distributions see Supplementary
Figure 3. ........................................................................................................................................ 75
Figure 23. ROC Curve for Generated vs. Shuffled Sets across four model groups (Base, Base+S,
Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random
Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. .................... 76
Figure 24. ROC Curve for Generated vs. Test Set across four model groups (Base, Base+S,
Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random
Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend. .................... 77
Figure 25. ROC Curve for IGF-1 Generated Sets vs. Ransom Sample Set across four model
groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference
to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the
legend. ........................................................................................................................................... 82
Figure 26. ROC Curve for IGF-1 Generated Sets vs. Real Binder Set across four model groups
(Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to
Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the
legend. ........................................................................................................................................... 82
Figure 27. ROC Curve for VEGFR2 Generated Sets vs. Real Binder Set across four model
groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference
to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the
legend. ........................................................................................................................................... 83
Figure 28. ROC Curve for VEGFR2 Generated Sets vs. Ransom Sample Set across four model
groups (Base, Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference
to Random Guess and Real Binders vs. Random Sample Set; AUC Scores are included in the
legend. ........................................................................................................................................... 83
Supplementary Figure 1. Mean Tanimoto Similarity Distributions of Generated and Shuffled
sets............................................................................................................................................... 105
Supplementary Figure 2. Max Tanimoto Similarity Distributions of Generated and Shuffled sets.
..................................................................................................................................................... 106
Supplementary Figure 3. Classifier Score Distributions of Generated and Shuffled sets. ......... 107
Supplementary Figure 4. ROC Curve for Generated Sets vs. Shuffled Test Set across four model
groups (Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with
reference to Random Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the
legend. ......................................................................................................................................... 108

ix

Dedication

Dedicated
to
My Wife,
My Mother,
and
My Father
In Memory
of
Maria Borges,
Gordon Fossl,
and Gus

x

Acknowledgement
I wish to express my gratitude first and foremost for my supervisor Dr. Fan Jiang. Thank
you for believing in my work. Your support and guidance throughout my research have enabled
me to produce work more valuable than I could have on my own. I would like to thank my
Committee Members, Dr. Alia Hamieh and Dr. Sean Maurice, for their guidance, support, and
invaluable feedback throughout the course of this thesis. Their expertise and encouragement
have been instrumental in shaping this work. My appreciation to Dr. Jianhui Zhou for
participating and chairing my defence. And a special thanks to Dr. Stephen Radar. The
opportunity to do bioinformatic analysis in his lab was my first opportunity and exposure to
programming. I am not sure where I would be without his encouragement.
I want to acknowledge the support of my friends Galen Seilis, Daniel O’Reilly, and
Edward Lee. Their academic support, advice, and comedic relief are the reasons I have been able
to persevere through this work. The support of my family, my sister Cassandra Fossl, my mother
Debbie Fossl, and my father Norman Fossl. This work is in memory of loved ones who have
passed, my vavó Maria Borges, and my uncle Gordon Fossl. An extra special thanks to my wife,
Braelyn Fossl. Her love and support have been at the foundation of all my effort throughout this
research. Marrying Braelyn is an accomplishment that supersedes all else, including this work.
I wish to acknowledge the support of my colleagues at Northern Health, particularly my
manager, Jim Condon. The opportunities and support you have provided me have made this
work possible. Lastly, I would like to acknowledge the financial and institutional support of the
University of Northern British Columbia and the Natural Sciences and Engineering Research
Council of Canada - CGSM, which made this research possible.

xi

Chapter 1 - Introduction
1.1 Background
1.1.1 Difficulties in Drug Discovery
The search for cures to aliments and disease is likely one of the most consistent and
universal human struggles. The importance of speed and innovation in the pursuit of these cures
has become more salient in recent years with the emergence of COVID-19. It is undisputed that
the past decade has seen a massive increase in the amount of data that is available for analysis
and because of this we have seen the rise of machine learning techniques utilizing this data. The
drug development space is not an exception to this trend.
Advancements in RNA-Seq technologies allow for low-cost analysis of gene expression
data for quick determination of compound effects and are available on public databases such as
Sequence Read Archive [1] and the national cancer institute genomic data [2]. Protein data itself
can be found in abundance on the Protein Data Bank (PDB) with more than 150,000 protein
structures [3]. Drug compound information can be found thoroughly cataloged at ChEMBL,
PubChem, and Zinc with 2.3 million, 112 million, and 750 million molecules, respectfully [4],
[5], [6]. This data provides the opportunity for the application of machine learning techniques to
aid in the drug discovery process. In addition to abundant data, the cost and time demand of drug
discovery also motivates this research.

1

The global pharmaceutical market is estimated to be worth 1.2 trillion dollars but drug
discovery itself is still plagued with low success rates [7]. A 2019 study evaluated 21,143 drug
candidates and found that only 6.2% reach market [8]. This assessment only observed
compounds that had made it to clinical trials and did not account for compounds that were
invested in and researched but failed before clinical trials, suggesting an even lower figure. With
this low success rate the average drug development cycle from target screening to market takes
13.5 years and costs an estimated 1.8 billion dollars [9]. Finding potential drug-like compounds
is also a technical challenge. It is estimated that the chemical space of drug-like compounds is as
large as 1023 - 1060 molecules with approximately only 108 of those being estimated to be
therapeutically relevant [10]. With standard methods of high-throughput screening it is difficult,
if not impossible, to navigate such a space efficiently [10]. It is these time, cost, and space
problems paired with the rich data availability that motivate the research and application of
machine learning in drug discovery.

1.1.2 Brief Introduction to Machine Learning
Machine Learning (ML)
Machine learning is a subset of artificial intelligence focused on developing algorithms
that enable computers to learn from and make predictions or decisions based on known data.
Machine learning can be unsupervised where the input data is unlabeled or supervised where the
input data is being mapped to output data. Supervised learning is performed by using a set of
algorithms that are built to approximate the relationship between input data and some target
output data [11].
2

Deep Leaning
Deep learning is a subset of machine learning that utilizes neural networks with many
layers, which are trained through a process called gradient descent [12]. It is best known for its
ability to handle large unstructured datasets. The advantage of using deep learning is the ability
to scale the neural network size or change the structure of the network to create many different
architectures. Some popular architectures include:
Generative Adversarial Networks (GANs):
-

Consists of competing models, the generator and the discriminator. The generator’s
goal is to learn to generate realistic synthetic data through adversarial training where
the generator generates data and the discriminator identifies if it is real or fake [13].

Autoencoders:
-

Neural network architecture primarily for data compression, noise reduction, and the
creation of embeddings. It consists of two models, the Encoder and Decoder. The
Encoder takes input data and outputs a lower dimensional representation, while the
Decoder takes in that representation and tries to reconstruct the original data input
[12].

Recurrent Neural Networks (RNNs):
-

Designed for sequential data, like time series or natural language. Built to take in
sequence data one token at a time [12].

3

Long Short-Term Memory Networks (LSTMs):
-

LSTMs are a type of RNN that contain additional mechanisms for learning long-term
dependencies [14].

Gated Recurrent Units (GRUs):
-

GRUs are a variant of RNN that, like LSTMs, has an additional mechanism for
learning long-term dependencies. However, GRUs use one memory vector, where
LSTMs use two. This makes GRUs computationally more efficient while still
effectively capturing temporal dependencies in sequential data [15].

Transformers:
-

Like RNNs, Transformers are designed for sequential data. The difference is that they
use a self-attention mechanism to process data without recurrence. This enables
parallelization for more efficient training and has shown improved performance on
tasks like natural language processing [16].

Deep learning architectures relevant to this thesis will be reviewed in greater detail in Chapter 2.

1.1.3 Machine Learning in Drug Discovery
To date, machine learning has been researched as an intervention at almost all stages of
the drug discovery pipeline. Many of these approaches utilize the discriminative approach of
machine learning, where the models are classifying a trait or are predicting some key indicator
[10]. This could be predicting drug toxicity [17], solubility [18], [19], binding potential for a
target [20], [21], [22], or even processing high throughput screening images to predict candidate
4

molecule effectiveness in assays [23]. The largest area of discriminative machine learning in
drug discovery is quantitative structure-activity relationships (QSARs) and quantitative
structure-property relationships (QSPRs) [24], [25], [26]. These methodologies can be used to
search an already established chemical library, which is advantageous considering these
compounds are known and synthesizable. However, this approach only evaluates a subset of the
chemical space.
To compliment discriminative approaches to machine learning, an exciting area of recent
research has emerged to leverage the generative approach of machine learning. Where previous
methods were concerned with identifying if a compound may have therapeutic properties,
generative models seek to flip this flow and generate new novel compounds given a particular
property set or desired feature [10], [27], [28], [29]. Such a method carries more overhead
considering generated compounds still need to be synthesized, however, it allows for exploration
through generation of novel compounds.

1.1.4 Generative Deep Learning
Generative machine learning has risen in popularity as an innovative sub-field of deep
learning where the goal of the model is to generate new data that resemble training data.
Generative Deep Learning has gained popularity recently with its advancements in text and
image generation [30], [31]. In drug discovery, these models are particularly useful for
generating novel molecular structures and optimizing these created structures for drug-like
properties.

5

In the compound generation model space, a large number of models are recurrent neural
networks, specifically long-short term memory models (LSTMs) or gated recurrent unit models
(GRUs) [28], [32], [33], [34], [35], [36], [37], [38], [39]. Other notable models are Variational
AutoEncoders/HeteroEncoders (VAEs/HEs) [27], [40], [41], [42], [43], [44], [45], [46], [47],
Generative Adversarial Networks (GANs)/Adversarial AutoEncoders (AAEs) [48], [49], [50],
[51], [52], [53], [54], [55], [56], and Transformers [57], [58], [59]. These architectures are often
paired with a reinforcement learning element in studies to optimize the generated molecules for a
variety of properties [48], [60], [61], [62], [63], [64], [65], [66].
For the compound generation methods above, additional steps are needed to generate
compounds for a specific target. One option is to train or fine-tune the model on a subset of
molecules that are known inhibitors of a target [35], [36]. The model can be conditioned on a
property [40]. Or, the model can be reinforced for generating compounds that bind to a specific
target [56].
A broader approach to this specific conditioning is general conditioning. Rather than
being conditioned to some specific target or property the model can take some condition as input
and generate compounds relevant to that condition without further training or optimization. That
can be some chemical properties [43], [45], [47], [53], [57], [67], protein target information [59],
[65], [68], or gene expression data [51], [66]. The motivation of a generally conditioned model is
to potentially generate compounds for targets that may lack or have minimal training data
available. Chapter 2, the Literature Review, will go into more detailed exploration of some of the
cited work above.

6

1.2 Problem Statement
In the drug development pipeline one of the earliest goals is to identify a drug target.
Much of the time this comes down to a specific protein or a set of protein targets. The next step
is the search for a candidate drug compound that can bind and inhibit the chosen target [10],
[69]. Currently, virtual screening techniques are reliant on having data sets on known compounds
that already bind the protein and/or having detailed structural information about the target [69].
For new emerging targets, such as pandemic viruses, these techniques can be rendered
ineffective.
The development of generative models that can generate compounds with only nonstructural protein data can allow for the identification of lead compounds without any prior
knowledge of what compounds interact with the target, or the structure of the target. Of
particular interest to us, is the using of amino acid sequences for targets for the generation of
potential binding molecules. An advantage of amino acid sequences is the ability to avoid the
computational complexity that comes from working with protein 3-dimensional structures.
There are three models, to my knowledge, that attempt to tackle this issue. CogMol from
Chenthamarakshan et al 2020 [68] is a model that generates potential binding molecules for a
given target by utilizing two models in tandem. A model to generate compounds and a model to
predict affinity. The affinity model is then used to fine tune molecule generation and is trained
using pretrained protein embeddings from Unirep [70]. A downside of this is the need to fine
tune the model for each target. Ghanbarpour et al 2020 [65] attempted this through the use of
RNN’s and reinforcement learning techniques for fine-tuning. The model also makes use of a
7

pretrained protein embedding network ELMo [71]. Grechishnikova 2021 [59] attempted with the
application of transformers which take amino acid sequences as context to generate potential
binders. Ghanbarpour et al 2020 [65] and Grechishnikova 2021 [59] are unique in being the two
models that attempt to create a single all-in-one model that takes in amino acid sequences as
input and outputs potential binding compounds. This is to say a model that does not need to do
any optimizations when given an amino acid sequence and can simply produce molecules
directly, in a deterministic manner.
I will be tackling this problem by utilizing transformers similar to Grechishnikova 2021
[59] with the additional improvement of new protein embeddings, similar to those utilized in
Ghanbarpour et al 2020 [65].

1.3 Thesis Structure
The remainder of the thesis will have 5 chapters, Literature Review, Thesis Statement,
Methodologies, Results and Discussions, and Conclusions and Future Research:
Chapter 2 - Literature Review:
-

This chapter will provide a high-level background into the mathematics of deep
learning, neural networks, recurrent neural networks, and transformers needed for
understanding the proposed work and cited literature. The section will also review
how biological data is represented for work in machine learning and review in detail
the application of deep generative models in drug discovery.

8

Chapter 3 - Thesis Statement:
-

This chapter will present the primary models to be explored in the thesis, their
rationale, and any expectations or hypotheses.

Chapter 4 - Methodologies:
-

This chapter will review in detail the structure of each of the models, the
hyperparameters chosen, the training process for the models, the compound
generation process, and the evaluation criteria. In addition to exploring the models’
designs, the source data for the project will be introduced.

Chapter 5 – Results and Discussion:
-

This chapter will review the results primarily through figures and tables with some
notes on apparent trends and provide discussion on the findings. The results will
review the models’ training results, the chemical feasibility of the generated
compounds, the physiochemical properties of the generated compounds, and how
well the compounds approximate true binders.

Chapter 6 – Conclusions and Future Research:
-

This chapter will review the results and their potential meanings and implications in
detail. It will also review potential shortcomings and limitations of the work in
addition to suggestions for future work, improvements, and will conclude with a final
statement.

9

10

Chapter 2 - Literature Review
2.1 Deep Learning
This work will make many references to deep learning and some specific deep learning
architectures. In this section we hope to introduce neural networks and the associated
architectures important for understanding this work.

2.1.1 Neural Networks
Neural Networks are a machine learning technique that utilize multiple layers of
functions that are applied on some input vector with these functions having parameters that can
be learned [12]. The first function applied to the input vector is the first layer, the second
function is the second layer, etc. It is common convention to name the first layer the input layer,
the last layer the output layer, and all layers in between hidden layers. Neural networks are often
defined through a fundamental unit called a perceptron. The perceptron is a node that represents
a linear function wrapped with an activation function. Figure 1 shows an example of a perceptron
where 𝑊𝑖 and 𝑏 are learnable parameters, and the sigmoid function is the activation function.
The activation function is a non-linear function that is applied to the output of the perceptron.
The sigmoid function is often used as an activation function as it maps the linear functions output
between 0-1.
𝑥1
𝑊
If we define 𝑥 as the vector [𝑥 ] and 𝑊 as [ 1 ] then a perceptron can be defined through
𝑊2
2
a singular function:
11

𝑓 (𝑥 ) = 𝜎 (𝑥 𝑇 𝑊 + 𝑏 ).

(1)

Note, the input vector can be any size; two is used as an example. A layer is then defined as a
function that takes in an input vector of shape (𝑛, 1) and applies 𝑘 perceptrons concurrently to
produce an output of shape (𝑘, 1). This makes up the fundamental unit of neural networks and all
abstractions past this point vary primarily in the structure of these perceptrons and techniques for
how the models learn.

Figure 1. Visual Example of Perceptron

2.1.2 Recurrent Neural Networks
A recurrent neural network is a neural network that is fed sequence data recursively [12].
This means that each input 𝑥𝑡 depends on information from 𝑥𝑡−1. The RNNs structure at a
fundamental level is a normal neural network where the input is an input vector 𝑥𝑡 and a hidden
vector ℎ𝑡−1 . These are put through the neural network and two things are output 𝑦𝑡 and ℎ𝑡 . 𝑦𝑡 is
the prediction for the next character in the sequence, ℎ𝑡 is the hidden vector. The next step of the
process uses the same network and now puts in ℎ𝑡 and 𝑦𝑡 as inputs. 𝑦𝑡 is essentially the predicted
𝑥𝑡+1 . In training it is popular to feed in the real 𝑥𝑡+1 value, this is called teacher forcing [72],
however when generating sequence after training 𝑦𝑡 is fed into the next step as the input.
12

Figure 2. Segler et al 2018 [35] “Symbol generation and sampling process. We start with a random seed symbol s1,
here c, which gets converted into a one-hot vector x1 and input into the model. The model then updates its internal
state h0 to h1 and outputs y1, which is the probability distribution over the next symbols. Here, sampling yields s2 =
1. Converting s2 to x2 and feeding it to the model leads to updated hidden state h2 and output y2, from which we
can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example,
we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state hi allows the
model to keep track of opened brackets and rings, to ensure that they will be closed again later.”

Figure 2 visualizes this process for generating a sequence of SMILES characters. RNNs are
described by the following equations:
ℎ𝑡 = 𝐺 (𝑊ℎ ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏ℎ ),

(2)

𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦 .

(3)

Where 𝑥𝑡 is the input, 𝑦𝑡 is the output, and ℎ𝑡 is the hidden state. 𝑊’s are weight matrices and
𝑏’s biases. 𝐺 is the activation function. Note that the bracket and comma notation
“𝑊ℎ ∙ [𝑥𝑡 , ℎ𝑡−1 ]” means two separate weight matrices multiply and added which translates to
“𝑊ℎ𝑥 𝑥𝑡 + 𝑊ℎℎ ℎ𝑡−1 ”.

13

Long Short-Term Memory (LSTM)
The long-short term memory cells were originally suggested by Hochreiter et al. 1997
[14]. Regular RNNs must pass previous sequence information through a single hidden vector.
This makes it difficult to hold onto long-term contextual sequence information. The LSTM cell
allows for long-term memory storage using a context vector. As originally described, an LSTM
cell is given by the following equations:
𝑓𝑡 = 𝜎(𝑊𝑓 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑓 ),

(4)

𝑖𝑡 = 𝜎 (𝑊𝑖 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑖 ),

(5)

𝐶̃𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝐶 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝐶 ),

(6)

𝐶𝑡 = 𝑓𝑡 ⨀ 𝐶𝑡−1 + 𝑖𝑡 ⨀ 𝐶̃𝑡 ,

(7)

𝑜𝑡 = 𝜎 (𝑊𝑜 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑜 ),

(8)

ℎ𝑡 = 𝑜𝑡 ⨀ 𝑡𝑎𝑛ℎ(𝐶𝑡 ).

(9)

Where 𝜎 is the sigmoid function, 𝑓𝑡 , 𝑖𝑡 , 𝐶̃𝑡 , 𝐶𝑡 , 𝑜𝑡 are the forget gate, the input gate, temporary
context vector, the context vector, and the output gate, respectfully. Equations (4) and (5) are
similar to equation (2) for a regular RNN. As can be seen in (7) the 𝑓𝑡 and 𝑖𝑡 are used to
determine how much of the previous context vector and temporary context vector will be added
to make the new context vector. In (9) we see 𝑜𝑡 dictating what of the new context vector should
be passed as the hidden state. Overall, the RNN maintains its structure but with the LSTM cell,
two vectors are output, the hidden state vector, and the context vector.

14

Gated Recurrent Unit (GRU)
GRUs were originally described in Chung et al. 2014 [15]. The GRU cells, like the
LSTM cells, have the goal of passing both long-term and short-term memory, however, GRU
cells store long and short-term memory in one hidden state. GRU cells can be described by the
following equations:
𝑟𝑡 = 𝜎 (𝑊𝑟 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑟 ),

(10)

𝑧𝑡 = 𝜎(𝑊𝑧 ∙ [𝑥𝑡 , ℎ𝑡−1 ] + 𝑏𝑧 ),

(11)

ℎ̃𝑡 = 𝑡𝑎𝑛ℎ(𝑊ℎ ∙ [𝑥𝑡 , (𝑟𝑡 ⊙ ℎ𝑡−1 )] + 𝑏ℎ ),

(12)

ℎ𝑡 = 𝑧𝑡 ⊙ ℎ𝑡−1 + (1 − 𝑧𝑡 ) ⊙ ℎ̃𝑡 .

(13)

Where 𝑟𝑡 , 𝑧𝑡 , and ℎ̃𝑡 are the reset gate, update gate, and temporary hidden state, respectively. It
can be seen in (12) that 𝑟𝑡 is being used to determine what portions of the previous hidden state
should be included in determining the temporary hidden state. Then in (13), 𝑧𝑡 is used to
determine what proportions of the previous hidden state and what proportions of the temporary
hidden state should be combined to form the new hidden state.

Bidirectionality and Layers
Bidirectional RNNs describe a similar structure to regular RNNs. The difference is that
two RNNs are generated each of which read the sequence in the opposite direction [73]. All
calculations are the same except for in the final output the hidden states are concatenated
together. This does not change any equations. Bidirectional RNNs are often used to encode a
sequence and not used to generate. In a similarly trivial fashion, RNN layers describes
15

constructing multiple RNNs that feed into each other. In this case, the first RNN will calculate an
output 𝑦𝑡 that will be fed into another RNN as an input.

2.1.3 Transformers
The transformer architecture was originally introduced by Vaswani et al 2017 [16]. This
architecture started a revolution in the processing of unstructured sequence data, namely textual
data. Unlike recurrent neural networks (RNNs) and their variants long-short term memory
(LSTM) networks and Gated Recurrent Units (GRUs), transformers do not rely on sequential
data processing. Instead, they leverage a mechanism known as self-attention to process all
elements of the input sequence simultaneously, which allows for more efficient computation and
better handling of long-range dependencies. Transformers have an encoder-decoder structure
where both the encoder and decoder have an almost identical structure composed of two main
components, multi-head self-attention mechanisms and a position-wise feed forward network
[16].

Attention
The purpose of the attention mechanism is to allow the model to focus on different parts
of sequences simultaneously. Attention is described by the following equation:
𝑄 ′𝐾′𝑇

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄′, 𝐾′, 𝑉′) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (
) 𝑉′.
√𝑑𝑚𝑜𝑑𝑒𝑙

(14)

Where 𝑄′, 𝐾′, and 𝑉′ are the query, key, and value matrices, respectively. 𝑄′ has a shape of
(𝑛, 𝑑𝑚𝑜𝑑𝑒𝑙 ) where 𝑛 is the length of the sequence and 𝑑𝑚𝑜𝑑𝑒𝑙 is the size of the embedding. 𝐾′
16

and 𝑉′ both have a shape of (𝑚, 𝑑𝑚𝑜𝑑𝑒𝑙 ) where 𝑚 is the length of the sequences. 𝑄′, 𝐾′, and 𝑉′
are calculated from a linear transformation performed on three candidate sequences, 𝑄, 𝐾, and 𝑉.
𝑄 ′ = 𝑄𝑊𝑄 ,

(15)

𝐾 ′ = 𝐾𝑊𝐾 ,

(16)

𝑉 ′ = 𝑉𝑊𝑉 .

(17)

Where 𝑄 has shape (𝑛, 𝑑𝑚𝑜𝑑𝑒𝑙 ), 𝐾 and 𝑉 both have a shape of (𝑚, 𝑑𝑚𝑜𝑑𝑒𝑙 ), and 𝑊𝑄 , 𝑊𝐾 , and 𝑊𝑉
have a shape of (𝑑𝑚𝑜𝑑𝑒𝑙 , 𝑑𝑚𝑜𝑑𝑒𝑙 ). In Self-Attention 𝑄, 𝐾, and 𝑉 are all the same sequence. In
Cross-Attention 𝐾 and 𝑉 are the same sequence, and 𝑄 is a different sequence. To control where
attention is applied, Masked-Attention can be implemented. Masked-Attention can be defined
with the following:

𝑀𝑎𝑠𝑘𝑒𝑑𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄 ′, 𝐾 ′ , 𝑉 ′ , 𝑀) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (

𝑄 ′𝐾′𝑇
√𝑑𝑚𝑜𝑑𝑒𝑙

+ 𝑀) 𝑉′,

(18)

Where 𝑀 is a matrix that contains −∞ where the attention is chosen to be masked and 0′𝑠
elsewhere. For example, when preventing paying attention to tokens ahead in the sequence in
Masked Self-Attention, 𝑀 [𝑖, 𝑗] = −∞ for 𝑗 > 𝑖 is applied. Multi-Headed Attention extends
attention to multiple subspaces using the following equations:
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1 , ℎ𝑒𝑎𝑑2 , … , ℎ𝑒𝑎𝑑ℎ )𝑊𝑂 ,

(19)

ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑄𝑖 , 𝐾𝑊𝐾𝑖 , 𝑉𝑊𝑉𝑖 ).

(20)

Where the concatenated heads have a shape of (𝑛, ℎ ∙ 𝑑𝑚𝑜𝑑𝑒𝑙 ) and 𝑊𝑂 has a shape of
(ℎ ∙ 𝑑𝑚𝑜𝑑𝑒𝑙 , 𝑑𝑚𝑜𝑑𝑒𝑙 ). 𝑊𝑂 helps project the concatenated heads back into the 𝑑𝑚𝑜𝑑𝑒𝑙 dimensional
17

space. The additional weighted matrices, 𝑊𝑄𝑖 , 𝑊𝐾𝑖 , and 𝑊𝑉𝑖 , are unique to each head and learn to
perform a unique transformation on 𝑄, 𝐾, and 𝑉 for that attention head. This allows the
transformer to capture different types of relationships in different subspaces [16].

Positional Encoding
The transformer does not process the sequence sequentially so the concept of order and
distance within the sequence needs to be added. To accomplish this the positional encodings are
added to the inputs of the model. These encodings are calculated using sine and cosine:
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛 (
),
100002𝑖⁄𝑑𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠 (
).
100002𝑖⁄𝑑𝑚𝑜𝑑𝑒𝑙

(21)
(22)

Where 𝑑𝑚𝑜𝑑𝑒𝑙 is the dimension of the embedding, 𝑝𝑜𝑠 is the position of the embedding vector
within the sequence and 𝑖 is the dimension within the embedding. Therefore, for each embedding
vector a positional encoding is calculated and added at each dimension.

Transformer Encoder and Decoder Structure
The Encoder and Decoder of a transformer have very similar structures. The Encoder
takes in an input sequence, calculates an embedding, adds the positional encoding, calculates the
Multi-Head Self Attention and passes results through a feed forward network. The embedding
layer’s dimension is 𝑑𝑚𝑜𝑑𝑒𝑙 and the feed forward layer’s dimension is 𝑑𝑓𝑓 . For the Decoder the
same initial steps are taken except the Multi-Head Self Attention is masked. The Decoder also
has a second attention process that is Cross-Attention where 𝐾 and 𝑉 are the outputs of the
18

Encoder and 𝑄 is from the Decoder. If there are multiple layers in the Encoder and/or Decoder
the output of the feed forward network in the previous layer is passed to the next layer with the
same structure. The Encoder-Decoder structure is outlined in Figure 3.

Figure 3. Vaswani et al. 2017 [16] “The Transformer - model architecture”

19

2.1.4 Sequence Generation Methods
For deep learning models there are different methodologies for generating token
sequences from the model. Three common generation methods are argmax sampling, random
sampling with temperature, and beam search. These models output a discrete probability
distribution of the possible next token in a sequence. In argmax sampling, the most probable
token is sampled. In random sampling with temperature this distribution is sampled from [74].
The temperature is a scaling is a mechanism to scale the probability distribution. Given a
discreate probability distribution 𝑝, and a temperature 𝑇, the new discrete probability distribution
(𝑝′) is defined as:
⁄

𝑝𝑖′ =

𝑝𝑖1 𝑇
∑𝑛𝑗 𝑝𝑗1⁄𝑇

.

(23)

As can be seen in equation 23, as 𝑇 approaches infinity, the larger probabilities are scaled down
much more than smaller probabilities flattening the distribution and making all tokens equally
likely. As 𝑇 approaches 0 the higher probabilities are scaled up much larger than the lower
probabilities making the sampling approach argmax sampling.
In beam search a beam size is determined. For a beam size of 𝑛 the top 𝑛 most probably
tokens are sampled at each step. For the first step 𝑛 tokens are sampled, then on the second step
each of those 𝑛 tokens generate 𝑘 candidate tokens, where 𝑘 is the size of the vocabulary. This
creates 𝑛 ∙ 𝑘 number of token sequences to choose from. The probabilities of the first token and
the second token are multiplied to get the probabilities for the two token sequences and only the
top 𝑛 are kept. This is then repeated for the third token until a max length is hit or all 𝑛 token
20

sequences hit an end token. This methodology of beam search is known as max-probability
based beam search [75]. A downside of this methodology is that it can penalize longer
sequences. To correct this, length normalization can be applied [75]. When using an
implementation with normalization the selection of sequences then depends on some scoring
function. In Wu et al. 2016’s [75] implementation the top 𝑛 scoring sequences are explored in
each step. If any sequences reach an end token they are stored and the remaining top 𝑛 sequences
are explored until either all sequences reach an end token or, a max sequence is reached [75].
This implementation of beam search is discussed further in Chapter 4.

2.2 Representation of Biological Data
2.2.1 Compound Representations
The most common representation of molecules that people are familiar with are
molecular drawings where atoms are represented by alphabetic characters and bonds between
atoms are denoted with lines connecting them. These representations are called Lewis Structures
(Figure 4) [76]. Although, simply utilizing molecular images is possible [77], for most
computational problems an appropriate digital representation of the molecule is preferred. There

Figure 4. Example of Lewis Structure Representation.

21

are multiple goals when developing and deciding between chemical representations. Two of the
most important qualities of representations are uniqueness and ambiguity.
A chemical representation is considered unique if for every molecule it only produces
one valid representation. A chemical representation is non-unique if for one molecule it can
generate many representations. A representation is considered ambiguous if one representation
can represent many molecules, and it is considered unambiguous if each representation can only
represent one molecule. In simpler terms, uniqueness describes a one-to-one relationship from
molecule to representation, and non-uniqueness represents a one-to-many relationship between
molecule to representations. In reverse, ambiguity represents a one-to-many relationship from
representation to molecules, while non-ambiguity represents a one-to-one relationship from
representation to molecule [76]. Lewis structures under these categories would then represent a
unique, non-ambiguous representation, since a drawn structure can only represent one molecule,
and one molecule only has one correct drawing, after accounting for any rotational invariance
[69].

Figure 5. David et al 2020 [76] “Example graph representation for acetic acid. a Graph representation of acetic
acid with nodes numbered from one to four. b Example adjacency matrix, A, for an acetic acid graph with the
corresponding node ordering on the left. c Example node features matrix, X, which one-hot encodes a few
selected properties. d Example edge features matrix, E, where each edge feature vector is a one-hot encoding of
single, double, or triple bonds. “Implicit Hs” stands for the number of implicit hydrogens on a given node.”

22

To create suitable representations of molecules from structural drawings, researchers
early on decided to model compounds as graphs. In this modelling atoms are viewed as nodes in
a graph and bonds are considered edges. Once a graph is constructed opportunities for mapping
the molecule to different representations become possible. The simplest representation to use in
the case of a molecular graph is an adjacency matrix. Where each index maps to a specific atom
in the molecule (Figure 5) [76]. This representation has the advantage of being inherently
numerical and has the extendibility to add additional information about the molecule (Figure 5).
This information can include properties of the atoms as well as properties describing the bonds
between atoms. Another advantage is that all sub-graphs within these representations are
interpretable [76]. A representation like this is also unique and unambiguous. A significant
disadvantage of this representation is the cost of space and its sparseness. This can lead to
potential memory limitations when applied to large problems and computational bottlenecks
[76]. A linear chemical representation can circumvent these limitations.
The most prominent linear representation of molecules is the Simplified Molecular Input
Line Entry System (SMILES). SMILES strings are constructed by traversing over the graph
representation of molecules in a depth-first manner. Atoms can be represented by their alphabetic
symbols and additional characters can be used to handle geometric information. A negative of
this representation is that it loses the extendibility that graph representations have, and a
SMILES substring cannot be guaranteed to be interpretable [76]. However, this representation is
advantageous for its compact form. Any depth-first traversal of a molecular graph can generate a
valid SMILES string. For this reason, SMILES strings on their own are non-unique. But all
SMILES strings can only be decoded into one molecule making them non-ambiguous. To correct
23

for the non-uniqueness of SMILES several canonicalization algorithms have been developed
with traversal rules that construct unique SMILES strings [76] (Figure 6). There are many
existing canonicalization algorithms but the most prominent of these is the opensource
implementation in the python chemical toolkit RDkit [78]. The non-uniqueness of SMILES in
some circumstances can be an advantage as shuffling SMILES into multiple valid configurations
has been utilized as a data augmentation technique [79]. For generative deep learning tasks, a
prominent issue with the use of SMILES strings is the frequent low success in generating valid
molecules [80], [81]. For a SMILES string if a single character is wrong the whole string can be
invalid resulting in no interpretable molecule being formed.

Figure 6. David et al 2020 [76] “Canonical (a) and randomized (b) SMILES representations of aspirin. Randomized
SMILES correspond to the various representations of a molecule obtained by randomly selecting the starting node
in the graph traversal algorithm, thus changing the order of the nodes traversed in the molecular graph (still using
depth-first search). Numbers represent the order of graph traversal, where 1 is the initial node (user defined).
Considering a as being the canonical representation of aspirin, b shows a different ordering of the atoms of the
molecule. The final SMILES is one possible SMILES among all the randomized SMILES which can be generated.
Green arrows indicate how the molecular graph is traversed. Both SMILES strings shown represent the same
molecule but, as the atom numberings are different, the generated SMILES strings are, too.”

24

Krenn et al 2020 [82] addressed this issue by developing a linear compound
representation called Self-Referencing Embedded Strings (SELFIES). In their paper they
describe an algorithm that can systematically construct a completely syntactically and
semantically robust grammar from a set of provided example words, where “word” means a
sequence of characters from a defined alphabet [82]. Using this method, the researchers were
able to learn a grammar using 72 million molecules from PubChem. This learned grammar
allows for the translation from SMILES strings to SELFIES strings and guarantees any order of
characters in the SELFIES alphabet can be decoded into a valid SMILES string and their for a
valid molecule [82].
The previous representations can be considered “structural representations” since they are
ways of representing the structure of the compound through a graphical representation
transformed into a linear representation [83]. Another common compound representation are
feature based representation, namely structural keys and molecular fingerprints [83]. These
representations are important for many common metrics evaluating compound properties [80],
[81].
Structural keys are a representation where a specific set of chemical substructures are
scanned for in a molecule and the number of occurrences for each substructure is counted and
held in an array where each index represents a specific substructure. This representation is
unique but ambiguous [76]. The theory behind such an implementation is that these substructures
hold important chemical activity, so recording their presence and occurrences is enough to
capture the chemical essence of a compound in a compact and simple form factor [76].

25

Molecular fingerprints are hashed representations of molecules, with the most popular
being Morgan fingerprints [76]. These representations are constructed by first defining a max
path length. Then for 0 to the max path length every possible path in the molecular graph is
generated as a string. The set of these strings are then hashed into an array of some
predetermined size. Additions into the array are handled by an “exclusive or” meaning any even
collision number results in a 0 bit and any odd collision number results in a 1 bit. This makes the
representation ambiguous and potentially non-unique [76]. However, the advantage of such a
method is that it is quick to calculate, it is uniform in size, and it takes a whole representation of
the molecule to determine a specific hash rather than substructures like in the structural key
representation. Its structure also makes it possible to perform simple distance calculations for
determining how similar two molecules are [76]. Both structural keys and molecular fingerprints
are popular for comparing molecules with a variety of metrics [80], [81].

2.2.2 Protein Representations
Proteins, by definition, are still compounds and in theory, can be represented with the
aforementioned techniques. However, because of their size, such representations are impractical.
For example, drug-like molecules often have under 50 atoms present in their compounds while
proteins can have on the order of thousands of atoms [76]. However, although large the
advantage of proteins is that they often have a specific three-dimensional structure. Because of
this a prominent and practical representation for proteins is a simple coordinate system that
defines the position of each atom with an x, y, z coordinate in atomic measurement units

26

(Angstroms) [76]. The protein data bank (PDB) is a database that contains more than 150,000
protein structures [3].
A downside of this representation of proteins is that 3D data is more complicated to work
with from a machine learning perspective. This motivates the development of linear
representations of proteins. Proteins can be described by a single linear sequence of 20 possible
amino acids. This amino acid sequence can act as a unique identifier but doesn’t hold any
structural information. However, it is important to note that all amino acid sequences map to a
specific protein structure [84]. The prediction of protein structures from amino acid sequence
information alone is an area of rich research, with AlphaFold2 recently emerging successful in
achieving high accuracy in the prediction of protein structure from amino acid data alone [85].
Recently, a large area of study has emerged for applying large language model techniques
to the generation of information rich linear protein embeddings from amino acid sequences alone
[70], [71], [86], [87], [88]. These models are called Protein Language Models (PLMs). All these
PLMs follow a similar methodology. An architecture for a large language model, likely a RNN
variant or Transformer, is adapted to take in amino acid sequences and perform next-in-sequence
or gap-in-sequence predictions. After performing this task the hidden layers of the model are
used to construct a vector representation for the protein [70], [71], [86], [87], [88]. These
embeddings have shown success in categorizing amino acid sequences with biological
information [70], [71], [86], [87], [88]. Figure 7 shows a plot from Elnaggar et al 2021 [86] for
the embeddings developing an unsupervised but biologically relevant structure in the latent
space. These protein embeddings can therefore be repurposed into other machine learning tasks.

27

Figure 7. Elnaggar et al 2021 [86] “Protein LMs learned constraints. t-SNE projections visualized information
extracted by the unsupervised protein LMs (here best-performing ProtT5-XL-U50; upper row: before training
(Random), and lower row: after pre-training on BFD & UniRef50. (A) The left-most column highlights single
amino acids by biophysical features. (B) The middle column annotates protein structural class (taken from
SCOPe). (C) The right-most column distinguishes proteins according to the kingdom of life in which it is native.
Although the random projections on top may suggest some adequacy of the clusters, the trained models shown
on the lower row clearly stood out. Incidentally, the comparison of the two also highlighted the potential pitfalls
of using t-SNE projections from many dimensions onto 2D: although random, the human might see some
correct patterns even in the top row.”

2.3 Deep Generative Models in Drug Discovery
Generative deep learning has been applied to the compound generation space for years. In
these years multiple model architectures have been explored. In this section we will review some
major papers in this research space highlighting the architectures most relevant to this thesis’
goals, RNNs and Transformers.

28

2.3.1 Recurrent Neural Networks
Out of all the architectures utilized in generative drug discovery, RNN based models
(LSTMs and GRUs) appear to be the most prevalent. Bjerrum et al. 2017 [32] was the first paper
to generate SMILES compounds via an LSTM. Although not attempting to generate compounds
for drug discovery, this paper showed the feasibility of generating valid, novel, chemically
diverse SMILES strings with LSTMs. In the following year multiple papers arose exploring this
technique for drug like compounds [35], [36], [37], [65], [67].
Segler et al. 2018 [35] took the method established in Bjerrum et al. 2017 [32] and
expanded it specifically for drug discovery. After training an LSTM model to generate valid
compound from ChEMBL, they fine-tuned the model on 732 molecules that are active for a
specific receptor HT2A. The fine-tuning method was an iterative approach first fine tuning on
real binders and then fine tuning on the best molecules generated. Gupta et al. 2018 [36] took
this work and expanded it by training on a broader more diverse compound set. Merk et al. 2018
[37] was the first to validate this methodology of fine tuning in vitro. In their research they
trained an LSTM to generate SMILES compounds and fine-tuned it on retinoid X and
peroxisome proliferator-activated receptor agonists. The lab synthesized the five top ranking
compounds of which four showed activity in cell-based assays.
Kotsias et al. 2020 [67] was one of the first papers in the RNN space to explore context
based generation. The goal of this research was to create a model that can generate compounds
with a desired set of properties. In training they calculated a set of chemical properties and
bioactivities for each molecule and used those values as an initial contextual input in the model
29

during training. This resulted in a model that provided a set of desirable values as inputs could
generate a set of compounds that estimate those values. Ghanbarpour et al. 2020 [65] expanded
on this concept of conditioning an RNN for compound generation by utilizing a dataset of
protein compound binding pairs and training the RNN with protein data for contextual input. The
protein context used in this work were embeddings calculated from amino acid sequences using a
large protein language model ELMo [71].

2.3.2 Transformers
As Transformers have been used in place of RNNs in natural language problems in recent
years, the same is true for RNNs in compound generation. Mandhana et al. 2020 [58] took the
RNN approach originally proposed in Bjerrum et al. 2017 [32] and simply expanded it to a
Transformer Decoder specific architecture. Bagal et al. 2022 [57] took an approach similar to
Kotsias et al. 2020 [67], where they took at Transformer Decoder architecture and utilized
compound property information as a conditional input for the model for the generation of
compounds sets with desired properties. However, in addition to properties, the model also takes
in compound scaffolds as an input SMILES string. In practice that means this model can take in
reference molecules and desired properties and make additions to the scaffold that align with the
provided properties. Haroon et al. 2023 [89] took this work and expanded on in by applying a
different attention mechanism called relative attention. This change made improvements in
novelty and validity metrics. Grechishnikova 2021 [59] followed a similar process as
Ghanbarpour et al. 2020 [65] where protein information was used as contextual information in

30

compound generation. It diverts from this work, however, by not using pre-trained protein
embeddings but rather the raw amino acid sequences as contextual input into the model.

2.3.3 Other Model Architectures
Although less relevant to the models used in our research, it is important to recognize that
other models outside of RNNs and Transformers have been extensively explored in compound
generation to some success. Notably, Variational AutoEncoders/HeteroEncoders (VAEs/HEs)
[27], [40], [41], [42], [43], [44], [45], [46], [47], and Generative Adversarial Networks
(GANs)/Adversarial AutoEncoders (AAEs) [48], [49], [50], [51], [52], [53], [54], [55], [56].
VAEs are of particular interest because they are the Deep Learning architecture that has
been used with the SELFIES compound format [90], [91]. The improved validity of SELFIES
aids in the exploration of latent spaces created by VAEs, which when constructed by SMILES
creates a space with many invalid molecules [82]. This guarantee correctness also makes it
appealing for other usually noisy methods of generation, such as genetic algorithms which
introduce random mutations to the output SELFIES [92], [93], [94].

2.3.4 Protein Embedding Use in Drug Discovery
Some of the models above have used raw amino acid sequences [59] while others have
utilized pretrained protein embeddings [65], [68]. The benefit of utilizing pretrained protein
embeddings is having access to their immense size. These models can have total trainable
parameters ranging from the 18.2 million up to 15 billion [90], [93]. This would make training
these models from scratch a computational feat. Luckily, there is a large selection of open source
31

large protein language models available for use [70], [71], [86], [87], [88]. Previous models
focused on LSTM architectures [70], [71]. More recent best performing models are transformer
based [86], [87], [88]. ESM-2 (Evolutionary Scale Model-2) is one of the newest models, being
released in the summer of 2022 [88]. Although protein embeddings have been used in compound
generation in the past ESM-2 has not seen this application [65], [68]. However, ESM-2 has been
utilized in models for affinity prediction and compound screening [95], [96], [97], [98], [99].

32

Chapter 3 - Thesis Statement
This thesis will explore the use of transformers for the generation of compounds using
amino acid sequences as conditioning. The work will expand on the work of Grechishnikova
2021 [59] exploring two major adaptations, the use of transfer learning by using the pretrained
protein embeddings from the ESM-2 model [88], and the use of SELFIES strings instead of
SMILES strings [82]. The expectation is that the larger training corpus of ESM-2 will allow the
model to better generalize between a broader set of proteins, potentially improving the model’s
ability to make assumptions on new amino acid sequences and overall improving performance.
SELFIES strings are guaranteed to be valid and therefore will increase the validity metrics of the
model, but the expectation is they may also help with diversity, a cited benefit of their use in
models [82]. The goal is to investigate how these changes effect the performance of a model with
the same design and size as Grechishnikova 2021 [59].
In addition to exploring the effects of embeddings and SELFIES we will explore changes
in the evaluation process. Grechishnikova 2021 [59] only explores beam search for evaluation,
which is a deterministic generation mechanism. We will expand this by also exploring random
sampling with temperature as a generation method. We will also attempt to scale the evaluation
process. In Grechishnikova 2021 [59] a case study is performed by simulating the binding of a
subset of generated compounds to a small selection of proteins. We will attempt to build a
binding classifier for the purpose of benchmarking model performance across the entire dataset.
For consistency all this work will use the same dataset as Grechishnikova 2021 [59].

33

34

Chapter 4 - Methodologies
This section reviews the methodology of training for models used in compound
generation as well as models used in evaluation. And will overview the methods used to evaluate
model performance.

4.1 Transformers for Compound Generation
For this project four transformer models will be trained and compared in amino acid
conditioned compound generation. The four models are:
Base model (Base):
-

Model designed to mimic Grechishnikova 2021 [59]. Trained on SMILES and raw
amino acid sequences.

Base + SELFIES model (Base+S):
-

Same as base model but trained with SELFIES instead of SMILES.

Base + Embedding model (Base+E):
-

Same as base model but replaces the Encoder of the transformer with pretrained
embeddings from ESM-2.

Base + SELFIES + Embedding model (Base+S+E):
-

The same as Base+E with SELFIES instead of SMILES.

35

All these models were trained in Googles CoLab platform with V100 GPUs and
developed using Tensorflow2 version 2.15.0 [100]. Transformers were implemented with the
help of TensorFlow’s documentation and tutorials [101].

4.1.1 Model Data
The dataset chosen for this analysis was from Grechishnikova 2021 [59]. Grechishnikova
2021 [59] had five train test splits. Only one split was chosen for this thesis work due to
constraints on computational resources and time. No amino acid sequences in the training split
are present in the test split. The Needleman-Wunsch similarity distribution between the train and
test split can be seen in Figure 8. Additional cleaning steps were performed on the data following
some of the processes outlined in Brown et al 2019 [81]. The additional cleaning steps are as
listed:
-

The SMILES string was cleared of salts.

-

The SMILES string was canonicalized.

-

Molecules were filtered to remove any that contained elements outside of H, B, C, N, O,
F, Si, P, S, Cl, Se, Br, and I.

-

The molecules were neutralized.

This ultimately removed 26 compounds from the dataset, all of which contained an element not
found in the allowable group. This compound filtering was performed using the RDKit python
package version 2023.9.5 [78]. These molecules were then also converted into SELFIES strings
using the selfies python package version 2.1.1 [102]. After filtering the dataset was left with
183,928 unique amino acid compound pairs (1512 unique amino acid sequences) and 9198
36

unique amino acid compound pairs (99 unique amino acid sequences) for the train and test set,
respectively.
The final processing step for data was to tokenize. For SELFIES strings the unique
tokens are already predefined in the grammar. For SMILES strings the unique tokens were
unique elements charged or neutral. SMILES were tokenized using the selfie packages
tokenize_smiles function [102]. For the embedding models, the ESM-2 650M parameter
model was used. ESM-2 has model ranging from 8 million to 15 billion parameters [88]. The 650
million parameter model was chosen instead of higher parameter models because of memory
limitations of the V100 GPUs used in training.

Figure 8. Distribution of Needle-Whitman Similarities Between Test and Train Amino Acids.

37

4.1.2 Model Structure
Encoder and Decoder Structure
The encoders and decoders for the models were modeled identically to Grechishnikova
2021 specifications [59]. The Encoder and Decoders had the same structure which is made up of
4 layers with a feed-forward dimension (𝑑𝑓𝑓 ) of 512, 𝑑𝑚𝑜𝑑𝑒𝑙 of 128, and 4 attention heads. A
dropout rate of 0.1 was applied after the embedding layer, each attention layer, and each feed
forward layer before normalization. The embedding models, Base+E and Base+S+E, do not have
this encoder but rather replace it with ESM-2 pretrained embeddings.

Embedding Model
ESM-2 is a transformer model was originally trained using PyTorch [88]. Since my work
is developed in Tensorflow2, combining the models together and using the pretrained weights is
infeasible. To accommodate this all the embeddings were pre-calculated for all the amino acids
in the dataset. Each embedding is calculated over the full length of the amino acid sequence,
replacing each amino acid in the sequence with a 1280 embedding vector. Since ESM-2 is a
transformer model, this output is the same shape as what is expected from the Encoder. For this
reason, the embeddings can be used as a drop-in replacement for the Encoder and can be passed
directly into the Decoder. However, because the precalculated embeddings expand the amino
acid sequences into a large second dimension loading the full dataset into memory is no longer
feasible. To accommodate this, we made a custom Tensorflow2 model layer that functions as a
hash map. The layer EmbedHashLookupLayer on construction makes a one-to-one mapping
38

from amino acid sequences to embedding representations. At model runtime this layer will take
batches of amino acid sequences and return batches of embedding representations.
This approach is beneficial because the embedding representations can be stored in
system RAM instead of GPU VRAM which saves GPU memory. The non-embedding model
utilized ~14Gb of VRAM while the embedding model only utilized ~5Gb of VRAM during
training. An additional benefit is the mappings can be passed each time the model is constructed
which means you only ever need to provide the model with the minimum sized hash map for the
current context, also potentially saving memory. The major downside of this approach is that the
ESM-2 model must be used to calculate any new amino acid embeddings before using the model.
In addition to this, although more resilient to memory issues, there can still be memory errors if
the hash map becomes too large. This implementation ultimately allows for the embedding
models to replace the Encoder with a single EmbedHashLookupLayer layer.

4.1.3 Model Training
The total number of epochs for training was not predetermined and was chosen by
observing the progression of the loss. This will be discussed further in Chapter 5 Results and
Discussions. The models were trained using a batch size of 72. This batch size was chosen to
approximate the number of tokens per batch similar to Grechishnikova 2021 [59]. In their work
they utilized a batch size of 4096 tokens per batch. The batch size we utilized was estimated
using the average compound size to achieve approximately 4096 tokens per batch on average.
The V100 GPUs utilized in this work did not have enough memory to process this batch size in
total. To accommodate this, we implemented gradient accumulation with 6 mini-batches of size
39

12. Gradient accumulation is a training process where the batch is split into minibatches. For
each minibatch a forward pass is made, and the gradient is calculated and accumulated but not
applied. On the final minibatch the accumulated gradient is applied in backward propagation.
The accumulation is the summation or the average of the gradients depending on if the loss is
normalized [103]. In our case the gradient is averaged across the minibatches. This saves
memory because each forward pass only 12 examples need to be held in memory but doesn’t
impact training because the back propagation only occurs every 72 examples, as originally
intended.
Adam was the optimizer utilized in training with 𝛽1 = 0.9, 𝛽2 = 0.98, 𝜀 = 10−9 and a
learning rate defined as:
−0.5
𝑙𝑟𝑎𝑡𝑒 = 𝑑𝑚𝑜𝑑𝑒𝑙
∙ 𝑚𝑖𝑛(𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 −.05 , 𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 ∙ 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠 −1.5).

(24)

Where 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠 is 4000. This definition of learning rate and the parameters of the Adam
optimizer are identical to Grechishnikova 2021 and the original transformer publication [16],
[59]. The loss function for training was sparse categorical cross entropy which was masked to
not penalize failing to predict padding values. Teacher forcing was used during training [72].
It is important to note that normally the theoretically ideal loss is 0, but in this case the
same amino acid sequence can map to multiple different binding compounds. Because of this
one-to-many mapping, even a perfectly fitted model cannot achieve a loss of 0. To accommodate
this in evaluation the theoretically best loss was calculated for the SMILES and SELFIES
training and test sets. This was done by calculating the ideal probabilities for each molecule and
amino acid sequence pair. The process for this was for every amino acid sequence to take all of
40

the potential binding molecules and for every possible prefix calculate the appropriate next
probability. This is best explained by an example. If an amino acid sequence bound to three
compounds “AAC”, “AAB”, and “ABB” then the ideal probability for the first character would
be 1/1 “A” since all three compounds start with “A”. Then for the next character after “A” would
be 2/3 “A” and 1/3 “B”. Then for the prefix “AA” the next ideal probability would be 1/2 “C”
and 1/2 “B”, but for the prefix “AB” it would be 1/1 “B”. Since the model is using teacher
forcing in training, we know that for the example “AAC” we would expect and ideal model to
guess 1/1 for “A” being the first character, 2/3 “A” and 1/3 “B” for the second character, and 1/2
“C” and 1/2 “B” for the final character. For both SMILES and SELFIES, the ideal loss was
calculated to be ~0.11, for both train and test sets.

4.2 Binding Classifier for Evaluation
To evaluate the similarity of generated compounds to real binding molecules a classifier
was constructed with the goal of distinguishing real binders from random molecules for a given
amino acid sequence. This is a similar approach to what was performed in Grechishnikova 2021
[59] where instead they simulated binding affinities for generated compounds and real binders.
Constructing a classifier is advantageous because it scales and allows us to evaluate generated
molecules without the computational cost of simulating binding for all real, random, and
generated compounds. This allows for the evaluation of all generated compounds for all amino
acid sequences and not a subset like in Grechishnikova 2021 [59]. Since the transformer models
are conditioned to generate SMILES and SELFIES strings that approximate real binding
SMILES and SELFIES strings, this method of evaluation is in line with model performance. It is
41

important to note that better performance of generated compounds in the classifier evaluation is
not making a claim that these compounds are better binders, just that they better approximate the
SMILES/SELFIES strings of true binders.
In training the classifier, a delicate feature will be avoiding overfitting, or underfitting. If
the model is overfit, it will be too discriminatory of potential binders and only recognize those
almost identical to real binders. But if underfit it will not discriminate enough. To account for
this the model size was limited, high dropout values were explored, and training was early
stopped to prevent overfitting. As a waypoint for appropriate performance, we benchmarked
against the simulation results of Grechishnikova 2021 [59]. In their work the best area under the
receiver operating curve (AUC) for their simulations was 0.86. Therefore 0.86 was used as an
approximate target for the classifier model’s AUC score to stop training.

4.2.1 Model Data
The dataset used for the evaluation classifier was constructed by taking all unique amino
acid sequences from the Grechishnikova 2021 [59] dataset and splitting them into a 50:50 train
test split. The 190 amino acids with only one occurrence were randomly split evenly between the
train and test split. For the remaining 1421 amino acid sequences that had multiple occurrences
were evenly split with their binding pairs into the train and test sets. If the number of binding
pairs for an amino acid sequence was odd the left over pair was randomly placed in the either set.
This model also made use of ESM-2 for protein embeddings. For this data the 3 billion
parameter model was used. The reason the larger model could be used in this instance was
because the resulting protein embeddings were reduced via taking the mean across the length of
42

the sequence. This created a single 2560 vector for each amino acid sequence, which did not
create memory issues.
To train the classifier it is necessary to have negative binding examples. For the data,
negative binders were molecules randomly sampled from the training and test set, respectfully.
The method of using randomly sampled molecules for negative binding pairs was chosen
because distinguishing between random molecules and true binders is the benchmark evaluation
in Grechishnikova 2021 [59]. Random sampling does allow for the accidental sampling of true
binders as negative examples, but with the size of the dataset this was not a major concern. A
simulation of 200 full random samplings showed that in the training and test set the average
percentage of collisions per amino acid was 0.12%.
This methodology creates a training and test set with almost equal representations of each
amino acid sequence in both sets. By not using the same train-test split as the transformer we
ensure that the classifier learns to classify for amino acid sequences both in the transformers
training set and the test set. Having the same amino acid sequences in the training set and test set
was not a concern, considering the goal of the classifier is to perform evaluation on this dataset,
and not generalize outside of it.

4.2.2 Model Structure
The model is constructed to take in a potential binder SMILES string and an amino acid
sequence embedding and return the probability that that molecule is a binder for that amino acid
sequence. The structure of the binding classifier was motivated by Song et al. 2023 [104]. In
their work they developed an architecture using pretrained protein embeddings from ESM-2 and
43

compounds to predict affinity and interactions [104]. Their model is constructed of three main
parts, a model for processing compound graphs constructed from input SMILES strings into a
single embedding, a model for processing protein embeddings into a single compressed
embedding, and a final feed forward network that takes the Kronecker product of the
embeddings (Figure 9). Our implementation simplifies this by using a GRU recurrent neural
network for SMILES strings instead of a Graph Neural Network, and by replacing the protein
embedding processing model with the mean reduced ESM-2 embeddings and simple single layer
dimension reduction feed forward network.
The SMILES string processing model has an initial embedding layer with a dimension of
32 and a single layer bidirectional GRU with a dimension of 32. The hidden state vectors from
each step of the GRU are averaged together to get the final compound embedding. This is similar
to the original protein embedding LSTM in Song et al. 2023 [104] as seen in Figure 9. The
protein embedding dimensional reduction model was a simple feed forward network with a
dimension of 128. The results of these two models were concatenated with each other, and then
concatenated with their Kronecker product as in Song et al 2023 [104]. For two vectors 𝐴 and 𝐵
the Kronecker product is simple the matrix 𝐵 multiplied by every value in 𝐴 and concatenated
into a single vector. The resulting shape of a (1, 𝑛) and (1, 𝑚) Kronecker product will be a
vector of shape (1, 𝑛 ∙ 𝑚). The concatenated result for the original embeddings and the
Kronecker product is passed through a two-layer feed forward network with dimensions 128 and
64, respectfully. Before input into the first feed forward layer and after each layer a dropout of
0.3 is applied. The last layer of size 1 with a sigmoid activation is used to output the binding
probability.
44

Figure 9. Song et al. 2023 [104] “The framework of PMF-CPI. For a compound-protein pair, the protein sequence is
transformed into context embedding via pretrained TAPE and then fed into LSTM. And a compound SMILES is
turned into a molecular graph by RDKit and encoded with graph neural networks. After that, two encoding vectors
are merged through the Kronecker product, and MLP provides a final output. PMF-CPI can process two task types,
regression for CPI binding affinities or classification for interactions”

4.2.3 Model Training
The model was trained on a V100 GPU using Google’s CoLab platform. The loss for
training was binary cross entropy. The Adam optimizer was used with 𝛽1 = 0.9, 𝛽2 = 0.999,
𝜀 = 10−7 and a learning rate of 0.001. These values were informed from Song et al 2023 [104].
To prevent overfitting the model was to be stopped when the AUC score of the test set
approached an AUC score of 0.86. This is because this is the maximum AUC score found in
Grechishnikova 2021 [59] for binding simulations.

45

4.3 Experiment Setup
For each model three sets of compounds were generated. All SMILES generated were
canonicalized using the RDkit package [78]. SELFIES strings were converted to SMILES strings
and canonicalized. Two sets were generated using beam search with a beam size of 10, as done
in Grechishnikova 2021 [59]. One set from beam search was the 1-1 set which only kept the
most probably compound for each unique amino acid sequence. The other set from beam search
was the 1-10 set which kept all 10 of the compounds for each amino acid sequence. The third set
was generated via random sampling with a temperature of 1. For random sampling, 100
molecules were generated. For shorthand this set will be referred to as the 1-100 set. It was
chosen to generate more molecules for random sampling because of the lower computational
cost compared to beam search. In Grechishnikova 2021 [59] only beam search is explored, but
since beam search is a deterministic generation mechanism that generates more probable
molecules the goal was to use random sampling as an exploration of a more diverse generation
method that also allows us to explore more of the chemical space.
The beam search methodology used in the project was the same as used in the
Tensor2Tensor project [105]. This is an adaptation of the beam search methodology used in Wu
et al. 2016’s previous work without the coverage penalty:

46

𝑠(𝑌, 𝑋) =

log(𝑃(𝑌|𝑋))
,
𝑙𝑝(𝑌)

(25)

𝑙𝑝(𝑌) =

(5 + |𝑌|)𝛼
.
(5 + 1)𝛼

(26)

Where 𝑌 is the generated sequence and 𝑋 is the output of the encoder. 𝛼 is a hyperparameter for
normalization that was set to 0.6, as in Grechishnikova 2021 [59]. The value of 5 is from the
original work of Wu et al 2016 and was a hyperparameter that was tuned and found to be optimal
set to 5 [75].
For appropriate baselines additional compound sets were added as references. An “Ideal
set” was created as a subsample of the test set where 50% of the compounds per amino acid we
sampled and treated as generated. The dataset was created 5 times with different random splits
and evaluations were calculated on each split to get a baseline for what real binders looked like
when compared to other real binders. The generated sets were also all shuffled to make “shuffle
sets”. Shuffled sets were used in evaluation calculations that were approximating how well the
compounds associated with their amino acid sequences. If shuffled and non-shuffled sets
performed similarly then the model was not generating molecules specific to the amino acid
sequences. Lastly, random sampling of compounds from the train and test set or the whole train
and test set were used as references where appropriate.
Each of the generated sets was then put through evaluation. The sets were evaluated on
their own for chemical feasibility and distribution of physicochemical properties. They were also
compared to the real binder compounds using tanimoto similarity and classification scores. For
comparison, all amino acid sequences in the test set were explored and aggregated in the results.
In addition to this, two specific proteins, IGF-1 and VEGFR2, were explored individually. This
is a case study to evaluate the performance on specific proteins, and are the same two proteins
utilized in Grechishnikova 2021 [59].

47

4.4 Evaluation
The evaluation performed will have two main focuses, evaluating the overall chemical
feasibility and the physiochemical properties of the overall compound sets, and evaluating how
well the generated compounds approximate true binding compounds for their corresponding
amino acid sequences. The code utilized in evaluation was adapted from MOSES, a bench
marking framework for molecular generation models [80].

4.4.1 Chemical Feasibility
Chemical feasibility metrics are metrics that describe how viable the generated
compound sets are. These metrics will be calculated for each model’s sets, the ideal set, and
random samples from the training set. All these metrics are single quantitative values calculated
for a whole set of molecules. The reference set for this section is the ideal set, and random
samples from training.

Validity
Validity is defined as the percentage of generated sequences that can be decoded into
valid molecules [29], [80], [81]. This metric is important for comparing between models, but also
for comparing between random sampling and beam search. Considering beam search is
generating the compounds of highest probability there is an expectation that its validity may be
higher.

48

Uniqueness
Uniqueness is simply the fraction of unique molecules in the generated set divided by the
number of valid generated molecules [29], [80], [81]. Uniqueness is important for measuring the
diversity of the generated distribution. It is a goal of generative models to produce a diverse
chemical space. This is an important metric to assess if the model can generate a diverse set of
compounds. This metrics compares performance between models but can also explore the
difference in generation techniques. The expectation is that random sampling may generate
better uniqueness.

Novelty
Novelty is the percentage of generated compounds that are not seen in the training data.
This is an important metric for ensuring the models are not simply copying molecules from the
training set [29], [80], [81]. This can investigate if any models are over fitting to the training set.
If models can approximate the same novelty as the ideal set that would be appropriate, but more
novelty can be considered desirable.

Internal Diversity
Internal diversity is a more sophisticated metric for determining how unique a set of
molecules is [80]. In can be described by the following equation:

𝑝

𝐼𝑛𝑡𝐷𝑖𝑣𝑝 (𝐺 ) = 1 − √

1
∑
𝑇(𝑚1 , 𝑚2 )𝑝 .
| 𝐺 |2
𝑚1 ,𝑚2 𝜖 𝐺

(27)

49

Where 𝑇 is the Tanimoto similarity, a well-defined metric of measuring compound similarity
[80]. 𝐺 is the generated set of compounds, and 𝑝 = 1. The Internal Diversity metric amounts to
being the normalized sum of the Tanimoto similarities of all combinations of molecules in the
generated set. Tanimoto similarities will be expanded on in a up coming section. This method
expands on uniqueness because it doesn’t stop at only evaluating molecules as identical or not
but incorporates a similarity score. The expectation is that generated sets should be as, if not
more, internally diverse than the random training set sample and the test set.

4.4.2 Physiochemical Properties
The evaluation of physiochemical properties allows for the evaluation of the potential
drug-likeness and drug feasibility of generated compounds. Properties like molecular weight,
number of H donors, H acceptors, water-octanol partition coefficient (LogP), rotatable bonds,
and total polar surface area (TPSA) are all metrics used to assess the drug likeness of a
compound [106], [107], [108]. Lipinski’s rule is a common rule of thumb for assessing drug
likeness and states that for oral compounds the molecular weight should not exceed 500 Da,
should have no more than 5 hydrogen bond donors, no more than 10 hydrogen bond acceptors,
and a water-octanol partition coefficient (LogP) of less than 5 [106]. It is also understood that
ideally rotatable bonds shouldn’t exceed 10, and that TPSA values of under 140 A2 and under 90
A2 are necessary for having good absorption and passing the blood brain barrier, respectively
[107], [108]. In addition to these more fundamental properties of compounds there are other
more abstract metrics that also hold valuable information.

50

Three more abstract metrics are Quantitative Estimate of Drug likeness (QED), Natural
Product likeness score (NP Score), and Synthetic Accessibility Score (SA Score). QED was
originally defined by Bickerton et al 2012 [109]. QED is a value between 0 and 1, where the
closer to 1 the more drug like a compound is. The score is calculated by using 8 molecular
properties, molecular weight, LogP, number of hydrogen bond donors , number of hydrogen
bond acceptors, TPSA, number of rotatable bonds, number of aromatic rings, and number of
structural alerts [109]. The distributions of these metrics in known drug-like compounds are used
to fit asymmetric double sigmoid functions which allows for a score between 0 and 1 to be
generated for each metric that describes the distance from the mean a compound is for a given
metric. The weighted geometric mean of these values is then calculated to create the QED score.
This allows for a single value to be used to approximate drug-likeness [109].
The NP score was originally developed by Ertl et al 2008 as an estimate of how similar a
compound is to a natural product [110]. The theory is that the more like a natural product the
more viable a drug candidate may be [110]. The metric usually spans from -5 to 5, with a higher
value representing a more natural product. The score is essentially calculated by using the
precalculated frequency values for a set of compound fragments. For each fragment found in a
compound the score is added to a total and then normalized by the size of the molecule.
Fragments found more frequently in synthetic products have a negative score, while fragments
found more frequently in natural products have a positive score.
Ertl et al 2009 [111] went on to propose another heuristic, the Synthetic Accessibility
Score. The SA score of a compound ranges from 1 to 10 with 10 being the hardest to synthesize.
The score is calculated in a similar way, utilizing precalculated fragment scores. These fragment
51

scores are influenced by the frequency of these fragments, molecular complexity, size,
symmetry, and rare structural motifs. More common fragments with a lower complexity have a
lower SA score, while less common fragments with a higher complexity have a higher SA score
[111]. A suitable SA score has conflicting citations. Lucio et al 2020 [51] uses 4.5 as a cut-off
for considering a molecule synthesizable, while Grechishnikova 2021 [59] uses 6 as a cut-off.
Other papers simply report distributions of SA scores for a higher level view [80].
These metrics will be evaluated by comparing their distributions to reference sets. The
reference sets will be the full training and test set. These metrics are important to evaluate druglikeness and feasibility so their values will be evaluated on where they align with known cutoffs.
How well the generated sets approximate the training and test distribution will also be evaluated.
Since no additional property information is being purposely selected for, any deviations from the
training and test set can be interpreted as a bias for that model and/or generation method.

4.4.3 Similarity to Known Binders
These metrics will specifically be used to evaluate how well the compound’s generated
are approximating real binders. In this section, for each amino acid sequence in the test set the
metrics are calculated for the generated compounds and the shuffled sets.

Tanimoto Similarity
Tanimoto similarity is a well defined metric of measuring compound similarity [80].
Tanimoto similarity is calculated from the morgan fingerprints of two compounds and is
calculated by dividing the sum of the bitwise-and of the two fingerprints by the sum of the
52

bitwise exclusive or of the two fingerprints. The tanimoto similarity was calculated between each
generated compound and each real binder for each amino acid sequence. For each generated
compound the mean tanimoto similarity and max tanimoto similarity was recorded. This created
two metrics, the mean tanimoto similarity, and the max tanimoto similarity. Each of the
generated sets were compared to the ideal set as a reference. The ideal set calculated the
tanimoto similarity between two 50/50 splits of the real binders for each test amino acid
sequence. To make an empirical comparison the Wasserstein Distance was calculated between
the generated sets and the ideal set to determine their similarity. This comparison has been
previously utilized compound generation benchmarks [80]. The Wasserstein Distance is a
measure the amount of work it takes to transform a one distribution to another. The work is
proportional to both the distance the probability needs to be moved and the amount of
probability. The shuffled generated sets were also compared to the ideal set. This was to test if
the compound generation is specific to the amino acid sequences. We would expect the shuffled
sets to perform worse than the unshuffled sets if this were the case.

Binding Classifier
The binding classifier is trained to generate a binding probability for a given amino acid
sequence and compound. By generating a classification score for each compound and its
corresponding amino acid sequence we can compare the generated sets and shuffled generated
sets to the full test set. Similar to tanimoto similarity evaluation we can compare the distributions
of the classification scores for the generated set and shuffled generated sets with the test set by
calculating the Wasserstein Distance. However, in addition to this comparison we can also
53

evaluate the performance of the generated sets by visualizing the receiver operating characteristic
(ROC) curve and calculating the corresponding AUC score. This is the same technique utilized
in Grechishnikova 2021 [59]. For the ROC-AUC evaluation three kinds of comparisons will be
investigated.
-

Comparing Generated compounds against real binders. This will evaluate the ability of
the generated compounds to approximate real binders. A successful result here would be
an AUC score closer to 0.50, equivalent to the classifier randomly guessing.

-

Comparing Generated compounds against themselves shuffled. If compound generation
is amino acid specific, we would expect the classifier to be able to tell the difference
between the generated set that is appropriately paired with the right amino acid sequences
and the shuffled set. A successful result would be an AUC score closer to 1.0.

-

Comparing Generated compounds against random shuffled sample from the test set. This
will evaluate if the generated compounds are better than randomly sampled compounds
from the test set. A successful result would be an AUC score closer to 1.0.

As a reference point for all these calculations the AUC score for the test set real binders vs the
shuffled test set will be calculated.
In addition to performing this evaluation over the full test set, we also will explore two
specific proteins as case studies. These proteins are insulin-like growth factor-1 (IGF-1) and
vascular endothelial growth factor receptor 2 (VEGFR2). These proteins were chosen because
they were also the proteins explored in Grechishnikova 2021’s simulation analysis [59]. The
PDB ids explored for IGF-1 were 20J9, 3O23, and 4D2R. For VEGFR2 2P2H and 4ASE were
54

explored. These PDB ids had amino acid sequences that were not in the training set. For the case
study the 1-1 generation set will not be used since it will only generate 3 compounds for IGF-1
and 2 for VEGFR2, which isn’t enough to make a meaningful evaluation. The known binders for
each of these specific PDB ids were extracted from BindingDB in June of 2023 [112].

55

56

Chapter 5 - Results and Discussion
5.1 Transformer Training Metrics
The models were all trained to 50 epochs. This stopping point was chosen for
computational resource reasons and was the point where the trained loss appeared to level off
(Figure 10). This plateau occurs at different values for each model. Base+S model settled with a
loss around 0.38, Base and Base+S+E both settled around 0.30 and the lowest loss was seen for
Base+E with a loss around 0.22. Overall, the embedding models performed better than their nonembedding counterparts in both cases, by ~0.10. This could be evidence of the better
generalization of the embedding models. However, to be certain this difference persists, more
epochs would need to be explored. Overall SELFIES models perform worse than their SMILES
counterparts when it comes to loss. This could potentially be explained by the difference in
vocabulary sizes. For SMILES there were 79 unique tokens while for SELFIES there were 111
unique tokens. More tokens allow for more opportunities to be wrong or uncertain. The
SELFIES grammar is also more complex than the SMILES grammar. For example, a single
token in the SELFIES grammar can hold multiple meetings depending on the context of the
sequence [82]. This added difficulty may also explain the loss difference.
A striking feature of the training curves is the test sets’ losses not improving and mostly
degrading continuously over training. The trends in validation loss appear to be opposite to the
trend with training loss where embedding models had higher validation loss than their non-

57

Figure 10. Train and Test loss curves for all four models with Ideal loss visualized.

embedding counter parts and SELFIE models having higher validation losses than their SMILES
counter parts. The overall validation loss being high may be explained by the fact that for both
SMILES and SELFIES strings multiple permutations of tokens can create the same molecule.
For example, the SMILES BrCCBr and C(Br)CBr are identical molecules that share 0%
sequence accuracy. Because of this there is a potential that the compounds that are being
generated for the training set are valid, but simply don’t align with the precise expected
permutation. A potential hypothesis for how this could cause a higher validation loss over time is
that as a model improves, it becomes more certain with the molecule it is deciding to create. This
certainty leads to a narrower probability distribution being predicted. Where in the beginning the
model is less certain and predicts a much broader probability distribution. If the more certain
prediction is narrowing in on a permutation that isn’t the expected permutation, then it will
generate a worse loss than a broader prediction which has a better chance of overlapping with the
58

expected permutation. It is uncommon for other compound generation papers to display their loss
curves, so it is unclear if this is a common occurrence or an anomaly. The later evaluation steps
allow for a clearer picture of model performance.

5.2 Binding Classifier Training Metrics
The classifier model was trained for 25 epochs, but the 7 th epoch was chosen as the early
stoppage point (Figure 11). This is because it’s AUC score was approaching 0.86 which
approximates the best simulation performance in Grechishnikova 2021 [59]. The training showed
consistent performance between the train and test set, which gives confidence that the model was
not overfitting to binding molecules in the training set.

Figure 11. Evaluation Binding Classifier Train and Test AUC scores over training.

59

5.3 Chemical Feasibility
In chemical feasibility results the SELFIES models stand out in validity, novelty, internal
diversity and uniqueness; only falling to the Base+E model in the 1 to 10 generation for
uniqueness (Table 1). High performance in internal diversity, novelty, and uniqueness is an
expectation of SELFIES strings since their propensity for greater compound diversity is a known
effect [82]. With this being said, internal diversity is quite consistent between all models and
aligns well with the ideal set and random samples (Table 2). The embedding models appear to do
slightly better in uniqueness than their non-embedding counter parts by an average of 3% across
generation methods.
An interesting result is the degradation in validity seen in the baseline embedding model
when compared to the baseline model. The 1 to 1 generation validity drops by 20% and the 1 to
10 generation drops by 25% (Table 1). There are multiple potential reasons for this. One
potential reason is that this model is simply failing to learn proper SMILES grammar and failing
to generate valid SMILES strings. However, this is unlikely because the Decoder structure is
identical to the base model, which has no trouble generating valid SMILES strings. A second
potential reason is that the Base+E model is uncertain on the kind of molecule to generate and is
generating a broader probability distribution for token prediction. This could be a sign of the
improved protein embedding space providing more context for the model. For the Base model,
when given a new amino acid sequence it may only be able to draw an association to a smaller
set of the amino acid sequences it was trained on, giving it a narrower scope of potential
compounds to generate. If the protein embeddings allow the Base+E model to make more
60

Table 1. Chemical Feasibility results for each model and generation set. Bolded values are the max values for that
metric and generation set between the models. Ties are both bolded and underlined.
1-1

Base
1-10

1-100

1-1

Base+S
1-10
1-100

1-1

Base+E
1-10
1-100

1-1

Base+S+E
1-10
1-100

Validity

92%

86%

79%

100%

100%

~100%

72%

61%

77%

100%

~99%

~100%

Uniqueness

83%

92%

92%

94%

90%

97%

91%

98%

94%

96%

89%

98%

Novelty

87%

92%

97%

91%

96%

99%

78%

87%

95%

92%

96%

98%

Internal
Diversity

0.83

0.84

0.87

0.86

0.87

0.89

0.85

0.86

0.87

0.87

0.89

0.89

Table 2. Chemical Feasibility of reference sets. Ideal Set is made up of random samples from the Test set. Random
sets made up of random samples from the Training set.
Ideal Set

1-1

Random
1-10

1-100

Validity

100%

100%

100%

100%

Uniqueness

99%

99%

100%

98%

Novelty

79%

0%

0%

0%

Internal
Diversity

0.86

0.86

0.87

0.87

relations to a broader set of proteins, then this could cause a larger scope of potential compounds
to generate. The broader scope may then cause the model to become less certain of the patterns it
is trying to generate and result in an invalid SMILES string being generated. Essentially, the
Base model may have a narrower scope on the kind of molecule it wants to generate, making it
more certain, while the Base+E model has a broader scope making it more uncertain and prone
to error.
It is possible this would affect beam search more than the random sampling method
because beam search functions by selecting sequences that maximize the scoring probability.
61

This cumulative probability score may cause the model to fixate on compound fragments that are
chemically valid but put together are not valid SMILES strings. Random Sampling, however,
isn’t constrained by this strict selection process and is only looking at what is the most probable
next token. Since the model has likely learned the general structure of valid SMILES string, even
if it is uncertain about the exact molecule it should generate, it may be better at avoiding over
fixating on some compound fragment patterns. The uniqueness being lower for the Base model
compared to the Base+E model in beam search also supports the idea that the Base model may
be narrowly focusing on a smaller chemical space and therefore regenerating the same molecules
more frequently. As will be seen in results that will be discussed later, the Base+E model does
appear to produce better molecules for the amino acid sequences than the Base model. This
provides confidence that the Base+E is generalizing on the protein information better and that the
low validity is not impacting performance.
The Base+E model also has slightly lower novelty than the Base model, especially for the
1 to 1 generation set. However, for the ideal set the novelty is only 79%. Therefore, even though
the Base+E generation set is producing lower novelty values it isn’t producing novelty much
worse than a sample of the test set would. In some ways this can be seen as a positive as
approximating the ideal set better could be a sign of better approximation of the real binders.
A general trend that can be observed for all models is an increase in novelty and internal
diversity when going from 1-1 generation, 1-10 generation and 1-100 generation, respectively.
The increase in novelty is a more prominent effect and can be expected because 1 to 1 generation
is generating the most probable compound, which is more likely to be one from the training set
the model was trained on because of inherent bias for the training set. Expanding to 10 molecules
62

allows for slightly more randomness since less probable molecules are being generated, and the
random sampling of the 1-100 set allows for even less probable molecules to be generated. This
is because in beam search the probability of the whole sequence is being considered where in
random sampling only the current token is being considered, and even then, it is being randomly
sampled from the token probability distribution.
Overall, in comparison to the ideal set the generated compounds are not deviating in a
concerning manner with the only point of concern being the Base+E model’s lower validity.

5.4 Physiochemical Properties
The molecular weight distributions for generated sets overlap appropriately with the test
and train distributions (Figure 12). The generated sets trend on the lighter side. For each
generated set the SELFIES models have the largest high molecular weight outliers. All the
generated sets have a majority of their compounds are under 500 Da which meets our Lipinski
ideal cut off [106]. For beam search, embedding models tend towards lighter compounds on
average compared to non-embedding models.
For hydrogen acceptors and hydrogen donors a majority of the generated compounds are
under 10 and 5 respectfully, which aligns with Lipinski’s cut offs (Figure 13) [106]. Both of
these metrics overlap appropriately with the test and train sets, but for hydrogen acceptors tends
to be a bit lower. In beam search the embedding models tend to be a bit lower in H acceptors.

63

Figure 12. Distributions of Molecular Weights for four model groups (Base, Base+S, Base+E, Base+S+E)
across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means
visualized with vertical dashed lines.

Figure 13. Distributions of the Number of Hydrogen Acceptors and Hydrogen Donors for four model groups (Base, Base+S,
Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons.
Means visualized with vertical dashed lines.

64

A majority of molecules generated in each set do not have rotatable bonds exceeding 10
which is ideal (Figure 14) [107]. SELFIES seem to have more large outliers for rotatable bonds
and consistently have a higher mean than the SMILES models. Overall, the distributions overlap
appropriately with the train and test set.

Figure 14. Distributions of Rotatable Bonds for four model groups (Base, Base+S, Base+E, Base+S+E)
across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means
visualized with vertical dashed lines.

A majority of generated compounds for all sets are under the LogP cutoff of 5 specified
in Lipinski’s rules (Figure 15) [106]. The SELFIES models have the largest outliers for high
LogP values compared to the other generated sets and the Base model has a slightly lower mean.
Overall, the sets agree with the train and test sets.
All the generated sets have a majority of molecules under the 140 A 2 cut off and the
average sits close to the 90 A2 mark (Figure 16). The Base+S, Base+S+E, and the Base+E model
65

Figure 15. Distributions of LogP for four model groups (Base, Base+S, Base+E, Base+S+E) across three
Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set comparisons. Means visualized
with vertical dashed lines.

appear to trend to lower TPSA values, especially the embedding models which appear to have
averages in the 80-60 A2 range. This trend also matches the patterns seen in molecular weight.
The overall overlap with training and test sets are appropriate.
QED values in the generated sets seem to be more spread out with means tending higher
than the train and test set (Figure 17). Embedding models appear to generate more higher value
QED scores suggesting better drug likeness. Base+E seems to consistently have the highest
average QED scores regardless of generation method. Base+S consistently is one of the lowest,
with the exception of Base+S+E in the 1-100 generation set.
The natural product scores of the train and test set actually group quite lower than the
generated sets (Figure 18). For the Base and Base+E model the means appear to align closer with
66

Figure 16. Distributions of Total Polar Surface Area (TPSA) for four model groups (Base, Base+S, Base+E,
Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set
comparisons. Means visualized with vertical dashed lines.

the train and test set. The SELFIE models appear to deviate from the training and test set with a
distribution tending to higher natural products, primarily the Base+S+E model. This deviation
may be evidence of the SELFIE model creating a more diverse chemical set.
SA scores for Base and Base+E model trended slightly lower than the train and test set
(Figure 19). SELFIE models created much more difficult to synthesize compounds. So much so
that the averages for the SELFIE models are close to the high outliers for the SMILES models.
Overall, the generated sets for all models across all metrics overlap very well. There are
many slight trends with differing means, most are small. Since molecular properties are not a
feature being selected on the only reason for any deviation from the training sets is due to some
bias of the model type and/or the compound string type.
67

Figure 17. Distributions of Quantitative Estimates of Drug Likeness (QED) for four model groups (Base,
Base+S, Base+E, Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and
Test Set comparisons. Means visualized with vertical dashed lines.

Figure 18. Distributions of Natural Product Scores for four model groups (Base, Base+S, Base+E,
Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set
comparisons. Means visualized with vertical dashed lines.

68

Figure 19. Distributions of Synthetic Accessibility Scores for four model groups (Base, Base+S, Base+E,
Base+S+E) across three Compound Generation Sets (1-1, 1-10, 1-100), with Training and Test Set
comparisons. Means visualized with vertical dashed lines.

Comparing generation methods, in general 1-100 sets had more outliers and a wider
spread than the beam search generation sets. This is expected since the 1-100 set is generating
more compounds and therefore have a greater potential for diversity.
The most striking deviations were in natural product scores, where the Base+S+E model
appeared to deviate strikingly from the train and test set, and the SA score distributions where
SELFIES models deviated largely from the train and test sets. Since SA scores take into account
compound complexity this may be a sign of the structural diversity being generated in the
SELFIES models. For other metrics, SELFIES models appeared to also have larger outliers.
This can make logical sense for SELFIES models since their forced validity means that large

69

uncertain SELFIES strings are still guaranteed to be valid, unlike large uncertain
SMILES strings which have a higher likelihood of being invalid and filtered out. This shows the
potential cost of this diversity, with hard to synthesize compounds that may be impractical.
The tendency for the embedding models to have lower molecular weights, H acceptors,
and TPSAs is an interesting pattern. A potential explanation chemically could be that these
models are favoring less functional groups, or a different distribution of functional groups.
Functional groups in a molecule can contribute to its mass, H acceptor profile, and its polar
surface values. For example, carbonyl groups (C=O) are a common hydrogen acceptor functional
group with a molecular mass of 28. So lowering carbonyl frequency could drop all of these
metrics lower. Although this could explain the deviation, more exploration would be necessary
to confirm it, and even if functional group preference is the practical cause, why the models are
making this preference would remain unknown.

5.5 Similarity to Known Binders
5.5.1 Tanimoto Similarities
To evaluate tanimoto similarities the distributions of tanimoto mean metrics and tanimoto
max metrics are determined for each generated set, the shuffle sets, and the ideal set. The
Wasserstein distance is calculated between the generated sets and the ideal set and the shuffled
sets and the ideal set. The lower the value the more the set approximates the ideal set
distribution. These calculations show that for all models the generated sets perform better than
the shuffled sets (Figure 20, Figure 21). This supports that the models are learning some
70

Figure 20. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the
ideal set for Mean Tanimoto Similarity. For full distributions see Supplementary Figure 1.

Figure 21. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the
ideal set for Max Tanimoto Similarity. For full distributions see Supplementary Figure 2.

generalization and the compounds being generated are specific to the amino acid sequences
provided as context. However, each model is not equally efficient at this generalization.
The worst model is Base+S, which on average across all generated sets had a Wasserstein
distance of 0.138 for the mean tanimoto comparison, and 0.571 for the max tanimoto
comparison. The Base+S+E model is second worst with average mean and max tanimoto
Wasserstein scores of 0.131 and 0.548, respectively. The Base model is second best with an
average mean and max tanimoto Wasserstein score of 0.122 and 0.529, respectively. The best
model is Base+E with an average mean and max tanimoto Wasserstein score of 0.097 and 0.471,
respectively. This increase in the performance of Base+E is a large jump, where the previous
71

models had average improvements of 0.008 and 0.021 for mean and max tanimoto metrics, the
Base+E model improved by 0.024 and 0.058. That is more than double the improvement. Also, it
is important to note that across all models beam search performs better than random sampling
and further, 1-1 generation is consistently the best performer out of the generation methods with
few exceptions.
On this metric the models can be compared to each other to get a sense of the effect of
adding embeddings and the effect of using SELFIES strings. Comparing the Base and Base+S
models to each other across generation sets we get an average increase in the Wasserstein
distance of 0.017 and 0.042 for mean and max metrics, respectively. Comparing the Base+E and
Base+S+E models we get an average increase in the Wasserstein distance of 0.034 and 0.077 for
mean and max metrics, respectively. This increase suggests an overall degradation of
generalization from the use of SELFIES strings. The impact seems to harm embedding models
more with a mean tanimoto degradation 2x worse in the embedding model, and a max tanimoto
degradation 1.8x worse.
Comparing the Base and Base+E models to each other we get an average decrease in the
Wasserstein distance of 0.024 and 0.058 for mean and max metrics, respectively. Comparing the
Base+S and Base+S+E models we get an average decrease in Wasserstein distance of 0.007 and
0.023 for mean and max metrics, respectively. This suggests an overall improvement of
generalization from the use of the embeddings. The impact seems to be impacted by the presence
of SELFIES. Models using SELFIES strings saw a degradation of the mean tanimoto
improvement by 3.4x and the max tanimoto improvement by 2.5x.

72

The tanimoto results can also be compared on raw counts. Comparison of two random
samples of compounds determined that a mean tanimoto score of over 0.2 can be considered an
outlier, and a max tanimoto score of 0.3 can be considered an outlier. Using these cutoffs, we can
count the number of molecules generated over these cutoffs to see compound generation
performance in absolute terms. Table 3 and Table 4 show that consistently, despite lower
validity, the Base+E embedding model is producing more high value compounds in total.
Table 3. Binned absolute count of outlier compounds (>0.2) for Mean Tanimoto Similarity. Max values for each
generation set are bolded. Ties are bolded and underlined.
Base_1-1
Base+S_1-1
Base+E_1-1
Base+S+E_1-1
Base_1-10
Base+S_1-10
Base+E_1-10
Base+S+E_1-10
Base_1-100
Base+S_1-100
Base+E_1-100
Base+S+E_1-100

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Total

7
7
14
9
64
78
100
78
466
419
804
442

1
2
3
3
6
12
19
18
144
88
248
133

3
1
3
1
17
4
21
7
119
69
177
94

1
1
1
2
9
8
4
9
68
42
103
55

0
0
0
0
0
0
0
0
24
6
20
4

0
0
0
0
0
0
0
0
43
5
39
9

0
0
0
0
0
0
0
0
0
0
2
1

0
0
0
0
0
0
0
0
0
0
3
0

12
11
21
15
96
102
144
112
864
629
1396
738

Table 4. Binned absolute count of outlier compounds (>0.3) for Max Tanimoto Similarity. Max values for each
generation set are bolded. Ties are bolded and underlined.
Base_1-1
Base+S_1-1
Base+E_1-1
Base+S+E_1-1
Base_1-10
Base+S_1-10
Base+E_1-10
Base+S+E_1-10
Base_1-100
Base+S_1-100
Base+E_1-100
Base+S+E_1-100

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Total

3
4
9
4
39
53
68
43
444
283
656
329

1
1
3
0
14
10
35
18
160
85
279
139

4
1
1
1
18
5
12
18
111
54
166
70

0
0
0
1
2
3
11
9
32
21
116
54

0
0
1
0
7
2
15
9
72
32
84
46

0
1
1
0
9
4
11
6
68
30
76
33

6
3
8
8
24
18
30
28
254
134
288
154

14
10
23
14
113
95
182
131
1141
639
1665
825

73

The better performance of the Base+E model in Wasserstein distance and in absolute
count shows that this model’s performance cannot only be attributed to producing less bad
compounds but is also attributed to the model producing more high-quality compounds. The
overall poorer performance of the SELFIES models could be a result of their guaranteed validity.
Where the SMILES models may be uncertain and inevitably generate invalid SMILES, the
SELFIES models force the generation of the uncertain compound, potentially leading to the
inclusion of lower quality compounds. Table 3 and 4 do show that SELFIES models are
producing less high-quality compounds but not an unreasonably low amount. This also supports
the idea that SELFIES performance may not only be poor because of the lack of good
compounds but the over production of poor compounds. On top of the general poor performance
of SELFIES models it appears that their inclusion potentially stifles improvement. An interesting
result was the negative interaction between embeddings and SELFIES strings. The inclusion of
embeddings appeared to improve tanimoto performance but less so if SELFIES were being used
and SELFIES tended to hurt tanimoto performance but hurt the embedding models’ performance
more.

5.5.2 Classifier Scores
As was done for tanimoto similarity, we can calculate the classifier scores for the
generated set and the shuffled sets. Since the classifier does not require us to compare to a
reference set of compounds we can calculate the classifier score distribution for the whole test
set. Calculating the Wasserstein distance as we did before we can see again that overall, the
shuffled sets perform worse than the generated sets, showing the models’ ability to generalize
74

(Figure 22). In the performance comparison between models the order changes. The worst model
is almost a tie between Base and Base+S with average classifier Wasserstein scores of 0.306 and
0.305, respectively. The second-best model is Base+S+E with an average classifier Wasserstein
score of 0.25. The Base+E model remains the top model with an average classifier Wasserstein
score of 0.156. As in tanimoto similarity, the jump the Base+E model makes is larger gain than
the other models. Where Base+S+E makes about an ~0.05 improvement, the Base+E model
makes a ~0.10 improvement, almost doubling the improvement difference. As seen in the
tanimoto results, beam search is consistently better than the 1-100 set and the 1-1 set is the best
performer.

Figure 22. Heatmap of Wasserstein distances between generated sets and the ideal set, and shuffled sets and the
ideal set for Classifier Scores. For full distributions see Supplementary Figure 3.

Comparing the models Base and Base+S shows an overall average of no change in
classifier scores. For the random sampling generation set the Base model performance is better
by 0.051, while for the beam search generation the Base+S model is better by 0.033 and 0.021
for 1-10 and 1-1 generation sets, respectively. For comparison of Base+E to Base+S+E we see an
average increase of 0.094. This continues a theme of SELFIES having a negative impact on
performance and impacting the embedding model worse. For comparison of Base to Base+E we
75

Figure 23. ROC Curve for Generated vs. Shuffled Sets across four model groups (Base, Base+S, Base+E,
Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs.
Shuffled Test Set; AUC Scores are included in the legend.

see an average improvement of 0.15 and for Base+S to Base+S+E we see an average
improvement of 0.055. This is more evidence in support of the better generalization of the
embedding models and the hindrance of SELFIES on this improvement.
The classifier allows us to also visualize the ROC curve comparing sets. When we ask the
classifier to discriminate between the generated sets and the shuffled sets, we see that all the
models do better than a random guess (Figure 23). The test set result has an AUC score of 0.86.
The best model is the Base+E model with 1-100, 1-10, 1-1 AUC scores of 0.73, 0.78, and 0.76
respectively. The second best is the Base+S+E model with AUC scores of 0.64, 0.67, and 0.73
respectively. The Base and Base+S model seem to come in almost tied in the low 0.60s. This
paired with previous Wasserstein distance results shows that all the models are generating
76

Figure 24. ROC Curve for Generated vs. Test Set across four model groups (Base, Base+S, Base+E,
Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random Guess and Test Set vs.
Shuffled Test Set; AUC Scores are included in the legend.

compounds conditioned on their amino acid sequences, with the embedding models performing
the best. When comparing these results to the models’ performance against negative binding
examples the performance is almost identical, with the same models performing better against
random samples as they did against their own shuffled values (Supplementary Figure 4).
When asking the classifier to discriminate between real binders, and the generated sets
we can see that the Base+E model performs the best with 1-100, 1-10, 1-1 AUC scores of 0.72,
0.67, and 0.64 (Figure 24). The second best is the Base+S+E model with 0.78, 0.75, and 0.70
respectively. The Base and Base+S model also seem to come in at almost tied in the low-high
0.80s-0.70s. The lower AUC scores mean the classifier is having greater difficulty discerning
between the real binders and the generated compounds. This very clearly shows that the
77

embedding models are producing compound sets that better approximate the real binding
compounds.
Like with tanimoto similarity we can look and evaluate absolute counts (Table 5). Table
5 shows total counts for a set of bins over a 50% probability of binding score. We can see that
the tanimoto outlier cutoff is stricter with overall less molecules making the cutoff when
comparing the tables (Table3,4,5). When we look at the absolute count of compounds, we see
that Base+S+E produces the highest number of high-quality binders. The Base+E model only
outperforms other models in bins 0.7-0.8 and 0.8-0.9 for the 1-100 gen set. However, Base+E
does in total outperform Base in all generation sets. The Base+E model’s outsized performance
in AUC scores and classifier Wasserstein distance but lower absolute compound count highlights
the importance of compound filtering. Considering the validity scores in Table 1, it is clear that
the Base+E model is the most efficient model in compound production. For example, the Base+E
model 1-100 set generates 4830 compounds over 50% binding probability and the Base+S+E
model 1-100 set generates 4931. But, when considering the validity difference of 77% and 100%
respectively, the Base+S+E model is generating an extra ~2000 compounds to get an additional
101 high probability compounds.
For generation methods, the classifier results agree with tanimoto similarity with 1-1
generation performing the best with 1-10 and 1-100 generation coming in second and third, with
few exceptions. The overall trend of embeddings improving performance, and SELFIES hurting
performance remains true.

78

Table 5. Binned absolute count of compounds classified as binders (>0.5) for Classifier Scores. Max values for each
generation set are bolded. Ties are bolded and underlined.

Base_1-1
Base+S_1-1
Base+E_1-1
Base+S+E_1-1
Base_1-10
Base+S_1-10
Base+E_1-10
Base+S+E_1-10
Base_1-100
Base+S_1-100
Base+E_1-100
Base+S+E_1-100

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Total

10
8
6
5
71
89
52
60
614
743
677
826

9
10
4
13
97
123
68
101
923
990
984
1079

22
16
15
16
144
114
117
127
1209
1009
1402
1212

4
10
13
15
76
103
117
127
716
867
1126
1110

3
5
11
11
24
70
65
91
409
500
641
704

48
49
49
60
412
499
419
506
3871
4109
4830
4931

5.5.3 Case Study
The exploration of specific proteins allows us to see the effectiveness of the model in a
case study. The classifier when applied to real binders and a random sample for IGF-1 was able
to distinguish between the two well with an AUC of 0.90 (Figure 25,26). For VEGFR2 the
classifier was still successful in differentiating but much less so with an AUC score of only 0.76
(Figure 27, 28). This provides some confidence that the classifier can classify appropriately for
these novel amino acid sequences and binders. For the case studies only, generation sets 1-10
and 1-100 are being used because the 1-1 set will only generate 3 and 2 molecules for IGF-1 and
VEGFR2, respectively.
For IGF-1 when comparing real binders to the generated sets, only three models gave
reasonable performance. Those being Base+E 1-10 and 1-100 and Base+S+E 1-10 with AUC
scores of 0.69, 0.55, and 0.76, respectively (Figure 26). When looking at the performance of
generated molecules against a random sample only Base+E sets 1-10 and 1-100 maintain
79

reasonable performance with AUC scores of 0.89, and 0.87 respectively (Figure 25). In both
comparisons a surprising result is that the other generated sets actually perform worse than
random molecules (Figure 25, 26). When comparing to real binders, the model is able to separate
generated compounds better than random compounds, and when comparing to random
compounds the generated compounds produce classifier probabilities worse than random
compounds.
What doing worse than random compounds may indicate is that the models are focusing
on a specific subset of the chemical space that is wrong. The negative examples come from a
random sample from the dataset, which means that the diversity of compounds allows for some
by chance to score higher on the classifier. So, performing worse than this set must mean the
models are focusing their generation in an area of the chemical space that is particularly wrong
and not diversifying. This could be evidence of a lack of generalization in the other models for
the IGF-1 protein family. Other models are latching onto the wrong types of protein neighbors in
the protein embedding space and therefore generating the wrong type of binders, where the
Base+E model is correctly finding associations with a set of analogous proteins and therefore
generating more appropriate binders.
For VEGFR2 the performance across all the generated sets is more reasonable. In
comparison of generated compounds and real binders the Base and Base+E models stand out as
the best performers (Figure 28). The Base model has AUC scores of 0.58 and 0.61 for 1-100 and
-10 generation sets respectively. The Base+E model has AUC scores of 0.54 and 0.48 for 1-100
and -10 generation sets respectively. For comparison of generated sets to negative binding
examples this result holds with the Base model having AUC scores of 0.67 for both generated
80

sets and the Base+E model having AUC scores of 0.73 and 0.79 for generation sets 1-100 and 110 respectively (Figure 27).
Through these results we see the Base+E modeling consistently performing the best. The
effect of the embeddings seems greater in the IGF-1 case as opposed to the VEGFR2 case. This
could be because there may be some closer examples of VEGFR2 in the training set than there
are for IGF-1 so the advantage of the larger embedding isn’t as necessary or prominent.
However, it is also clear that the classifier has a more difficult time with even real binders for
VEGFR2, so it is possible that there is a larger benefit caused by the inclusion of embeddings for
VEGFR2 that the classifier doesn’t have the resolution to show.
Consistent with previous classifier and tanimoto results, we see that the inclusion of
SELFIES hinders models and stifles the benefits of adding the embeddings. IGF-1 results appear
to agree with previous results with Base+E being the best model and Base+S+E being second
best. However, the performance gap between Base+E and Base+S+E is much larger than in the
other results. VEGFR2 deviates from previous results with the Base model being the second-best
model but agree with Base+E still being the best model. In the case study results the
improvement of beam search is less prominent, with performance between 1-10 and 1-100 being
very similar.

81

Figure 25. ROC Curve for IGF-1 Generated Sets vs. Ransom Sample Set across four model groups (Base,
Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and
Real Binders vs. Random Sample Set; AUC Scores are included in the legend.

Figure 26. ROC Curve for IGF-1 Generated Sets vs. Real Binder Set across four model groups (Base,
Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and
Real Binders vs. Random Sample Set; AUC Scores are included in the legend.

82

Figure 27. ROC Curve for VEGFR2 Generated Sets vs. Real Binder Set across four model groups (Base,
Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and
Real Binders vs. Random Sample Set; AUC Scores are included in the legend.

Figure 28. ROC Curve for VEGFR2 Generated Sets vs. Ransom Sample Set across four model groups (Base,
Base+S, Base+E, Base+S+E) and two generation sets (1-10, 1-100), with reference to Random Guess and
Real Binders vs. Random Sample Set; AUC Scores are included in the legend.

83

84

Chapter 6 - Conclusions and Future Research
6.1 Conclusions
6.1.1 Effect of SELFIES
The results show that the introduction of SELFIES strings has a potential negative effect
on performance. In chemical feasibility SELFIES strings improved validity, novelty and
diversity. For many of the physiochemical properties SELFIES models consistently showed the
largest deviations from the training and test set. Some of these deviations may potentially be
positive, like the deviation into more natural products, but others are negative, like the increase
in SA score. The overall concern with this deviation is that if the models are generating outside
the domain of the training set, this is a potential indication that the models are generating noise.
In tanimoto results and classifier results, including the case studies, we consistently see
the Base+S model perform the worst. When comparing Base to Base+S and Base+E to
Base+S+E we consistently see a performance drop. An interesting observation in these
comparisons is that it appears the SELFIES strings have a more of a negative performance hit on
the Base+E model than the Base model. This suggests that the SELFIES strings are a hinderance
in the performance gains facilitated by the embeddings. The deviations from the training set in
physiochemical properties, and this poor performance in producing binders suggests that the
SELFIES models forcing validity may be resulting in more noise and low-quality molecules
being produced, and that SMILES invalidity may actually be useful mechanism of filtering. This
is most clear when taking into consideration the absolute counts. For the tanimoto results and
85

classifier results we can see that the SELFIES model produces a reasonable number of highquality molecules, but unfortunately performance is still poor, because of the total amount of
poor quality molecules generated.
A recent paper by Skinnider 2024 [113] supports this interpretation. This work provided
causal evidence that invalid outputs are actually beneficial for filtering out low-likelihood
samples and that the enforcing of valid outputs via SELFIES strings produces structural biases in
the generated compounds [113]. Skinnider 2024 [113] ultimately suggests that the creation of
invalid SMILES strings can be seen as a feature and not a bug of generation models. This aligns
well with our findings that SELFIES models could generate high quality molecules but were
overall hindered by their over generation of poor-quality molecules.

6.1.2 Effect of Embeddings
The results revealed that pretrained embeddings can improve conditional compound
generation performance. This improvement in compound quality is seen through tanimoto
results, classifier results, and the case studies explored. These improvements appear to come at
no cost to physiochemical properties and only stand to potentially decrease validity. However,
the increase in compound quality outweighs the decrease in validity, with the raw number of
high-quality molecules being generated is still the highest for tanimoto metrics and second
highest for classifier metrics. Notably the classifier results, including the case studies, show a
striking improvement in the Base+E model compared to the other models. The improvement of
embeddings is also seen in the SELFIES models, although less prominent. This increased ability
to generate SMILES strings that approximate true binding SMILES can potentially be attributed
86

to the pre-trained embeddings allowing the model to generalize and find better associations
between new proteins and proteins in the training set. This improvement in performance from a
pretrained embeddings for conditional compound generation is a novel finding, but aligns with
improvements seen in binding prediction and compound screening [95], [96], [97], [98], [99],
[114]. The notable finding is that this improvement in conditional compound generation comes
with no additional change to model structure or size of the Transformer Decoder. An interesting
note, when looking at the absolute count results, is that the Base+S+E could potentially be a
viable model if the generation task has an efficient way to filter generated compounds. If that is
the case, then Base+S+E stands to potentially produce more high-quality compounds.
Unfortunately, in use cases where the model is being used to generate new lead molecules for a
novel amino acid sequence, filtering may not be feasible. This highlights Skinnider 2024’s [113]
assertion that potential low validity could be a feature of SMILES string generation and not a
bug.

6.1.3 Effect of Generation Method
Performance between the methods of compound generation did vary. In chemical
feasibility results from random sampling appeared to improve uniqueness, novelty and internal
diversity. For the Base model it had a negative effect on validity but Base+E was unaffected. In
physiochemical properties the random sampling generation did tend to have more outliers
outside of the domain of the training set, but distribution shape overall agreed with training data.
In tanimoto, and classifier results there was a consistent trend of beam search methods
performing better than random sampling, and 1-1 being the best performing generation method.
87

This isn’t a surprising result, given beam search is generating the most probable compound,
which intern means it is the compound that the model thinks best matches the provided amino
acid context. However, the difference in performance isn’t drastic, and considering random
sampling improved diversity and novelty it presents itself as a reasonable method for generating
and exploring a larger set of lead molecules with more diversity.

6.1.4 Scaling Evaluation with a Classifier
The results of the case study show how performance may vary from protein to protein
and highlights the importance of scaling compound generation evaluation to acquire more robust
results. The classifier for evaluation was successful and was able to distinguish between training
and test sets, and when provided with novel amino acid sequences for the case study showed
effectiveness in distinguishing between real binders and random compounds. These results serve
as a good proof-of-concept and motivate the creation of a more universal binding classifier
compound generation benchmarking at scale.

6.1.5 Transformers Moving Forward
Recurrent Neural Networks have been the model of choice from compound generation
historically [28]. However, Transformers do provide a number of key benefits for the field
moving forward. Their ability to parallelize and scale allows for the training of larger models
faster, as seen with the rise of large language models currently [16]. The use of attention allows
the model to extract key information on how protein sequences and compound sequences attend
to each other, potentially important for determination of binding or other associations. Most
88

notably, transformers are proliferating in bioinformatic models more broadly as seen with
AlphaFold2 and ESM-2 [85], [88]. The utilization of transformer models for compound
generation allows for the more seamless repurposing of these pre-trained models, as displayed in
this work. It is important to reiterate that although compound generation methods have the
advantage of potentially providing new novel compounds; they do provide the concern of more
overhead in synthesizing and validating proposed compounds. Because of this, the use of
discriminative machine learning approaches on established compound libraries can still be a
more favorable approach.

6.2 Future Research
6.2.1 Short Comings and Limitations
This research was restricted computationally which led to some experimentation being
less robust than we would hope. Performing multiple train test splits would allow us to validate
the results of this work with more confidence. In addition to multiple splits, the datasets could be
improved by not only splitting on identical amino acid sequences but splitting on amino acid
sequence similarity. This would create train test splits that are more diverse, and better testing
models’ ability to generalize.
Training length of the models is also a point that can be explored further. In this work we
trained the models for 50 epochs but Grechishnikova 2021 [59] was trained to approximately 200
epochs. The effects of longer training still need to be investigated. There is a possibility that with

89

more training the non-embedding and embedding models may converge on performance,
although the trend seems unlikely.
The evaluation of this research makes use of a classifier which is only a proxy for
evaluating binding. With more compute power the models could be evaluated with more robust
binding simulations. There is also an opportunity to explore the use of some open-source binding
and affinity prediction models to benchmark results against independently trained models.
Ideally, utilizing multiple independent models could allow for more confidence in the results as
well.

6.2.2 Improvements
Analysis of Performance on Protein Family
This research evaluates the models at an aggregate level and only goes into the specifics
for two example proteins in a case study. The case study demonstrated that performance for any
given protein can be variable. A deeper dive into trends in performance related to protein
families or amino acid sequence similarity would be valuable. This could uncover if the model
has any protein family bias. This could also aid in a deeper investigation of the Base+E validity
drop. If the validity drop is caused by focusing on a boarder set of protein associations, we may
expect amino acid sequences that don’t have close associations to proteins in the training set but
rather have multiple distant associations to then have worse validity. The true binders for these
amino acid associations could be identified in the training set and compared to the invalid
molecules generated for that amino acid sequence, to look for potential substring similarity. This
90

could further validate the hypothesis of validity dropping due to uncertainty caused by broader
amino acid sequence associations.

Exploring Difference Generation Parameters
This research only explores beam search with an alpha value of 0.6 and random sampling
with a temperature of 1. Doing a more exhaustive exploration of the effect of these parameters
on performance would be a valuable exercise. For example, increasing temperature will increase
the randomness of sampling generation and measuring the potential increase in diversity and
novelty against a potential decrease in performance would be interesting. The alpha value of 0.6
was determined to be optimal by Wu et al. 2016 [75] for translation tasks, so it would be
interesting to reevaluate this parameter empirically for compound generation to see if it is
potentially different [75].

Exploring Few-Shot Performance
The evaluation of the models in this work focuses on zero-shot generation of compounds
from amino acid sequences. This is to say that the models are generating compounds for an
unknown amino acid sequence with no additional training. But other papers in the past use a
method of fine tuning a pretrained model with known binders for the protein [35], [36], [37]. It
would be interesting to compare the established RNN methods to the models explored in this
paper with amino acid sequence context. In theory, the models created in this work may be able
to generalize to new examples faster because it is not only learning the compounds but also
learning the amino acid sequence and making associations to other amino acid sequences it
already knows. It would also be interesting to compare them in context to each other, to see if the
91

embedding models are able to perform better with fewer fine-tuning cycles. This could present a
potentially more practical use case for the models in this work, as foundational models that can
be quickly finetuned for specific protein generation.

Generating Compounds from Scaffolds
All compounds generated in this research were generated from scratch, but it is possible
to give the decoder a starting sequence to then add to. It would be interesting to explore this
functionality in a variety of ways. For example, giving the same scaffold to multiple amino acid
sequences and seeing the different generations. It would also be possible to take a lead
compound that has already been generated, make a scaffold from it and have the models branch
off of it while changing the random sampling temperature to see the effectiveness of iterating on
generated molecules repeatedly.

Graphical Network for Evaluation Classifier
The classifier used in evaluation takes in compounds as SMILES strings. This means at
its core the model is measuring if the string is associated with a protein. The hope is that the
classifier is in some way deriving the physical and chemical properties from the SMILES string.
An interesting exploration would be retraining the classifier model using a graphical network to
take in compounds via their graph structure instead of string structure and see if the results
remain the same. This classifier could be more interesting since it will be directly evaluating the
compounds’ 3D structure in similarity to the 3D structure of other binders.

92

Exploring Model Size
This work explores models with Encoder and Decoder structures and sizes identical to
Grechishnikova 2021 [59]. This leaves room to explore increasing (or decreasing) model size,
whether in size of 𝑑𝑓𝑓 , 𝑑𝑚𝑜𝑑𝑒𝑙 , or number of layers/attention heads. It would be interesting to
explore the overall performance impact of a larger model. Namely, if the non-embedding models
eventually narrow the performance gap with embedding models and if an increased decoder size
could help the performance of SELFIES string generation. When Skinnider 2024 [113] compares
SMILES string and SELFIES string generation model sizes and structures are also identical. This
leaves room for the potential exploration of the impact of model size on SELFIES string
generation. There is a chance that a larger model may be needed to better grasp the more
complex grammar of SELFIES strings. Outside of model architecture size, the other embedding
sizes of ESM-2 could also be explored to see the impacts of the smaller models, as well as the
larger ones if computational resources allow.

6.3 Final Statement
This work explores how pretrained protein embeddings from a large protein language
model, ESM-2 [88], and the use of an alternative string compound representation, SELFIES [82],
effects the performance of compound generation in Transformer models. Overall, results showed
that the inclusion of pretrained embeddings has the potential to improve model performance in
generating high quality compounds for specific amino acid sequences. The utilization of
SELFIES strings was seen to improve validity and diversity but promoted difficult to synthesize
compounds and overall hurt the performance of the models. Random sampling with temperature
93

was shown to be a less effective generation method than beam search but considering its nondeterministic nature and increased speed it is a feasible methodology for exploring compound
generation.
These results ultimately show that without an increase in model size or change in
structure the application of pretrained protein embeddings can improve protein specific
compound generation in Transformer models. These models can potentially be used to generate
lead compounds for novel proteins without any retraining or fine-tuning.

94

Bibliography
[1]

R. Leinonen, H. Sugawara, M. Shumway, and International Nucleotide Sequence
Database Collaboration, “The Sequence Read Archive,” Nucleic Acids Research, vol. 39,
no. Database issue, pp. D19-21, Jan. 2011, doi: 10.1093/nar/gkq1019.

[2]

M. A. Jensen, V. Ferretti, R. L. Grossman, and L. M. Staudt, “The NCI Genomic Data
Commons as an Engine for Precision Medicine,” Blood, vol. 130, no. 4, pp. 453–459, Jul.
2017, doi: 10.1182/blood-2017-03-735654.

[3]

H. M. Berman et al., “The Protein Data Bank,” Nucleic Acids Res, vol. 28, no. 1, pp. 235–
242, Jan. 2000, doi: 10.1093/nar/28.1.235.

[4]

A. Gaulton et al., “The ChEMBL Database in 2017,” Nucleic Acids Res, vol. 45, no. D1,
pp. D945–D954, Jan. 2017, doi: 10.1093/nar/gkw1074.

[5]

S. Kim et al., “PubChem Substance and Compound databases,” Nucleic Acids Res, vol.
44, no. D1, pp. D1202-1213, Jan. 2016, doi: 10.1093/nar/gkv951.

[6]

J. J. Irwin and B. K. Shoichet, “Zinc--a Free Database of Commercially Available
Compounds for Virtual Screening,” J Chem Inf Model, vol. 45, no. 1, pp. 177–182, Feb.
2005, doi: 10.1021/ci049714+.

[7]

R. and Markets, “Global Pharmaceuticals Market Report 2021-2030 Featuring Major
Players - Pfizer; F. Hoffmann-La Roche Ltd; Sanofi; Johnson & Johnson and Merck &
Co.” Accessed: Oct. 05, 2022. [Online]. Available: https://www.prnewswire.com/newsreleases/global-pharmaceuticals-market-report-2021-2030-featuring-major-players--pfizer-f-hoffmann-la-roche-ltd-sanofi-johnson--johnson-and-merck--co-301260487.html

[8]

C. H. Wong, K. W. Siah, and A. W. Lo, “Estimation of Clinical Trial Success Rates and
Related Parameters,” Biostatistics, vol. 20, no. 2, pp. 273–286, Apr. 2019, doi:
10.1093/biostatistics/kxx069.

[9]

S. M. Paul et al., “How to Improve R&d Productivity: The Pharmaceutical Industry’s
Grand Challenge,” Nat Rev Drug Discov, vol. 9, no. 3, Art. no. 3, Mar. 2010, doi:
10.1038/nrd3078.

[10] Y. Cheng, Y. Gong, Y. Liu, B. Song, and Q. Zou, “Molecular Design in Drug Discovery:
A Comprehensive Review of Deep Generative Models,” Brief Bioinform, vol. 22, no. 6, p.
bbab344, Nov. 2021, doi: 10.1093/bib/bbab344.
[11] I. H. Sarker, “Machine Learning: Algorithms, Real-World Applications and Research
Directions,” SN COMPUT. SCI., vol. 2, no. 3, p. 160, Mar. 2021, doi: 10.1007/s42979021-00592-x.
[12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online].
Available: http://www.deeplearningbook.org
95

[13] I. Goodfellow et al., “Generative Adversarial Networks,” Commun. ACM, vol. 63, no. 11,
pp. 139–144, Oct. 2020, doi: 10.1145/3422622.
[14] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol.
9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[15] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent
Neural Networks on Sequence Modeling,” Dec. 11, 2014, arXiv: arXiv:1412.3555. doi:
10.48550/arXiv.1412.3555.
[16] A. Vaswani et al., “Attention Is All You Need,” Dec. 06, 2017, arXiv: arXiv:1706.03762.
doi: 10.48550/arXiv.1706.03762.
[17] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “DeepTox: Toxicity Prediction
using Deep Learning,” Frontiers in Environmental Science, vol. 3, 2016, Accessed: Oct.
06, 2022. [Online]. Available:
https://www.frontiersin.org/articles/10.3389/fenvs.2015.00080
[18] W. F. D. Bennett et al., “Predicting Small Molecule Transfer Free Energies by Combining
Molecular Dynamics Simulations and Deep Learning,” J. Chem. Inf. Model., vol. 60, no.
11, pp. 5375–5381, Nov. 2020, doi: 10.1021/acs.jcim.0c00318.
[19] A. Lusci, G. Pollastri, and P. Baldi, “Deep Architectures and Deep Learning in
Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules,” J
Chem Inf Model, vol. 53, no. 7, pp. 1563–1575, Jul. 2013, doi: 10.1021/ci400187y.
[20] J. Jiménez, M. Škalič, G. Martínez-Rosell, and G. De Fabritiis, “KDEEP: Protein–Ligand
Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks,” J. Chem.
Inf. Model., vol. 58, no. 2, pp. 287–296, Feb. 2018, doi: 10.1021/acs.jcim.7b00650.
[21] H. Öztürk, A. Özgür, and E. Ozkirimli, “Deepdta: Deep Drug–Target Binding Affinity
Prediction,” Bioinformatics, vol. 34, no. 17, pp. i821–i829, Sep. 2018, doi:
10.1093/bioinformatics/bty593.
[22] T. Nguyen, H. Le, T. P. Quinn, T. Nguyen, T. D. Le, and S. Venkatesh, “GraphDTA:
Predicting Drugd–Target Binding Affinity with Graph Neural Networks,” Oct. 02, 2020,
bioRxiv. doi: 10.1101/684662.
[23] P. T. Jackson et al., “Phenotypic Profiling of High Throughput Imaging Screens with
Generic Deep Convolutional Features,” Mar. 15, 2019, arXiv: arXiv:1903.06516.
Accessed: Oct. 05, 2022. [Online]. Available: http://arxiv.org/abs/1903.06516
[24] P. Karpov, G. Godin, and I. V. Tetko, “Transformer-CNN: Swiss Knife for QSAR
Modeling and Interpretation,” Journal of Cheminformatics, vol. 12, no. 1, p. 17, Mar.
2020, doi: 10.1186/s13321-020-00423-w.
[25] C.-T. Chi, M.-H. Lee, C.-F. Weng, and M. K. Leong, “In Silico Prediction of PAMPA
Effective Permeability Using a Two-QSAR Approach,” Int J Mol Sci, vol. 20, no. 13, p.
3170, Jun. 2019, doi: 10.3390/ijms20133170.
96

[26] S. Kim and K.-H. Cho, “PyQSAR: A Fast QSAR Modeling Platform Using Machine
Learning and Jupyter Notebook,” Bulletin of the Korean Chemical Society, vol. 40, no. 1,
pp. 39–44, 2019, doi: 10.1002/bkcs.11638.
[27] J. Vamathevan et al., “Applications of Machine Learning in Drug Discovery and
Development,” Nat Rev Drug Discov, vol. 18, no. 6, Art. no. 6, Jun. 2019, doi:
10.1038/s41573-019-0024-5.
[28] D. D. Martinelli, “Generative Machine Learning for De Novo Drug Discovery: A
Systematic Review,” Computers in Biology and Medicine, vol. 145, p. 105403, Jun. 2022,
doi: 10.1016/j.compbiomed.2022.105403.
[29] X. Tong et al., “Generative Models for De Novo Drug Design,” J. Med. Chem., vol. 64,
no. 19, pp. 14011–14027, Oct. 2021, doi: 10.1021/acs.jmedchem.1c00927.
[30] T. B. Brown et al., “Language Models are Few-Shot Learners,” Jul. 22, 2020, arXiv:
arXiv:2005.14165. doi: 10.48550/arXiv.2005.14165.
[31] P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Jun. 01,
2021, arXiv: arXiv:2105.05233. doi: 10.48550/arXiv.2105.05233.
[32] E. J. Bjerrum and R. Threlfall, “Molecular Generation with Recurrent Neural Networks
(RNNs),” May 17, 2017, arXiv: arXiv:1705.04612. doi: 10.48550/arXiv.1705.04612.
[33] P. Ertl, R. Lewis, E. Martin, and V. Polyakov, “In Silico Generation of Novel, Drug-Like
Chemical Matter Using the LSTM Neural Network,” Jan. 08, 2018, arXiv:
arXiv:1712.07449. doi: 10.48550/arXiv.1712.07449.
[34] S. Harel and K. Radinsky, “Prototype-Based Compound Discovery Using Deep
Generative Models,” Mol. Pharmaceutics, vol. 15, no. 10, pp. 4406–4416, Oct. 2018, doi:
10.1021/acs.molpharmaceut.8b00474.
[35] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, “Generating Focused Molecule
Libraries for Drug Discovery with Recurrent Neural Networks,” ACS Cent. Sci., vol. 4, no.
1, pp. 120–131, Jan. 2018, doi: 10.1021/acscentsci.7b00512.
[36] A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, and G. Schneider,
“Generative Recurrent Networks for De Novo Drug Design,” Molecular Informatics, vol.
37, no. 1–2, p. 1700111, 2018, doi: 10.1002/minf.201700111.
[37] D. Merk, L. Friedrich, F. Grisoni, and G. Schneider, “De Novo Design of Bioactive Small
Molecules by Artificial Intelligence,” Mol Inform, vol. 37, no. 1–2, p. 1700153, Jan. 2018,
doi: 10.1002/minf.201700153.
[38] J. Arús-Pous et al., “Randomized SMILES Strings Improve the Quality of Molecular
Generative Models,” Journal of Cheminformatics, vol. 11, no. 1, p. 71, Nov. 2019, doi:
10.1186/s13321-019-0393-0.

97

[39] T. Blaschke and J. Bajorath, “Fine-Tuning of a Generative Neural Network for Designing
Multi-Target Compounds,” J Comput Aided Mol Des, vol. 36, no. 5, pp. 363–371, May
2022, doi: 10.1007/s10822-021-00392-8.
[40] R. Gómez-Bombarelli et al., “Automatic Chemical Design Using a Data-Driven
Continuous Representation of Molecules,” ACS Cent. Sci., vol. 4, no. 2, pp. 268–276, Feb.
2018, doi: 10.1021/acscentsci.7b00572.
[41] E. J. Bjerrum and B. Sattarov, “Improving Chemical Autoencoder Latent Space and
Molecular De Novo Generation Diversity with Heteroencoders,” Biomolecules, vol. 8, no.
4, Art. no. 4, Dec. 2018, doi: 10.3390/biom8040131.
[42] A. Kadurin et al., “The Cornucopia of Meaningful Leads: Applying Deep Adversarial
Autoencoders for New Molecule Development in Oncology,” Oncotarget, vol. 8, no. 7,
pp. 10883–10890, Dec. 2016, doi: 10.18632/oncotarget.14073.
[43] S. Kang and K. Cho, “Conditional Molecular Design with Deep Generative Models,” J
Chem Inf Model, vol. 59, no. 1, pp. 43–52, Jan. 2019, doi: 10.1021/acs.jcim.8b00263.
[44] T. Blaschke, M. Olivecrona, O. Engkvist, J. Bajorath, and H. Chen, “Application of
Generative Autoencoder in De Novo Molecular Design,” Molecular Informatics, vol. 37,
no. 1–2, p. 1700123, 2018, doi: 10.1002/minf.201700123.
[45] J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim, “Molecular Generative Model Based on
Conditional Variational Autoencoder for De Novo Molecular Design,” Journal of
Cheminformatics, vol. 10, no. 1, p. 31, Jul. 2018, doi: 10.1186/s13321-018-0286-7.
[46] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, “Grammar Variational
Autoencoder,” Mar. 06, 2017, arXiv: arXiv:1703.01925. Accessed: Oct. 06, 2022.
[Online]. Available: http://arxiv.org/abs/1703.01925
[47] S. H. Hong, S. Ryu, J. Lim, and W. Y. Kim, “Molecular Generative Model Based on an
Adversarially Regularized Autoencoder,” J. Chem. Inf. Model., vol. 60, no. 1, pp. 29–36,
Jan. 2020, doi: 10.1021/acs.jcim.9b00694.
[48] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. AspuruGuzik, “Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence
Generation Models,” Feb. 06, 2018, arXiv: arXiv:1705.10843. Accessed: Oct. 06, 2022.
[Online]. Available: http://arxiv.org/abs/1705.10843
[49] E. Putin et al., “Adversarial Threshold Neural Computer for Molecular de Novo Design,”
Mol. Pharmaceutics, vol. 15, no. 10, pp. 4386–4397, Oct. 2018, doi:
10.1021/acs.molpharmaceut.7b01137.
[50] O. Prykhodko et al., “A De Novo Molecular Generation Method Using Latent Vector
Based Generative Adversarial Network,” Journal of Cheminformatics, vol. 11, no. 1, p.
74, Dec. 2019, doi: 10.1186/s13321-019-0397-9.
98

[51] O. Méndez-Lucio, B. Baillif, D.-A. Clevert, D. Rouquié, and J. Wichard, “De Novo
Generation of Hit-Like Molecules from Gene Expression Signatures Using Artificial
Intelligence,” Nat Commun, vol. 11, no. 1, Art. no. 1, Jan. 2020, doi: 10.1038/s41467-01913807-w.
[52] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov, “druGAN: An
Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New
Molecules with Desired Molecular Properties in Silico,” Mol. Pharmaceutics, vol. 14, no.
9, pp. 3098–3104, Sep. 2017, doi: 10.1021/acs.molpharmaceut.7b00346.
[53] D. Polykovskiy et al., “Entangled Conditional Adversarial Autoencoder for de Novo Drug
Discovery,” Mol. Pharmaceutics, vol. 15, no. 10, pp. 4398–4405, Oct. 2018, doi:
10.1021/acs.molpharmaceut.8b00839.
[54] Y. Liu and J. Bailey, “De Novo Molecular Generation with Stacked Adversarial Model,”
Oct. 24, 2021, arXiv: arXiv:2110.12454. Accessed: Oct. 06, 2022. [Online]. Available:
http://arxiv.org/abs/2110.12454
[55] X. Bai and Y. Yin, “Exploration and Augmentation of Pharmacological Space Via
Adversarial Auto-Encoder Model for Facilitating Kinase-Centric Drug Development,”
Journal of Cheminformatics, vol. 13, no. 1, p. 95, Dec. 2021, doi: 10.1186/s13321-02100574-4.
[56] Q. Bai, S. Tan, T. Xu, H. Liu, J. Huang, and X. Yao, “MolAICal: A Soft Tool for 3d Drug
Design of Protein Targets by Artificial Intelligence and Classical Algorithm,” Briefings in
Bioinformatics, vol. 22, no. 3, p. bbaa161, May 2021, doi: 10.1093/bib/bbaa161.
[57] V. Bagal, R. Aggarwal, P. K. Vinod, and U. D. Priyakumar, “MolGPT: Molecular
Generation Using a Transformer-Decoder Model,” J. Chem. Inf. Model., vol. 62, no. 9, pp.
2064–2076, May 2022, doi: 10.1021/acs.jcim.1c00600.
[58] V. Mandhana and R. Taware, “De Novo Drug Design Using Self Attention Mechanism,”
in Proceedings of the 35th Annual ACM Symposium on Applied Computing, in SAC ’20.
New York, NY, USA: Association for Computing Machinery, Mar. 2020, pp. 8–12. doi:
10.1145/3341105.3374024.
[59] D. Grechishnikova, “Transformer Neural Network for Protein-Specific De Novo Drug
Generation as a Machine Translation Problem,” Sci Rep, vol. 11, no. 1, Art. no. 1, Jan.
2021, doi: 10.1038/s41598-020-79682-4.
[60] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular De-Novo Design
Through Deep Reinforcement Learning,” Journal of Cheminformatics, vol. 9, no. 1, p. 48,
Sep. 2017, doi: 10.1186/s13321-017-0235-x.
[61] M. Popova, O. Isayev, and A. Tropsha, “Deep Reinforcement Learning for De Novo Drug
Design,” Science Advances, vol. 4, no. 7, p. eaap7885, Jul. 2018, doi:
10.1126/sciadv.aap7885.
99

[62] T. Blaschke, O. Engkvist, J. Bajorath, and H. Chen, “Memory-Assisted Reinforcement
Learning for Diverse Molecular De Novo Design,” Journal of Cheminformatics, vol. 12,
no. 1, p. 68, Nov. 2020, doi: 10.1186/s13321-020-00473-0.
[63] B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, and A. Aspuru-Guzik, “Optimizing
Distributions Over Molecular Space. an Objective-Reinforced Generative Adversarial
Network for Inverse-Design Chemistry (ORGANIC),” Aug. 2017, doi:
10.26434/chemrxiv.5309668.v3.
[64] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley, “Optimization of Molecules via Deep
Reinforcement Learning,” Sci Rep, vol. 9, no. 1, Art. no. 1, Jul. 2019, doi:
10.1038/s41598-019-47148-x.
[65] A. Ghanbarpour and M. A. Lill, “Seq2Mol: Automatic Design of De Novo Molecules
Conditioned by the Target Protein Sequences Through Deep Neural Networks,” Oct. 29,
2020, arXiv: arXiv:2010.15900. doi: 10.48550/arXiv.2010.15900.
[66] J. Born, M. Manica, A. Oskooei, J. Cadow, G. Markert, and M. Rodríguez Martínez,
“PaccMannRL: De Novo Generation of Hit-Like Anticancer Molecules from
Transcriptomic Data Via Reinforcement Learning,” iScience, vol. 24, no. 4, p. 102269,
Mar. 2021, doi: 10.1016/j.isci.2021.102269.
[67] P.-C. Kotsias, J. Arús-Pous, H. Chen, O. Engkvist, C. Tyrchan, and E. J. Bjerrum, “Direct
Steering of De Novo Molecular Generation with Descriptor Conditional Recurrent Neural
Networks,” Nat Mach Intell, vol. 2, no. 5, Art. no. 5, May 2020, doi: 10.1038/s42256-0200174-5.
[68] V. Chenthamarakshan et al., “CogMol: Target-Specific and Selective Drug Design for
COVID-19 Using Deep Generative Models,” Jun. 23, 2020, arXiv: arXiv:2004.01215. doi:
10.48550/arXiv.2004.01215.
[69] R. Gupta, D. Srivastava, M. Sahu, S. Tiwari, R. K. Ambasta, and P. Kumar, “Artificial
Intelligence to Deep Learning: Machine Intelligence Approach for Drug Discovery,” Mol
Divers, vol. 25, no. 3, pp. 1315–1360, Aug. 2021, doi: 10.1007/s11030-021-10217-3.
[70] E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified Rational
Protein Engineering with Sequence-Based Deep Representation Learning,” Nat Methods,
vol. 16, no. 12, Art. no. 12, Dec. 2019, doi: 10.1038/s41592-019-0598-1.
[71] M. Heinzinger et al., “Modeling Aspects of the Language of Life Through TransferLearning Protein Sequences,” BMC Bioinformatics, vol. 20, no. 1, p. 723, Dec. 2019, doi:
10.1186/s12859-019-3220-8.
[72] R. J. Williams and D. Zipser, “A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989, doi:
10.1162/neco.1989.1.2.270.

100

[73] M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE
Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, Nov. 1997, doi:
10.1109/78.650093.
[74] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” Mar.
09, 2015, arXiv: arXiv:1503.02531. doi: 10.48550/arXiv.1503.02531.
[75] Y. Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation,” Oct. 08, 2016, arXiv: arXiv:1609.08144. doi:
10.48550/arXiv.1609.08144.
[76] L. David, A. Thakkar, R. Mercado, and O. Engkvist, “Molecular Representations in AiDriven Drug Discovery: A Review and Practical Guide,” Journal of Cheminformatics, vol.
12, no. 1, p. 56, Sep. 2020, doi: 10.1186/s13321-020-00460-5.
[77] G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, and N. Baker, “Chemception: A Deep
Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expertdeveloped QSAR/QSPR Models,” Jun. 20, 2017, arXiv: arXiv:1706.06689. doi:
10.48550/arXiv.1706.06689.
[78] “RDKit.” Accessed: Oct. 17, 2022. [Online]. Available: http://www.rdkit.org/
[79] E. J. Bjerrum, “SMILES Enumeration as Data Augmentation for Neural Network
Modeling of Molecules,” May 17, 2017, arXiv: arXiv:1703.07076. doi:
10.48550/arXiv.1703.07076.
[80] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for
Molecular Generation Models,” Frontiers in Pharmacology, vol. 11, 2020, Accessed: Oct.
08, 2022. [Online]. Available:
https://www.frontiersin.org/articles/10.3389/fphar.2020.565644
[81] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking
Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–
1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839.
[82] M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-Guzik, “Self-Referencing
Embedded Strings (SELFIES): A 100% Robust Molecular String Representation,” Mach.
Learn.: Sci. Technol., vol. 1, no. 4, p. 045024, Oct. 2020, doi: 10.1088/2632-2153/aba947.
[83] D. S. Wigh, J. M. Goodman, and A. A. Lapkin, “A Review of Molecular Representation in
the Age of Machine Learning,” WIREs Computational Molecular Science, vol. 12, no. 5,
p. e1603, 2022, doi: 10.1002/wcms.1603.
[84] D. Nelson and M. Cox, “Lehninger Principles of Biochemistry,” in Lehninger Principles
of Biochemistry, 6th ed., New York, NY: W.H. Freeman, 2017.
[85] J. Jumper et al., “Highly Accurate Protein Structure Prediction with Alphafold,” Nature,
vol. 596, no. 7873, Art. no. 7873, Aug. 2021, doi: 10.1038/s41586-021-03819-2.
101

[86] A. Elnaggar et al., “ProtTrans: Towards Cracking the Language of Life’s Code Through
Self-Supervised Deep Learning and High Performance Computing,” May 04, 2021, arXiv:
arXiv:2007.06225. doi: 10.48550/arXiv.2007.06225.
[87] A. Rives et al., “Biological Structure and Function Emerge from Scaling Unsupervised
Learning to 250 Million Protein Sequences,” Proceedings of the National Academy of
Sciences, vol. 118, no. 15, p. e2016239118, Apr. 2021, doi: 10.1073/pnas.2016239118.
[88] Z. Lin et al., “Language Models of Protein Sequences at the Scale of Evolution Enable
Accurate Structure Prediction,” Jul. 21, 2022, bioRxiv. doi: 10.1101/2022.07.20.500902.
[89] S. Haroon, H. C.a., and J. A.s., “Generative Pre-Trained Transformer (gpt) Based Model
with Relative Attention for De Novo Drug Design,” Computational Biology and
Chemistry, vol. 106, p. 107911, Oct. 2023, doi: 10.1016/j.compbiolchem.2023.107911.
[90] P. Eckmann, K. Sun, B. Zhao, M. Feng, M. K. Gilson, and R. Yu, “LIMO: Latent
Inceptionism for Targeted Molecule Generation,” Jun. 17, 2022, arXiv: arXiv:2206.09010.
doi: 10.48550/arXiv.2206.09010.
[91] K. Kaitoh and Y. Yamanishi, “Scaffold-Retained Structure Generator to Exhaustively
Create Molecules in an Arbitrary Chemical Space,” J. Chem. Inf. Model., vol. 62, no. 9,
pp. 2212–2225, May 2022, doi: 10.1021/acs.jcim.1c01130.
[92] A. Nigam, R. Pollice, and A. Aspuru-Guzik, “Parallel Tempered Genetic Algorithm
Guided by Deep Neural Networks for Inverse Molecular Design,” Digital Discovery, vol.
1, no. 4, pp. 390–404, Aug. 2022, doi: 10.1039/D2DD00003B.
[93] A. Nigam, R. Pollice, M. Krenn, G. dos P. Gomes, and A. Aspuru-Guzik, “Beyond
Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and
Discovery (stoned) Algorithm for Molecules Using Selfies,” Chem. Sci., vol. 12, no. 20,
pp. 7079–7090, May 2021, doi: 10.1039/D1SC00231G.
[94] S. Park and H. Lee, “A Molecular Generative Model with Genetic Algorithm and Tree
Search for Cancer Samples,” Dec. 16, 2021, arXiv: arXiv:2112.08959. doi:
10.48550/arXiv.2112.08959.
[95] S. Bhat et al., “De Novo Generation and Prioritization of Target-Binding Peptide Motifs
from Sequence Alone,” Jun. 28, 2023, bioRxiv. doi: 10.1101/2023.06.26.546591.
[96] H. Wu et al., “AttentionMGT-DTA: A Multi-Modal Drug-Target Affinity Prediction
Using Graph Transformer and Attention Mechanism,” Neural Networks, vol. 169, pp.
623–636, Jan. 2024, doi: 10.1016/j.neunet.2023.11.018.
[97] F. Wu, L. Wu, D. Radev, J. Xu, and S. Z. Li, “Integration of Pre-Trained Protein
Language Models into Geometric Deep Learning Networks,” Commun Biol, vol. 6, no. 1,
pp. 1–8, Aug. 2023, doi: 10.1038/s42003-023-05133-1.

102

[98] Y. Kalakoti, S. Gawande, and D. Sundar, “Estimating Protein-Ligand Interactions with
Geometric Deep Learning and Mixture Density Models,” Dec. 25, 2023, bioRxiv. doi:
10.1101/2023.10.03.560738.
[99] Z. Du, X. Ding, W. Hsu, A. Munir, Y. Xu, and Y. Li, “pLM4ACE: A Protein Language
Model Based Predictor for Antihypertensive Peptide Screening,” Food Chem, vol. 431, p.
137162, Jan. 2024, doi: 10.1016/j.foodchem.2023.137162.
[100] T. Developers, TensorFlow. (Jul. 11, 2024). Zenodo. doi: 10.5281/zenodo.12726004.
[101] “Neural Machine Translation with a Transformer and Keras | Text,” TensorFlow.
Accessed: Sep. 14, 2024. [Online]. Available:
https://www.tensorflow.org/text/tutorials/transformer
[102] A. Lo, R. Pollice, A. Nigam, A. D. White, M. Krenn, and A. Aspuru-Guzik, “Recent
advances in the self-referencing embedded strings (SELFIES) library,” Digital Discovery,
vol. 2, no. 4, pp. 897–908, Aug. 2023, doi: 10.1039/D3DD00044C.
[103] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning
and Stochastic Optimization,” J. Mach. Learn. Res., vol. 12, no. null, pp. 2121–2159, Jul.
2011.
[104] N. Song, R. Dong, Y. Pu, E. Wang, J. Xu, and F. Guo, “PMF-CPI: Assessing Drug
Selectivity with a Pretrained Multi-Functional Model for Compound–Protein
Interactions,” Journal of Cheminformatics, vol. 15, no. 1, p. 97, Oct. 2023, doi:
10.1186/s13321-023-00767-z.
[105] A. Vaswani et al., “Tensor2Tensor for Neural Machine Translation,” Mar. 16, 2018,
arXiv: arXiv:1803.07416. doi: 10.48550/arXiv.1803.07416.
[106] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney, “Experimental and
Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and
Development Settings,” Advanced Drug Delivery Reviews, vol. 46, no. 1, pp. 3–26, Mar.
2001, doi: 10.1016/S0169-409X(00)00129-0.
[107] D. F. Veber, S. R. Johnson, H.-Y. Cheng, B. R. Smith, K. W. Ward, and K. D. Kopple,
“Molecular Properties That Influence the Oral Bioavailability of Drug Candidates,” J.
Med. Chem., vol. 45, no. 12, pp. 2615–2623, Jun. 2002, doi: 10.1021/jm020017n.
[108] S. A. Hitchcock and L. D. Pennington, “Structure−Brain Exposure Relationships,” J. Med.
Chem., vol. 49, no. 26, pp. 7559–7583, Dec. 2006, doi: 10.1021/jm060642i.
[109] G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins, “Quantifying
the Chemical Beauty of Drugs,” Nature Chem, vol. 4, no. 2, Art. no. 2, Feb. 2012, doi:
10.1038/nchem.1243.
[110] P. Ertl, S. Roggo, and A. Schuffenhauer, “Natural Product-likeness Score and Its
Application for Prioritization of Compound Libraries,” J. Chem. Inf. Model., vol. 48, no.
1, pp. 68–74, Jan. 2008, doi: 10.1021/ci700286x.
103

[111] P. Ertl and A. Schuffenhauer, “Estimation of Synthetic Accessibility Score of Drug-Like
Molecules Based on Molecular Complexity and Fragment Contributions,” Journal of
Cheminformatics, vol. 1, no. 1, p. 8, Jun. 2009, doi: 10.1186/1758-2946-1-8.
[112] T. Liu, Y. Lin, X. Wen, R. N. Jorissen, and M. K. Gilson, “Bindingdb: A Web-Accessible
Database of Experimentally Determined Protein–Ligand Binding Affinities,” Nucleic
Acids Res, vol. 35, no. Database issue, pp. D198–D201, Jan. 2007, doi:
10.1093/nar/gkl999.
[113] M. A. Skinnider, “Invalid SMILES Are Beneficial Rather Than Detrimental to Chemical
Language Models,” Nat Mach Intell, vol. 6, no. 4, pp. 437–448, Apr. 2024, doi:
10.1038/s42256-024-00821-x.
[114] Y. Huang et al., “A Robust Drug–Target Interaction Prediction Framework with Capsule
Network and Transfer Learning,” Int J Mol Sci, vol. 24, no. 18, p. 14061, Sep. 2023, doi:
10.3390/ijms241814061.

104

Supplementary Figures

Supplementary Figure 1. Mean Tanimoto Similarity Distributions of Generated and Shuffled sets.

105

Supplementary Figure 2. Max Tanimoto Similarity Distributions of Generated and Shuffled sets.

106

Supplementary Figure 3. Classifier Score Distributions of Generated and Shuffled sets.

107

Supplementary Figure 4. ROC Curve for Generated Sets vs. Shuffled Test Set across four model groups
(Base, Base+S, Base+E, Base+S+E) and three generation sets (1-1, 1-10, 1-100), with reference to Random
Guess and Test Set vs. Shuffled Test Set; AUC Scores are included in the legend.

108