COMBINING ACTIVE LEARNING AND DATA AUGMENTATION TO REDUCE LABELLED TRAINING DATA FOR SENTIMENT ANALYSIS by Colton Aarts B.Sc., University of Northern British Columbia, 2019 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF NORTHERN BRITISH COLUMBIA April 2025 © Colton Aarts, 2025 Abstract Creating a sentiment analysis classifier requires a large amount of labelled training data. Labelling this data is an expensive and time-consuming process. Because of this, reducing the amount of labelled data required leads to classifiers that are cheaper to train and more accessible to all disciplines. Many different methods can be used to reduce the amount of labelled data. For this research, we focused on combining active learning and lexical expansion techniques. By combining these two techniques, this research examined an underutilized area of study. Active learning focuses on letting the classifier select the data to learn from, while lexical expansion creates more data for the classifier. While there are a larger number of different techniques in both fields, there is little work to be done to combine them. We felt this was a natural progression for these techniques as they complement each other well. The active learning technique will select the data to be labelled, and the lexical expansion technique will generate high-quality artificial data from this hand-selected information. In addition to combining these techniques, we examined how different neural network structures would interact with our new technique. Our research found that the combination of active learning and lexical expansion improved the performance of our classifiers for very small amounts of data. We found a significant difference between the performance of our two classifiers. While there was an improvement at low levels of training data, at higher levels, we found that the combined techniques did not offer any improvements over the active learning technique. Overall, we found potential benefits to combining the two techniques and that future research is required to understand further how to leverage these improvements best. ii TABLE OF CONTENTS Abstract ii Table of Contents iii List of Tables v List of Figures vi List of Acronyms viii Acknowledgements ix 1 Introduction and Motivation 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 3 2 Background 2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 2.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . 2.1.3 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Sequentail Task Learning . . . . . . . . . . . . . . . . . . . . . 2.1.5.1 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5.2 Adoption . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6.1 Inter-attention . . . . . . . . . . . . . . . . . . . . . . 2.1.6.2 Intra-attention . . . . . . . . . . . . . . . . . . . . . . 2.1.6.3 Multi-Headed Attention . . . . . . . . . . . . . . . . 2.1.7 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 6 6 9 11 12 12 13 14 14 16 17 17 19 20 23 iii 2.5 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 3 Related Research 3.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Stacked 1D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 CoLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.1 Behera et al. . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.2 Vo et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4.1 Original BERT . . . . . . . . . . . . . . . . . . . . . . 3.2.4.2 ALBERT . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5.1 Exploration-Exploitation . . . . . . . . . . . . . . . . 3.2.6 Lexical Expansion and Data Augmentation . . . . . . . . . . . 3.2.6.1 PLSDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 27 27 28 30 30 31 31 31 32 33 35 37 39 4 Algorithm 4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Over Length Sequences . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Intermediate Information . . . . . . . . . . . . . . . . . . . . . 4.1.4 Build Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Convert Words to Numbers . . . . . . . . . . . . . . . . . . . . 4.1.6 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.7 Complete Preprocessing . . . . . . . . . . . . . . . . . . . . . . 4.2 AL + LE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Combining AL and LE . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.1 AL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.2 LE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.3 Combining AL and LE . . . . . . . . . . . . . . . . . 4.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 43 43 44 45 45 46 46 46 47 47 48 48 52 5 Experiment Set Up 5.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 CoLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 ALBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Algorithm Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 55 56 57 57 iv 6 Evaluation and Analysis 6.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Comparison and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Compare Networks . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Compare Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Overall Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 61 61 67 73 74 74 75 76 7 Conclusion 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 81 Bibliography 82 v LIST OF TABLES 3.1 3.2 CNN vs. RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stacked 1D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 5.1 5.2 5.3 Our CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Our CoLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ABLERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 56 57 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 Average F1 Score CNN . . . . . . . . . . . . . . . . . . . . . . . . . . Max F1 Score CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard Deviation CNN . . . . . . . . . . . . . . . . . . . . . . . . . P-Values CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average F1 Score LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . Max F1 Score LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard Deviation LSTM . . . . . . . . . . . . . . . . . . . . . . . . . P-Values LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BERT F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Active Average F1 Score Comparison . . . . . . . . . . . . . . . . . . Active Max F1 Score Comparison . . . . . . . . . . . . . . . . . . . . PLSDA Average F1 Score Comparison . . . . . . . . . . . . . . . . . PLSDA Max F1 Score Comparison . . . . . . . . . . . . . . . . . . . . APLSDA Average F1 Score Comparison . . . . . . . . . . . . . . . . APLSDA Max F1 Score Comparison . . . . . . . . . . . . . . . . . . . P Values Between LSTM and CNN . . . . . . . . . . . . . . . . . . . 61 64 65 67 68 69 71 72 73 76 77 77 77 78 78 78 vi LIST OF FIGURES 2.1 2.2 2.3 2.4 A Single Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Simple Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RNN Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LSTM Neuron [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 7 8 3.1 3.2 3.3 3.4 Exploitation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploration Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Substitution Candidate Selection . . . . . . . . . . . . . . . . . . . . . Instance Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 37 40 41 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 Create SVD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Create Vocab Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Add Sequence Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preprocessing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Active Learning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . PLSDA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Updated Decision Boundary . . . . . . . . . . . . . . . . . . . . . . . 44 45 46 47 49 49 50 50 51 51 52 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 Average F1 Score CNN . . . . . . . . . . . . . . . . . . . . . . . . . . Change in Average F1 Score CNN . . . . . . . . . . . . . . . . . . . . Max F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Change in Max F1 Score CNN . . . . . . . . . . . . . . . . . . . . . . Change in Standard Deviation F1 Score . . . . . . . . . . . . . . . . . Average F1 Score LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . Change in Average F1 Score LSTM . . . . . . . . . . . . . . . . . . . Max F1 Score LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . Change in Max F1 Score LSTM . . . . . . . . . . . . . . . . . . . . . . Change STD LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BERT F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 63 64 65 66 68 69 70 70 71 74 vii LIST OF ACRONYMS AL Active Learning BERT Bidirectional Encoder Representations from Transformers CNN Convolutional Neural Network CoLSTMConvolutional Long Short Term Memory LSTM Long Short Term Memory NLP Natual Language Processing PLSDA Part-of-Speech Focused Lexical Substitution for Data Augmentation POS Part-of-Speech viii Acknowledgements I would first like to express my gratitude to my supervisor, Dr. Fan Jiang (Terry). Without his continued support and advice, I would never have started, let alone completed my Masters. He has constantly provided me with guidance as well as opportunities for success. Between research papers to industry grants, Terry has been instrumental in my success as a Masters student at UNBC. I would next like to thank my committee, Dr. Chen and Dr. Monu, who have supported me as much as Terry has. They have provided advice and guidance for my research and work at the university. Without their input, I would not have been able to complete this research. Next, I would like to thank the fantastic Computer Science department here at UNBC. From professors like Dr. Haque, who has been a constant source of support and motivation, to our outstanding Admin Assistants, who helped me navigate the mysteries of required paperwork, everyone in the department has helped me during my time at UNBC. Finally, I would like to thank my friends and family. Without my friend group’s unconditional love and support, I would never have been able to complete my research. Alongside this, my mom and my sister’s motivation, drive, and love were instrumental in my ability to complete my Masters. ix Chapter 1 Introduction and Motivation Sentiment Analysis (SA) is a field of Natural Language Processing (NLP) that is concerned with detecting positive, neutral, and negative text. Currently, the best-performing models use various deep-learning techniques and neural network structures. While these models perform well, they suffer from a shared drawback of requiring a large amount of labelled data to achieve the desired performance. In addition, many SA models are domain-dependent, adding the additional constraint that a model that performs well on one domain may not perform well on a different domain. Acquiring large domain-specific datasets can be expensive as domain experts must provide the labels. Utilizing methods that reduce the required labelled data to combat these costs is helpful. Active Learning (AL) and Data Augmentation (DA) are popular methods to reduce the total data needed. 1.2 Motivation A trend that has appeared in many different data mining disciplines is using more and more data in training and training longer to achieve better performance. While this has been achieving the desired results, it is not always feasible for smaller research groups or many industrial partners. For example, Open AI used over 40 1 GBs of data for Chat-GPT and built a custom supercomputer to train it, the BERT pre-trained word embeddings are trained on 13 GB of text, AlphaGo was trained on over 1000 CPUs and over 100 GPUs, and for text mining, there has been research showing that having over 200,000 labelled data points can improve performance [2]. While continually pushing the bleeding edge of performance is important, it is also important to remember that producing tools other domains can use is essential to research. With this in mind, finding ways to reduce the amount of data or computational resources required to achieve a model that still performs at a high level is essential. When focusing on sentiment analysis, acquiring highquality labels can be expensive [3]. The cost of labelling is compounded by the fact that current sentiment analysis techniques remain domain-dependent, and creating high-quality labels requires experts in that domain. While there are methods that can acquire labels cheaper, these run the risk of creating a noisy dataset, which can affect the model’s performance [4]. 1.3 Objectives and Contributions While there has been extensive research into different active learning and lexical expansion techniques, combining them is under-researched. In addition, the existing research tends to focus on a single model and demonstrate how the individual techniques improve the performance of the specific model. For our research, we proposed four research questions: 1. Does combining an AL and LE offer better performance? 2. Is it possible to use this technique to create a classifier whose performance improves faster? 3. Does the proposed algorithm behave differently with different classifier ar2 chitectures? 4. Is there a large variance between the average and best performance of the classifiers? The main goals of our research are Questions 1 and 2. We expect that combining the two techniques will allow us to perform better with less training data. To do this, we will examine our classifiers’ performance when they are trained with various training data. We propose Questions 3 and 4 to help get a better understanding of how our algorithm will interact with different classifier architectures and to investigate if our proposed algorithm creates reliable results. We expect that our proposed algorithm will increase the performance of any classifier it uses. While we expect that the overall performance of the classifier will improve, we predict that the variance of our proposed APLSDA algorithm will be higher than the base classifier as we introduce noise through the PLSDA technique. 1.4 Thesis Layout The thesis is organized as follows. We introduce the background information in Chapter 2. In Chapter 3, we introduce the different AL and LE algorithms we are using and the neural network architecture we are using to create our classifiers. Our combined algorithm is presented in Chapter 4, and our experiment setup is found in Chapter 5. Finally, we evaluate the performance in Chapter 6, and the final Chapter 7 contains our conclusions and future work. 3 Chapter 2 Background This chapter will focus on reviewing concepts related to our research. We will start by providing background knowledge on neural networks (NNs). We will then introduce the different types of Language Models (LMs) used in Sentiment Analysis. We will then explore how SA combines the different LM and NN structures. We will conclude this chapter by examining techniques used to improve the performance of SA models. 2.1 Neural Networks The idea of neural networks was introduced in the mid-1940s by McCulloch and Pitts in [5]. In this work, the authors proposed a computational model to represent how the brain learns. Their work laid the foundation for creating NNs. The smallest component of a NN is the neuron or perceptron. A perceptron works by summing its inputs in multiplied by their associated weights wn . The output from the perceptron results from applying an activation function f(y) to the sum. The basic structure of a single perceptron can be seen in Figure 2.1. Many different activation functions are popular in research, including softmax and tanh. The weights are first initialized to random values. These values are updated during training to 4 minimize a given error function. i1 .. . in o = f( n ∑︁ ix wx ) o x=1 Figure 2.1: A Single Neuron One of the main drawbacks of a single perceptron is that they can only learn linear relationships between their inputs. However, arranging multiple perceptrons into different layers can create a network that learns non-linear relationships. The weights of each perceptron are learned through backpropagation. This is the basic structure of neural networks today, and a simple example can be seen in Figure 2.2. This network would be described as a fully connected network whose first two layers have three perceptrons, and the final layer has a single perceptron. Fully connected networks are simply networks where the output from all perceptrons in the previous layer is provided as input to each perceptron in the following layer. Combining different combinations of layers that contain different numbers of perceptrons can allow NNs to learn a fast variety of relationships. While NNs with structures similar to Figure 2.2 are useful, different types of networks have been developed to help solve complex problems. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are particularly interesting to our research and will be examined in more detail in this section. Figure 2.2: A Simple Network 5 2.1.1 Convolutional Neural Networks CNNs are neural networks that contain at least one convolutional layer. These layers were developed initially to be used on image data but have been adapted to text processing. CNNs were inspired by the brain’s visual cortex and how specific regions focus on sections of the visual field. CNNs also help reduce the neurons needed in the fully connected layers. This helps reduce the computation time needed to train the network. CNNs use convolutions to detect different features in the dataset. A convolution is an integral whose result is the overlap between two functions. CNNs use this idea by applying several filters to the input. The filters will all be the same size and are moved across the data. Each of these filters will detect a different feature. Stacking multiple convolutional layers on top of one another is also popular. As you add more convolutional layers, the complexity of the features that can be detected increases. While convolutions will reduce the data’s dimensionality, using a pooling layer is also common. This layer will look at a section of the output from the convolution and apply a function to return a single number. Two popular functions are max pooling and average pooling. After applying the pooling layer, the final results will be flattened and fed into a fully connected layer for classification. CNNs were adapted to be used on text data by modifying how the filter is moved across the data. Instead of moving the filter in two dimensions, the filter is only moved in one. This allows the CNN to consider the words that occur near each other and allows the network to learn different relationships in the text. 2.1.2 Recurrent Neural Networks While CNNs offer an advantage over regular NNs by changing the structure of the layers and reducing the number of neurons needed, RNNs change the internal 6 structure of the neurons, allowing them to remember the information they have previously seen. Adding memory to the neurons makes RNNs especially useful for time series data. This includes text, as the meaning of words is influenced by all the context that occurs before them. RNNs modify the internals of a neuron to include past information. These neurons work by taking the previous state ht−1 and combining it with the current input. The output from this is then used as the state for the next neuron and the output for timestep t. This can be seen in Equation 2.1. (2.1) ht = f(Wx xt + Wh ht−1 + bh ) Where f is some activation function. Wx and Wh are the weights associated with the inputs and the previous state, respectively, and finally, bh is the neuron’s bias. This configuration allows for the passage of information between neurons in the same layer. An example RNN neuron can be seen in Figure 2.3. While RNNs provide advantages over a standard NN, they suffer drawbacks. Namely vanishing and exploding gradients. A solution to the vanishing gradient problem can be found in Long-Short Term Memory units (LSTMs). ot ht−1 tanh ht xt Figure 2.3: RNN Neuron LSTMs were introduced in 1997 by Hochreiter and Schmidhuber [6]. LSTMs modify regular RNNs to allow the neuron to remember information longer. This modification comes in the form of three gates and an additional state. These gates 7 are the forget, output, and input gates. The state is a long-term state that accumulates and forgets information as time passes. The forget gate was added by Gers, Schmidhuber, and Cummins in 2000 [7], allowing the LSTM to remove information from the long-term state. The input gate allows information to be added to the long-term state, and finally, the output gate creates the next hidden state and the output for the current time step. An example LSTM neuron can be seen in Figure 2.4. Where × is the Hadamard product, + is element-wise addition, and σ is the sigmoid function. We will start by examining the forget-gate, the out-gate, and the in-gate in Equations 2.2, 2.3, and 2.6. s(t − 1) × σ s(t) + × σ tanh tanh σ × h(t − 1)&x(t − 1) h(t) Figure 2.4: LSTM Neuron [1] for(t) = σ(Wfor xt + Vfor ht−1 + bfor ) (2.2) out(t) = σ(Wout xt + Vout ht−1 + bout ) (2.3) Both the forget-gate and the out-gate are straightforward processes. They both apply a sigmoid function to the sum of their input weights and inputs Wx, their state weights and the previous state Vht−1 , and their bias b. The in-gate is somewhat more complicated as it has two parts that are combined by taking the Hadamard product. These are Equations 2.4 and 2.5. ina (t) = σ(Wina xt + Vina ht−1 + bina ) 8 (2.4) inb (t) = tanh(Winb xt + Vinb ht−1 + binb ) (2.5) in(t) = ina (t) × inb (t) (2.6) We now can calculate the long-term memory state s(t) and the hidden state h(t). s(t) = for(t) × ct−1 + in(t) (2.7) h(t) = out(t) × tanh(s(t)) (2.8) With the addition of three gates and a long-term state, LSTMs help the network perform better on longer series. The final improvement for recurrent networks is bi-directional networks. In traditional RNNs, the network can only look backward and not relate what it sees to what comes next. While this is suitable for some scenarios to determine the context of text data, it is necessary to consider the entire sentence, not simply what has come before. Bi-directional LSTMs (BiLSTMs) are one of the most popular bidirectional RNNs. To create a BiLSTM, you simply have an LSTM layer for both directions and combine the output of the neurons at each time step. 2.1.3 Encoder-Decoder In natural language processing, several problems require a sequence of words to be transformed into a sequence of different words. Standard neural networks can struggle to accomplish this task [8]. Encoder-decoder neural networks were introduced to help with this task. In addition to helping with tasks such as machine translation researchers have discovered that using an encoder-decoder structure 9 a general representation of a language can be learned. The BERT model that will be discussed in 3 is of particular interest to this research. First, we will explore encoder-decoder models here, and in section 2.1.7, we will examine transformers. An encoder-decoder network can be broken into three important parts: the encoder, the encoded vector, and the decoder. The encoder is composed of an RNN cell that will read each input symbol sequentially. The goal of the encoder is to read a sequence of any length and to create the encoded vector that is a specified length. This is accomplished by having the hidden state of the RNN be updated after each symbol. Cho et al. proposed a new method to update the RNN cell in the encoder. Their method can be seen in 2.9. Their method improved the RNN’s ability to forget while remaining simpler to compute than an LSTM cell [9]. This method uses a reset gate rj and an update gate zj , seen in equations 2.11 and 2.12 respectively. Where the j subscript denotes the j-th element of a vector. While the network is reading through the input symbols, we discard the outputs from the RNN. After reading the entire input sequence, the hidden state of the encoder RNN will be the encoded vector. t htj = zj ht−1 + (1 − zj )h̃j j (2.9) h̃j = tanh([Wx]j + [U(r ⊙ ht−1 )]j ) (2.10) rj = σ([Wr x]j + [Ur ht−1 ]j ) (2.11) zj = σ([Wz x]j + [Uz ht−1 ]j ) (2.12) t where h̃j is: t In these equations, Wr and Ur are learned matrices. The logistic sigmoid func- 10 tion is represented by σ. The encoded vector is a vector that summarizes the input sequence. This section of the Encoder-Decoder model is worth highlighting as this is the area that BERT focuses on. This vector aims to encapsulate the information that was contained in the input sequence. This vector acts as the initial hidden state of the decoder. There has been additional research in providing the decoder with more information about the input sequence that will be discussed in section 2.1.6. The decoder is another RNN that is trained to generate the next word in the target sequence. The hidden state of the decoder is initialized as the encoded vector. As the decoder proceeds from one output to the next, the hidden state evolves. This can be seen in equation 2.13. ht = f(ht−1 , yt−1 , c) (2.13) Where c is the encoded vector, and f is some activation function that provides valid probabilities. 2.1.4 Transfer Learning Unlike traditional neural networks, encoder-decoders work by exploiting the idea of transfer learning. The main idea behind transfer learning is that knowledge learned from one task should transfer to a similar task. This parallels human learning, where knowledge from one domain can be transferred to another if they are similar enough. This is useful for tasks where there is a lack of labelled data for the target task, but there is a large amount of data in a related task. Specifically for NLP, there is a large amount of unlabeled text available. Still, when we want to focus on a specific task such as sentiment analysis, question answering, named entity recognition, etc., we can run into scenarios where there is a lack of labelled 11 data available. Transfer learning can be broken down into two sub-groups: transductive and inductive transfer learning. Inductive transfer learning is used in NLP, where the target task differs from the source task. Inductive transfer learning can be further sub-grouped into multi-task and sequential task learning. Of these two, sequential task learning is the most common in NLP. Unlike multi-task learning, there is only one target task in sequential task learning. We will explore the specifics of how sequential task learning works in the following section. 2.1.5 Sequentail Task Learning STL can be broken down into two main stages: pretraining with source data and adoption with the target data. Ideally, the source data will be similar to the target data. We will break these two stages down in more depth. 2.1.5.1 Pretraining In the pretraining stage, the network is presented with data that is similar to the target data, but the task that the network is learning is different. In NLP, this generally looks like training the network on unlabeled data to learn a general representation of the target language. This can be done with a variety of different source training tasks, but they are generally some forms of word prediction. A common pretraining task is next-word prediction. This involves training the network to create a probability distribution of all words, with the network predicting which words are the most likely to occur next in the sentence. Another task that has been used is next-sentence prediction, where, in addition to learning which words follow each other, the network is trained to identify which sentences follow each other. A drawback to this kind of learning is that you still require a large amount of unlabeled data, as seen in [When Do You Need Billions of Words of Pretrain12 ing Data]. For classification tasks, you need upwards of one billion words for the pre-trained model. 2.1.5.2 Adoption After pretraining the model, the next step is to decide how to transfer the information to the target task. There are two approaches to facilitate this transfer: finetuning or embedding/feature extraction. The main distinction between these two is if the weights of the pre-trained model are adjusted in the new task (fine-tuning) or if the weights are kept as is and a new model is trained (embedding/feature extraction). Fine-tuning is the more flexible of the two approaches as it does not require any specific changes to the pre-trained models’ architecture. It can be accomplished by selecting specific outputs representing the target task or adding a final layer to compute the needed task [10]. One of the main drawbacks of fine-tuning is that the relationships between the tokens in the pre-trained network can be lost as the model learns the specifics of the target task. This is called “catastrophic forgetting,” and there are some proposed solutions, including freezing learning rates and regularization. In contrast to fine-tuning, the embedding approach freezes the weights of the pre-trained model. This allows the creation of a fixed-sized representation of the tokens to be extracted. For NLP, you can create word embeddings of a fixed size representing the word and its context regarding all other words in the pre-trained models’ data set. You can build a new network from this embedding to take this contextual information and use it to complete the target task. A drawback of the embedding approach is that if the usages and meanings of words change between the pretraining and adoption stages, there is no way to update the embeddings. 13 2.1.6 Attention In natural language processing, attention is the networks’ ability to know which words relate to each other. The relationship between words can be demonstrated best with an example: consider the sentence: ”The chicken did not cross the road because it was tired.” Determining what the word ”it” refers to requires the ability to recall previous information. Specifically, the network must know that the chicken is the associated noun, not the road. Basic attention accomplishes this by having different components share information. As the encoder-decoder section mentions, a neural network without attention will create a context vector c to predict outputs. One of the downsides to this vector is that it lacks information on how the different words relate to each other. Attention mechanisms work to add this information back to the vector. There are two general types of attention: inter or cross attention. 2.1.6.1 Inter-attention Inter-attention networks are generally encoder-decoder models that are frequently used in machine translation. The context vector is created so that the decoder understands which inputs could influence the correct output. Hence, the name inter-attention refers to the interdependencies between the encoder and decoder networks. The context vector is changed by adding a vector of hidden states. The decoder will use these context vectors to help create their hidden state. This gives the decoder a unique context vector for each time step. The context vector contains the dependencies between the current word and all the other words in the input. The context vector adds an alignment score to the summation of the hidden states as 14 seen in Equation 2.14. ct = Tx ∑︂ αt,i hi (2.14) i=0 The hidden state hi concatenates the bidirectional states for each element i. Calculated as: → − ← − hi = [ hi ; h i ] (2.15) The alignment score αt represents how relevant the input is to the current time step of the decoder. This is calculated by taking the softMax of the score function as seen in Equation 2.16. This score function can be calculated in a variety of different ways, including concat, general, location-based, and dot-product seen in Equations 2.17, 2.18, 2.19 and 2.20 respectively. αt,i = softMax(score(st , hi )) (2.16) score(st , hi ) = vTα tanh(Wα [st ; hi ]) (2.17) score(st , hi ) = sTt Wα hi (2.18) αt,i = softmax(Wα st ) (2.19) score(st , hi ) = sTt hi (2.20) W and v are both learned matrices. This process will allow the attention mech- anism to determine how each word relates to all other words in the sentence. With the context vector calculated, we can now determine the current state of 15 the decoder. This is accomplished by applying the decoding function to the previous state, previous output and the context vector. This function may be multiple neural layers, but it will ultimately provide the state as seen in Equation 2.21. This state will be used to calculate the correct output by being fed into an additional neural network. st = f(st−1 , yt−1 , ct ) 2.1.6.2 (2.21) Intra-attention Intra-attention or self-attention networks do not generally have an encoder-decoder structure. These networks can be used for a variety of tasks that are not suitable for an encoder-decoder structure, including sentiment analysis. Self-attention is calculated using three sets of vectors: query, key and value. These vectors are obtained by multiplying the word vector with three different weight matrices seen in 2.22. ⎡ ⎤ ⎡ ⎤ Q W ⎢ ⎥ ⎢ Q⎥ ⎢ ⎥ ⎢ ⎥ ⎢ K ⎥ = H ⎢W ⎥ ⎢ ⎥ ⎢ K⎥ ⎣ ⎦ ⎣ ⎦ V WV (2.22) Where H is the hidden states of the output layer. The values of the WQ , WK , and WV matrices are learned during the network training. The final out layer is calculated by multiplying these matrices together as seen in 2.23. QKT Attention(Q, K, V) = softmax( √ )V d √ Where d is the dimensionality of the layers and d is the scaling factor. (2.23) Self Attention has been shown to help improve various NLP tasks, including machine translation and linguistic probing. 16 2.1.6.3 Multi-Headed Attention Normal attention will learn the relationships between the words in the sequence using the entire word embedding. While this is effective, it can lose some of the nuance encoded into the embedding. Different parts of the embedding may be capturing different aspects of the word. Multi-headed attention looks to solve this problem by splitting the embedding into separate sections. Its attention head attends to each of these sections. In [11], the authors propose using eight attention heads and splitting the embedding dimension evenly between these heads. This changes the attention calculation into eight different attention calculations as seen in 2.24 that are concatenated together at the end 2.25. headi = Attention(QWiQ , KWiK , VWiV ) (2.24) MultiHead(Q, K, V) = Concat(head1 , ..., headh )W A (2.25) The Attention function is similar to the one described in 2.23. The only change is that the √ √ d becomes dk . This is because the attention heads no longer examine the entire embedding space. They are now examining sections of size dk . This dk is calculated by taking the model’s dimensions and dividing it by the number of heads. 2.1.7 Transformers As language models have progressed, the computing constraints of using various RNN cell structures have become problematic. The desire for a more efficient neural network model has increased as the sequence length, and the number of sequences required to train have increased. This is where the Transformer was introduced in [11]. The transformer does not use any RNN cell or any convolutions. 17 This allows it to have a large amount of parallelization, significantly increasing training efficiency. Transformers are encoder-decoder networks that are built from stacked, fully connected layers. The encoder is built from six layers. These layers have two components. The first is a multi-headed self-attention mechanism. The second is a position-wise fully connected network. A position-wise neural network uses the same layers to transform all the words from the input sequence [12]. This can be seen in equcation 2.26, and can be explained as two sets of matrix multiplication with a ReLU activation between them. In addition, each sub-layer has a residual connection that allows the input from each layer to bypass the network and be included in the output. This makes the output of each sub-layer: PWN(x) = max(0, xW1 + b1 )W2 + b2 (2.26) SublayerOut = SubLayerNorm(x + Sublayer(x)) (2.27) The decoder is composed of six identical layers. These layers are composed of three sub-layers. The first and last sub-layers are the same as the encoder. The second sub-layer is composed of an additional multi-headed attention layer. The attention layer takes its Q and K values from the encoder, while its V value comes from the first sub-layer. In addition, the attention mechanisms in the decoder are modified to ensure that they only attend to words in the sequence that have already been seen. This prevents the decoder from looking into the future and ensures it only uses information that should be available. Since transformers do not use RNNs or any convolutions, they need to add additional information about the position of the words in the sequence. This is 18 accomplished using the sine and cosine functions: PE(pos,2i) = sin(pos/100002i/dmodel ) (2.28) PE(pos,2i+1) = cos(pos/100002i/dmodel ) (2.29) where pos is the word’s position in the sequence, and i is the dimension. These positional encodings will have the same dimension as the model, allowing them to be summed into word encoding before being fed into the encoder or decoder. With this setup, the authors found that the proposed transformer was nearly identical or better than all previous models. In addition to being competitive, their model requires significantly less training time than previous models. 2.2 Pre-processing Before the computer can start to create a language model, it is important to preprocess the text. This entails various activities, including but not limited to removing unwanted symbols, stemming or lemmatization, and removing stopwords. A typical text analysis application will use some pre-processing techniques to ensure that the text being used is suitable for their chosen model. Removing unwanted symbols is a common practice when the text being analyzed is collected from an environment where it is common for unusual or unique characters (or combinations of characters) to be included in the text. For example, if the text corpus is collected from Twitter, it is common to remove the @ sign as it is used to identify a specific account. When the data was collected using the Twitter API, it is typical for the character ”RT” to be appended to the beginning of the text in certain circumstances. Removing these characters helps the model avoid 19 learning relationships that are overly specific to where the data was collected. Stemming and lemmatization are techniques used to convert words with similar meanings into a single word. Stemming is the process of removing characters from a word to reduce it to a base form known as a lemma. This process is generally accomplished by following rules that can result in misspellings. For example, the words funny and funnier can be stemmed into the word funni. Lemmatization is similar to stemming in that the end goal is to convert words to a simpler lemma. However, lemmatization considers the context of the word and can replace the entire word with a lemma if one is suitable. An example of lemmatization would be the nouns leafs and leaves, both converting to the word leaf. By utilizing stemming and lemmatization techniques, it is possible to drastically reduce the number of words the language model needs to learn, thus helping both performance and run time. Removing stopwords from the text is another process that can reduce the overall number of words that the model is exposed to and required to learn. Stopwords are words that have been identified in a language as common and not necessary for the content of the text to be learned. Some examples of stopwords for English are the, a, and as. Removing these words from the text reduces the number of words that are required to be learned and helps reduce the time required to run the model. 2.3 Language Models Converting text into something a computer can understand is one of the most important parts of any text-processing application. Creating a systematic representation of a language is called creating a language model. Language models are a probabilistic representation of the language. This means that the model is trained 20 to determine the probability of words in a given sentence. The simplest LM is an n-gram model. These LMs are models trained to predict the next word based on the previous n words. This can be accomplished by using Equation 2.30. P(wn |wn−N+1:n−1 ) = C(wn−N+1:n−1 wn ) C(wn−N+1:n−1 ) (2.30) Where wn |wn−N+1:n−1 is a word that is proceeded by N other specific words. C(x) is a function that returns the frequency of a sequence of words. Several im- provements have been proposed over the years; however, with the rise of neural networks, neural LMs have become more popular. Neural language models (NLMs) are a type of LM created by a neural network. These models create embeddings to represent the text. These embeddings allow the LM to encode additional information about the language. Similar words will be encoded closer together in the embedding space, allowing the model to understand the language better while requiring less training data. A popular approach with neural networks is to use pre-trained language models. These pre-trained LMs can provide significant advantages compared to network learning in representing the language alongside their prescribed task. Pre-trained LMs leverage transfer learning, the idea that knowledge learned on one task can be transferred to a similar task. This applies to language models by training the pre-trained model on a general language task such as next sentence prediction or word prediction and then taking that model and fine-tuning it to more precise tasks. An additional advantage of pre-trained LMs is that they can be pre-trained on large unlabelled datasets. This is because the tasks they are initially learning can be trained using a self-supervised approach, eliminating the requirement of labelled data. Several different pre-trained LMs have recently become popular with these advantages [13, 14]. Of particular interest to our work is the BERT family of pre-trained LMs. We will introduce BERT here, and in Chapter 3, we will further explore the 21 specific architectures we will use. In their work, Devlin, Chang, Lee, and Toutanova introduce BERT [13], an LM trained on a corpus composed from the BooksCorpus [15] and Wikipedia that totalled over 3 billion words. The main advantage of BERT over previous pre-trained LMs is that it is a fine-tuned bidirectional model. Being a fine-tuned model means that when BERT is adapted to a specific task, the only change that needs to happen is for an additional layer to be added to the output of BERT. This layer is trained on the specific task instead of the entire model. A bidirectional model means that BERT has been trained to look ahead in the sentence and behind when analyzing a particular word. The two tasks that BERT is pre-trained on are the two mentioned above. Training to predict masked words is fundamental to BERT’s success. Without masking words, BERT would struggle to learn an accurate representation of the language. In a traditional unidirectional LM, when considering a word in a sentence, you can train it to predict what the next word most likely is. However, in a bidirectional model, the word has access to too much information about itself and to what words would come next in the sentence. Because of this, the model would struggle to learn an actual representation and not overfit the given data. The solution to this problem is to mask random words in the sentence and ask BERT to predict what that word would be. The authors tested various masking strategies that varied how a word is masked. The authors mask fifteen percent of the total words in each sentence. Each masked word has an eighty percent chance that the target token is replaced with the ”MASK” token, a ten percent chance that the target token is replaced with a random token and a ten percent chance that the target token is replaced with the original token. The authors found that these percentages resulted in the best overall performance of the model. 22 2.4 Sentiment Analysis Determining if a piece of text expresses positive or negative sentiment can be used in many fields, including security, finance, and medical [16]. This process is called Sentiment Analysis (SA) and has been an area of research since 1940 [16]. There are two broad categories for sentiment analysis: binary (positive or negative) or ternary (positive, negative, neutral). The specific models may differ between the two categories, while general strategies remain similar. Over time, the methods used have evolved into the sophisticated ones used today. Alongside the models’ evolution, the domains in which they were being used evolved as well. The growth of Web 2.0 opened up new and exciting areas for applying sentiment analysis. The rise of social media, online blogs, and companies moving customer reviews online created a vast source of potential training data for SA. While this data is easily available, the amount of data gave rise to a new problem: acquiring high-quality labels is expensive. Coupling this with the fact that the best-performing models are neural networks that require a large amount of labelled data to train creates a problem for developers and researchers. There have been a variety of different solutions proposed, including: • Creating models more resilient to noise. • Improving the quality of labels acquired from cheap sources. • Creating artificial data from high-quality labels. • Reducing the number of labels required for current models. For our purposes, we will focus on the last two ideas. 23 2.5 Improvements One of the main drawbacks when creating a SA model is the requirement for a large training set. While there are several publicly available datasets, SA is domain-dependent. You must create your own if you are working on data that is not similar to any public datasets. Considering that it is expensive to obtain high-quality labels, ensuring that you are labelling the correct pieces of data and getting the most out of your labels is important. The areas of data augmentation and lexical expansion work to solve these problems. 2.5.1 Active Learning In machine learning, Active Learning (AL) is a machine learning algorithm that chooses the data from which it will learn [17]. The main idea behind these algorithms is that if you allow the classifier to choose which data points to label, you will achieve a higher level of performance while requiring fewer labels. To train a classifier using AL, we first need two sets of data: U, the unlabeled data, and L, the labelled data. We will train our classifier C on L: train(C, L). After completing this training step, we need to query U for the data we want to add to L to improve the classifiers’ performance. This requires the creation of some query function f that will return a value that can be used to compare the data points in U. There are several different query strategies, including uncertainty sampling, query-bycommittee, and expected model change. For our research, uncertainty sampling is the most important. The most common uncertainty sampling strategy is entropy: E(x) = − ∑︂ pi (x)log(pi (x)) (2.31) i∈Y where Y is the set of all possible labels and pi (x) is the probability that x belongs to label i. This allows us to create our function f as: 24 f(S) = max(E(x), x ∈ S) (2.32) where S is a data set for which the classifier has provided labels. With this function, our classifier can provide labels for U and pass U and the labels to f. This will give us the next data point to get the correct label. This process is repeated until the desired performance is reached. In traditional AL, new data will be added one at a time. However, this is impractical in practice. Labelling potentially hundreds of data points one by one is expensive, and some classifiers will overfit if trained on datasets where there has been little change between one step and the next. This leads to the idea of batch learning. Batch learning is where the classifier will select multiple data points at each step to learn from. This will modify Equation 2.32 to return multiple points. While this can lead to some data points being labelled when they did not necessarily require it, the benefits achieved between cost reduction and preventing overfitting outweigh the cost. 2.5.2 Data Augmentation Data Augmentation (DA) expands the provided training data to increase performance without manually labelling more data. It is widely used in computer vision and has started to be used more in other areas, including text classification tasks like sentiment analysis. There are a variety of different types of DA. However, we will be focusing on word-level replacement methods. These methods focus on how many sentiment analysis techniques work on a word level. This means that the classifier attempts to determine a representation of the individual words in the dataset and decide how the words relate to the label. Over the training steps, the classifier will encounter words and learn this representation. Problems arise 25 when the classifier encounters words in the testing stage that were not known in the training stage. When this happens, the classifier cannot use the new word and discard it. This can lead to losing large amounts of information, especially when the training data set is small. Word Level Data Augmentation looks to solve this problem by introducing a wider variety of words in the training step to ensure that the classifier can recognize as many words as possible. Two main approaches have been proposed for word-level replacement in sentiment analysis. The first utilizes word embeddings, and the second depends on a thesaurus. There are a variety of different approaches that utilize word embeddings. These range from generating data to balance classes to using cosine comparison and k-nearest neighbours to select words for replacement. The other methods utilize thesauruses to replace selected words with their synonyms. The selection of the original words has been an area of active research. We have chosen to implement the PLSDA algorithm that selects words belonging to specified parts of speech and ensures that all replacement words belong to the same part of speech. 26 Chapter 3 Related Research In this chapter, we will discuss the research directly related to ours. We will examine the dataset on which we chose to train our NNs. Then, we will introduce the different NN structures that we are comparing, and finally, we will look at the AL and DA techniques we are combining. 3.1 Data set The data set we trained on is the IMDB Movie Ratings Sentiment Analysis dataset that can be found from [18]. This is a publicly available dataset that contains a large number of labelled movie reviews. There are 20019 negative reviews and 19981 positive reviews. 3.2 Literature Review 3.2.1 Neural Networks There are a wide variety of neural networks used in natural language processing. The most common structures are CNN, LSTMs and pre-trained networks. We have 27 implemented one of each of these networks. We will introduce the networks that we have based our implementations on here. We discuss the exact structures of our implementations in Chapter 5. We are comparing three different neural network structures. These are a stacked one-dimensional CNN proposed by Dang, Moreno-Garcia, and De la Prieta in [19], the second is a convolutional LSTM from the work of Behera et al. in [20]. The last NN is a BERT classifier based on the work of Devlin, Chang, Lee, and Toutanova in ”BERT: Pre-trainings of Deep Bidirectional Transformers for Language Understandings.” [13] 3.2.2 Stacked 1D CNN The first neural network we use for our research comes from Dang, Moreno-Garcia, and De la Preieta’s work in ”Sentiment Analysis on Deep Learning: A Comparative Study.” [19] The authors reviewed many different SA techniques and model structures in their work. They compare the performance of three different neural network structures and two sentence representation techniques across eight datasets. From these comparisons, the authors found that using a stacked CNN structure, you can create a SA model that achieves competitive performance while taking at most 65% of the time to train as the top-performing model. The F-score and run time comparisons can be seen in Table 3.1. The neural network structure can be found in Table 3.2. The run times reported in Table 3.1 come from running the selected neural network structure on 100% of the data. While the authors found that the RNN outperformed CNN, the run times are much longer. This increase in run time can make the RNN unsuitable for certain tasks where getting results quickly is an important factor for the model. Overall, the authors’ comparisons in this work provided us with a CNN structure that has been shown to perform well on various datasets. 28 Table 3.1: CNN vs. RNN Datasets CNN F Score Sentiment140 0.8006 Tweets Airline 0.9406 Tweets SemEval 0.8288 IMDB Movie Reviews (1) 0.8591 IMDB Movie Reviews (2) 0.8273 Cornell Movie Reviews 0.7156 Book Reviews 0.7773 Music Reviews 0.7403 RNN F Score 0.8297 0.9406 0.8387 0.8702 0.8697 0.7759 0.7340 0.7321 CNN Time 7 min 3 s 1 min 22s 1 min 11 s 33 s 37 s 21 s 21 s 17 s Table 3.2: Stacked 1D CNN Layer Embedding 1D Conv 1D Conv Max Pooling 1D 1D Conv 1D Conv Global Avg 1D Dense Output Shape 40, 300 40, 64 40, 32 13, 32 13, 16 13, 8 8 1 29 Parameters 4500300 57664 6176 0 1552 264 0 9 RNN Time 1 h 4 min 16 s 2 min 41 s 2 min 43s 7 min 42 s 8 min 23 s 4 min 40 s 4 min 40 s 4 min 42 s 3.2.3 CoLSTM To create our Convolutional Long-Short Term Memory Neural network, we combined the ideas presented in the the following papers: • ”Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data” by Behera et al. [20]. • ”Multi-channel LSTM-CNN model for Vietnamese sentiment analysis” by Vo et al. [21] We will explore these works in this chapter, and in Chapter 4, we will introduce how we have combined these ideas. 3.2.3.1 Behera et al. With the introduction of their Co-LSTM model, Behera et al. provide an exciting framework for creating a neural network to perform sentiment analysis [20]. The model that the authors propose combines CNNs and LSTMs into a single network to leverage the advantages that the different layer types bring. The authors use a CNN layer to capture the relationships between different words. The LSTM layer is used to help identify how these different relationships interact with each other and how they relate to the overall sentiment of the input. The model proposed by the authors has six layers. The first is a standard embedding layer that embeds the input into 128-dimension space. The subsequent two layers are the convolutional layer and the max pooling layer, respectively. The authors use seven 3x3 filters for their convolutions. After the max pooling layer comes the LSTM layer. The output from the LSTM is then fed through two fully connected layers to determine the final label for the input. 30 3.2.3.2 Vo et al. In their work, Vo et al. introduce a neural network structure that combines CNNs and LSTMs [21]. Of particular interest to our work, the authors propose a CNN structure that utilizes multiple kernels of different sizes. This allows their network to identify relationships between multiple different combinations of words. The authors created an LSTM-CNN network by having a separate LSTM and CNN network, concatenating the outputs from each and then feeding this through a final dense layer. For both the CNN and LSTM networks, they use word embeddings with a dimension of 200. They then use three different kernels to create 450 different convolutions. There are 150 convolutions of sizes 3, 5, and 7. The max of each output is selected and fed through a dense layer to get the final output with a dimension of 100. The LSTM network feeds the word embeddings into an LSTM layer with 128 nodes. The output of this matches the dimensions of the CNN network at 100. Finally, these two outputs are concatenated and fed into a final neural network that will predict the sentiment of the original input. 3.2.4 BERT We have chosen the ALBERT model for our work [10]. We will summarize what BERT is here and the improvements that ALBERT offers. In Chapter 5, we will explore how we have used ALBERT for our experiment. 3.2.4.1 Original BERT BERT (Bidirectional Encoder Representations from Transformers) was proposed in [13] in 2018. The main idea behind BERT was to improve on previous ideas of unsupervised pre-training by utilizing bidirectional architecture, attention, and transformers. Devlin, Change, Lee, and Toutanova achieved this by stacking many 31 transformers on top of one another and training these on two unsupervised tasks. BERT was trained on a masked word prediction task and next-sentence prediction. In the word prediction task, BERT was required to predict which word should occur in a sentence when a word has been hidden. For the next sentence task, BERT must predict if sentence A is followed by sentence B. The authors demonstrate that these tasks create a pre-trained model that can be used as the basis for many other more complicated NLP tasks. After the model is pre-trained, it will have learned a general representation of the text in the training materials. This representation can then be used for the target task. To use the pre-trained representation, all that is needed is to supply the BERT model with the new task inputs, take the outputs from the model and feed them into a classifier. The classifier will be trained as usual and only need to learn how to solve the new task, not how to represent the language and the new task. BERT is pre-trained on large amounts of unlabeled data. The representation that BERT learns can be distributed. This saves training time as only one base is needed to create many different classifiers. This is one of the main advantages that BERT offers. Overall, the ability of BERT to learn accurate representations of a language and to transfer this learning to task-specific models can be a significant advantage when creating accurate models in many NLP domains. 3.2.4.2 ALBERT One of the main drawbacks that arose from the base BERT model is that the pretrained BERT model is large. This comes from the fact that it contains a large number of parameters that are used to represent the language. Lan et al. proposed ALBERT (A lite BERT), a model that reduces the number of parameters needed in the final pre-trained product [10]. Their research aimed to determine if having a larger model is required to produce good results for NLP tasks. The authors propose 32 two changes to the architecture of BERT and one change in how the model is pretrained. The first proposed changes to the architecture were to reduce the number of embedding dimensions needed by taking the encodings into a lower-dimension space before embedding them. The authors found that this enhanced the model’s performance in all the test cases. The authors proposed the second change to allow information to be shared between the different layers. Unlike the first change, the authors found that this caused the results of their models to lose performance. Finally, the authors tested a pre-training method different from the original BERT. The authors changed the sentence prediction training from next-sentence prediction to sentence ordering; instead of simply determining if one sentence occurs after another, ALBERT is trained to determine the order of sentences. They found that this change in pre-training increased the performance in all other tasks. Overall, the authors found that their model offers better performance when compared to the BERT models that contain a similar number of parameters. For our research, we use the ALBERT implementation that TensorFlow provides [22]. We used this network as our baseline to compare the performance of our proposed algorithm. 3.2.5 Active Learning Active learning in natural language processing is a diverse field with many different areas of active research. These include a large variety of querying strategies, annotation techniques, and various structures for the network and techniques for learning [23]. We will briefly explore the available research in these areas before exploring the research we use directly in our work. We will start by examining the three general types of querying strategies: informativeness, representativeness, and hybrid [23]. These strategies are all involved in deciding how you select information from the unknown data set. 33 Informativeness strategies involve using different sampling methods that examine each data point individually. These strategies can range from uncertainty sampling strategies such as entropy-based to local divergence [24, 25] to gradient [26] and performance prediction methods [27]. The general motivation behind these strategies is to optimize how the new information is added to the model. Representativeness strategies help to deal with the fact that informativeness selection strategies can be susceptible to sampling bias and outlier selection [23]. The representativeness strategies include density selection, discriminative selection and batch diversity [23]. Density selection attempts to avoid outliers, where data points are selected based on the points’ average similarity to all other points. Discriminative selection selects samples that are different from data points with labels. This will hopefully give the classifier a good representation of the entire data set. Batch diversity strategies work to optimize the ability to select multiple data points at once. In other selection strategies, only a single data point is selected for each iteration. While this can ensure optimal data points are selected each iteration, acquiring labels one at a time can be expensive. Supplying the labellers with a more extensive selection of data points helps ensure their time is used well. Hybrid strategies are a combination of the previous two. These strategies aim to combine the advantages of the other strategies and minimize the disadvantages. There are numerous different ways to combine informative and representative approaches. Common combinations include entropy and density combinations [28, 29] or combinations of uncertainty, representativeness, and diversity [30]. Many of these strategies naturally combine different approaches. Examples of these are weighted clustering or filtering for uncertain samples and then clustering to select a diverse set of the samples [23]. While these natural combinations can provide advantages, dynamic combinations can further improve the performance of active learning algorithms. Dynamic combinations are combinations where the 34 specific selection method evolves or changes over time. For example, a representativeness sampling method should be used at the start of the active learning process, then changing to uncertainty sampling as you acquire more data can be helpful. While tasks like sentiment analysis and other classification tasks that require the classifier to provide a single label can be costly to train, other tasks like event extraction and named entity recognition are even more expensive to label. Different annotation strategies can help reduce the costs for these types of tasks [31]. While different annotation strategies are an essential aspect of AL, we will focus on the querying strategy we used in our research. 3.2.5.1 Exploration-Exploitation In their work ”Deep similarity-based batch mode active learning with explorationexploitation,” Yin et al. identified two limitations that previous AL algorithms presented. The first limitation was that the similarity measure used to compare instances was a feature space. The second limitation was that previous algorithms focused heavily on the ”exploitation” of the data. This means that other techniques focused on information close to the decision boundary. The authors argue that focusing too heavily on the data around the decision boundary will limit the algorithm’s ”exploration” of the entire data distribution. With these limitations in mind the authors propose a AL algorithm that utilizes two different equations to ensure that the entire distribution is explored while ensuring that the decision boundaries are sufficiently explored. The first equation is used in the exploitation step: I(x) = E(x) − Sim(x, S) (3.1) where Sim(x, S) is the similarity between an element x and the set of selected 35 elements S, this equation calculates the amount of information each data point contains. To use Equation 3.1, we first need to construct the set S. This set will represent the selected data, and we will initialize it with the data point that maximizes Equation 2.31. We will calculate Equation 3.1 for the U and select the element with the maximum value to add to S. We will repeat this step until we have added a total of numexploit data points. After we complete the exploitation step, we have ensured that we have added the top m points that contain the most information. The sudo code for this step can be found in Figure 3.1. The algorithm will now move into the exploration step. Given: U, numexploit Find x ∈ U such that x = max(E(x)) S = set(x) U = U−x While |S| < numexploit : For i ∈ U: I(i) = E(i) − Sim(i, S) End For x = max(I(x)) S = S∪x U = U−x End While Return S, U Figure 3.1: Exploitation Code The equation used in the exploration step is: i = mini Sim(i, L ∪ S) (3.2) This equation ensures that the element we select is the most dissimilar in the dataset. We will add i to S and recompute Equation 3.2 numexplore times. The sudo code for the exploration step can be found in Figure 3.2. 36 Given: U, L, S, numexploit , numexplore total = numexploit + numexplore While |S| < total: For i ∈ U: I(i) = −Sim(i, L ∪ S End For x = max(I(x)) S = S∪x U = U−x End While Return S, U Figure 3.2: Exploration Code After completing the exploitation and exploration steps, we can retrain our classifier and calculate the new performance. We repeat these steps until we achieve the desired performance. We perform minor modifications to the ExplorationExploitation Algorithm for our research. These changes are discussed in Chapter 4. 3.2.6 Lexical Expansion and Data Augmentation As discussed previously, one of the main challenges that NLP models can face is a lack of diversity in the training vocabulary. This is especially prevalent when the number of training samples is limited. LE and DA techniques aim to fix this problem by introducing a wider range of words into the training data. The motivation behind this is simple: unknown words cause errors. Following the work of [32], we can see four main areas of DA. These are token-level augmentation, sentencelevel augmentation, adversarial argumentation, and hidden-level augmentation. We will examine these individually before highlighting the algorithm we have implemented for our research. Starting from the individual token level, LE has been used to improve the per- 37 formance of NLP techniques. This level of augmentation focuses on changing the individual words or simple phrases found in the training data. This can take the form of replacing words with synonyms to insert and delete words. Many approaches have been used to decide how to insert new words into existing sentences. These include using part of speech information to ensure that the new words fulfill the same role as the original [33], to changing the structure of the classifier to detect the changes that are made with the new data [34], or to incorporate the identification of target and non-target words with as well as ideas from image data augmentation to handle incorrect word insertions [35]. The idea that modifying the individual tokens of the training sentences can help improve performance while not requiring more labelled data has been shown to increase the performance of the classifiers. This approach can be expanded to a sentence level as well. Tasks like machine translation and summarization depend on words and a wider understanding of the language structure. Generating sentences in different contexts can improve the performance of these tasks. Instead of targeting the words in the training sentences, augmenting the data by examining what the classifier has been learning is possible. Adversarial augmentation techniques can be created by directly challenging the classifier with unknown data. These techniques train the model to be more robust and help increase the overall performance [36]. Finally, it is possible to create artificial data by augmenting the hidden representations of the words. This means that humans cannot read these new samples, but they still help improve the classifier performance. An example of this can be found in [37], where the authors combine two real samples to create a new virtual sample. LE and DA techniques all aim to create more data from the limited amount of available labelled data. For our research, we focused on a token-level technique, PLSDA. 38 3.2.6.1 PLSDA Xiang et al. proposed the PLSDA algorithm to ensure that any generated data followed the syntactic consistency principle [33]. This principle is that any changes to the data do not change the syntactic information that the data originally contained. To ensure this principle is followed, the authors proposed the constraint of selecting words for replacement based on specified parts of speech and ensuring any selected synonyms are of the same part of speech. Choosing a word from a sentence and replacing it with its synonyms is accomplished over two steps. The first is the Substitution Candidate Selection step, and the second is the Instance Generation step. The first step is choosing which words could be replaced with their synonyms. We will assume a training sentence S composed of n words: S = w1 , w2 , ..., wn . We will also have their associated POS tag for all these words. The authors present two ways to select potential words from this set. The first finds all possible synonyms with the same POS tag as the original word. While slightly more complicated, the second provides a higher degree of quality in the generated sentences. The second method incorporates word similarity into the synonym selection process. After generating the synonyms for a selected word and filtering them based on their POS tag, there is an additional comparison of the similarity of the synonym to the original word. Only synonyms that are above a threshold are considered for possible replacement. The algorithm will go through all n words and determine whether synonyms should be generated. The sudo code for this step can be found in Figure 3.3. Once complete, the words and their synonyms will be passed on to the next step. 39 Given: S, pos, sim SCLS = set() FOR each word w in S: SCLw = set() wp = POS(w) IF wp ∈ pos: syns = synonyms of w FOR sy ∈ syns: syp = POS(sy) IF syp =wp AND SIM(syn, w) ⩾ sim: SCLW ∪ syn END IF END FOR END IF SCLS ∪ (w, SCLW ) END FOR Return SCLS Figure 3.3: Substitution Candidate Selection The second step in the PLSDA process is instance generation. This is where the artificial data is created. The number of possible new sentences that could be created is relatively high. Consider a sentence where three words fulfilled the POS restraints. If each of these words has five possible synonyms, then the number of potential artificial sentences is 53 . Creating all these possible sentences is computationally expensive; more importantly, they would require a large amount of storage—the authors propose an additional constraint exp, which is the expected number of generated instances. The PLSDA algorithm will stop generating new instances once it reaches this number. If the number of possible combinations is less than exp the PLSDA will generate all possible combinations. To ensure that a random sampling of all possible instances is selected, the authors propose using a probabilistic distribution to choose the initial word for replacement and a random selection of all the potential synonyms. Each word is 40 selected for replacement based on a Bernoulli distribution. The synonyms for each selected word are selected randomly from all possible synonyms, with each synonym being equally likely to be selected. By utilizing this method, it is possible to generate artificial instances of a sentence. Using the PLSDA algorithm can greatly increase the number of training samples that a classifier can access without manually labelling more samples. We discuss the details of our implementation of PLSDA in Chapter 4. Given: S, SCLS InsGen = set() benS = BenDis(SCLS ) FOR index in benS : Snew = S IF benS [index] == 1: w = SCLS [index][0] SCLW = SCLS [index][1] probW = Prob(SCLW ) FOR indexB in probW : IF probW [indexB ] == 1: S.replace(w, SCLW [indexB ]) END IF END FOR END IF END FOR InsGen ∪ Snew Return InsGen Figure 3.4: Instance Generation 41 Chapter 4 Algorithm This chapter will examine the proposed algorithm’s structure and setup. We have designed the algorithm modularly so that either the active learning or the lexical expansion can be used separately or together. Additionally, the selection of the neural network is independent of the algorithm. This independence allows the easy insertion of any appropriate network into the algorithm. We will examine the algorithm sequentially, starting with the preprocessing, then moving into the AL and LE and ending in the training stage. The final section of this chapter will summarize the contributions this research has made to advancing sentiment analysis research. 4.1 Preprocessing Our preprocessing stage can be broken down into five steps: 1. Remove over length sequences. 2. Generate intermediate information. 3. Tokenization. 42 4. Build vocabulary. 5. Convert words to numbers. 6. Padding. 4.1.1 Over Length Sequences The first step in our preprocessing stage is to remove overly long sequences. We do this because our neural networks have a set input length. We chose the input length to be 800 words. We chose this length as the average sequence length in our training set is 757. We ensured we kept most of the data by selecting a number larger than the average. Another approach would be to truncate the longer sequences than the input size. We wanted to avoid the noise this approach would introduce, as the longest sentence in the dataset is 8574 words long. In addition, we felt that there may be some differences between classifying long sequences and short ones. By removing overly long sequences, we avoid any of the potential differences between these tasks. 4.1.2 Tokenization The tokenization process that we use is composed of four steps. We first remove special characters. The second step is to remove single characters. The third step is to remove numbers and stop words. The stop word list we use is the English stop words from nltk [38]. We then finally convert the entire sentence to lowercase. These four steps help ensure we offer our neural network the highest quality sequences. 43 4.1.3 Intermediate Information The second step in the preprocessing stage is the creation of two intermediate information processes. These are required as the active learning stage needs to compare the similarity of unknown sequences with sequences that the classifier has seen. We examined two approaches to this step as proposed by the authors in [39]. It is possible to use BERT embeddings to take all the sequences into an easy-tocompare state. However, we felt that using the BERT embedding for the similarity comparison but not for the actual network representation was utilizing information the network did not have access to. We wanted to compare the performance of the active learning technique when the similarity between sequences was limited to information the network had access to. To do this, we used a latent semantic analysis approach. We take all the sequences in the training set and create a termfrequency inverse document frequency matrix. We then use a singular value decomposition to reduce the dimensions of this matrix to 100. With this final matrix, we can easily compare any unknown sequence by running it through the same process. Any words unknown to the network will be given the same label, and similarity will only be calculated based on available information. This process can be seen in 4.1. We wanted to avoid using the network-learned embeddings as these are actively being learned and would be a poor representation of the information until the network is more established. Define svd() as: Given: dataT rain, dataT est, tfidf, svd temp = tfidf.fit transform(dataT rain) svd train = svd.fit transform(temp) temp = tfidf.transform(dataT est) svdt est = svd.transform(temp) Return svd train, svd test, Figure 4.1: Create SVD Code 44 4.1.4 Build Vocabulary Using SVD to determine the similarity between two sequences requires maintaining a vocabulary for the words we have seen. We do this by uniquely mapping the words in the training set to numbers. Additionally, we need to be able to map the numbers back to the words. When we add a sequence from the test set to the training set, we update the mapping with any new words. This process can be seen in algorithms 4.2 and 4.3. Define vocab() as: Given: dataT rain, ST OPWORDS as SW index = 1 map = dictionary() indecies = dictionary() For sentence ∈ dataT rain: For w ∈ sentence: If w ∈ / SW and w ∈ / map map[w] = index indecies[index] = w index + + End If End For End For Return map, index, indecies Figure 4.2: Create Vocab Code 4.1.5 Convert Words to Numbers After building the vocabulary, we convert all the words to numbers. Using the dictionary created from 4.2, words that we present in the training set are replaced with numbers. Words that are not present in the training set are removed. This is applied to both the training and the testing set. This results in sequences that are of varying length. Our neural networks require sequences to be the same length; 45 Define addVocab() as: Given: sent, map, index, indices, SW For w in sent: If w ∈ / SW and w ∈ / map: map[w] = index indicies[index] = w index + + End If End For Return index Figure 4.3: Add Sequence Code this is handled by the next step: padding. 4.1.6 Padding Padding is a simple preprocessing step that ensures all sequences are the same length. This is accomplished by inserting zeros at the end of any sequences that are shorter than needed. It is essential to mask these zeros from the neural networks to ensure they are not learning from these extra inputs. This is why in 4.2 our indexing starts from one not zero. 4.1.7 Complete Preprocessing Now that the different preprocessing stages have been examined, we can look at the code for the entire process. 4.2 AL + LE We will now examine our algorithm’s Active Learning and Lexical Expansion sections. These are based on the works from [39] and [33], respectively. The basics of these algorithms were discussed in Chapter 3. Here, we will examine the changes 46 that we have made and how we combine these two approaches. In addition, as we have created our algorithm modularly, we will assume the network is provided to us as a variable. Define preprocessing() as: Given: data, ST OPWORDS as SW , svd, tfidf data = removeOverLength(data) data = tokenize(data) dT r, dT e = split(data) svd(dT r, dT e, tfidf, svd) svdT r, svdT e = vocab(dT r, SW) map, index, indices = vocab(dT r, SW) dT rNum = convertNum(dT r) dT eNum = convertNum(dT e) dT rNum = pad(dT rNum) dT eNum = pad(dT eNum) Figure 4.4: Preprocessing Code 4.2.1 Combining AL and LE We made small changes to both the Active Learning and Lexical Expansion algorithms. We will discuss these changes and then outline how we combine the techniques. 4.2.1.1 AL The main change we make to the Exploration-Exploitation approach proposed in [39] is switching the Siamese network with a singular value decomposition (SVD) matrix. Training an additional network alongside the classification network is computationally expansive. The original paper did not focus solely on text analysis, so the authors needed a method of comparing the similarity of a wide variety 47 of data types. For our purposes, we can replace the Siamese network with a similarity comparison that is widely used in a text, a sparse term frequency-inverse document frequency (TF-IDF) matrix that is reduced using SVD. Using this matrix has been shown to have good results for similarity [40]. By removing the Siamese network, the overall computation time of the algorithm should be reduced, and the accuracy of the similarity calculation should improve as well. 4.2.1.2 LE We also changed the PLSDA algorithm proposed in [33] by changing the similarity measurement. In the original paper, the authors used pre-trained Glove embeddings to determine the similarity between the old sentence and the new modified sentence. We did not want to use embeddings that were not being used by the classifier. Instead, we used the WordNet similarity comparison to compare words [41]. WordNet generates the synonyms used by the algorithm, and it contains a built-in word similarity comparison. We feel that relying on a pre-trained embedding that may not generalize to the current training set for similarity could introduce noise into the algorithm. Using the similarity comparison from WordNet removes some of this noise [41] 4.2.1.3 Combining AL and LE To combine the two strategies, we first run the Active Learning approach to select the new sequences to be added to the data set. We then take these sequences and apply the Lexical Expansion technique to them. This will generate additional sequences. This can be visualized in the following Figures. The motivation behind our approach is that the exploration-exploitation algorithm will select data that confuses the classifier and is close to the decision boundary and data that is far from the decision boundary on which the classifier has little information. Adding 48 the PLSDA algorithm to this will generate more information on the areas that are either the most confusing or the most unknown. Figure 4.5: Generated Data Figure 4.5 contains artificially generated data demonstrating our algorithm. The data is separated into two classes: the squares and the crosses. Figure 4.6: Training Data In Figure 4.6, we have selected a small amount of the data to train our classifier. The circles around the data points denote this. The remaining data will be used to test the classifier. We create an artificial decision boundary based on the selected training data. The 49 Figure 4.7: Decision Boundary line in Figure 4.7 shows this boundary. The data points above and to the right of the line will be labelled as crosses, and the ones below and to the left will be labelled as squares. Figure 4.8: Active Learning Data After the initial labelling, we can apply our proposed algorithm. The first step will be to run the active learning algorithm: exploration-exploitation. We will have it selected eight data points. Four from exploration and four from exploitation. The exploration stage will select four points far from the original training set. The exploitation stage will select four data points that are confusing to the classifier: four points that the classifier was wrong or unsure about in the labelling. If our 50 algorithm is used on real-world data, where the correct labels are unknown beforehand, the newly selected data will be provided to topic experts to label. Figure 4.9: PLSDA Data After the Active Learning section of our algorithm has been completed, we move on to the PLSDA section. For this demonstration, we will create three new data points. These will be within a preset similarity threshold to the originals. The newly created data points are shown in Figure 4.9 as the diamonds. The generated data will be labelled as the parent data point. The new training set will be assembled after the PLSDA section has processed all the data points selected in the Active Learning section. This can be seen in Figure 4.10. Figure 4.10: New Training Data 51 Figure 4.11: Updated Decision Boundary After completing our algorithm, the classifier is retrained on the new training data. This will create a new decision boundary that could look something like the line in Figure 4.11. An aspect of the PLSDA approach that is not illustrated in Figure 4.11 is that when that training data is text data, the generated data can add information that was not initially present in the training set. Ideally, this new information will reshape the decision space, making the distinction between the classes more apparent. 4.3 Contributions The contributions that this thesis has made to advance the current state of sentiment analysis research are: • Combining EE and PLSDA. • Testing a different similarity measure for EE. • Testing a different similarity measure for PLSDA. • Training and comparing two neural networks with EE, PLSDA, and APLSDA. 52 The main contribution of this work is the combined algorithm APLSDA. This newly synthesized algorithm tests to see if its component algorithms’ advantages can be combined while minimizing the weaknesses that those algorithms introduce. By changing the similarity measures of both the EE and the PLSDA algorithms, this research aims to determine if pre-trained embeddings are necessary for sentiment analysis research. One of the main motivating factors for this research is to reduce the amount of labelled training data needed to create a classifier and to reduce the overall cost of the classifier. If pre-trained embeddings are required to complete this goal, then the cost of creating a classifier is still high, as a general end user will not be able to create these embeddings themselves. Finally, by training and comparing two different neural network structures, we examine how our APLSDA algorithm interacts and whether different neural networks interact with the algorithm differently. In the next section, we will describe how we test this algorithm to determine if it offers better performance when compared to the baseline networks. 53 Chapter 5 Experiment Set Up Our experiment aims to determine if combining the PLSDA lexical expansion algorithm with the Exploration-Exploitation active learning algorithm provides better results when compared to either of the algorithms alone or the base performance of the model. We are measuring each algorithm’s average and maximum performance to determine this. In addition to changes in performance, we also analyze changes in processing time between the different combinations of approaches. We are examining three different neural networks. We are interested in seeing if the changes in performance depend on the network structure. We will first explore the three models we are evaluating before examining the structure of our experiments. 5.1 Models We have two main types of models. These are general neural network models and BERT-based models. Overall, we tested three different neural networks. The two general neural network structures are a stacked CNN model based on [19] and a CoLSTM, which is a combination of the works from [20] and [21]. For the BERTbased model, we have selected an ALBERT model [10] for evaluation. All of our neural networks use the Python API for TensorFlow [22]. We use the Keras layers 54 for both the CNN and the CoLSTM. We use the pre-trained models for the BERT model, which are available through the TensorFlow Hub. 5.1.1 CNN We used the CNN based on the network found in [19]. The structure of their model can be found in Table 3.2. Our structure can be found in Table 5.1. The main difference between the author’s network and ours is that we allow the input sentences to be longer. In the original paper, the max length of a sentence was 40, while we allow sentences up to a length of 800. With this in mind, we follow the example of [19] by setting the kernel size for all of the 1DCNN layers to 3. The first 1DCNN layer has 64 filters, and the next has 32. A max-pooling layer follows this. After this, there are two 1DCNN layers, the first with 16 filters and the second with 8. The last two layers are a global average pooling layer and, finally, the fully connected dense layer. The authors in [19] showed that this neural network structure performs well while requiring relatively little training time. Table 5.1: Our CNN Layer Embedding 1D Conv 1D Conv Max Pooling 1D 1D Conv 1D Conv Global Avg 1D Dense Output Shape 800, 300 800, 64 800, 32 266, 32 266, 16 266, 8 8 1 55 Parameters 18000000 57664 6176 0 1552 264 0 9 5.1.2 CoLSTM As discussed in Chapter 3, our CoLSTM is loosely based on the works of [20] and [21]. From [20], we adopt the idea of first using a CNN layer to reduce the dimensionality of the data before providing it as input to the LSTM. We take the idea of using multiple different CNN kernel sizes from [21]. By combining these two ideas we get our final structure, where we have the word embeddings being fed into three separate 1DCNNs. These have kernel sizes of 3, 5, and 7, respectively. Each of these has five filters that are learned. An average pooling layer is then applied to each of the CNN layers. This will provide us with three tensors with 800 elements. We then stack these on one another to create a 3,800 tensor that is the input into our LSTM. The LSTM’s output is fed into two dense layers to get our final output. The idea behind this structure is that the three CNN layers will learn the different word relationships in the text. The LSTM layer will take these different relationships and learn any temporal relationships between them. The layer-by-layer breakdown can be seen in Table 5.2. Table 5.2: Our CoLSTM Layer Input Embedding 1D Conv Kernal 3 1D Conv Kernal 5 1D Conv Kernal 7 Average Pooling 1D For 3 Average Pooling 1D For 5 Average Pooling 1D For 7 Stack LSTM Dense Dense Output Shape 800 800, 300 800, 5 800, 5 800, 5 800 800 800 3, 800 100 100 1 56 Parameters 0 18000000 4505 7505 10505 0 0 0 0 360400 10100 101 5.1.3 ALBERT We use the large ALBERT model [10] that is available from TensorFlow [22]. We made this choice as it has comparable performance to the base BERT model while still providing an increase in performance. We did not allow the weights of the pretrained embeddings to be updated. On top of the embeddings, we have a network that is similar to our CNN network. There is a single 1DCNN with a kernel of size 3 followed by a pooling layer. After the pooling, there are two more 1DCNN layers with kernels of 3 each. A final pooling layer feeds into a dense layer to produce the final classification. This can be seen in Table 5.3 Table 5.3: ABLERT Layer Input Embedding 1D Conv Kernal 3 Max Pooling 1D Conv Kernal 3 1D Conv Kernal 3 Global Avg Pooling Dense 5.2 Output Shape 800 128, 1024 128, 64 42, 64 42, 16 42, 8 8 1 Algorithm Selection We have four different algorithms that we are testing: 1. Basic 2. PLSDA 3. Exploration-Exploitation 4. PLSDA + Exploration-Exploitation 57 Parameters 0 17683968 196672 0 3088 392 0 9 Depending on which algorithm we are testing, two different experiment setups are used. When we test the basic or PLSDA algorithm, we randomly select a portion of the data set for training and use the rest for testing. The amount of data selected for training starts from one percent and increases by one percent up to five percent. After the amount reaches five percent, we increase it by five percent until sixty percent. The motivation is to use the smallest training data possible while keeping the computational costs reasonable. We train ten of the selected neural networks at each increment and record their performances. The iterative nature of the Exploration-Exploitation algorithm makes the previous experiment setup ineffective in testing it. Instead, we start the algorithm with one percent of the data, allowing it to choose 64 sequences from the testing set at each iteration. Each time the algorithm adds data to the testing set, we train ten networks and keep the best-performing one for the next iteration. Again, we track the performance of all the networks for comparison. We have the active learning algorithm run until it has selected sixty percent of the data to train from. For our combined algorithm, we use the same setup as the Exploration-Exploitation algorithm. We start from one percent, and using the active learning section of our algorithm, we add 64 sequences. Additionally, our algorithm will add generated sequences from the lexical expansion step. 58 Chapter 6 Evaluation and Analysis 6.1 Evaluation Criteria To measure the performance of our classifiers, we use three metrics. These metrics are Precision, Recall, and F1-Score. To understand these metrics, we must introduce the confusion matrix. The confusion matrix contains four quadrants. These quadrants are labelled as: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (FN). TP and TN terms are used when the classifier correctly identifies whether a data point belongs to the target class. The FP and FN terms are used when the classifier incorrectly identifies a data point as belonging to the target class when it does not (FP) or says a data point does not belong to the class when it does (FN). By calculating a classifier’s TP, FP, TN, and FN rates, we can start calculating that classifier’s precision, recall, and F1-Score. Precision measures how often the classifier correctly makes a positive prediction. The formula can be found in Equation 6.1 precision = TP T P + FP (6.1) Recall measures the classifier correctly identified that a data point belonged to the 59 target class. The equation for recall can be seen in Equation 6.2 TP T P + FN recall = (6.2) Finally, the F1-Score is the harmonic mean of precision and recall. The formula for F1-Score can be found in Equation 6.3 F1 = 2 × precision × recall precision + recall (6.3) In addition to F1-Score, we measure the mean and variance of the different networks and algorithms. F1-Score is measured across all labels in a testing set. We take the average F1Score to measure our classifiers’ overall performance. In both of our datasets, there are two labels, so our final metric is: F1average = F10 + F11 2 (6.4) We also measure the statistical significance of the results of our different networks and algorithms. This is done using the Student’s t-test. This equation can be found in 6.5. t= X̄1 − X̄2 s∆¯ (6.5) Here s∆¯ is calculated in 6.6. √︄ s∆¯ = s21 s2 + 2 n1 n2 (6.6) Where s2i is the estimator of the variance of the ht two series, and ni is the number of sequences in the two samples. We set our confidence interval to 0.95 and our p value to 0.05. 60 6.2 Evaluation We use the IMDB movie review dataset [18] for our evaluation. We will analyze the average and the maximum for the classifiers. We have taken ten observations for each network and each algorithm. Although this is a small number of observations for significance testing, we believe these tests will still provide some helpful information. 6.2.1 CNN We will start by analyzing the average from the CNN. The averages can be found in 6.1. We have graphed these values as well. This graph can be found in 6.1. The highest average at each sampling step is in bold. Values followed by an asterisk are statistically significant compared to the normal, unmodified results with a p-value of p < 0.05. The p-values are in Table 6.4 in the A-N, P-N and AP-N lines for the Active-Noraml, PLSDA-Normal and APLSDA-Normal, respectively. Table 6.1: Average F1 Score CNN 15 0.812 0.782 0.816 0.819 20 0.806 0.800 0.824* 0.830* 25 30 35 40 45 50 55 Normal 0.825 0.823 0.827 0.827 0.830 0.830 0.832 PLSDA 0.820 0.822 0.823 0.820 0.825 0.828 0.832 Active 0.840* 0.848* 0.852* 0.869* 0.879* 0.894* 0.894* APLSDA 0.840* 0.851* 0.864* 0.865* 0.875* 0.891* 0.902* ?* denotes statistical significance compared to the normal approach 60 0.826 0.832 0.914* 0.913* Normal PLSDA Active APLSDA 1 0.335 0.335 0.335 0.335 2 3 4 0.335 0.335 0.558 0.452 0.683* 0.684* 0.433 0.626* 0.667 0.671* 0.706* 0.717 5 0.666 0.720 0.697 0.771 10 0.791 0.782 0.801 0.788 This table shows that our proposed method outperforms the other approaches for training amounts less than ten percent. This difference is only statically sig61 nificant for values two and three. For values of ten and above, the AL approach or our APLSDA approach had the highest average. Among these values, the AL and PLSDA approaches are only significantly different from the normal approach for values greater than 15. There seems to be no predictable pattern in which one outperforms the other. It is also worth noting that we have calculated the p-values between the different approaches. These can be found in 6.4, and as seen in the AAP lines, there is no significant difference between the two approaches for values greater than two. Average F1 CNN 1 0.9 0.8 Avg Combined F1 Score 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 60 Percent Training Data Active PLSDA Normal APLSDA Figure 6.1: Average F1 Score CNN The graph in Figure 6.1 helps to visualize the difference between the active learning and APLSDA algorithms, which is minor for values above 25. To further illustrate the differences between the different algorithms compared to the normal CNN network, we have plotted the differences in the average F1 score in 6.2. This figure shows that our APLSDA algorithm improves slightly faster than the Active learning algorithm. While the PLSDA algorithm offers some improvements, it 62 quickly falls off and offers slight improvement, and, in some cases, it is worse than the standard network. Change AVG 0.4 0.35 0.3 Difference Combined F1 Score 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 10 15 -0.05 20 25 30 35 40 45 50 55 60 Percent Training Data Active PLSDA APLSDA Figure 6.2: Change in Average F1 Score CNN We also want to look at the maximum values the neural network achieves. These values are in Table 6.2 and Figure 6.3. From the table and the figure, we can see that the max value follows a similar pattern to the average value. Our APLSDA algorithm achieves the highest value four out of five times for values of five and less. There is little pattern to the best-performing algorithm for values of ten and above. We can see from Figure 6.4 that the three different algorithms provided are similar to the changes to the average F1 score. 63 Table 6.2: Max F1 Score CNN 1 0.335 0.335 0.335 0.335 Normal PLSDA Active APLSDA 2 0.335 0.706 0.651 0.745* 3 4 0.367 0.706 0.756* 0.766* 0.783* 0.753 0.779* 0.794 5 0.786 0.791 0.795 0.797 10 0.817 0.813 0.815 0.814 15 0.824 0.816 0.829 0.826 20 0.824 0.826 0.837* 0.839* 25 30 35 40 45 50 55 60 Normal 0.845 0.848 0.848 0.840 0.854 0.838 0.839 0.839 PLSDA 0.826 0.830 0.830 0.831 0.835 0.850 0.837 0.836 0.847* 0.854* 0.870* 0.873* 0.883* 0.896* 0.908* 0.918* Active APLSDA 0.846* 0.858* 0.867* 0.874* 0.883* 0.893* 0.904* 0.915* ?* denotes statistical significance compared to the normal approach Max F1 CNN 1 0.9 0.8 Avg Combined F1 Score 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 10 15 20 25 30 Percent Training Data Active PLSDA Normal APLSDA Figure 6.3: Max F1 Score 64 35 40 45 50 55 60 Change MAX 0.45 0.4 0.35 Difference in Combined F1 Score 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 10 15 -0.05 20 25 30 35 40 45 50 55 60 Percent Training Data Active PLSDA APLSDA Figure 6.4: Change in Max F1 Score CNN Table 6.3: Standard Deviation CNN Normal PLSDA Active APLSDA 1 2 3 4 5 10 15 20 5.551E-17 0.0743 0.0331 0.0842 0.1580 0.0194 0.0108 0.0119 0.1006 0.1881 0.1203 0.0696 0.0247 0.0105 0.0101 0.0143 5.551E-17 0.1148 0.0940 0.0913 0.1257 0.0254 0.0225 0.0109 5.551E-17 0.0488 0.0997 0.1127 0.0229 0.0220 0.0164 0.0096 Normal PLSDA Active APLSDA 25 0.0132 0.0046 0.0100 0.0059 30 35 40 45 50 55 60 0.0077 0.0043 0.0034 0.0038 0.0021 0.0051 0.0024 0.0057 0.0273 0.0047 0.0020 0.0038 0.0039 0.0110 0.0037 0.0234 0.0020 0.0033 0.0015 0.0267 0.0029 0.0037 0.0028 0.0111 0.0114 0.0023 0.0022 0.0016 65 In Table 6.3, the standard deviation for the four different algorithms is listed. Unlike the previous tables, the smallest value is highlighted in bold. We are interested in seeing if there is any algorithm that constantly provides a smaller standard deviation. The change in standard deviation can be seen in Figure 6.5. While the PLSDA approach does offer a smaller standard deviation in six of the sixteen points, it is important to remember that it resulted in worse performance than the baseline. The PLSDA approach is statistically significant in only one of these points. The APLSDA and Active algorithms reliably produce values with a lower standard deviation for values above 15. The Active algorithm has a smaller standard deviation in seven out of the nine values, and the APLSDA has smaller values for six. The Active algorithm has the smallest values for three of the points. Change SD CNN 0.15 0.1 Difference in Standard Deviation 0.05 0 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 60 -0.05 -0.1 -0.15 Percent Training Data Active PLSDA APLSDA Figure 6.5: Change in Standard Deviation F1 Score When looking at the statistically significant difference between the different algorithms, we can see a clear difference in how the algorithms perform for values of five and less, the values of ten and fifteen, and the values of twenty and higher. 66 For the values of one to five, there is little pattern to the statistical significance between the different algorithms. For ten and fifteen, none of the algorithms are significantly different from each other. Finally, for all the training values of twentyfive and higher, all but Active-APLSA and PLSDA-Normal are significant. Table 6.4: P-Values CNN A-P A-N A-AP P-N P-AP AP-N 1 2 3 0.348 0.829 0.217 4E-225 0.136 4E-06 1E-223 6E-05 0.073 0.343 0.427 0.001 0.344 0.002 0.009 6E-212 1E-08 5E-07 A-P A-N A-AP P-N P-AP AP-N 25 1E-04 0.001 0.819 0.63 7E-07 0.001 4 0.087 0.741 0.24 0.035 0.812 0.139 5 0.482 0.479 0.187 0.166 0.018 0.069 10 0.63 0.444 0.309 0.612 0.391 0.732 15 0.308 0.667 0.873 0.318 0.131 0.442 20 0.032 0.034 0.217 0.784 0.003 0.002 30 35 40 45 50 55 60 7E-09 0.001 6E-12 2E-16 1E-14 4E-05 3E-10 6E-07 0.005 9E-15 9E-17 1E-21 4E-05 2E-22 0.315 0.154 0.461 0.607 0.012 0.449 0.164 0.98 0.08 0.276 0.156 0.29 0.989 0.2 3E-09 1E-04 3E-07 4E-07 8E-17 3E-17 1E-09 3E-07 1E-13 1E-06 9E-08 1E-21 4E-14 1E-22 Taking all the results from this section together, we can see that our APLSDA algorithm and the Active algorithm perform similarly. Overall, the performance at five percent or less is unpredictable. Our algorithm produces slightly more consistent results for values over fifteen when considering the standard deviation. For the CNN, the APLSDA and AL algorithms perform similarly, and neither outperforms the other. 6.2.2 LSTM Similarly, for the LSTM, we will first examine the average performance of the network. The numeric values can be found in Table 6.5, and the graphical representations can be found in Figure 6.6 and 6.7. We can see that the Active and APLSDA 67 outperformed the other two algorithms in all cases. In all cases, either the Acitve or the APLSDA algorithm performed the best. However, the statistical significance of these improvements is unreliable for values less than 30. Table 6.5: Average F1 Score LSTM 1 0.335 0.381* 0.34 0.531* 15 0.748 0.751 0.774 0.767* 20 0.751 0.755 0.783 0.775 25 30 35 40 45 50 55 Normal 0.767 0.783 0.795 0.8 0.798 0.806 0.809 PLSDA 0.775 0.776* 0.784* 0.791 0.801 0.812 0.815 0.794 0.805* 0.818* 0.819* 0.831* 0.848* 0.847* Active APLSDA 0.785 0.789 0.813* 0.826* 0.836* 0.844* 0.848* ?* denotes statistical significance compared to the normal approach 60 0.81 0.81 0.854* 0.866* Normal PLSDA Active APLSDA 2 0.516 0.505 0.441 0.552 3 0.519 0.541 0.596 0.568 4 0.565 0.59 0.649* 0.608 5 0.646 0.668 0.703 0.638 10 0.708 0.721 0.75 0.735 AVG F1 LSTM 1 0.9 0.8 Avg Combine F1 Score 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 10 15 20 25 30 35 40 Percent Training Data Active PLSDA Normal APLSDA Figure 6.6: Average F1 Score LSTM 68 45 50 55 60 Change Avg F1 LSTM 0.25 0.2 Difference Combined F1 Score 0.15 0.1 0.05 0 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 60 -0.05 -0.1 Percent Training Data Active PLSDA APLSDA Figure 6.7: Change in Average F1 Score LSTM The max F1-Scores for the LSTM network can be found in Table 6.6, Figure 6.8, and Figure 6.9. Again, the Active and the APLSDA algorithms outperform the PLSDA algorithm at all values. The statistical significance is the same as the average, so there is no pattern to the significance for values less than 30. Table 6.6: Max F1 Score LSTM Normal PLSDA Active APLSDA 1 0.336 0.458* 0.366 0.571* 2 0.541 0.532 0.651 0.652 3 0.582 0.618 0.684 0.661 4 0.667 0.7 0.718* 0.747 5 0.709 0.715 0.732* 0.722 10 0.757 0.746 0.767 0.763 15 0.765 0.773 0.795* 0.794* 20 0.794 0.795 0.809 0.8 25 30 35 40 45 50 55 60 Normal 0.798 0.801 0.81 0.811 0.818 0.816 0.815 0.826 PLSDA 0.796 0.798 0.802* 0.808 0.814 0.823 0.824 0.823 Active 0.806 0.825* 0.836* 0.837* 0.843* 0.86* 0.868* 0.872* APLSDA 0.81 0.819 0.833* 0.844* 0.852* 0.856* 0.864* 0.875* ?* denotes statistical significance compared to the normal approach 69 Max F1 LSTM 1 0.9 0.8 Avg Combined F1 Score 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 60 40 45 50 55 60 Percent Training Data Active PLSDA Normal APLSDA Figure 6.8: Max F1 Score LSTM Change Max F1 LSTM 0.25 0.2 Difference Combined F1 Score 0.15 0.1 0.05 0 1 2 3 4 5 10 15 -0.05 20 25 30 35 Percent Training Data Active PLSDA APLSDA Figure 6.9: Change in Max F1 Score LSTM 70 We will next examine the standard deviation of the LSTM network. Unlike the CNN, our APLSDA algorithm with the LSTM produced a smaller standard deviation for the training values of 2, 50 and 60. All algorithms produced the smallest deviations for some values. There is no observable pattern for which algorithms will produce small standard deviations. Table 6.7: Standard Deviation LSTM Normal PLSDA Active APLSDA 1 0 0.0432 0.0096 0.0372 2 3 0.0134 0.0461 0.0159 0.042 0.1146 0.0883 0.0566 0.0733 4 0.0659 0.0692 0.0474 0.0798 5 10 15 20 0.041 0.072 0.0206 0.062 0.0415 0.0234 0.0165 0.045 0.0407 0.0087 0.0113 0.0119 0.0828 0.0247 0.0125 0.0123 Normal PLSDA Active APLSDA 25 0.0544 0.0148 0.0082 0.015 30 35 40 45 50 55 60 0.0142 0.0092 0.0097 0.0167 0.0105 0.0047 0.0151 0.0232 0.0113 0.0111 0.0132 0.0085 0.0075 0.0103 0.0114 0.0116 0.0107 0.0102 0.0119 0.0223 0.0174 0.0332 0.0157 0.0266 0.0124 0.0103 0.0214 0.0094 Change SD LSTM 0.12 0.1 0.08 Change in Standard Deviation 0.06 0.04 0.02 0 1 2 3 4 5 10 15 20 25 30 35 -0.02 -0.04 -0.06 -0.08 Percent Training Data Active PLSDA APLSDA Figure 6.10: Change STD LSTM 71 40 45 50 55 60 The p-values found in Table 6.8 show that the APLSDA and AL algorithms are only statistically different at three percent. Overall, we can see that all algorithms except the PLSDA algorithm are significantly different from all others for values greater than thirty. For values less than thirty, there are no clear patterns for the significant differences between algorithms except for the previously mentioned APLSDA and AL algorithms. Table 6.8: P-Values LSTM A-P A-N A-AP P-N P-AP AP-N 1 0.032 0.189 5E-08 0.012 1E-04 7E-08 2 0.136 0.072 0.009 0.133 0.01 0.098 3 0.167 0.072 0.274 0.385 0.601 0.107 4 0.062 0.003 0.032 0.513 0.839 0.222 5 0.151 0.017 0.001 0.204 0.014 0.789 10 15 20 0.008 0.003 0.081 0.125 0.016 0.188 1E-06 2E-04 9E-07 0.582 0.728 0.902 3E-05 5E-04 4E-05 0.313 0.03 0.287 A-P A-N A-AP P-N P-AP AP-N 25 0.012 0.185 5E-04 0.622 0.001 0.372 30 35 40 45 0.009 3E-04 8E-05 3E-04 0.003 0.002 0.006 0.001 2E-05 5E-06 2E-05 7E-06 0.312 0.048 0.184 0.582 0.003 1E-04 3E-04 9E-05 0.657 0.01 0.017 5E-05 50 55 60 5E-06 0.002 1E-04 5E-06 0.001 0.002 9E-09 5E-07 5E-07 0.216 0.052 0.903 3E-07 2E-08 1E-05 3E-07 3E-04 1E-07 Overall, the Active and APLSDA algorithms significantly improved the performance of the LSTM network. These two algorithms achieved the best average performance and the best max performance. They were both consistently statistically significant for values greater than 25. However, the standard deviation of these algorithms is not consistently lower than the normal algorithm. Similar to the CNN, these two algorithms are not significantly different from each other at nearly all points. The most notable result from the LSTM is that the APLSDA algorithm started with a max F1 score of 0.571. This is over twenty points above the performance of the CNN. This result is surprising when examining the rest of 72 the LSTM’s performance. This indicates that with one percent of the data being used for training, the LSTM is slightly better than guessing. While the LSTM does not progress to the same levels as the CNN, this result indicates some benefit to the LSTM structure. Notably, neither the PLSDA nor the AL algorithms offer the same performance at one percent. This could indicate that there is some use in combining them. 6.2.3 BERT The final model that we tested was our BERT-based model. We had to treat this model differently than the others. We did not apply the different algorithms and only ran the algorithm once for each dataset division. We had to do this from a computational standpoint, as the BERT model took over three hours per epoch. While this prevents us from being able to calculate an average or find an actual max, we can treat this as a baseline for a large pre-trained model. The results from our testing can be found in Table 6.9 and Figure 6.11. From this table, we can see that the BERT model is unpredictable for values less than 15. In addition, the performance of the BERT model is lower than what we would have expected from other research. This may be because the ALBERT model may struggle to transfer its learning from its pre-trained knowledge base of books and Wikipedia to movie reviews. Table 6.9: BERT F1 Score Normal 1 0.384 2 0.539 3 0.336 4 0.365 5 0.506 10 0.339 15 0.602 20 0.635 Normal 25 0.656 30 0.647 35 0.540 40 0.588 45 0.642 50 0.685 55 0.681 60 0.713 73 BERT 1 0.9 0.8 Avg Combined F1 Score 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 60 Percent Training Data BERT Figure 6.11: BERT F1 Score 6.3 Comparison and Analysis We will compare the different techniques and networks. We will first examine to see if our approach performed better than its components, then compare the overall performance of the networks. 6.3.1 Compare Networks Comparing the CNN, LSTM, and BERT networks, we can see that the CNN outperforms the other networks in almost all cases. We can see this clearly in Tables 6.10, 6.11, 6.12, 6.13, 6.14, and 6.15. For all the algorithms, either BERT or the LSTM starts from a higher F1 score. The CNN outperforms the other networks as early as three percent training data. Even in the notable case with the APLSDA algorithm, where the LSTM starts from 0.571, the CNN reaches an F1 score of 0.745 at two percent. When examining the standard deviation of the networks, we can see that the 74 CNN has a smaller deviation. In general, the deviation decreases as the amount of training data increases. There is some variability in the change in deviation, but it trends down. This implies that the CNN is learning a better representation of the data in addition to being able to classify positive and negative sentiment. Finally, we will examine if there is a significant difference in how the different networks performed with the different algorithms. From Table 6.16, we can see a significant difference between the networks for values greater than twenty. There is no clear pattern of significance for values of twenty and under. There are no significant differences for the values of two, three, and four. This implies that for small values of training data, the networks are too variable to be able to statistically pick one as better than the other, regardless of the technique used. We can see that the CNN is statistically the better choice for higher values. 6.3.2 Compare Algorithms In addition to comparing the differences between the networks, we want to examine the different algorithms to see any patterns in how they performed with the two networks. Looking back at Figures 6.3 and 6.8, we can see that with less than ten percent of the training data, all the algorithms improved the performance of both networks. From values equal to or greater than ten percent, the AL and APLSDA algorithms outperform the PLSDA algorithm. The AL and APLSDA algorithms start outperforming the PLSDA algorithm for both networks by 20 points and outperforming the PLSDA algorithm by at least 50 points by the end. Comparing how the different algorithms affect the average and max scores as the amount of training data changes shows some interesting differences between the networks. For the CNN, the APLSDA algorithm increases the average performance by 0.336 from one percent to two and the max by 0.310. All the algorithms created a significant increase in the max performance for the CNN, but only the APLSDA al75 gorithm created such a significant increase in the average. For the LSTM, there were no sudden changes in performance, but the APLSDA algorithm did start surprisingly high. The standard deviation is unpredictable when determining if the algorithms create a smaller concentration of results. There are no apparent patterns for either network. 6.4 Overall Comparisons Overall, we can conclude that the CNN outperformed the other networks across all algorithms. The only exception is for values one and two, where either the LSTM or BERT would perform better. As for which algorithm performs best, the AL and APLSDA algorithms outperformed the PLSDA algorithm. However, we can not differentiate between these two algorithms as they performed almost the same. There are two notable exceptions to this. One is for the CNN. The APLSDA algorithm increases the average performance from one percent to two significantly more than the AL algorithm. The other exception is the LSTM, where the APLSDA algorithm produced surprisingly high values for one percent of the training data. Table 6.10: Active Average F1 Score Comparison CNN LSTM BERT 1 0.335 0.34 0.384 2 0.433 0.441 0.539 3 4 5 10 15 20 0.626 0.667 0.697 0.801 0.816 0.824 0.596 0.649 0.703 0.75 0.774 0.783 0.336 0.365 0.506 0.339 0.602 0.635 CNN LSTM BERT 25 0.84 0.794 0.656 30 0.848 0.805 0.647 35 0.852 0.818 0.540 40 0.869 0.819 0.588 76 45 0.879 0.831 0.642 50 0.894 0.848 0.685 55 0.894 0.847 0.681 60 0.914 0.854 0.713 Table 6.11: Active Max F1 Score Comparison CNN LSTM BERT 1 0.335 0.366 0.384 2 3 4 5 10 15 20 0.651 0.783 0.753 0.795 0.815 0.829 0.837 0.651 0.684 0.718 0.732 0.767 0.795 0.809 0.539 0.336 0.365 0.506 0.339 0.602 0.635 CNN LSTM BERT 25 0.847 0.806 0.656 30 35 0.854 0.87 0.825 0.836 0.647 0.540 40 45 50 55 60 0.873 0.883 0.896 0.908 0.918 0.837 0.843 0.86 0.868 0.872 0.588 0.642 0.685 0.681 0.713 Table 6.12: PLSDA Average F1 Score Comparison CNN LSTM BERT 1 0.335 0.381 0.384 2 0.452 0.505 0.539 3 4 5 10 15 20 0.683 0.684 0.72 0.782 0.782 0.8 0.541 0.59 0.668 0.721 0.751 0.755 0.336 0.365 0.506 0.339 0.602 0.635 CNN LSTM BERT 25 0.82 0.775 0.656 30 0.822 0.776 0.647 35 0.823 0.784 0.540 40 0.82 0.791 0.588 45 0.825 0.801 0.642 50 0.828 0.812 0.685 55 0.832 0.815 0.681 60 0.832 0.81 0.713 Table 6.13: PLSDA Max F1 Score Comparison CNN LSTM BERT 1 0.335 0.458 0.384 2 3 4 5 10 15 20 0.706 0.756 0.766 0.791 0.813 0.816 0.826 0.532 0.618 0.7 0.715 0.746 0.773 0.795 0.539 0.336 0.365 0.506 0.339 0.602 0.635 CNN LSTM BERT 25 0.826 0.796 0.656 30 0.831 0.798 0.647 35 0.83 0.802 0.540 40 0.831 0.808 0.588 77 45 0.835 0.814 0.642 50 0.85 0.823 0.685 55 0.837 0.824 0.681 60 0.836 0.823 0.713 Table 6.14: APLSDA Average F1 Score Comparison CNN LSTM BERT 1 0.335 0.531 0.384 2 3 4 5 10 15 20 0.671 0.706 0.717 0.771 0.788 0.819 0.83 0.552 0.568 0.608 0.638 0.735 0.767 0.775 0.539 0.336 0.365 0.506 0.339 0.602 0.635 CNN LSTM BERT 25 0.84 0.785 0.656 30 0.851 0.789 0.647 35 0.864 0.813 0.540 40 0.865 0.826 0.588 45 0.875 0.836 0.642 50 0.891 0.844 0.685 55 0.902 0.848 0.681 60 0.913 0.866 0.713 Table 6.15: APLSDA Max F1 Score Comparison CNN LSTM BERT 1 0.335 0.571 0.384 2 3 4 5 10 15 20 0.745 0.779 0.794 0.797 0.814 0.826 0.839 0.652 0.661 0.747 0.722 0.763 0.794 0.8 0.539 0.336 0.365 0.506 0.339 0.602 0.635 CNN LSTM BERT 25 0.846 0.81 0.656 30 0.858 0.819 0.647 35 0.867 0.833 0.540 40 0.874 0.844 0.588 45 0.883 0.852 0.642 50 0.893 0.856 0.685 55 0.904 0.864 0.681 60 0.915 0.875 0.713 Table 6.16: P Values Between LSTM and CNN Normal PLSDA Active APLSDA 1 1E-05 0.089 0.007 0.104 2 3 4 5 10 15 20 0.097 0.095 0.088 0.119 0.069 0.036 0.055 0.144 0.093 0.1 0.049 0.043 0.031 0.044 0.118 0.095 0.075 0.096 0.033 0.027 0.024 0.084 0.116 0.115 0.091 0.037 0.029 0.031 Normal PLSDA Active APLSDA 25 0.048 0.026 0.026 0.031 30 0.024 0.03 0.024 0.04 35 0.016 0.024 0.026 0.028 40 0.016 0.02 0.027 0.029 78 45 50 55 0.019 0.015 0.012 0.017 0.011 0.01 0.025 0.025 0.035 0.024 0.025 0.032 60 0.015 0.014 0.034 0.025 Chapter 7 Conclusion We will examine our research questions individually, starting with Question 1. 1. Does combining AL and LE offer better performance? Our combined APLSDA algorithm performed significantly better than the PLSDA algorithm and the unmodified approaches. However, it was only significantly different from the AL approach for the CNN at the value 3. For the LSTM, the differences between the APLSDA and AL approach were significantly different for all values but 3 and 4. However, the performance of the algorithms was not consistent with both the APLSDA and the AL algorithm performing the best with no clear pattern. In regards to our expectations, our algorithm did result in better performance than the unmodified approach, while our APLSDA and the AL approach we comparable. While the average performance of our APLSDA algorithm was better than the AL algorithm for the CNN for training values up to 40 percent, the differences were minor and not significant. 2. Is it possible to use this technique to create a classifier whose performance improves faster? Our algorithm did help the classifiers improve faster than all the other techniques. For training values of one or two percent, our algorithm outperformed 79 the other techniques significantly. This is most apparent in the LSTM, where the APLSDA algorithm started over ten points higher than the other algorithms for both its max F1 and its average. For higher training values the APLSDA and the AL algorithms are similar again when examining the rates of improvement. These results match our expectations that our technique would help improve the classifiers faster than the other results. 3. Does the proposed algorithm behave differently with different classifier architectures? To answer question 3 we can conclude that the CNN network outperformed both the LSTM and the BERT model for training values higher than two percent. For the very small values, the LSTM performed surprisingly well. While this performance is notable, overall the CNN is the more useful architecture. Our expectations that the APLSDA algorithm would help improve the performance of any classifier regardless of architecture were supported by our research. Both the LSTM and the CNN saw significant improvements with our APLSDA algorithm. We did not expect that the CNN would outperform the LSTM so drastically. 4. Is there a large variance between the average and best performance of the classifiers? By examining the standard deviation of the different algorithms, we were able to see how their performance changed over time. Overall, all the algorithms trended towards a smaller standard deviation. Our APLSDA algorithm had the smallest deviation at 60 percent training data. However, the other algorithms had smaller deviations leading up to 60 percent. We can conclude that all algorithms increase the accuracy of their predictions as the amount of training data increases. While the PLSDA algorithm performed the worst out of the three improvements tested, it 80 did have much lower standard deviations for some of the smaller training values. This result did not align with our expectations. We had predicted that introducing artificial data from the PLSDA and APLSDA approaches may hurt the deviation of those classifiers. However, this does not appear to be the case. This result is encouraging, as it shows that the data points generated by the PLSDA and APLSDA algorithms are similar to the real data. Overall, our algorithm and the AL algorithm performance are too similar to say that one is better than the other. However, we have some useful results from our research. We found significant improvements in all training percentages when using either technique. When focusing on very small amounts of training data, we found that our algorithm outperformed the AL algorithm. Finally, by examining the average performances of the classifiers as well as the max, we saw that there could be a significant amount of variance between the performance of two classifiers that are trained on the same data. All these findings open up interesting future avenues for continuing this research. 7.1 Future Work Our research brought forth several future research avenues. These include investigating the LSTM, trying different combinations of techniques, further investigating into average performances of classifiers, and the potential to examine several classifiers. One of the most interesting results from this research was the incredible performance of the LSTM with the APLSDA algorithm at one percent training data. It achieved a performance that was slightly better than guessing when the other classifiers were only capable of supplying a single label. Investigating what this classifier has learned and how it differs from the CNN and the BERT model is an area of potential future research. The second avenue of research could be 81 combining other techniques. We selected two techniques to try and combine. Several other AL and LE algorithms could be beneficial to use together. There is also potential to examine different ways of combining the techniques. We only examined a complete combination of the techniques; it would be useful to examine if starting from one technique before transitioning to another would offer any benefits. We found that the performance of the classifiers was not as consistent as we would have hoped. This was especially clear with training values of less than five. Future research could focus on reducing the variance of the classifiers at low training values. Finally our research showed that there can be increable differences in the performance of one classifier when compared to another. This encourages further research into different architectures and experimentation with unusual neural network structures. 82 Bibliography [1] G. Chevalier, “Lstm cell.” https://commons.wikimedia.org/wiki/File: LSTM_Cell.svg, May 2018. Accessed: 2023-05-11. [2] J. Prusa, T. M. Khoshgoftaar, and N. Seliya, “The effect of dataset size on training tweet sentiment classifiers,” in 2015 IEEE 14th International Conference on Machine Learning and Applications, ICMLA 2015, pp. 96–102, Dec 9-11, 2015. [3] C. G. Northcutt, A. Athalye, and J. Mueller, “Pervasive label errors in test sets destabilize machine learning benchmarks,” in 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, NeurIPS 2021, Dec 6-14, 2021. [4] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? improving data quality and data mining using multiple, noisy labelers,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008, pp. 614—-622, Association for Computing Machinery, 2008. [5] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, pp. 115–133, Dec 1943. [6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [7] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” Neural Computation, vol. 12, no. 10, pp. 2451–2471, 2000. [8] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds.), NIPS 2017, Curran Associates, Inc., Dec 8-13, 2014. [9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (A. Moschitti, B. Pang, and W. Daelemans, eds.), EMNLP 2014, pp. 1724–1734, Association for Computational Linguistics, Oct 25-29, 2014. 83 [10] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in 8th International Conference on Learning Representations, ICLR 2020, Apr 26-30, 2020. [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), NIPS 2017, Curran Associates, Inc., Dec 4-9, 2017. [12] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning. Cambridge University Press, 2023. [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (J. Burstein, C. Doran, and T. Solorio, eds.), NAACL 2019, pp. 4171– 4186, Association for Computational Linguistics, Jun 2-7, 2019. [14] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, Dec 2020. [15] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proceedings of the IEEE international conference on computer vision, ICCV 2015, pp. 19–27, Dec 7-13, 2015. [16] M. V. Mäntylä, D. Graziotin, and M. Kuutila, “The evolution of sentiment analysis—a review of research topics, venues, and top cited papers,” Computer Science Review, vol. 27, pp. 16–32, 2018. [17] B. Settles, “Active learning literature survey,” Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. [18] M Yasser H, “Movie ratings sentiment analysis.” https://www.kaggle.com/datasets/yasserh/imdb-movie-ratingssentiment-analysis. Accessed: 2022-02-15. [19] N. C. Dang, M. N. Moreno-Garcı́a, and F. De la Prieta, “Sentiment analysis based on deep learning: A comparative study,” Electronics, vol. 9, no. 3, pp. 1– 29, 2020. [20] R. K. Behera, M. Jena, S. K. Rath, and S. Misra, “Co-lstm: Convolutional lstm model for sentiment analysis in social big data,” Information Processing & Management, vol. 58, no. 1, p. 102435, 2021. 84 [21] Q.-H. Vo, H.-T. Nguyen, B. Le, and M.-L. Nguyen, “Multi-channel lstm-cnn model for vietnamese sentiment analysis,” in 2017 9th International Conference on Knowledge and Systems Engineering, KSE 2017, pp. 24–29, 2017. [22] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org. [23] Z. Zhang, E. Strubell, and E. Hovy, “A survey of active learning for natural language processing,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (Y. Goldberg, Z. Kozareva, and Y. Zhang, eds.), EMNLP 2022, pp. 6166–6190, Association for Computational Linguistics, Dec 7-11, 2022. [24] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948. [25] K. Margatina, G. Vernikos, L. Barrault, and N. Aletras, “Active learning by acquiring contrastive examples,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, eds.), EMNLP 2021, pp. 650–663, Association for Computational Linguistics, Nov 7-11, 2021. [26] B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,” in Advances in Neural Information Processing Systems (J. Platt, D. Koller, Y. Singer, and S. Roweis, eds.), vol. 20 of NIPS 2007, Curran Associates, Inc., Dec 3-8, 2007. [27] H.-S. Chang, S. Vembu, S. Mohan, R. Uppaal, and A. McCallum, “Using error decay prediction to overcome practical issues of deep active learning for named entity recognition,” Machine Learning, vol. 109, pp. 1749–1778, Sep 2020. [28] J. Zhu, H. Wang, T. Yao, and B. Tsou, “Active learning with sampling by uncertainty and density for word sense disambiguation and text classification,” in 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference, COLING 2008, pp. 1137–1144, Aug 18-22, 2008. [29] V. Ambati, Active learning and crowdsourcing for machine translation in low resource scenarios. PhD thesis, Carnegie Mellon University, 2012. [30] A. Erdmann, D. J. Wrisley, B. Allen, C. Brown, S. Cohen-Bodénès, M. Elsner, Y. Feng, B. Joseph, B. Joyeux-Prunel, and M.-C. de Marneffe, “Practical, efficient, and customizable active learning for named entity recognition in the 85 digital humanities,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (J. Burstein, C. Doran, and T. Solorio, eds.), NAACL-HLT 2019, pp. 2223–2234, Association for Computational Linguistics, Jun 2-7, 2019. [31] M.-A. Rocha and J.-A. Sanchez, “Towards the supervised machine translation: Real word alignments and translations in a multi-task active learning process,” in Proceedings of Machine Translation Summit XIV: Posters (A. Way, K. Sima’an, and M. L. Forcada, eds.), MTSummit 2013, Sep 2-6, 2013. [32] J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang, “An empirical survey of data augmentation for limited data learning in nlp,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 191–211, Mar 2023. [33] R. Xiang, E. Chersoni, Q. Lu, C.-R. Huang, W. Li, and Y. Long, “Lexical data augmentation for sentiment analysis,” Journal of the Association for Information Science and Technology, vol. 72, no. 11, pp. 1432–1447, 2021. [34] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III, “Deep unordered composition rivals syntactic methods for text classification,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (C. Zong and M. Strube, eds.), ACL-IJCNLP 2015, pp. 1681–1691, Association for Computational Linguistics, Jul 26-31, 2015. [35] Z. Miao, Y. Li, X. Wang, and W.-C. Tan, “Snippext: Semi-supervised opinion mining with augmented data,” in Proceedings of The Web Conference 2020, WWW 2020, pp. 617––628, Association for Computing Machinery, Apr 20-24, 2020. [36] Y. Cheng, L. Jiang, and W. Macherey, “Robust neural machine translation with doubly adversarial inputs,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (A. Korhonen, D. Traum, and L. Màrquez, eds.), ACL 2019, pp. 4324–4333, Association for Computational Linguistics, Jul 28 - Aug 2, 2019. [37] J. Chen, Z. Yang, and D. Yang, “MixText: Linguistically-informed interpolation of hidden space for semi-supervised text classification,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, eds.), ACL 2020, pp. 2147–2157, Association for Computational Linguistics, Jul 5-10, 2020. [38] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. O’Reilly Media, Inc., 1st ed., 2009. [39] C. Yin, B. Qian, S. Cao, X. Li, J. Wei, Q. Zheng, and I. Davidson, “Deep similarity-based batch mode active learning with exploration-exploitation,” 86 in 2017 IEEE International Conference on Data Mining, ICDM 2017, pp. 575–584, Nov 18-21, 2017. [40] O. Shahmirzadi, A. Lugowski, and K. Younge, “Text similarity in vector space models: A comparative study,” in 2019 18th IEEE International Conference On Machine Learning And Applications, ICMLA 2019, pp. 659–666, Dec 16-19, 2019. [41] C. Fellbaum, WordNet: An electronic lexical database. MIT press, 1998. 87