STUDY OF DOCUMENT RETRIEVAL USING LATENT SEMANTIC INDEXING (LSI) ON A VERY LARGE DATA SET by A N K Zaman B.Sc. Computer Science and Technology, University of Rajshahi, Bangladesh M.Sc. Computer Science and Technology, University of Rajshahi, Bangladesh THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN MATHEMATICAL, COMPUTER, AND PHYSICAL SCIENCES (COMPUTER SCIENCE) THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA January 2010 © A N K Zaman, 2010 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1+1 Library and Archives Canada Bibliothgque et Archives Canada Published Heritage Branch Direction du Patrimoine de l'6dition 395 Wellington Street Ottawa ON K1A0N4 Canada 395, rue Wellington Ottawa ON K1A0N4 Canada Your file Votre reference ISBN: 978-0-494-60849-4 Our file Notre r6f6rence ISBN: 978-0-494-60849-4 NOTICE: AVIS: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, preter, distribuer et vendre des theses partout dans le monde, a des ftns commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these. Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformement a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. I+I Canada Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract The primary purpose of an information retrieval system is to retrieve all the relevant documents, which are relevant to the user query. The Latent Semantic Indexing (LSI) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions/queries. Performance of LSI has been tested for several smaller datasets (e.g., MED, CISI abstracts etc) however, LSI has not been tested for a large dataset. In this research, we concentrated on the performance of LSI on large dataset. Stop word list and term weighting schemes are two key parameters in the area of information retrieval. We investigated the performance of LSI by using three different set of stop word lists and, also, without removing the stop words from the test collection. We also applied three different term-weighting (raw term frequency, log-entropy, and tf-idf) schemes to measure retrieval performance of LSI. We observed that, firstly, for a LSI based ad hoc information retrieval system, a tailored stop word list must be assembled for every unique large dataset. Secondly, the use of tf-idf term weighting scheme shows better retrieval performance than log-entropy and raw term frequency weighting schemes even when the test collection became large. ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ®e3ioa£ed, ta nujr juwesitb, my, 0 2. The word stresses is stemmed as stress because of the rules sses ->ss and ss->ss • Participles: 1. The word examined is stemmed as examin because of the rule ed -> 0 2. The word playing is stemmed as play because of the rule ing -> 0 1.3 Problem Statement We have reviewed the published work on LSI for the last 20 years, and we found some open questions on LSI, stop word lists, term weighting schemes, and large datasets. So, we are interested in exploring the following issues: • LSI has not been tested on a large standard dataset with the most accepted weighting schemes tf-idf and log-entropy. So, we are interested in the performance of LSI on a large standard dataset. • TREC encourages text retrieval research, but TREC does not provide any evidence/standard rule to use stop words in IR research. So, we propose an algorithm to produce a stop word list based on TREC-8 LA Times dataset for our experiments. • Different information retrieval research groups use different stop words lists. Our aim is to find the effect of stop words in the IR process in the case of a large 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. standard dataset. Also to find the effect of arbitrary (a stop word lists is not tailored for a unique dataset) use of stop word list on LSI-based retrieval system. • Term weighting schemes improve retrieval performance, however, we did not find any concrete evidence that mentions which weighting scheme gives better performance in case of a large dataset. So, we decided to find the effects of different term weighting schemes (raw term frequency, tf-idf, and log-entropy) on LSI based IR process in case of large standard dataset. 1.4 Thesis Outline The rest of the thesis is organized as follows: Chapter-2 focuses on the theories of IR and LSI as well as the evaluation methods of retrieval performance. Chapter-3 is the literature review of the Latent Semantic Indexing (LSI) research. Chapter-4 focuses on the characteristics of the large test collection and associated query set. It also illustrates the different phases of data pre-processing, e.g., stemming. Chapter-5 focuses on the experimental setup, obtained results, and the evaluations. Chapter-6 contains the concluding remarks. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter-2 Theories of Information Retrieval Techniques 2.1 Introduction With the exponential development of information technologies, a vast amount of intellectual resources has been recorded in computer-readable forms of information that can be digitally transmitted or processed. The evolution of communication networks has made an immense number of datasets available in the public domain that can be used by ordinary users. However, due to the tremendous volume of the data, it has become extremely difficult for the users to find the desired information. This challenge has led to a huge demand for information management technologies which can facilitate access to the information for users. [Zeng 2005] The objective of this thesis to study the effects of different stop word lists, different term weighting schemes on LSI based IR system with a very large dataset. 2.2 Background and Related Theory Information retrieval encompasses the techniques for retrieving relevant information from storage in response to a user's request. In the following sections, a brief overview of the field of information retrieval is provided in order to introduce the proposed research work. The following sections present a historical review of a number of IR approaches that are relevant to the thesis topic. The evaluation strategy that is used to estimate the retrieval performance of an IR method is also covered. 2.3 Information Retrieval vs Data Retrieval When the term information retrieval (IR) is mentioned in this thesis, it refers to the automatic information retrieval systems that search unstructured databases for data records related to the user's queries, and inform the user regarding the existence/nonexistence as well as the location of the document. Although the term information has a close association with the term data, they are not equivalent concepts. The differences between information retrieval and data retrieval are illustrated in Table 2.1. [Rijsbergen 1979] Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Information Retrieval Data Retrieval Natural Artificial (not natural) Incomplete Complete Best Match, Partial Matching Exact match Relevance Exact matching Probabilistic or Statistical Deterministic Query Language Query Specification Matching Items Wanted Model Table 2.1: Information Retrieval vs Data Retrieval In IR, the query statement is usually expressed in natural language and does not necessarily need to be structured. On the contrary, data retrieval usually requires the request to comply with specified syntax and to provide a complete description of the desired information. With regard to evaluating the retrieved records, IR judges relevant items to be good matches, and among them the most relevant one as judged by the user, is determined to be the best match. Whereas, data retrieval only regards exact matches as an acceptable match. Due to the differences between information retrieval documents (unstructured records) and data retrieval documents (structured records), information retrieval engines might, but are not guaranteed to, retrieve all the relevant documents from the storage. Even if IR systems manage to retrieve a list of relevant records, such a list might not be complete. In contrast, data retrieval systems are guaranteed to output all occurrences of the records that match. 2.4 Information Retrieval Systems An information retrieval system is a "device interposed between a potential user of information and the information collection itself'. [Harter 1986] The purpose of an IR system is to capture wanted items and to filter out unwanted items. [Harter 1986] A typical IR system performs representation, storage, and retrieval of unstructured data, and may contain some or all of the following parts/components: indexing, query operation, matching, output module, and feedback module. The indexing usually contains two primary processes. The first process conducts conceptual analysis of information resources in the collection to obtain the contained concepts. These concepts, usually called effective terms, make up a system vocabulary that is applicable to all the information pieces in this system. The output of the first process flows to the second stage, in which a translation mechanism is employed and a database of information representations is created. When an information request is posed, the query operation 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. process parses it into its constituent elements. An analysis is conducted over its conceptual content and the query is transformed into a representation that is consistent with all the information items in storage. Given the query representation, the matching mechanism evaluates the relatedness of all the potential targets to the query and obtains a rank of relevance. An ordered set of information items is returned to the user by the output module. When interaction between the user and the retrieval system is available, one can communicate with the system through the feedback module by refining the query during one search session in the light of sample retrieval. information Resources> Conceptual * Analysis • Translation Database of Information Representations Information> Evaluation Strategy Request Representation Conceptual Analysis Translation Figure 2.1: Framework of an Information Retrieval System The user interface serves as a bridge connecting the client to the other modules of the system. The infrastructure of a typical IR system is depicted in Fig. 2.1. 2.5 Information Retrieval Approaches In the field of information retrieval, there are two major categories of techniques: statistical analysis and semantic analysis. Statistical approaches consider the freeform natural language documents to be pure data records and index them in terms of some statistical measure. Assessment of the relevance between a pair of documents is also 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. based on a certain statistical metric. Semantic approaches, however, reproduce some degree of understanding of natural language texts that a human may provide. By far, the greatest amount of work has been devoted to statistical approaches which fall into four categories: classical Boolean, extended Boolean, vector space, and probabilistic. The following sections briefly discuss classical Boolean, extended Boolean, and vector space approaches. 2.5.1 Classical Boolean Approach The classical Boolean approach is based on the theory of Boolean algebra. A conventional Boolean query combines terms with the classical Boolean operators AND, OR, and NOT, and is evaluated by the classical rules of Boolean algebra. Such a model is very straightforward and relatively tractable to implement. However, the classical Boolean method has some major limitations due to the characteristics of the standard Boolean model. Like all Boolean expression, the query only has two values: true or false. Correspondingly, a document is either relevant to a query or non-relevant to it. Therefore, no ranking strategy is possible in the classical Boolean method. With regard to the term weighting, only two values are available: 0 for an absent term and 1 for a present term. Such an all-or-nothing condition tends to have the effect that either an intimidating large amount of documents or none at all are retrieved [Harman 1992], As well, the classical Boolean rules tend to produce counterintuitive results because of this all-or-nothing characteristic. For example, in response to a multi-term OR operation, "a document containing all (or many of) the query terms is not treated better than a document containing one term" [Salton & Buckley 1987], Similarly, in response to a multi-term AND operation "a document containing all but one query term is treated just as badly as a document containing no query term at all" [Salton & Buckley 1987]. The ranking inefficiency of the classical Boolean model is a considerable issue that makes this method ineffective in document retrieval. 2.5.2 Extended Boolean Approach As mentioned above, the classical Boolean scheme has ranking inefficiency and to overcome this inefficiency several extended Boolean models that integrate ranking 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. strategies are discussed in the literature. [Waller & Kraft 1979], [Salton & McGill 1983], [Paice 1984], [Zimmerman 1991], [Grei et al 1997] The extended Boolean models employ extended Boolean operators, also called soft Boolean operators. These operators assign weights to the terms in each document. They also extend the truth value range from a discrete two-element-set: {0, 1} in the case of classical Boolean model to a consecutive range: [0, 1]. In other words, the operators evaluate their arguments to a number, corresponding to the estimated degree to which a given logical expression matches the given document. By doing this, the extended Boolean methods are able to provide a ranked output allowing some documents to satisfy the query condition more closely than others. [Lee 1994] Therefore extended Boolean methods manage to overcome the limitation of the classical Boolean approach. From literature it is found that the extended Boolean model can achieve greater IR performance than either the classical Boolean or the vector space model. [Greengrass 2001] However, there is a big price for this performance improvement. Formulating effective extended Boolean queries involves more thought and expertise in the query domain than either the classical Boolean method or the vector space approach. [Greengrass 2001] 2.5.3 Vector Space Approach A Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as terms. A Vector space model is used for various operations of IR such as information filtering, information retrieval, indexing and relevancy rankings. The vector space approaches have achieved great success in IR by applying the theory of linear algebra in this representation model. In the traditional vector space method, the union of the effective terms defines a document space so that each distinct term represents one dimension in this space. For a given document, each term is assigned a numeric weight which estimates the usefulness of the term as a descriptor of the given document, i.e., the discriminatory power of the term for the document. By exploiting the weights of all the terms for a document, the document is then encoded as a term vector in the document space. It is worth noting that a query is usually considered to be a pseudo-document and can also be represented as a term vector. 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Perspectives of document space and term space can be combined by viewing the entire collection as a term-by-document matrix, also called an indexing matrix. Each row of this matrix represents a term, and each column of this matrix represents a document. The element at row /, column j reflects the importance of term i in representing the characteristics of document j. A data set of d documents and t terms can be represented by the matrix shown below: M= \ m n m,2 . . . mtd j Please note that any defined denotation, e.g., d, t and M, will be applicable to the rest of this text. An example of term-by-document matrix is given below: Let us consider three documents (dl, d2, and d3) and each document contains one sentence: dl: You read magazine. d2: You play cricket. d3: You like like like pizza. The term-by-document matrix takes the following form for the above documents: Terms you read play cricket like pizza dl 1 1 0 0 0 0 Documents d2 1 0 1 1 0 0 d3 1 0 0 0 3 1 Table 2.2: Term-by-document Matrix We can also write down this term-by-document matrix as follows: 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 'i i 1 0 0 Mg*3— r 0 1 0 0 1 0 0 3 0 ,0 0 1, 2.5.3.1 Term Weighting One significant issue is that any vector space model needs to address term weighting, i.e. assigning weight to a term for a document so that the assigned weight properly reflects the contribution of the term in distinguishing the document from other documents. A variety of weighting schemes have been proposed, which basically fall into two categories: local weighting and global weighting. Local weighting schemes attempt to reflect the importance of a term within a given document by a document-specific statistic. Usually, the local weights assigned to identical terms vary among documents. Some popular local weighting functions include the raw term frequency, binary, and logarithm of the term frequency. Let us denote L,j to be the local weight of term i in document j and denote tfj to be the frequency with which term i appears in document j. The local weights for three weighting schemes are as follows: o Raw Term Frequency: L,j = tfj o Binary L,y= o 1(a) fl if J [0 otherwise fl + log tfu Logarithm L,j = < 0 tf >1 J,j if tf>0 '' otherwise 1(b) 1 (c) There is a critical problem associated with raw term frequency. In the case of raw term frequency, all terms are considered equally important when it comes to assessing relevancy on a query; as a result, certain terms have little or no discriminating power in determining relevance, and thus have no effect in scoring. For example, a document with 10 occurrences of the term is more relevant than a document with one occurrence of the term but not ten times as relevant. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The problem with the binary weighting scheme is that it fails to capture the fact that some terms might be more important than others as the binary weighting scheme only considers the presence or absence of a term in a document. It is worth mentioning that the logarithm weighting scheme exploits the logarithm function to transform the raw term frequency, thereby dampening the effects of large differences in frequencies of individual terms. For example, let us consider tfi = 1000 is the term frequency of a term i in the entire collection and tf^ =10 is the term frequency of a term j in the entire collection. The ratio of two raw term frequencies is tfi/tf2 =100. If we go for logarithm weighting scheme (please see equation 1(c)) then local weight, Lj for tfi is 4 and local weight, L2 for tf2 is 2, so after applying logarithm weighting schemes the ratio of two local weights L1/L2-4/2=2, so it is clear that the logarithm weighting scheme reduces the large differences in frequencies of individual terms. Global weighting schemes on the other hand attempts to attenuate the effect of terms that occur too often in the collection to be meaningful in determining relevance of documents against a query. The idea is to scale down the term weights of terms with high collection frequency (defined to be the total number of occurrences of a term in the collection) by reducing the tf (term frequency) weight of a term by a factor that grows with its collection frequency. In addition to estimating the document-specific statistics, characterizing a term's overall importance in the whole collection can also be useful. Global weighting strategies are designed to measure the distribution of a term within the given collection. Thus they are able to estimate how frequently a term occurs in a certain document by chance. Generally, they give less weight to terms that occur frequently or occur in many documents because these terms are not considered to be strong descriptors for any specific document. Four well-known global weighting schemes are: normalized term frequency (or normal for short), inverse document frequency (or idf for short), term frequency-idf (or tf-idf for short), and entropy. Let G, be the global weight assigned to term /, let gfi be the frequency term i occurs in the entire collection, let df be the frequency of documents in which term i occurs, and let d be the number of documents in the whole collection. The evaluation of G, by the four types of global weighting methods is discussed below: 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The normal weighting scheme normalizes the length of the vector for a term to 1, giving high weight to infrequent terms in the collection. However, normal weighting only depends on the sum of the squared frequencies, not the distribution of those frequencies per se. The following equation presents the normal weighting scheme: o Normal: Gj 2(a) The tf-idf and idf weighting schemes are closely related because both schemes weight terms inversely by the number of documents in which they appear. Neither method depends on the distribution of terms; rather they depend on the number of different documents in which a term occurs. The tf-idf weighting scheme is used to scale the frequency of a term. By nature tf assigns high weights to the frequent terms and low weights to the rare terms. On the other hand, idf (inverse document frequency) assigns high weight to a rare term and low weight to a frequent term. So, tf-idf term weighting scheme produces a composite weighting schemes with the following characteristics: • Weight of a certain term becomes highest when the term occurs many times within a small number of documents. • Weight of a certain term becomes lower when the term occurs fewer times in a document or occurs in many documents. • Weight of a certain term becomes lowest when the term occurs in virtually all documents. • The tf-idf weight values are strictly smaller than 1. Reviewing the tf-idf weighting scheme in the IR literature leads to the following points being noted: • tf-idf attenuates the effect of terms that occur too often in the input dataset. • The discriminatory power of tf-idf allows the retrieval engine to quickly find relevant documents that are likely to satisfy the user. • tf-idf exhibits an overall advantage over other term weighting system. [TREC-3, 1997] The following equations present the idf and tf-idf weighting schemes: 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. o 2(b) Idf: G,= logf -j\dfi) o Tf-idf: G, = tft] x idf 2(c) The entropy scheme is based on information theoretic ideas and is the most sophisticated weighting scheme. It exploits the distribution of terms over documents. The main idea is to assign less weight to terms that are equally distributed over all the documents and assign more weight to terms that are concentrated in a few documents. [Dumais 1991], [Manning et al 2008] The following equation presents the entropy weighting scheme: Entropy: G,= 1 - V'' ——7^7^ where p =— ^ \og{d) gf, 2(d) All of the global weighting schemes share a principle of assigning less weight to terms that occur frequently. Global weights involve variations in the relative importance of local frequency, global frequency, and document frequency. We can combine local and global weighting schemes to measure weight of a term as each of the weighting schemes contains advantages and limitations. There is not a fixed solution for choosing a term weighting scheme. Both local weights and global weights are used to measure the term weights; the value of the z'th row, y'th column element can be evaluated as shown below: o m,j = Ljj x G , (3) If you look at the individual value of m in the matrix M [section 2.5.3] and the local and global weightings equations [1(a), 1(b), 1(c) and 2(a), 2(b), 2(c)], it is clear that the value of m is dependent on both local and global weights of terms. 2.5.3.2 Normalization (equivalence classing of terms) After getting the tokens from the documents and query, the simple case is if tokens in the query just match with the tokens in the token list of a document. However, there are many 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. situations when two character sequences are not quite the same but you would like a match to occur. For instance, if we search for UK, we might hope to also match documents containing U.K. Term/Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens. The standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set. For example, if the tokens anti-war and antiwar are both mapped onto the term anti-war, in both the document text and queries, then searches for one term will retrieve documents that contain either. [Manning et al 2008] 2.5.3.3 Discussion As an efficient model, the traditional vector space scheme is becoming very popular in IR research. Since it has a sound mathematical foundation, it applies a similarity measurement [see section 2.5.4.5] technique between document and query to produce a ranked list of relevant documents. The traditional vector space approach provides an effective way to approximate the statistical properties of the document set. The major problem that exists with this method is that it assumes all the terms are independent, orthogonal dimensions of the document space so it simply ignores the relationship among terms. However, as it is a fact that there are strong associations among terms in natural languages, the above assumption is never satisfied [Hull 1994], Another drawback of this model in some applications is that the number of terms that occur in a collection can be quite large; the traditional term-based document space tends to have a large number of dimensions. Some alternative approaches such as LSI based on the traditional vector space model have overcome these limitations. 2.5.4 Latent Semantic Indexing Latent semantic indexing (LSI) is a method that exploits the idea of vector space model and singular value decomposition (SVD). 2.5.4.1 Motivation of LSI In the research of retrieving free-form natural language data, it is always useful to analyze the features of human verbal behavior. There are two issues that are discussed the most: 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. synonymy and polysemy. Synonymy refers to the states when two or more words or expressions have the same or nearly the same meaning in some or all senses. Polysemy describes the cases when one word has multiple meanings. These characteristics of natural languages result in the deficiencies of some IR algorithms that do not offer comparison methods on terms. The Latent semantic indexing method was proposed in order to capture the statistical relationship among terms and accordingly provide an effective solution to the problems of synonymy and polysemy that cannot be tackled by either word-based approaches or the traditional vector space approach. 2.5.4.2 Theory Latent semantic indexing, also called latent semantic analysis (LSA) method, was first proposed by Deerwester et al. [Deerwester et al 1990], LSI assumes that, in the textual data, there is some underlying latent semantic structure that is partially obscured by the randomness of word choice with respect to retrieval. This structure can be estimated by statistical techniques and the obscuring noise can be removed. Like the traditional vector space approach, the LSI method starts with a term-bydocument matrix that represents the association of terms to documents. It applies a dimensional reduction scheme, singular value decomposition (SVD), on the matrix to construct a reduced unified semantic space for retrieval. This smaller space, called LSI space, consists of derived dimensions that are assumed to convey truly independent concepts. By employing the dimensional reduction strategy, LSI not only captures most of the important underlying semantic structure in associating terms with documents, but also has a better chance in removing the noise or possible variability in word usage. 2.5.4.3 Singular Value Decomposition Singular value decomposition is an effective dimensional reduction scheme. It is closely related to a number of mathematical and statistical techniques that have been widely used in other fields, such as the principal component analysis (PCA) for image processing and face recognition. SVD has been proved to be a very good choice for uncovering latent semantic structure (See [Deerwester et al 1990] for a further discussion of SVD and the other alternative models). 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. SVD can be applied with an arbitrary rectangle matrix to decompose the matrix into three matrices containing singular vectors and/or singular values. These three matrices with special forms show a breakdown of the original matrix into linearly independent components or factors. Many of these components are very small, leading to an approximate model that contains many fewer dimensions. Thus, for information retrieval purposes, SVD provides a reduced model for representing the term-to-term, document-todocument and term-to-document relationships. By dimension reduction, it is possible for documents with somewhat different profiles of term usage to be mapped into the same vector of factor values [Deerwester et al 1990]. This property helps to eliminate the noise in the original data, thus improving the reliability of the algorithm. Suppose we obtained a t*d term-by-document matrix M from the collection indexing process of the traditional vector space method. We can apply SVD on M, which is then decomposed into three special matrices U, S, and V. The decomposition can be written as: M = USVr 4(a) U is the t*t orthogonal matrix (UUT=It) having the left singular vectors of M as its columns, and V is the d*d orthogonal matrix (VVT=Id) having the right singular vectors as its columns, and S is the t*d diagonal matrix having the singular values oj > 02 > • • • > d) matrix M, where t=3 and d=2: M= 0 / 1 0 We decompose this matrix by using the equation 6, and we get the following values: M=USVt where, U is a t*t, S is a t*d, and VT is a d*d matrix. U= f-0.816 0.000 - 0.057^ 0.408 -0.707 -0.057 V -0.408 -0.707 0.057 / 1.732 0.000 0.000 1.000 0.000 0.000 -0.707 0.707 -0.707 -0.707 M=USVT= -0.816 0.000 0.408 -0.707 -0.057 -0.408 -0.707 -0.057 1.732 0.000 '- 0.707 0.000 1.000 0.707 0.057 y 0.000 0.000 0.707 -0.707 The reduced singular value decomposition of matrix M is given below: f -0.816 0.000 A -0.707 1.732 0.000 0.000 1.000 -0.408 -0.707 M=USV = 0.408 -0.707 0.707 -0.707 -0.707 y 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5.4.4 Document and Query as Vectors To find the ranked list of documents we need to apply a similarity measurement between a set of documents against a query. In this thesis we used cosine similarity [section 2.5.4.5] measure. To do this we need to derive individual document and query vectors. From SVD calculation we have the following equation: Equation-4(a): M=USV T Equation-5: MT =(USVT )T Equation-6: M T = U T S T V Equation-7: M T US~ X =U T SVUS~ L Equation-8: V=M T US~ 1 Now the document and query vectors are represented by the following equations: Document Vector: DRJ = DRJ T US Query Vector: 1 DRQ =DR Q US Thus the reduced JC-dimensional LSI space can be presented as: T Document Vector: DR j—DR j U^S^ Query Vector: DR^=DR (J U^S^ —1 9(a) 9(b) 2.5.4.5 Similarity Measurement After deriving the term vector of documents and queries, a similarity measurement technique is applied to estimate the similarity between a pair of vectors. This similarity value is an indicator of how relevant the document is to the query. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A common similarity measure employed in the vector space model is called cosine similarity. In case of cosine similarity, the lower angles presents higher similarity and higher value of an angle represent dissimilarity between a query and a set of documents. Denote \DR\ to be the length of vector DR. The cosine similarity between vector DR; and DRq denoted by cos(DRj, DRq), can be evaluated by the following equation: (10) o Two strategies are commonly used to utilize the similarity values for retrieving relevant documents. One is called range queries, which is to retrieve all documents up to a distance threshold. The other one is called nearest-neighbor queries, which is to retrieve the N best matches. Although we do not expect any similarity metric to be a perfect model that corresponds exactly with human judgment of relevance, the measurement should somehow be able to assign higher values to the documents with a higher proportion of the relevant text as well as assigning lower values to the documents with less relevant content. By integrating the ranking strategy into IR systems, the human user can restrict their attention to a set of documents of manageable size instead of having to go through every single record in the data set. The following figure presents the idea of cosine similarity graphically: Term B 1.0 0.8 D2 0.6 0.4 D, 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A Where: Di represents document 1, D2 represents document 2 and Q represents a query 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ai is the angle between Di and Q (X2 is the angle between D2 and Q Angle a (Degree) 0 30 45 60 90 Cosine(a) 1 0.86 0.70 0.5 0 Table 2.3: Angles and Their Cosine Values. The Table 2.3 shows that when the angle between two vectors is very small, the cosine approaches 1. More precisely, when two documents are identical their cosine is 1; when they are orthogonal (no common terms/ totally dissimilar), their cosine is 0. As we are interested in rank-based retrieval and cosine similarity measure is a tool that helps us to rank the documents based on the similarity values against a query, we are using cosine similarity to rank the documents for retrieval purpose. 2.5.4.6 Retrieving Documents Using LSI The retrieval process sorts the documents according to their similarity values, and returns a ranked list of documents to the user. The retrieval process represents documents and queries in the same manner and applies a certain function to estimate the similarity between them. In the LSI method, each document/query is projected onto the LSI space that is obtained by using SVD. LSI then exploits the cosine measurement between the projections of a pair of term vectors in the LSI space to make comparison between the two documents. Thus, the similarity can be obtained by computing the cosine value of the angle between the document's term vector and the query's term vector. All the documents can be ranked according to their similarity values and an ordered set of documents retrieved. 2.6 TREC and Relevance Judgment Relevance judgment is an obvious part of a standard IR dataset. Relevance judgments help us to measure recall-precision by indicating which documents are relevant to which topics or query. The following sections discuss the background information of TREC-8 relevance judgment. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.6.1 Text REtrieval Conference-8 (TREC-8) Relevance Judgment Relevance Judgments are of critical importance to a test collection. For each topic it is necessary to compile a list of relevant documents - as comprehensive a list as possible. All TRECs have used the pooling method to assemble the relevance assessments. In this method a pool of possible relevant documents is created by taking a sample of documents selected by the various systems of participating research groups. The pool is then shown to the human assessor, who makes a binary (yes/no) relevance judgment for each document in the pool. Un-judged documents are assumed to be not relevant. The particular sampling method used in TREC is to take the top 100 documents (pooling method with the depth 100) retrieved per judge run for a given topic and merge them into the pool for assessment. This is a valid sampling technique since all the systems used ranked retrieval methods, with those documents most likely to be relevant returned first. Each pool is sorted by document identifier so assessors cannot tell if a document was highly ranked by some systems or how many systems (or which systems) retrieved the document. The method can be summarized as: 1. To have dozens of research groups from universities and companies participate. 2. To run all 50 queries through their system (TREC-8, Topic 401-450). 3. To submit raw retrieval results (Take the top 100 highest ranked documents from each topic e.g., TREC-8: 7100 documents). 4. To merge them into the candidate set e.g., TREC-8: 1736 documents. 5. To have human assessors judge relevance of each document. 6. To evaluate results and compile of performance statistics (e.g., TREC-8: 94 documents). 2.6.2 Relevance Judgement Problems There are two potential problems to preparing the relevance judgments. Those are: 1. Human assessors make errors; and 2. There are often many more relevant documents in the corpus beyond the candidate set; TREC procedure will consider them all irrelevant. A very recent research paper published in SIGIR 2008, Singapore tried to assess the accuracy of TREC relevance judgments with their own relevance judgments created by 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. their own human users and they found 63% similarity with TREC relevance judgments. [Al-Maskari et al 2008] 2.7 Evaluation of Information Retrieval Performance In previous sections, we presented the basic architecture of an IR system and introduced some representative approaches that can be used to implement it. In this part of the chapter, we discuss the criteria that may be investigated while evaluating an IR system. Frakes et al. [Frakes & Yates 1992] provided a summary of the evaluation process: an information retrieval system can be evaluated in terms of many criteria, including execution efficiency, storage efficiency, retrieval effectiveness and the features they offer a user. The system designers look into the relative importance of these factors based on the particular environment and expectations. Accordingly, they select appropriate data structures and algorithms for implementation. Execution efficiency is measured by the time it takes a system, or part of a system, to perform a computation. This can be measured in C based systems by using profiling tools such as proof on UNIX. This factor has always been a major concern for the interactive IR systems because a long retrieval time will interfere with the usefulness of the system. The requirements of such IR systems usually specify maximum acceptable times for searching, and for database maintenance operations such as adding and deleting documents. Storage efficiency is measured by the number of bytes needed to store the data. Space overhead, a common measure of storage efficiency, is the ratio of the sizes of the index files plus the size of the document files over the size of the document files. Most IR experimentation has focused on retrieval effectiveness, which is based on relevance judgment. Relevance is an inherently subjective concept in that the ultimate goal is to satisfy the human users' needs. Due to the variety of the user's personal background, it is impossible to design a perfect system that meets all the expectations of all human users. Therefore, it is necessary to introduce a method to evaluate the performance of a retrieving process by estimating the degree of relevance at which the 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. retrieved information matches the query. In this thesis we used recall-precision and paired sample t-test to evaluate our retrieval systems. 2.7.1 Recall and Precision Two widely used parameters to measure IR success, which are based on the concept of relevance, are precision and recall. Precision is the ratio of relevant items retrieved to all items retrieved. Recall is the ratio of relevant items retrieved to all the relevant items. To facilitate the understanding of the definitions, the following equations represent the recall and precision: „ Number of relevant documents retrieved recall = Total number of relevant documents . . Number of relevant documents retrieved precision — Total number of documents retrieved (11) (11\ y j The expectation of the users may vary from one person to another. Some users attach more importance to precision, i.e., they want to see relevant information without going through a lot of junk. Others take recall as a preference, i.e., they want to see all the documents that are considered to be highly relevant. Hence, the evaluation that involves only one of the two parameters may be biased. Due to this reason, some methods that evaluate the IR performance in terms of precision and recall simultaneously, have been developed. 2.7.2 Recall-Precision Calculations and Plotting The following section explains how to calculate and plot recall-precision using a numerical example. For more information [Manning et al 2008], [Croft et al 2009]. According to the theory of information retrieval evaluation, precision is calculated at 11 standard levels of recall 0, 10, 20 .... 100 percent (or 0, 0.1, 0.2 ... 1.0). To measure the performance of an IR system, multiple queries or topics (e.g. 50 queries) are used. The precision averages at 11 standard recall levels are used to compare the performance of different systems and as the input for plotting the recall-precision graph (see Fig-2.2 and 2.3). Each recall-precision average is computed by summing the interpolated precisions at the specified recall level by using the following standard rule: 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • For any standard recall level i, take maximum precision at any actual recall level >= i • This defines a precision at the standard recall of 0 even though precision at actual recall 0 is undefined Let us consider the list of relevant documents: 0123 0132 0241 Relevant Documents (9)/DocID 0256 0299 0324 0311 0357 0399 Let us consider an IR system that retrieved 12 documents as shown in Table-2.4. The relevant documents are shaded. The values of recall and precision of a single query were calculated by using the equations 11 and 12. Table-2.5 represents the interpolated precision of the same query. Rank DocID Recall Precision Recall level 0 10 1 0234 0 2 3 4 0132 0.111 0.111 0.111 0.5 5 6 7 0123 0.4 8 0256 0078 0311 0231 0177 0.222 0.222 0.222 0.333 0.333 9 10 11 12 0115 0193 0345 0387 0.444 20 30 40 0.375 0.4 0.444 0.444 Table 2.4: Recall-precision of a single query 50 60 70 80 90 100 Interpolated Precision 1.0 0.5 0.4 0.4 0.4 0 0 0 0 0 0 Table 2.5: Interpolated precision of a query Fig 2.2 represents the recall-precision graph using the Table 2.5. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.6 | • Example IR System) 0.5 0.4 e .2,;0.3 0. 0.2 0.1 00 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall Figure 2.2: Recall-Precision Graph of a Query We already mentioned that the performance of a single query does not necessarily represent the performance of a retrieval system. So, we need to compute the precision for multiple queries and average the interpolated precision values at each recall level. Table2.6 represents the average interpolated precision of two different retrieval systems (Sytem-1 and System-2). Recall level 0 10 20 30 40 50 60 70 80 90 100 Average Interpolated Precision System 1 1.0 0.5 0.4 0.37 0.33 0.33 0.33 0.2 0.2 0.12 0.0 System 2 1.0 0.8 0.75 0.6 0.6 0.5 0.4 0.37 0.33 0.2 0.0 Table 2.6: Interpolated Average Precision of Two Different Systems 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. System -1 0.2 0.3 0.4 0.5 0.6 -H -System-2 0.9 Recall Figure 2.3: Recall-precision Graph of Two Retrieval Systems Fig 2.3 represents the recall-precision graph of the Table 2.6 2.8 Paired Sample T-test The paired samples t-test compares the means of two variables to determine whether there is a statistically significant difference between two populations. We could do this by calculating confidence intervals for the two populations and seeing if the mean (x) of one population lies outside the 95% confidence interval of the other population. A formalized and slightly less cumbersome approach to answering this question is the t-test. A confidence interval (CI) is a range of values within which lies the true population mean. There are many possible confidence intervals. Each interval specifies a probability of "capturing" the true population mean. [Triola 2006] To perform the t-test we need to follow the following steps: 1. Choose the hypothesis: • Null: There is no significant difference between the means of the two variables. • Alternate: There is a significant difference between the means of the two variables. 2. Select the level of significance: 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • In paired sample t-test, after making the hypothesis, we choose the level of significance. In most of the cases in the paired sample t-test, significance level is 5% or 0.05. After doing the calculations of the t-test if we get the significance values greater than 0.05 it means that there is a significant difference between two systems otherwise the null hypothesis holds (no significant difference). Table-C in the appendix D represents the different values for significance level 5% or 0.05. We calculated precision for different retrieval systems (by varying parameters e.g. weighting scheme, stop word). Every retrieval system is independent; however precision was calculated on the same dataset. Paired sample t-test is used to compare two means that are repeated measures for the same dataset. So, we used a paired sample t-test to find the significance of the difference among the precision levels of retrieval systems to compare the performance of the retrieval systems. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter-3 Issues in Latent Semantic Indexing (LSI) Research 3.1 Introduction Information retrieval (IR) deals with the representation, storage, organization of, and access to information. The representation and organization of the information should provide users with easy access to the information in which they are interested. Unfortunately characterization of the user's information need is not a simple problem. [Yates & Neto 1999] For hundreds of years, people have understood the importance of archiving and finding information. With the advancement of computer technology, it is now possible to store large amounts of information and finding useful information from such collections became a necessity. On the basis of this necessity, the field of Information Retrieval (IR) was born in the 1950s. Over the last few decades, the field of IR has matured considerably. A number of automatic Information Retrieval systems are used on a daily basis by a variety of users. [Singhal 2008] 3.2 Related Research Work on Latent Semantic Indexing (LSI) This chapter examines the short comings and problems in existing published works on Latent Semantic Indexing (LSI), one of the most popular techniques in the area of IR. LSI is more fully discussed in the section 2.5.4, if the reader requires more background. The published works revealed that there are open questions about very large datasets, stop words, and term weighting schemes in LSI. This thesis attempts to answer these questions, particularly for very large datasets used in LSI. The following sections discuss the related published research that used the LSI algorithm. 3.2.1 Using Latent Semantic Analysis to Improve Access to Textual Information The first research work published on LSI was entitled "Using latent semantic analysis to improve access to textual information". [Dumais et al 1988] The authors used two different datasets. The descriptions of the datasets are given below: • MED: The first database consisted of 1033 medical reference abstracts and titles. A 5823*1033 term-document matrix was obtained and retrieval effectiveness evaluated against 30 queries available with the dataset. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • CISI: The second standard dataset consisted of 1460 information science abstracts. A 5135*1460 term-by-document matrix was obtained and retrieval performance evaluated using 35 queries available with the dataset. They did not use any stemming algorithm or remove stop words. Points to be concluded about this paper are: The dataset is small for use in IR. For the first dataset (1033 medical reference abstracts and titles), the researchers claim that LSI shows 13% average improvement over raw term matching and show that LSI captured some structure in the data that was missed by raw term matching. For the second dataset (1460 information science abstracts), the authors do not claim any improvement of LSI over raw term matching technique. Homogeneity of documents and poorly stated queries in the second dataset caused very poor performance in precision level, so, the authors said that CISI was not a reliable dataset. 3.2.2 Indexing by Latent Semantic Analysis Deerwester's [Deerwester et al 1990] paper "Indexing by latent semantic analysis" describes automatic indexing and retrieval. This paper also described the use of singular value decomposition. They used datasets with standard queries from the Medlars collection (same dataset mentioned in section 3.2.1). They removed stop words (SMART'S stop word list was used) but they did not use stemming on the input dataset. By reviewing their results we find the following weakness in their work: LSI was evaluated with a very small dataset (input data and query set). It was a poor choice for a test collection because the selected dataset contains homogeneous documents. Some queries are vague and poorly stated. In the first experiment, they claimed 13% average improvement in LSI results over the raw term matching technique. In another experiment, the authors claimed 50% improvement in retrieval performance in case of LSI (k=100/100-factor) against factor analytic techniques for information retrieval using small numbers of factors e.g., 7 dimensions, 13 dimensions and, 21 dimensions. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The authors only tested the above mentioned dataset with LSI and compared their results against a straightforward term matching method, a version of SMART, and the vector method reported by Voorhees (1985) for the same standard datasets. The SMART and Voorhees systems are information retrieval systems, and have different indexing, term weighting, and query processing procedure than LSI method. A set of 439 common (stop) words was applied to reduce the number of terms. However, they did not use stemming on the input dataset. They summarize their findings, as "These results are modestly encouraging. The latent semantic indexing method is superior to simple term matching in one standard case and equal in another" when considering two different datasets. [Deerwester et al 1990] 3.2.3 Large-Scale Information Retrieval Using Latent Semantic Indexing Todd A. Letsche [Letsche 1996] used the datasets shown in Table 3.1 in his master's thesis at the University of Tennessee, Knoxville, USA. Document Collection Abbrev. Number of Terms Number of Documents PVM 2547 170 PLMTP 9842 1154 CCE 31110 15486 USENET 534493 100000 PVM: parallel Virtual machine [GBD+94] All of the following Combined: PVM: parallel Virtual machine [GBD+94] LAPACK User's Guide Release 2.0 [ABB+95] MPI: The Complete Reference [SOHL+96] Templates for the Solution of Linear Systems [BBC94] Parallel Computing works [FWM94] The Concise Columbia Encyclopedia Usenet News Archives Table 3.1: A set of datasets [Letsche 1996] By reviewing his results we find the following issues: Letsche removed stop words and applied Porter's stemming algorithm to reduce the terms. However, the length of the stop word list is not mentioned. LSI was 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. used to search the World Wide Web. He used a large dataset of 100,000 documents from USENET. However, Letsche claimed due to memory constraint (low hardware resources), the system was unable to search the USENET dataset. [Letsche 1996] In his thesis, the results are only based on small dataset. LSI achieved up to 30% better retrieval performance than lexical searching technique (Lexical analysis based). Lexical analysis searching technique compare tokens/terms to find the matching. This technique does not use singular value decomposition (SVD) or cosine similarity to measure the relevance. 3.2.4 Automatic Cross-Language Information Retrieval using Latent Semantic Indexing Michael [Dumais et al 1997] used 2,482 documents for experimentation in the area of automatic cross-language information retrieval using latent semantic indexing. They used French and English documents. In this work, queries in one language can retrieve documents in the other language (as well as the original language). Reviewing the findings of this paper, the following is noted: They did not use any standard dataset (like TREC). They did not remove stop words or apply stemming. The authors compared LSI result with human generated retrieval. With the human generated text, the overall performance of this LSI system was about 10% worse for retrieving the corresponding category, but 15% better when looking at the top 10 retrievals. [Dumais et al 1997] 3.2.5 Latent Semantic Indexing-Based Intelligent Information Retrieval System for Digital Libraries Aswani [Kumar et al 2006] used titles of 116 journals and periodicals available in the Vellore Institute of Technology, India as their own input dataset. Reviewing the findings of this paper, the following is noted: The authors used a very small dataset and the data set and queries were not standard. There is no numerical indication of the improvements in retrieval performance. They applied the Porter's stemming algorithm on the input data set; however, they did not remove stop words. Authors mentioned in their paper that LSI has superiority over the standard vector space model. [Kumar et al 2006] 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.6 Singular Value Decomposition and Rank K Approximation Bast and Majumder's [Bast & Majumder 2005] paper "Why Spectral Retrieval Works" studied singular value decomposition method by varying the value of K (rank K approximation) on three different (in size and content) datasets. They tried to find a feasible value of K by measuring the relatedness scores between two terms. Using this method they choose some values of K and measured the retrieval performance of LSI and LSI-based methods. They used the following dataset and corresponding queries for their experiments: • Time Collection contains 425 documents and 3882 unique terms. • Reuters Collection contains 21,578 documents and 5701 unique terms. • Ohsumed Collection contains 233,445 documents and 99,117 unique terms Reviewing the findings of this paper, the following is noted: The Ohsumed dataset is large. For all three datasets, by varying the value of K it is found that the value of K does not play a significant positive role in the retrieval performance. For all three datasets, a bigger value of K reduces the relatedness of word pairs, so, reduces the retrieval performance in terms of recall-precision measures. 3.2.7 Singular Value Decomposition for Text Retrieval Husbands' [Husbands et al 2001] paper entitled "On the Use of the Singular Value Decomposition for Text Retrieval" explored LSI retrieval performance for large datasets. The authors used the following datasets for their experiments: • MEDLINE/ MED collection (please see section 3.2.1 for details): This dataset is an example of a small dataset. • TREC-6 Dataset: A collection with 115,000 terms and 528,155 documents. This is an example of a large dataset. • NPL Dataset: NPL dataset contains 11,429 documents and it is an example of a small dataset. In their results they compare the performance of LSI with term matching retrieval. For MED data set, LSI (K=200) showed better retrieval performance than the term matching technique. However, the large TREC-6 dataset LSI (K=200) showed poor retrieval (lower 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. recall-precision values) performance. The following figure presents the performance in terms of recall-precision graph: c .2 it \ t Figure 3.1: LSI vs. Term Matching on MEDLINE [Husbands et al 2001] Figure 3.2: LSI vs. Term Matching on TREC6 [Husbands et al 2001) Reviewing the findings of this paper, the following is noted: Most LSI research shows LSI is superior than term matching technique and other methods [Dumais et al 1988][Deerwester et al 1990][Kumar et al 2006][Letsche 1996], however, this paper shows poor performance of LSI over the term matching technique in case of a specific large dataset. Stemming improves retrieval performance [Hull 1995], but the authors did not use stemming on their input dataset. Term weighting schemes have positive effects on retrieval performance, and tf-idf and log-entropy weighting schemes are proven weighting schemes in IR. However, the authors only used idf weighting schemes for their experiments. The authors proposed a new method called "Term norms", and mentioned that this method has great influence and variability however; they showed LSI (Jf=300) had poor performance. Husbands and co-authors reported poorer performance of LSI (K= 200 and jfiT=300), and they did not use stop word removal or stemming on the TREC-6 input dataset. They only used the idf term weighting scheme in their experiment. From the IR literature, we know that pre-processing (stemming and stop word removal) and term weighting schemes improve retrieval performance significantly, so, we are encouraged to test the 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performance of LSI on a large dataset with the use of pre-processing of the input dataset and the most accepted weighting schemes (tf-idf and log-entropy). 3.3 Conclusion From the above discussion it is found that performance of LSI has been tested for several smaller datasets (e.g., MED, CISI abstracts etc) however, LSI has not been tested for a large standard dataset. In this research, we concentrated on the performance of LSI on a standard large dataset by varying different parameters e.g. stop word lists and term weighting schemes. The specification of the chosen standard large dataset or test collection is given below: • Dataset: LA Times dataset TREC-8 text collection. It includes articles published in the years 1989 and 1990 (all articles from 1 Jan 1989 to 31 Dec 1990). The number of the documents is 131,321 and size is 508 MB. • Topics (query): TREC-8 query set includes 50 standard queries (401-450). • TREC-8 test collection also provides the relevant judgment information associated with each topic (query) for the above-mentioned dataset. The significance of performance of retrieval systems is mentioned in terms of recallprecision values/graphs, and t-test. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter-4 Test Collection and Pre-processing 4.1 Introduction To have the answers to our research questions from LSI based information retrieval system, we need to perform the following steps: 1. Standard dataset/test collection selection and pre-processing; 2. Processing of data and query using LSI based IR system to retrieve information; and 3. Evaluating the LSI based IR system for retrieval performance. This chapter thoroughly discusses the above mentioned first step that focuses on the standard test collection and pre-processing of input data. Pre-processing includes extracting documents from large SGML files, stop word removal, and stemming. We developed our own software for pre-processing of the selected dataset. The following sections discuss the preprocessing steps in mere detail. 4.2 Standard Test Collection Ad hoc information retrieval is the task where the document collection is fixed, and users submit queries to the information retrieval system, and the system returns a set of ranked retrieval results (i.e., documents). To experiment with a LSI based ad hoc information retrieval (IR) system, it is important to have a standard test collection consisting of three things: A document collection, a test suite of information needs expressible as queries, and a set of relevance judgments (a binary assessment of either relevant or non-relevant for each query-document pair). [Manning et al 2008] For LSI based ad hoc IR, the most well known and recognized test collection is provided by the Text Retrieval Conference (TREC). 4.2.1 TREC and Standard Test Collection The U.S. National Institute of Standards and Technology (N1ST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tasks\tracks (e.g., Blog Track, Chemical IR Track, Legal Track, Filtering Track and Genomics Track, etc.) over a range of different test collections, but the best known test collections are the ones used for the TREC Ad Hoc task or track during the first eight TREC evaluations between 1992 and 1999. In total, these test collections contain 1.89 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test collections are defined over different subsets of this data. The early TRECs each consisted of fifty information needs, evaluated over different but overlapping sets of documents. TRECs 6 through 8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best sub-collection to use in future work, because it is the largest and the topics are consistent. Because the test document collections are so large, there are no exhaustive relevance judgments. Rather, NIST assessors' relevance judgments are available only for the documents that were among the top K documents (e.g., top K= 100 documents) returned for some system that was entered in the TREC evaluation for which the information need was developed. For this thesis, Los Angeles Times (LA Times) test collection from TREC-8 has been chosen for conducting this research. 4.2.2 Los Angeles Times (LA Times) Text Collection (TREC-8) The data/text collection includes the articles published by the Los Angeles Times in the two year period from Jan 1, 1989 - December 31, 1990. Each file contains the articles from one day (e.g., a file with the name "LA123190" contains articles published on 31 Dec 1990). Every such file contains a number of documents (e.g., the LA123190 contains 134 different documents). The following figure represents part of the content of file LA123190. 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LA123190-0001 329361 Document Number

December 31, 1390, Monday, Home Edition

2*letro; Parr B; Page 4; Col^smn 3; Letters Desk:

<7 SECTION •

33 words

"TAGGER ARREST'

The only way we are ever going to end the nasty, filthy graffiti problem is to- come down hard on the idiots doing it. I would be happy to contribute to a reward fund.

IRY BUSH, Marina del Rey

Letter to the Editor

LA12319Q-00Q2 329362 Document Number Figure 4.1: An extract from the LA Times Collection, file name LA123190, and illustrates one document Some notable characteristics of LA Times files: 1. LA Times files are encoded as SGML (Standard Generalized Markup Language) file format (SGML is a superset of XML file format). 2. The uppercase words enclosed by o and are tag names. 3. Every document is separated by and tag names. 4. and tag names enclose the unique name of each document The following table summarizes the characteristics of the input data set: 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Characteristics of the LA Times Data Set (1989,1990) Number of documents 131,321 Size of the Input Data Set 476MB Average Vocabulary size (approximately) 500 Average document size (approximately) 40 KB Largest file size 828 KB (LA052089 0101) Smallest Size 352 Bytes (LA070189_0001) Number of words in the smallest file 91 Number of words in the largest file 167,045 Number of relevant files (Out of 131,312 files) 1,151 with respect to TREC-8 query set Table 4.1: Input Data Characteristics (LA Times Data Set (1989,1990)) 4.3 Pre-processing of Data Pre-processing of data is a very important step in doing research in the area of information retrieval. The pre-processing of data encompasses the following steps: 1. Extract documents from the large SGML data file (e.g., LA 123190) into individual files for each document. 2. Remove stop words from the input dataset. 3. Apply stemming to the input dataset. 4.3.1 Method to Extract Documents In the selected corpora every file contains approximately 100 to 130 documents. Each document is separated by a document number that is presented using a special tag (e.g., LA010189-0001 ). So, in the very first step of the processing of raw data, smaller files are extracted from large files and each smaller file is named using the document tag number e.g., LA010189 0001. After extracting all the documents from TREC-8, 131,321 files are found. The following section presents the automatic TREC file extractor program in better detail: TREC file extractor is a software system written in PHP. The flow diagram of the software system is depicted in figure 4.2. TREC file extractor takes the large TREC-8 SGML files as input one after another and produces the small individual documents. The 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. name of the extracted document is the same as the unique name found between the and tags. TREC file extractor was tested in Windows Vista and Windows Xp environments. The code developed to extract documents from TREC-8 SGML files is available in appendix B. The output of the TREC file extractor is used as the input of the pre-processing software, which is presented in the figure 4.2. TKECFi® Extractor Largs TREC Doruratnt Extracted SvQlSii Documents —' Figure 4.2: TREC File Extractor Algorithmic steps of TREC File Extractor software: 1. Read each SGML files (LA123190 is a name of a file) one after another from the input directory. 2. Find each and tag pair (top-down methodology) as these tag pairs identify individual documents. 3. Extract (read) the content between a and tag pair including the tags and write the extracted content in a new SGML file as lower case (e.g., change all upper case letters to lower case). 4. Specify a unique name for the new file as "input file namedocument number" (e.g., LA123190_0001). 5. Write the new files as text files to the output directory. 4.3.2 Stop Word Removal and Stemming The IR group, University of Glasgow provides a list of stop words that contains 319 stop words and the IR group, University of Tennessee also provides a list of 439 stop words. In this research, University of Tennessee, University of Glasgow and an extended version of stop word lists were used in our experiments. We call the extended stop word list that we have prepared the UNBCLAT stop word list. UNBCLAT stop word list contains 1911 stop words. Algorithmic steps to create this stop word list are given below: 1. Consider the terms to create a stop word list those that have at least frequency 2 (a term must be present in a document at least twice). 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Create an initial stop word list by combining the stop word list of both IR groups, University of Tennessee and University of Glasgow (no duplication in the list). 3. Remove all the punctuation from the input TREC-8 LA Times dataset. 4. Create a list of terms from the input dataset, in descending order of term frequencies (i.e. the terms with the highest term frequency will be at the top of the list). 5. Manually extract the special items to be added to the initial list (those terms are not already in the initial list) to create an extended stop word list: o Add all file names to the initial list as every file contains file names e.g., LA123190. o Add all tags names (e.g., , and ) to the initial stop word list, o Add roman numbers to the initial list e.g., xvii. o Add scale units e.g., ft, mm, etc. o Add all adjectives, adverbs (e.g., ago, bye ). o Add prefixes from words (e.g., non comes from non-governmental) o Add special words (e.g., haven (comes from haven't), doesn (comes from doesn't) o Add dates e.g., feb-92, 11-apr o Add foreign words as dataset is newspaper articles. Add suspicious words e.g., aaftink, aachen, ora etc o Add other words e.g., ext (telephone extension), 19th, z90, v6 (engine) To create this stop word list, we manually searched the high and low frequency terms out of 132,785 terms in the frequency table at the initial stage. We repeated this in a number of cycles by removing different stop word lists (stop word list of the other IR groups) from the TREC-8 LA Times dataset, and manually searched for the remaining stop words from the frequency table. To search stop words is very time consuming as the dataset as well as the number of terms is also large. The most difficult thing is to choose a word as a stop word. TREC-8 LA Times dataset contains newspaper articles, so, there are variations in the contents. As we started to work with the stop word list of the other IR groups and found that there are many high and low frequency stop words in our terms list (to be used to create a term-by-document matrix), we decided to build our own stop word list for 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. efficient processing of the input dataset because the existing stop word lists do not include many of the high and low frequency terms in our dataset. All the stop word lists used in this thesis work are given in Appendix A. After creating the extended stop words list we remove these stop words from the input dataset. The idea of stemming is introduced in section 1.2.4. In this thesis, we used Porter's stemming on the input dataset to produce stem words. Software developed to remove stop words from the input dataset and to stem the input dataset is discussed in Section 4.2.2. Fig. 4.3 and Fig. 4.4 present a single input file before and after pre-processing respectively. All the stop words (all tags between < >, etc.) are removed from the input file (LA052190 0019) after pre-processing. After removing stop words an input file still contains a few symbols (<, >, /) and numerical values. These special symbols and numerical values will be removed by the LSI system while parsing the input files during creation of the term-by-document matrix. OSS D cS S3 & Bk £4 •!-. * B. •" % j la052190 - OOX3 < docid > 221951 < / docid < tiste > < p > itay 21. .1990 < / p > < moncSay > home edicion / dare > < ssccion > < p > sports pare c < page 7 / p > < / section < lengt£i > < p > 17 wc-rds < / p > colair«n 4 sports desk- > < / length > < iieadline > < p > newswire ssaroes in tiie news < / p > < < < / headline text > p > > center oaldwsii jones he sen antcr.io spurs < / P > < / text > < < / / p > type > < TYPE > < P > column < subject > < p > fcasJcetfcall players retirement < / P > < J subject > at 39 t*ie oldest player in tfee nfcsa san arstonio spurs basketball team retired from 3ones aldwell Figure 4.3: An Input File before Pre-processing 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The following figure represents an input file after pre-processing. I.WhV NOV.-.K! File Edit Format View Help 1-0019 <>221951 < / > <> <> 21 raofidai home <> <> sport 7 4 sport <> <> 17 <> <> newswir <> <> center caidvve! jone 39 oldest player nba retir san antonio spur <> <> <> <> basketbal player san antonio spur basketbal team jone caldwel retir Figure 4.4: An Input File after Pre-processing 4.4 Stop Word Removal and Stemming Program Pre-processing of input data includes stop word removal and stemming. Concepts of stop words and stemming have been introduced in the sections 2.4 and 2.5 respectively. The following section explains the working procedure of the data pre-processing phase. The pre-processing software takes input from the output of the TREC file extractor program. Pre-processing software reads each word from each file (from input folder) one after another. It reads every term/word of a file from the beginning, and checks the word to determine whether it is in the stop word list or not. If the word is in the stop word list 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. then it simply deletes that word and looks for the next word from the same file. If the word is absent from the stop word list then the word is stemmed using Porter's Stemming algorithm [Porter 1997] and the stemmed word is saved in a file (in the output folder) with the same name as the file it has read. Pre-processing software repeats the same process for all input files to discard all the stop words and to produce stemmed output of all the input files. The code developed for the pre-processing software is available in appendix B. The flow diagram of the pre-processing software system is depicted in the figure 4.5. Pre­ processing software has been tested in Windows Vista and Windows Xp environments. The output of the pre-processing software system is the input of the Latent Semantic Indexing (LSI) retrieval system. Stop Word List No Check for stop words Iiqiut Files Porter's Stemmer Output Files with stemmed words Yes Remove Word from Input File Figure 4.5: Pre-processing Software Algorithmic steps of stop word removal and stemming software are: 1. Read every input file (LA123 231 is a name of a file) one after another from the input directory. 2. Read a term/word from an input file. 3. Search for the word in the stop word list. 4. If the term is not in the stop word list, go to step-5 or if the term is in the stop word list, delete the term and put a NULL (in the position of the deleted word) then go to step-2 to read next word/term. 5. Apply Porter's stemming to create a stemmed word and write the stemmed word in a file with the same name of the file from which the word was read and save the file to the output directory. 6. Repeat steps-2 to step-5 to remove and stem all words from input files. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.5 Query Selection In this research work TREC-8 query sets (401 to 450) are used to evaluate the efficiency of the Latent Semantic Indexing (LSI) technique in terms of information retrieval for the mentioned dataset (section 4.2.2). TREC-8 query sets are taken from Text Retrieval Conference (TREC, http://trec.nist.gov/data/topics_eng/index.html) that is maintained by the National Institute of Standards and Technology (NIST). The relevance judgment information also taken from TREC. Fig. 4.6 represents the first two queries from the TREC-8 query set. The characteristics of a TREC-8 query: Every TREC-8 query starts with the tag and ends with . : num tag presents the query number, in Fig 4.6 query numbers are 401 and 402. : description tag describes the actual query. The text for query number 401 is "What language and cultural differences impede the integration of foreign minorities in Germany"? : narrative tag contains the information for the human assessors to judge the relevance of documents retrieved by this query. 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. jlTbJspics,481-45P - Wargp^d D & H S BT #4 '-!r & < FFL % \M Nuxsbe r: 401 foreign minorities, Germany Description: What language and cultural differences impede the integration of foreign minorities in Germany? Narrative: A relevant document' will focus on the causes of the lack" of integration in a significant way; that is, the mere mention of immigration difficulties is not relevant. Documents that discuss immigration problems unrelated to G-ermany are also not relevant. Number: *§02 behavioral genetics <desc> Description: What is happening in the field of behavioral genetics, the study of the relative influence of genetic and environmental factors on an individual's behavior or personality? <narr> Narrative: Documents describing genetic or environmental factors relating to understanding and preventing substance abuse and addictions are relevant. Documents pertaining to attention deficit disorders tied in with, genetics are also relevant, as are genetic disorders affecting hearing or muscles. The genome project is relevant when tied in with behavior disorders (i.e., mood disorders, Alzheimer's disease). </top> For Help; press F1 Figure 4.6: First Two Queries (401 and 402) from the TREC-8 Query Set 4.7 Conclusion This chapter discussed the idea and characteristics of test collection. The advantage of the standard test collection is that it contains human evaluation of relevance judgment which can be used in the evaluation of the LSI based IR system to be tested. Pre-processing of the input dataset and the program developed for pre-processing are also covered. The next chapter describes the experiments carried out after the pre-processing of the input dataset. 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter-5 Experimental Studies on Stop Word Lists and Term Weighting Schemes 5.1 Introduction This chapter describes the experiments that have been conducted using the Latent Semantic Indexing (LSI) based information retrieval system. The basic research procedures are demonstrated in this chapter. The experimental results as well as their evaluations are also covered in the following sections. 5.2 GTP (General Text Parser) The LSI based retrieval system used for the experiments is called the GTP (General Text Parser) [Giles et al 2001]. GTP is a general purpose text parser with a matrix decomposition option which can be used for generating vector space IR models. GTP is an open source tool developed by a group in the Department of Computer Science, University of Tennessee. The specific vector-space model exploited by GTP is Latent Semantic Indexing (LSI). The architecture of the GTP software is depicted in the Figure 5.1. The pre-processed (with stemming and stop word removal) data is submitted to the GTP processor for further processing. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Pre-processed Query Pre-processed Dataset GTP GTPQUERY Parser Raw Term-byDocument Matrix Term Weight Calculation Raw Query Matrix Cleaning Temporary file SVD Reduced ranked Document Vector Space Reduced Ranked Query Vector Summary of Run Cosine Similarity Calculation Ranked List of Documents Figure 5.1: Latent Semantic Indexing (LSI) Retrieval Systems [GILES ET AL 2001] GTP takes pre-processed text files as input and parses the files to create a database of terms/keys/words and their corresponding weights to create a raw term-by-document matrix, and then LSI performs Singular Value decomposition (SVD) on that raw matrix to produce the low-rank vector space in a binary file. Finally, LSI cleans up temporary working files and writes the summary of the run to the RUN SUMMARY file. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. GTPQUERY takes a query file as input and extracts the terms/keywords from the database and the corresponding weights of each term to create a raw query vector. Then, SVD creates a reduced ranked query vector. Once the query vector is generated, a cosine similarity calculation is performed between the document vectors and the query vector to create a ranked list of documents based on the similarity values. The results (ranked list of documents) are written to a file and the name of the file takes the form q_result.#, where # is the query number. The results written in these files are sorted by the most relevant to least relevant. 5.3 Experimental Design The following sections discuss the experiments done for this thesis. The experiments are performed to determine the answers to the research questions (Chapter-1, Section 1.3) associated with stop word lists and term weighting issues in LSI based information retrieval (IR) for the TREC-8 LA Times dataset (1989-1990). 5.3.1 Experiment-1: Study of Stop Words This study tries to find out the effect of "stop words/Common words" on a text based LSI information retrieval (IR) process. Evidence will be developed to indicate the most effective stop word lists for LSI based ad hoc IR processes. By surveying the literature, it was found that different research groups use their own stop word lists and these lists vary in the number of stop words they contain. For example, Information Retrieval Group, University of Glasgow (319 stop words) and University of Tennessee (439 stop words) have their own stop word lists. TREC encourages text retrieval research; however, TREC does not recommend any standard stop word list for IR based research. Christopher D. Manning [Manning et al 2008], in his book mentioned a brief way to prepare the list of stop words. The process is "the general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing." By following the above idea in this thesis, an extended stop word (also called UNBC_LAT stop word list) list (please see Section 4.3.2) is used that includes University of Glasgow and University of Tennessee stop word lists, 730 TREC file names (input 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data set), 22 tag names (e.g., doc, docno etc) and other words (e.g., alpha numeric words, roman numbers etc.). The total length of the UNBCLAT stop word list is 1911. The following table lists the various inputs and parameters used in this study. Data Set Stop Word List Stemming Weighting Scheme Number of Queries TREC-8 Los Angeles Times (LA Times) University of Glasgow University of Tennessee UNBC LAT stop word List Without removing stop words Porter's Stemming tf-idf 50 Queries for each experiment Table 5.1: Different Parameters for the Study of Stop Words 5.3.2 Experiment-2: Study of Weighting Schemes By surveying the literature of text based information retrieval, it was found that research suggested the use of log-entropy [Dumais 1991] or tf-idf [Manning et al 2008] (term frequency- inverse document frequency) weighting schemes because those result in better retrieval performance over idf and raw term frequency. We also studied raw term frequency weighting scheme for our research interest. The objective of the following three experiments is to study the two most popular weighting schemes, tf-idf and log-entropy, to find which weighting scheme is effective on a large data set on LSI based ad hoc text based IR system. The following table lists the various inputs and parameters used in this study. Data Set Stop Word List Stemming TREC-8 Los Angeles Times (LA Times) The Best Stop Word List from Experiment -1 Porter's Stemming Number of Queries Weighting Scheme tf-idf log-entropy raw term frequency 50 Queries for each Experiment Table 5.2: Different Parameters for the Study of Weighting Schemes 5.4 Results and Evaluations The following sections present the findings and the evaluations of the experiments. 5.4.1 Result: Experiment-1 Table 5.3 shows the 10-point interpolated precision of four different systems. The recall precision graph based on the Table 5.3 data is shown in Figure 5.2. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10Points Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 PRmylisttfidf (System-1) PRStemtfidf (System-2) PRTentfidf (System-3) PRGlasgowtfidf (System-4) 0.1757 0.1108 0.0888 0.0724 0.0678 0.0659 0.0646 0.0621 0.0554 0.0436 0.1385 0.1058 0.0841 0.0743 0.0710 0.0662 0.0632 0.0592 0.0553 0.0247 0.1513 0.0994 0.0735 0.0693 0.0672 0.0647 0.0609 0.0588 0.0489 0.0316 0.1221 0.0864 0.0799 0.0722 0.0694 0.0671 0.0660 0.0628 0.0578 0.0374 Table 5.3: 10-point Interpolated Precision of Four Different Systems In the Table 5.3, the recall value of 0.1 represents the top 10% of the documents (in the collection) which are relevant to a query set. As an example using PRmylisttfidf the precision associated with top 10% of the documents is 0.1757 or 17.57%. This value is calculated by interpolating the precision values of all 50 queries used for this thesis at the standard recall value 0.1. Details of the interpolation technique can be found in [Manning et al 2008], We can compare retrieval systems with different parameters (e.g., stop word list) in terms of precision in different standard recall points e.g., 0.1, 0.2 etc. For example, at recall point 0.3 means, (top ranked 30% documents) the precision values for system1(PR mylist tfidf) is 8.88% and for system-3(PR_ten_tfidf) is 7.35%. So, if we subtract (8.88-7.35) %=1.53%, means that system-1 shows 1.53% better retrieval performance than sytem-2 retrieval system for the top 30% retrieved documents. If we look at the Figure 5.2 at the point of recall 0.3, we can see the difference. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. - PR_mylist_tfidf (System-1) -PR_stein_tfidf(System-2) ~ A - PR_ten_tfi df (Svstem-3} (3Sgow_tfidtXSysfein-4} —» • PR_G e 0.0 0.5 0.1 Recall o.s 1 Figure 5.2: Recall-Precision Graph for Experiment-1 The system-1 (mylist tf idf) with extended stop word list (mylist means UNBC LAT stop word list) provides the best result when compared to the other 3 systems. For the top 10% retrieval, system-1 shows 5.37% better retrieval performance than system-4 (Glasgow means University of Glasgow stop word list), 3.68% better retrieval performance than system-2 (stemtfidf, stem means without removing stop word list), and 2.44% better retrieval performance than system-3. However, after top 40% retrieval all the systems showed almost the same retrieval performance. In the system-2 (stem tf idf), we just applied Porter's stemming without removing stop words. It is pretty interesting that the retrieval performance of system-2 compared to system-4 (glasgow_tf_idf) is 1.64% better. From the above results, it is found that, in case of LSI based ah hoc information retrieval, the set of stop words is dependent on the set of input data. However, if someone has a powerful computer then one can expect significant retrieval performance without removing stop words. From the above result is it clear that the use of an arbitrary set of 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. stop words reduces retrieval performance in case of LSI-based ah hoc information retrieval with large dataset. 5.4.1.1 Paired Sample T-test The idea of a t-test is introduced in section 2.8. We calculated the t-test using software SPSS 17.0. Table-5.4 represents recall-precision of four different systems and means and standard deviations of precision. Table 5.5 shows output of the t-test (detail step by step calculations are available in Appendix D). We compared system-1 with three other systems in terms of the level of significance. We found 3 different level of significance for every pair of systems. By analyzing the significance level (see Table-C in Appendix D) it is found that the t-test produced a result that is below the threshold level. This is an interesting finding in that if we apply paired sample t-test to compare different retrieval systems in terms of level of significance, paired sample t-test results indicate insignificant differences for all cases. 10-Points Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Mean Standard Deviation PR_mylist_tfidf (System-1) 0.1757 0.1108 0.0888 0.0724 0.0678 0.0659 0.0646 0.0621 0.0554 0.0436 0.0807 0.0381 PR Stem tfidf (System-2) 0.1385 0.1058 0.0841 0.0743 0.0710 0.0662 0.0632 0.0592 0.0553 0.0247 0.0742 0.0307 PR Ten tfidf (System-3) 0.1513 0.0994 0.0735 0.0693 0.0672 0.0647 0.0609 0.0588 0.0489 0.0316 0.0726 0.0326 PR_Glasgow_tfidf (System-4) 0.1221 0.0864 0.0799 0.0722 0.0694 0.0671 0.0660 0.0628 0.0578 0.0374 0.0721 0.0219 Table 5.4: 10-point Interpolated Precision, Mean and Standard Deviation of Four Systems 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Paired Samples Test Paired Differences Degree of Level of freedom (N-l) Significance 95% Confidence Interval of the Difference Lower Upper t Pair 1 System 1 - System2 -.0024183 .0153783 1.647 9 .134 Pair 2 System 1 - System3 .0027259 .0135741 3.399 9 .008 Pair 3 System 1 - System4 -.0041491 .0213491 1.526 9 .161 Table 5.5: Comparison among Systems in Terms of T-test 5.4.2 Result: Experiment-2 Table 5.6 shows the 10-point interpolated precision of three different systems. The recall precision graph based on the Table 5.6 data is shown in Figure 5.3. 10-Points Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 PR_mylist_tfidf (System -1) 0.1757 0.1108 0.0888 0.0724 0.0678 0.0659 0.0646 0.0621 0.0554 0.0436 PR_mylist_lofentropy (System - 5) 0.1464 0.1061 0.0803 0.0784 0.0743 0.0714 0.0688 0.0666 0.0515 0.0388 PR_mylist_raw_freq (System - 6) 0.0910 0.0874 0.0789 0.0720 0.0690 0.0636 0.0626 0.0596 0.0569 0.0339 Table 5.6: 10-point Interpolated Precision of Three Different Systems The objective of this study was to find the right term weighting scheme for LSI based retrieval system for a large dataset. From the results presented in Table 5.6, it is found that in case of top 10% retrieved documents, system-1 with tf-idf term weighting scheme showed 17.57% and system-6 with raw term frequency showed 9.1% retrieval performance. So if you subtract 17.57-9.1= 8.47 = 9%, means system-1 showed 9% better retrieval performance than system-6. Similarly, log-entropy (system-5) showed 5.54% better retrieval performance than raw term frequency (system-6). On the other hand, tf-idf (system-1) showed 2.93=3% better retrieval performance than system-5 with the logentropy term weighting scheme. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. •?R_mylist_tfidf(System-!) ~M ~PR_mySst_logen«opy{System-5) k* • PR_myHst_ra\v_fre q(System -6) 0.2 c © o«» o. 0.0 0. 1 0.2 0.3 0.4 0.5 0.$ 0.7 0.8 0.9 1 Recall Figure 5.3: Recall-Precision Graph for Experiment-2 If we look at Figure 5.3 in the recall-precision graph at the recall level 0.1, we can see the above results in a graphical form. So, in case of a large dataset tf-idf performs better than log-entropy and raw term frequency weighting schemes. However, after top 40% retrieval (in Figure 5.3 please look at the recall values 0.4 to 1.0) all the retrieval systems showed almost the same retrieval performance. As 1151 documents are relevant out of 131,321 documents (only 0.88% documents are relevant in this test collection) in LA Times TREC-8 standard test collection, we suspect that almost all relevant documents are retrieved in the top ranked 30% retrieved documents, that's why after 40% retrieved documents every retrieval system showed equal performance as shown in the Figures 5.2 and 5.3. 5.5 Conclusion In the above two studies (section 5.4.1 and 5.4.2), we performed seven sets of experiments, and used 50 queries (TREC-8) to evaluate the results of the experiments. 10point recall-precision graphs, and a t-test have been used to measure the retrieval 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performance of LSI for large test collection. By performing these experiments we found the following research results: • Different input datasets contain different contents (e.g., news paper articles, medical journals etc), so, the term list of an input dataset also is different for the other datasets. It is clear that for any unique document collection a tailored stop word list must be assembled. • In the existing published results on the LSI based retrieval system there was no concrete evidence about which term weighting scheme works better in case of large datasets. In our research, we conclude that the result (in numerical forms) shows tf-idf term weighting performs better than log-entropy and raw term frequency weighting schemes when the test collection becomes large. • Our findings on LSI based retrieval system with large dataset extend the earlier published results discussed in chapter-3 by providing concrete evidence of relevance performance. • The t-test failed to distinguish among the retrieval performance of different retrieval systems used in this research work. LSI showed promising results in case of TREC-8 LA Times dataset. TREC-8 LA Times dataset is an example of a large unstructured digital textual dataset. So, we can benefit from using LSI to find information from large digital unstructured textual archives. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter-6 Conclusion and Future Directions 6.1 Introduction In this chapter, we present a summary of this thesis as well as mention possible directions of our future research. 6.2 Summary The field of Information Retrieval (IR) was born in the 1950s and over the last forty years, the field has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. [Singhal 2008] Retrieving information from digital archives is a challenge to the users, and an automatic IR system is an obvious solution to this challenge. In Chapter-1, we introduced the fundamental terms of LSI research and presented the problem statement. In Chapter-2, we discussed the theories of LSI based IR system and related matters. This chapter covers the term weighting schemes, ranked based similarity (between a query and documents) measurement technique, singular value decomposition, TREC-8 relevance judgment, and the idea of recall-precision and t-test. We discussed TREC-8 relevance judgment and the recall-precision graph because those help us to evaluate the retrieval results. In Chapter-3, we surveyed the published results on LSI for last 20 years that helped us to identify the potential research questions on LSI based IR system. In Chapter-4, we described the characteristics of standard large TREC-8 test collection, and the pre-processing of the selected input dataset. To address the research questions on a large dataset, we created and used our own stop words lists, Porter's stemming, and LSI based vector space model (singular value decomposition), for a structured set of experiments. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Chapter-5, we presented the experimental design, experimental results, and the list of findings from the experimented results. All the results are based on the top 10% retrieval of documents. We performed two different experiments to judge the performance of LSI in case of large dataset. The experiments are given below briefly: • Study the effect of different stop word lists. • Study the effect of different term-weighting schemes. In the first experiment, we tested the retrieval performance of LSI using 3 different stop word lists and without removing stop words from the input dataset. The stop word list (UNBC LAT stop word list) we created and used in LSI showed 5.37% better performance than University of Glasgow stop word list, 2.44% better than the University of Tennessee stop word list, and 4% better than without removing stop word list. In the second experiment, we tested three different term-weighting schemes (raw term frequency, tf-idf, and log-entropy) on a large dataset. tf-ifd weighting schemes showed 9% better retrieval performance, and log-entropy showed 5.54% better performance than raw term frequency. On the other hand, tf-idf showed 3% better retrieval performance than log-entropy. In summary, tf-idf showed better performance than log-entropy weighting and raw term frequency schemes in case of large dataset. In general, LSI has a number of advantages to process large dataset as follows: 1. Both terms and documents are explicitly represented in the same space. 2. Queries and new documents can easily be added. 3. LSI uses SVD to reduce dimension and to remove noise. 4. LSI is able to handle polysemy and synonymy. Thus, LSI is able to represent and manipulate large data sets, making it viable for realworld applications. [Deerwester et al 1990] Our results on LSI with large dataset also proved that LSI is able to handle large dataset. 6.3 Possible Extensions of the Experiments Latent Semantic Indexing (LSI) is being used in a variety of information retrieval and text processing applications, although its primary application has been for conceptual text 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. retrieval and automated document categorization. [Dumais 2004] In this thesis we studied the accuracy of LSI in case of large data set for document retrieval. Some additional research works related to this study might include the following: 1. By analyzing the results [section 4.3.1 and 4.3.2] of this research, it is found that every experiment shows almost equal performance at the recall points 0.4 to 1.0. As only 0.88% documents are relevant, we suspect that almost all relevant documents are retrieved in 30% top ranked documents; meaning that the graphs show almost equal retrieval performance after recall point 0.4. This suspicion might be an interesting study to find the reason(s) behind this equal performance in this dataset. This suspicion could also be explored for other large dataset. 2. Term-weighting schemes are very influential in LSI based IR systems. One possible study that might be interesting is to apply a position based term weighting scheme (sometimes called positional indexing). This weighting scheme is very costly in terms of computing time, but may produce better retrieval performance. We plan to work on position based term-weighting schemes in the future. 3. The General Text Parser [GTP] [Giles et al 2001] used in this research has a command line interface. Adding Graphical User Interface (GUI) could enhance user interaction with the GTP. A java version of GTP is available with GUI; however, it has memory limitation that prevents it from processing a large dataset. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Bibliography [Al-Maskari et al 2008] Azzah Al-Maskari, Mark Sanderson, and Paul Clough. Relevance Judgments between TREC and Non-TREC. Assessors. SIGIR' 08, Singapore, July 20-24, 2008. [Anderson et al 2009] Anderson, David Ray, Sweeney, Dennis J., and Williams, Thomas Arthur (2009). Statistics for business and economics. 11th ed. Cengage Learning Inc. lorence, KY [Baeza-Yates & RibeiroNeto 1999] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman Publishing Co. Inc., first edition, 1999. [Berry & Dumais 2008] Berry Michael W. & Dumais Susan. Latent Semantic Indexing Web Site, http://www.cs.utk.edu/~lsi/, access time: Jan 2008. [Berry et al 1993] M. Berry, T. Do, G. O'Brien, V. Krishna, and S. Varadhan. Svdpackc (version 1.0) user's guide. Technical Report No. CS-93-194, University of Tennessee, May 1993. [Chakravarthy & Netserf A. Chakravarthy and K. Haase. Netserf. Using semantic 1995] knowledge to find internet archives. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, 1995. Pp. 4-11. [Chen et al 2001] L. Chen, N. Tokuda, and A. Nagai. Probabilistic information retrieval method based on differential latent semantic index space. IEICE Trans, on Information and Systems, E84-D (7):910-914, 2001. [Chen et al 2003] L. Chen, N. Tokuda, and A. Nagai. A new differential Isi space-based probabilistic document classifier. Information Processing Letters, 88:203-212, 2003. [Chen et al 2004] L. Chen, J. Zeng, and J. Pei. Classifying noisy and incomplete medical data by a differential latent semantic indexing approach. In Data Mining in Biomedicine. Springer Press, 2004. [Chen et al 2005] L. Chen, J. Zeng, and N. Tokuda. A "stereo" document representation for textual information retrieval. Journal of the American Society for Information Science and Technology, 2005. [Chen et al 2006] Liang Chen, Jia Zeng and Naoyuki Tokuda. A "Stereo" Document Representation for Textual Information Retrieval. Journal of the American Society for Information 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Science and Technology (JASIST). 57:6, pp. 768—774. 2006. [Chen et al 1999] Liang Chen, Naoyuki Tokuda and Akira Nagai, Probabilistic Information Retrieval Method Based on Differential Latent Semantic Index Space. IEICE Trans. Fundamentals, Vol.E82, No.l January 1999. [Coefficient of Variation] http.//www.graphpad.com/help/prism5/pri sm 5help.htm l?co efficient_of_variation _%28cv%29.htm. (Access time Dec 2009) [Cohen & Hirsh 1998] W. Cohen and H. Hirsh. Text categorization using whirl. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 169-173, New York, 1988. [Cooper & Dabney 1992] W. Cooper, F. Gey, and D. Dabney. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 198-210, Copenhagen, Denmark, 1992. [Croft et al 2009] W. Bruce Croft, Donald Metzler, and Trevor Strohman. Information Retrieval in Practice. Addison Wesley, 2009. [Deerwester et al 1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. A. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6), pp. 391-407. [Downing & Clark 2003] Downing, Douglas and Clark, Jeffrey (2003). Business Statictics. 4th ed. Barron's Educational Series, Inc. Hauppauge, NY [Dumais et al 1988] Dumais S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. Using latent semantic analysis to improve access to textual information. In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 1988. pp. 281-285. [Dumais et al 1997] Susan T. Dumais, Todd A. Letsche, Michael L. Littman, Thomas K. Landauer. Automatic Cross-Language Retrieval Using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross- Language Text and Speech Retrieval, 1997. Stanford University, pp. 18-24. [Dumais et al 1996] Susan T Dumais, Thomas K Landauer, Michael L Littman. Automatic Cross Linguistic Information Retrieval using Latent Semantic Indexing. SIGIR 1996. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Dumais 1991] S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers, 23(2):229—236, 1991. [Dumais 2004] Dumais, S. Latent Semantic Analysis. ARIST Review of Information Science and Technology, vol. 38, 2004, Chapter 4. [Dumais 1992] Dumais, S. T. Enhancing performance in LSI (Latent Semantic Indexing) Retrieval. Technical memo, 1992. [Fisher & Yates 1995] R.A. Fisher and F. Yates. 6th ed. Statistical tables for biological, agricultural and medical research. London: Longman Group, 1995. [Foltz 1990] P. W. Foltz. Using Latent Semantic Indexing for Information Filtering. In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 1990. pp. 40-47. [Frakes & Yates 1992] W. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992. [Gansterer et al 2007] Wilfried Gansterer, Andreas Janecek, and Robert Neumayer. Spam filtering based on latent semantic indexing. 5th SIAM Workshop on Text Mining, Minneapolis, MN, USA, 2007. [Greengrass 2001] E. Greengrass. Information retrieval: A survey. Technical Report DOD Technical Report TR-R52-008-001, 2001. [Grei et al 1997] W. Grei, W. Croft, and H. Turtle. Computationally tractable probabilistic modeling of Boolean operators. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997. pp. 119-128. [Giles et al 2001] Justin T. Giles, Ling Wo, Michael W. Berry. GTP (General Text Parser) Software for Text Mining. CRC Press 2001. [Harman 1992] D. Harman. User-friendly systems instead of user-friendly front-ends. Journal of the American Society for Information Science, 43(2):164-174, 1992. [Harman 1995] D. Harman. Overview of the third text retrieval conference. Third Text REtrieval Conference (TREC-3), pp. 1-19. National Institute of Standards and Technology Special Publication 500-207, 1995. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Harter 1986] S. Harter. Online Information Retrieval: Concepts, Principles, and Techniques. Academic Press, Inc., 1986. [Hofmann 1999] T. Hofmann. Probabilistic latent semantic indexing. 22nd Annual ACM Conference on Research and Development in Information Retrieval, pp. 50-57, Berkeley, California, 1999. [Hull 1994] D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 282-291, 1994. [Husbands et al 2001] Husbands, P., Simon, H., & Ding, C. On the use of singular value decomposition for text retrieval. In Computational Information Retrieval (SIAM) (pp. 45156), Philadelphia, PA, 2001. [Kumar et al 2006] Ch. Aswani Kumar, Ankush Gupta, Mahmooda Batool and Shagun Trehan. Latent Semantic Indexing-Based Intelligent Information Retrieval System for Digital Libraries. Journal of Computing and Information Technology - CIT 14, India 2006, pp. 191-196 [Labsky et al 2005-A] Labsky M., Vacura M., and Praks P. Web Image Classification for Information Extraction. First International Workshop on Representation and Analysis of Web Space (RAWS-05), Prague, 2005; pp. 55-62; [Labsky et al 2005-B] Martin Labsk'y, Vojfech Sv'atek and Ond"rej "Sv'ab. Information Extraction from HTML Product Catalogues: from Source Code and Images to RDF. The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 401-404, Washington, USA, 2005 [Lee 1994] J. Lee. Properties of extended Boolean models in information retrieval. 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 182-190, 1994. [Letsche 1996] Todd A. Letsche. Toward Large-Scale Information Retrieval Using Latent Semantic Indexing. A masters thesis supervised by Dr. Michael Berry, The University of Tennessee, Knoxville, USA, 1996. [Letsche & Berry 2009] Todd A. Letsche and Michael W. Berry. Large-Scale Information retrieval with Latent Semantic Indexing. http://www.cs.utk.edu/~berry/lsi++/, access time: Feb 2009 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Liddy et al 1994] E. Liddy, W. Paik, and E. Yu. Text categorization for multiple users based on semantic features from a machinereadable dictionary. ACM Transactions on Information Systems, 12(3):278—295, 1994. [Lynn 2002] Patrick A. Lynn. Evolving the General Text Parser (GTP) Utility into a Usable Application Via Interface Design. A Thesis Presented for the Master of Science Degree, The University of Tennessee, Knoxville By, December 2002 [Manning et al 2008] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schiitze. An Introduction to Information Retrieval. Cambridge University Press Cambridge, 2008. [Mayfield & McNamee 1997] James Mayfield and Paul McNamee. N-grams vs. words as indexing terms. In TREC-6 Conference Notebook Papers, 1997. [Montague 2002] M. Montague. Metasearch. Data fusion for document retrieval. PhD thesis, Dartmouth College, Hanover, New Hampshire, 2002. [Machala et al 2006] Praks P., Machala L., Snasel V., Strizik M. Intelligent Information retrieval using numerical linear algebra and applications. International Symposium GIS Ostrava 2006 Proceedings. Ed. J. Ruzicka VSB-TU Ostrava, January 2325, 2006. pp. 1-13 (CD-ROM). [Ogilvie & Callan 2003] P. Ogilvie and J. Callan. Combining document representations for known item search. Twenty Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 143-150, Toronto, Canada, 2003. [Ott & Larsen 1992] Ott, R Lyman and Larson, Richard (1992). Statistics a tool for the social sciences, ed. Pws-Kent publishing company, Boston. [Paice 1984] C. Paice. Soft evaluation of Boolean search queries in information retrieval systems. Butterworth Publishers Stoneham, MA, USA, pp.33—42, 1984. [Pearce & Nicholas 1996] C. Pearce and C. Nicholas. Telltale: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal of the American Society for Information Science, 47(4):263-275, 1996. [Porter 1980] M. Porter. An algorithm for suffix stripping. Program, 14(3): 130-137, 1980. 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Porter 1997] M. Porter. An algorithm for suffix stripping. In K. S. Jones and Willett P., editors, Readings in information retrieval, pp. 313-316. Morgan Kaufmann Publishers Inc., 1997. [Praks et al 2004] Pavel Praks, Libor Machala and Vaclav Snasel. On SVDFree Latent Semantic Indexing for Iris Recognition of Large Databases. Fifth International Workshop on Multimedia Data Mining (MDM/KDD'04) in conjunction with ACM SIGKDD Tenth International Conference on Knowledge Discovery and Data Mining, pp. 67-71, Seattle, USA, 2004. [Praks et al 2003] Praks P., Dvorsky J., Snasel V. Latent Semantic Indexing for Image Retrieval Systems. SIAM Conference on Applied Linear Algebra, Williamsburg, VA - USA, 2003 [Qin et al 2005] Bing Qin, Ting Liu, Zhang Yu, Sheng Li. Research on Multi-Document Summarization Based on Latent Semantic Indexing. Journal of Harbin Institute of Technology, China 2005, pp. 91-94 [Rehder et al 1998] B. Rehder, M. Schreiner, M. Wolfe, D. Laham, T. Landauer, and W. Kintsch. Using latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes, 25:337-354, 1998. [Rijsbergen 1979] C. van Rijsbergen. Information Retrieval. Butterworths, London, 2nd edition, 1979. [Rilo 1995] E. Rilo. Little words can make a big difference for text classification. 18th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, pages 130-136, 1995. [Rose et al 1990] K. Rose, E. Gurewitz, and G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters 11, 11:589-594, 1990. [Salton 1971] G. Salton. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Clis, New Jersey, 1971. [Salton 1989] G. Salton. Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading, MA, 1989. [Salton & Buckley 1987] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Technical Report, Cornell University, USA, 1987. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [Salton & McGill 1983] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw Hill Publishing company, New York, 1983. [Saul & Pereira 1997] L. Saul and F. Pereira. Aggregate and mixed-order Markov models for statistical language processing. In Claire Cardie and Ralph Weischedel, editors, Second Conference on Empirical Methods in Natural Language Processing, pages 81-89. Association for Computational Linguistics, Somerset, New Jersey, 1997. [Schutze & Silverstein 1997] H. Schutze and C. Silverstein. Projections for efficient document clustering. In ACM/SIGIR, pages 74-81, 1997. [Singhal 2008] Amit Singhal. Modern Information Retrieval: A Brief Overview. Google, Inc. http://pages.cs.wisc.edu/~anhai/, access time Jan 2008 [Smith 2003] Lindsay I. Smith. A Tutorial on Principal Component Analysis. Unpublish Document downloaded for the link kybele.psych.cornell.edu/~edelman/Psych-465-Spring2003/PCA- tutorial.pdf, in the date, 9 Sep 2007 [Spigel 1988] Spigel, Murray R (1988). Theory and problem of Statictics. 2nd ed. Schaum's outline series, McGraw-Hill. [Spoerri 2007] Anselm Spoerri. A Visual Tool for Information Retrieval. Ph.D. Research and Thesis at MIT, 1995. http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/InfoCry stal.htm, access time: Sep 2007. [Strang 1998] G. Strang. Introduction to Linear Algebra. WellesleyCambridge Press, 3rd edition, 1998. [SVDA 2008] Singular value decomposition, http://www.mathworks.com/access/helpdesk/help/techdoc/ ref/svd.html, accessing time: Dec 2008. [STLR 2007] Standard Template Library Resources. http://www.sgi.com/tech/stl/, accessing time: Dec 2007. [Taha & Gosh 1996] I. Taha and J Gosh. Characterization of the Wisconsin breast cancer database using a hybrid symbolicconnectionist system. Technical Report UT-CVISS-TR-97007, the Computer and Vision Research Center, University of Texas, 1996. [Triola 2006] Triola, Mario F (2006). Elementary Statistics. 3rd ed. Pearson, Addison-Wesley. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [TREC] The Text REtrieval http://trec.nist.gov/ Conference (TREC), [Triola 2006] Mario F. Triola. Elementary Statistics. 10* edition (2006), Addison Wesley. [Turtle & Croft 1991] H. Turtle and W. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 1991. [Wegner 2010] Wegner T (2010). Applied Business Statistics. 2nd ed. Juta Academic. [Wolfe & Goldman 2003] Wolfe MB, Goldman SR. Use of latent semantic analysis for predicting psychological phenomena: two issues and proposed solutions. Behavior Research Methods, Instruments and Computers, 2003, 35(1):22-31. [Wiemer-Hastings et al 1999] P. Wiemer-Hastings, K. Wiemer-Hastings, and A. Graesser. How Latent is Latent Semantic Analysis? In Proceedings of the 16th International Joint Congress on Artificial Intelligence, pp. 932-937, Morgan Kaufmann, San Francisco, 1999. [Wikipedia] Wikipedia, the free encyclopedia, http://en.wikipedia.org/ [Waller & Kraft 1979] W. Waller and D. Kraft. A mathematical model of a weighted Boolean Retrieval system. Information Processing and Management, 15:235-245, 1979. [Wolberg & Mangasarian 1990] W. Wolberg and O. Mangasarian. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, volume 87, 1990. pp. 9193-9196. [Wu & Zhou 2002] J. Wu and Z. Zhou. Face recognition with one training image per person. Pattern Recognition Letters, 23(14): 1711-1719, 2002. [Zeng 2005] Jia Zeng. Information Retrieval by Multi-Perspective Representation. Thesis (M.Sc.)--University of Northern British Columbia, 2005. [Zimmerman 1991] H. Zimmerman. Fuzzy Set Theory and its Applications. Kluwer Academic Publishers, 2nd edition, 1991. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A 1. Stop Words List: Generated Stop Words List (University of Northern British Columbia) apr-31 aug-44 dec-66 feb-92 11-apr 15-aug 6-dec a al alO all al2 al3 al4 al5 al6 al8 al9 a2 a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a3 a30 a300 a301 a31 a310 a32 a320 a320s a321 a33 a330 a330s a340 a340s a35 a36 a4 a41 a43 a5 a6 a7 a8 a9 aa aaa aaaa aaaah aable aachen aad aaf aafitink aagla aah aahed aahing aahs aai aar aas ab aba abb abc abl about above abt ac4 aca acc accordingly ace ach acn acoc across ad ada adl adm adn ae afc afi afl after afterwards ag again against agn ago ah ahl ahm ak al ala alf all allows aim almost alone along alp already also although always am ama amc ami amo among amongst amp an ana anc and ang anh ani ann another any anyawi anybody anyhow anyone anything anyway anywhere ap apart appear appropriate ar are around as asa ash aside aso associated ast asu at att av ava available aw away awfully b bl blO blOO bl2 b2 b210 b3 b4 b5 b52 b6 bl b747 b8 b9 ba ba2 ba3 ba4 ba6 baa baal baa2 baa3 back bbdo be became because become becomes becoming been before beforehand behind being below ben ber beside besides best better 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between beyond blm bmg bnd bo boa bol boo both bp brief bs bsd bta btu bu but bv by bye byline byu c cl clO ell cl2 cl3 cl4 cl5 cl6 c!7 cl8 cl9 c2 c20 c21 c22 c23 c24 c2h2 c3 c4 c5 c6 c7 C759915 c8 c9 cac came can cannot cant cause causes cb cbn cbo cc ccaa ccdc cch cct ce cee cellrule certain cfa ch cha changes chi chj chp chr chu cio ck clo cmc cmdr cmv CO column come consequentl y contain containing contains coq correction correspondi ng cot could cpa cpl era crc currently cvj cwl cy cya d dl dlO dl 1 dl2 dl3 dl4 dl 5 dl6 d2 d3 d4 d5 d6 d7 d8 d9 date dateline day db de dea ded def described dh di dib did didn different dl dmc do doc doc doc docid docno docno docno does doesn doing don done doo dosen down downwards dr ds du during dy e el e5 each eb ebb ec ed edit ee eel eg eh ei eight eip either ek el elk else elsewhere em en enough eo ep epa epp er es et etc ev even ever every everybody everyone everything everywhere ew ex example except ext f fl faa far fee fe few fg fi fifth fig first five fl flo followed following foo for former formerly forth four from ft fu further furthermore g g3 g4 g5 g6 ga gc ge get gets given gives gn 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. go gone goo good got gq graphic graphics great grp gte gto gtp gu h hi hi 1 h 13 hi 8 h2 h20 h2a h2o h3 h4 h5 h6 h7 ha haa had hadn hardly has have haven having he headline hem hence her here hereafter hereby herein hereupon hers herself hi him himself his hither hj hmmm hmmmmm ho hoo how howbeit however hoy ht hu hy ibf iby ic ida ie if ignored ii iii ik il im ima immediate in ina inasmuch inc indeed indicate ing inner insofar instead into inward ip q ir iri irk is isl ! isn it its itself iv ivo ix j jl jio j2 j3 j4 j5 j6 j8 ja jc ji jo ji" just k kl kin k2 k2r k9 ka ka7 kan kao kee keep kept kg2 kg7 kg8 khl kh2 kh7 kh8 know knx ko ky 1 10 12 14 15th 16 17 185 1983 1985 1986 1987 1989 1990 19th la laO10189 la010190 laO 10289 laO 10290 la010389 la010390 laO10489 laO10490 la010589 laO10590 laO 10689 laO10690 laO10789 laO10790 laO 10889 laO 10890 laO 10989 laO 10990 laOl 1089 laOl 1090 laO11189 laOl 1190 laOl1289 laO11290 laO11389 laO 11390 laO 11489 laOl 1490 laO11589 laOl 1590 laO 11689 laOl 1690 laO11789 laOl 1790 laOl 1889 laOl 1890 laO11989 laO 11990 laO12089 laO12090 la012189 la012190 laO12289 laO12290 la012389 laO12390 laO12489 laO12490 laO12589 laO12590 laO12689 laO12690 laO12789 laO12790 laO12889 la012890 laO12989 laO12990 la013089 la013090 la013189 laO13190 la020189 la020190 la020289 la020290 la020389 la020390 la020489 la020490 la020589 la020590 la020689 la020690 la020789 la020790 la020889 la020890 la020989 la020990 la021089 la021090 la021189 la021190 la021289 la021290 la021389 la021390 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ia021489 la021490 la021589 la021590 la021689 la021690 la021789 la021790 la021889 la021890 la021989 la021990 la022089 la022090 la022189 la022190 la022289 la022290 la0223 89 la022390 la022489 la022490 la022589 la022590 la022689 la022690 la022789 la022790 la022889 la022890 la030189 la030190 la030289 la030290 la030389 la030390 la030489 la030490 la030589 la030590 la030689 la030690 la030789 la030790 la030889 la030890 Ia030989 la030990 la031089 la031090 la031189 la031190 la031289 la031290 la031389 la031390 la031489 la031490 la031589 la031590 la031689 la031690 la031789 la031790 la031889 la031890 la031989 la031990 la032089 la032090 la032189 la032190 la032289 la032290 la032389 la032390 la032489 la032490 la032589 la032590 la032689 la032690 la032789 la032790 la032889 la032890 la032989 la032990 la033089 la033090 la033189 la033190 la040189 la040190 la040289 la040290 la040389 la040390 la040489 la040490 la040589 la040590 la040689 la040690 la040789 la040790 la040889 la040890 la040989 la040990 la041089 la041090 la041189 la041190 la041289 la041290 la041389 la041390 la041489 la041490 la041589 la041590 la041689 la041690 la041789 la041790 la041889 la041890 la041989 la041990 la042089 la042090 la042189 la042190 la042289 la042290 la042389 la042390 la042489 la042490 la042589 la042590 la042689 la042690 la042789 la042790 la042889 la042890 la042989 la042990 la043089 la043090 la050189 la050190 la050289 la050290 la050389 la050390 la050489 la050490 la050589 la050590 la050689 la050690 la050789 la050790 la050889 la050890 la050989 la050990 la051089 la051090 la051189 la051190 la051289 la051290 la051389 la051390 la051489 la051490 la051589 la051590 la051689 la051690 la051789 la051790 la051889 la051890 la051989 la051990 la052089 la052090 la052189 la052190 la052289 la052290 la052389 la052390 la052489 la052490 la052589 la052590 la052689 la052690 la052789 la052790 la052889 la052890 la052989 la052990 la053089 la053090 la053189 la053190 la060189 la060190 la060289 la060290 la060389 la060390 la060489 la060490 la060589 la060590 la060689 la060690 la060789 la060790 la060889 la060890 la060989 la060990 la061089 la061090 la061189 la061190 la061289 la061290 la061389 la061390 la061489 la061490 la061589 la061590 la061689 la061690 la061789 la061790 la061889 la061890 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ia061989 la061990 la062089 la062090 la062189 la062190 la062289 la062290 la062389 la062390 la062489 la062490 la062589 la062590 la062689 la062690 la062789 la062790 la062889 la062890 la062989 la062990 la063089 la063090 la070189 la070190 la070289 la070290 la070389 la070390 la070489 la070490 la070589 la070590 la070689 la070690 la070789 la070790 la070889 la070890 la070989 la070990 la071089 la071090 la071189 la071190 la071289 la071290 la071389 la071390 la071489 la071490 la071589 la071590 la071689 la071690 la071789 la071790 la071889 la071890 la071989 la071990 la072089 la072090 la072189 la072190 la072289 la072290 la072389 la072390 la072489 la072490 la072589 la072590 la072689 la072690 la072789 la072790 la072889 la072890 la072989 la072990 la073089 la073090 la073189 la073190 la080189 la080190 la080289 la080290 la080389 la080390 la080489 la080490 la080589 la080590 la080689 la080690 la080789 la080790 la080889 la080890 la080989 la080990 la081089 la081090 la081189 la081190 la081289 la081290 la081389 la081390 la081489 la081490 la081589 la081590 la081689 la081690 la081789 la081790 la081889 la081890 la081989 la081990 la082089 la082090 la082189 la082190 la082289 la082290 la082389 la082390 la082489 la082490 la082589 la082590 la082689 la082690 la082789 la082790 la082889 la082890 la082989 la082990 la083089 la083090 la083189 la083190 la090189 la090190 la090289 la090290 la090389 la090390 la090489 la090490 la090589 la090590 la090689 la090690 la090789 la090790 la090889 la090890 la090989 la090990 la091089 la091090 la091189 la091190 la091289 la091290 la091389 la091390 la091489 la091490 la091589 la091590 la091689 la091690 la091789 la091790 la091889 la091890 la091989 la091990 la092089 la092090 la092189 la092190 la092289 la092290 la092389 la092390 la092489 la092490 la092589 la092590 la092689 la092690 la092789 la092790 la092889 la092890 la092989 la092990 la093089 la093090 lal00189 lal00190 lal00289 lal00290 lal00389 lal00390 lal00489 lal00490 lal00589 M00590 lal00689 lal00690 lal00789 lal00790 lal00889 lal00890 lal00989 lal00990 lal01089 lal01090 lalOl189 lal01190 lal01289 lal01290 lal01389 lal01390 lal01489 lal01490 lalOl589 lal01590 lal01689 lal01690 lal01789 lal01790 lalOl889 lal01890 lal01989 lal01990 lal02089 lal02090 lal02189 lal02190 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ial02289 lal02290 la102389 lal02390 lal02489 lal02490 lal02589 lal02590 lal02689 lal02690 lal02789 lal02790 lal02889 la102890 lal02989 lal02990 lal 03089 la103090 lal03189 lal03190 lal10189 lal10190 lal10289 lal10290 lal10389 lal10390 lal10489 lal10490 lal10589 lal10590 lal10689 lal10690 lal10789 lal10790 lal10889 lal10890 lal10989 lal10990 lal11089 lal11090 lal11189 lal11190 lal11289 lal11290 lal11389 lal11390 lal11489 lal11490 lal11589 lal11590 lal 1689 lal 1690 lal 1789 lal 1790 lal 1889 lal 1890 lal 1989 lal 1990 lal 2089 lal 2090 lal 2189 lal 2190 lal 2289 lal 2290 lal 2389 lal 2390 lal 2489 lal 2490 lal 2589 lal 2590 lal 2689 lal 2690 lal 2789 lal 2790 lal 2889 lal 2890 lal 2989 lal 2990 lal 3089 lal 3090 lal20189 lal20190 lal20289 la120290 lal20389 lal20390 lal20489 la120490 lal20589 la120590 lal20689 lal20690 lal20789 lal20790 lal20889 lal20890 lal20989 lal20990 lal21089 lal21090 lal21189 lal21190 lal21289 lal21290 lal21389 lal21390 lal21489 lal21490 lal21589 lal21590 lal21689 lal21690 lal21789 lal21790 lal21889 lal21890 lal21989 lal21990 lal22089 lal22090 lal22189 lal22190 lal22289 lal22290 lal22389 lal22390 lal22489 lal22490 lal22589 lal22590 lal22689 lal22690 lal22789 lal22790 lal22889 lal22890 lal22989 la122990 lal23089 lal23090 lal23189 lal23190 la2 la23 la89 la90 last latter latterly le least length less lest let li life like little 11 lo long loo lot lou Ip Is It ltd lx iy m ml in 16 m2 m3 m4 m5 m6 m62 m71 m78 ma maa made mae make mal man many may mc mca mcb md mdc me mea meanwhile mee men mfg mg mi might mk mm mme mo moc moo more moreover most mostly mott mr ms msg mt mu much must mvp mx my myself n nl n2 n5 n6 na na3 na6 name namely ncols ncr ne near necessary neither nev never nevertheles s new next nfc 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. nfl ng ni nine nl nlrb nmb no nobody nom non none noone nor normally not nothing nov novel now nowhere nr nu o oat oc occ och od oda odd of off oft often oh °j ol old on once one ones only onto ooh ooo 0000 ooooo ooz op opt or ora ord ot other others otherwise ou ought our ours ourselves out outside over overall ow own P Pi p2 pa page part particular particularly pb pba pc pcb pcc pel pep pe people per perhaps pfc Pg ph pic Pj placed pic please plo plus possible PP pr pre pre pro probably provides Pt pta ptl q ql06 qb qd2 qd3 qd4 qd7 qd8 qe2 qe4 qe5 qe6 qe7 qe8 qed qf2 qf5 qf7 qg4 qg5 qg6 qh4 qi qtr que quite r rl r2 r2d2 r3 ra ra3 ra6 rather rb rbi rc rda re really relatively respectivel y right rk ro roo rowrule rv s si sl630 slw s2 s358 s4 s605 said same sb sba sbk sc sea see scca see scr sdg sdy se second secondly section see seem seemed seeming seems self selves sensible sent serious seven several shall she shoo should si since six smc smu so some somebody somehow someone something sometime sometimes somewhat somewhere soo sou specified specify specifying spn sq sr ss ssi ssy st state still stu su sub subject such sup sw sy syd t t4 ta table tablecel tablecell tablerow take taken tcu td 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. te text than that the their theirs them themselves then thence there thereafter thereby therefore therein thereupon these they thi third this thorough thoroughly those though three through throughout thru thus thy time tmc tnt to together too tot toward towards ts tu twa twice two type u u2 u60 ua uc ul ulf un unc und under unless uno until unto up upa upon us use used useful uses usf using usually ut uw V v20 v2500 v6 v8 v8v v91 v9w va value various ve very vh vi via vii viii vill vin viz vs vt vu vw vy w w6 w8 wa was wasn way wbc web wdm we wee well went were what whatever whatnot when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why wig will with within without wk wok word words work world would wouldn wow wr wt wu wuz wwii x6 xi xiii xiv xix xr xt xtra XV xvi xxi xxiii xxiv XXV xxvii XXX x2 xyz y y95 yap ye year years yet yo yoo you your yours yourself yourselves yve z z28 z90 zac zero zx zz zzzz alone along already also although always am among X 2. Stop Words List: University of Tennessee a about above accordingly across after afterwards again against all allows almost 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. amongst an and another any anybody anyhow anyone anything anywhere apart appear appropriate are around as aside associated at available away awfully b back be became because become becomes becoming been before beforehand behind being below beside besides best better between beyond both brief but by c came can cannot cant cause causes certain changes CO come consequentl y contain containing contains correspondi ng could currently d day described did different do does doing done down downwards during e each eg eight either else elsewhere enough et etc even ever every everybody everyone everything everywhere ex example except f far few fifth first five followed following for former formerly forth four from further furthermore g get gets given gives go gone good got great h had hardly has have having he hence her here hereafter hereby herein hereupon hers herself him himself his hither how howbeit however i ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is it its itself j just k keep kept know 1 last latter latterly least less lest life like little long ltd m made make man many may me meanwhile men might more moreover most mostly mr much must my myself n name namely near necessary neither never nevertheless new next nine no nobody none noone nor normally not nothing novel now nowhere o of off often oh old on once one ones only onto or other others otherwise ought our ours ourselves out outside over 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. overall own P particular particularly people per perhaps placed please plus possible probably provides q que quite r rather really relatively respectively right s said same second secondly see seem seemed seeming seems self selves sensible sent serious seven several shall she should since six so some somebody somehow someone something sometime sometimes somewhat somewhere specified specify specifying state still sub such sup t take taken than that the their theirs them themselves then thence there thereafter thereby therefore therein thereupon these they third this thorough thoroughly those though three through throughout thru thus time to together too toward towards twice two u under unless until unto up upon us use used useful uses using usually V value various very via viz vs w was way we well went were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without work world would X y year years yet you your yours yourself yourselves z zero 3. Stop Words List: University of Glasgow 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant CO computer con could couldnt cry de describe detail do done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fify fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither never nevertheles s next nine no nobody none noone nor not nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put rather re same see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under until up upon us very via was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix B Source Code: TREC File Extractor Download and Install following XAMPP software http://www.apachefriends.Org/download.php7xampp-win32-l.6.6a-installer.exe Click on the EXE and follow the easy steps (Please check "Install Apache as Serivce","Install MySQL as Serivce"). It will create PHP/MySQL & Apache environment for you in the local PC. Please choose a small named directory as the installation directory (for example "c:\php" or "c:\apache" names are very smalls enough to easily access from code). How to install: Extract "TREC file spliter.zip". Put all the files under a webroot directory. As you are using XAMPP (apache 2 friends), the webroot directory is "htdocs". Please create a directory (e.g., dp) under directory "c:\php\htdocs" and put all files there (e.g., c:\php\htdocs\dp). Then open the ".htaccess" file using wordpad. I consider your webroot is "C:/php/htdocs/" and "dp" is your project dir where you kept all files. Now type this address in your browser: http://localhost/dp/ user:ankzbd password:unbcca TIPS: Goto the "dp" directory there are two other directories "new input" Copy all large files here and goto http://localhost/dp/ (to access the program) to process the files and now check the "new output" directory for output files. It will help you to avoid upload and download the files. Version 1.0 Main.php /* 1. Creates the initial GUI 2. Calls show_input_files.php 3. Calls show output files.php 4. Calls uploadfile.php */ <table align="left" width="90%"> <tr> 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. <td><img src="images/unbc_logo.gif" border="0" align="absmiddle"><td> <td> </td> </tr> <tr bgcolor="#fcfcfc"> <td colspan="2" align="center" class="header"><h2>Data Pre- processing</h2></td> </tr> <?php if(!empty($msg)) { ?> <tr bgcolor="#C 1CDCD"> <td> </td><td colspan="l" align="center"><?=$msg?></td> </tr> <?php } ?> <tr valign="top"> <td width="15%"> <table vvidth="90%" style="border: lpx solid #9c9c9c"> <tr> <td><a href="?action=show_input_files" class="<?=$classl?>">Input Files</a></td> </tr> <tr> <td><a href="?action=show_output_files" class="<?=$class2?>">Output Files</a></td> </tr> <tr> <td><a href="?action=upload_file" class="<?=$class3?>">Upload Input</a></td> </tr> </table> </td> <td id="content"> <?php include($file_to_include); ?> </td> </tr> 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. <tr bgcolor="#fcfcfc"> <td colspan="2" align="center" class-'footer"> ©ANK Zaman, zaman@unbc.ca, University of Northern British Columbia. </td> </tr> </table> index.php /* 1. Calling common.lib.php and XmlToArray.class.php 2. Switches between following four operations i. createoutputfiles ii. show output files iii. show_uploader (to upload files to be processed iv. show input files 3. Handles some error messages and file deletion operation (if needed) */ <?php ob_start(); // Turn on output buffering require_once("common.lib.php"); require_once("XmlToArray.class.php"); $files = array(); Saction = 'showinput'; if(!empty($_GET['created'])) $msg = "Output files created successfully"; if(!empty($_GET['action'])) $action = trim($_GET['action']); if(!empty($_GET['action2'])) $action2 = trim($_GET['action2']); $file2del = $_GET['file']; $filetype = S_GET['type']; $file2del = 'new_'. $filetype . 7$file2del"; if(unlink($file2del)) { $msg = "File deleted succesfully"; } } if(!empty($_POST['delete_selected'])) 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. { $filetype = $_POST['file_type']; $del_cnt = 0; foreach($_POST['files'] as $file2del=>$val) { if($val == 1) { $file2del = 'new_'. $filetype . 7$file2del"; unlink($file2del); $del_cnt++; } } if($del_cnt) { $msg = "$del_cnt files deleted succesfully"; } } if(!empty($_FILES['input_file']['name'])) { $dest = "new_input/". $_FILES['input_file']['name']; if(move_uploaded_file($_FILES['input_file']['tmp_name'], $dest)) { $msg = Tile <'. $_FILES['input_file']['name'] . '> uploaded successfully'; } else { $msg = 'Error: file <'. $_FILES['input_file']['name'] . '> was not uploaded'; } } Sclassl = 'unselected'; $class2 = 'unselected'; $class3 = 'unselected'; $class4 = 'unselected'; switch($action) { case 'create': $file_to_include = "tpls/create_output_files.php"; break; case 'show output files': $files = findDirFiles("new_output"); $file_to_include = "tpls/show_output_files.php"; $class2 = 'selected'; 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. break; case 'upload file': $file_to_include = "tpls/show_uploader,php"; $class3 = 'selected'; break; case 'show_input_files': default: $files = findDirFiles("new_input"); $file_to_include = "tpls/show_input_files.php"; Sclassl = 'selected'; } ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtmll/DTD/xhtmll-transitional.dtd"> <html> <head> <title>Data Pre-processing Show uploader.php /* It creates file upload form to upload files to be processed uploading by calling show input files.php and displays files after */ 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Please Upload Your Input File
 
Common.lib.php (traverse the input directory) /* Travers the input folder reads every input file one after another and every document as a separate file with unique name */ ', '
', $xml_data); //Displaying the Array echo "
";
print_r($arrayData);
echo "
"; if($num_docs = sizeof($arrayData)) { foreach($arrayData as $v) { $doc_no = getbetweenTwo('', '', $v); $doc_no = str_replace('-',trim($doc_no)); //$file_data = "\r\n". $v . "\r\n"; $file_data = "". strtolower($v). ""; $fp = fopen("new_output/$doc_no", "w"); fwrite($fp, $file_data); fclose($fp); } }//End of createOutputFileQ Function 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ?> ShowinputFile.php /* 1. Displays the list of input files (file name and size) 2. Controls the deletion of input files 3. Controls the download option of input files by calling 4. Calls showFile() download.php */ 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
File Name File Size (Bytes)
View Content | Create Output Files Delete | Download
Creat_output_files.php /* calls createOutputFile() function */ Show_outputjfiles.php /* 1. Displays the list of output files (file name and size) 2. Controls the deletion of output files 3. Controls the download option of output files by calling 4. Calls showFile() download.php */ ctable width-'100%">
File Name File Size (Bytes) 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ');">View Content | Delete | Download
Showflle.php /* creates a button "close window" at the top (while displaying a file content */ CLOSE WINDOW

Download.php (

This resource does not exist

"); // Equivalent to exit() } ? INTRODUCTION: Stemming and Stop Word Removal Program Version 1.0 pstem folder contains the following two folders inputs : this folder contains the input text files outputs: generated output text files will be saved here and the following files index.php : this file start to access the input folder by calling the process.php file output.php : this file helps to display the generated output files process.php: this file creates temporary array of terms and compare then with stop word list for (to keep or to delete it) then call the stemming.php for stemming. stemming.php: this file contains the code of actual Porter's stemming. stopword.txt: it contains the list of stop words that will be discarded form the input files 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. How to Install Download and Install following XAMPP software http://www.apachefriends.Org/download.php7xampp-win32-l.6.6a-installer.exe Click on the EXE and follow the easy steps (Please check "Install Apache as Serivce","Install MySQL as Serivce"). It will create PHP/MySQL & Apache environment for you in the local PC. Please choose a small named directory as the installation directory (for example "c:\php" or "c:\apache" names are very smalls enough to easily access from code). How to install: Extract "pstem.zip". Copy "pstem" folder and put it into webroot directory. As you are using XAMPP (apache 2 friends), the webroot directory is "htdocs". I consider your webroot is "C:/php/htdocs/" and "pstem" is your program dir where you kept all files. INDEX.PHP /* This program is free to use (and modification) for education and research. This program has developed as part of masters thesis in the area of Information Retrieval(IR). If you have any question Contact with Zaman email zaman@unbc.ca or Dr. Charles Brown email brownc@unbc.ca Computer Science Program University of Northern British Columbia */ /* This program creates a button called "CONTIONUE" and call the process.php program segment.*/ Stop Word Removal and Porter's Stemming :: Step l Please place all input files in dir "inputs" and then press "Continue" button.

PROCESS.PHP 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. /* This program is free to use (and modification) for education and research. This program has developed as part of masters thesis in the area of Information Retrieval(IR). If you have any question Contact with Zaman email zaman@unbc.ca or Dr. Charles Brown email brownc@unbc.ca Computer Science Program University of Northern British Columbia */ /* This program segment calls the stemming.php reads the input files removes stop words performs stemming creates new files with the same name as input files */ Step 2 :: Processing FilesPlease wait while processing ...
"; include('stemming.php'); $obj = new PorterStemmerQ; $stop_words = array(); $dir_files = array(); $dir_files = findDirFiles("inputs"); if(!sizeof($dir_files)) { echo "
No input files found. Go again."; exit; back and try } $file_index = 1; if(!empty($_GET['file_index'])) { $file_index = $_GET['file_index']; } 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if(!empty($dir_files) && $file_index <= sizeof($dir_files)) $inp = $dir_files[$file index - 1]; $input_file = "inputs/$inp"; $file_data = file_get_contents($input_file); $stop_w°rds = file('stopword.txt'); array_walk($stop_w°rds, 'trim_value'); foreach($stop_words as $word) Sword = trim($word); //cleans every word with extra space, tab, new //line etc $words = array(" $word "\n$word ", "\r\n$word " $word\r\n", "\n$word\n", "\r\n$word\r\n"); Sreplacements = array('"); $file_data = str_ireplace($words, $replacements, //replaces stop words with NULL character ", " $word\n", $file_data); } // Implementing Porter's Stemmer $file_data = preg replace callback('/(\w+)/i', $file_data); 'replace_fn', /* preg_replace_callback() - returns an array if the subject parameter is an array, or a string otherwise.If matches are found, the new subject will be returned, otherwise subject will be returned unchanged. */ $output_file = "outputs/$inp"; $fp = fopen($output_file, "w"); fwrite($fp, $file_data); fclose($fp); echo "$file_index input files processed so far."; $file_index++; echo ""; } else { 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. echo "
Processing Finished. ."; href=output.php>See output } function trim_value(&$v) //Referencing the word to be trimmed { $v = trim($v); } function replace_fn($matches) //Calling stem() function from stemming.php { global $obj; $matches[l] = $obj->Stem($matches[l]); return $matches[l]; } function findDirFiles($path) //to access and organize inputs directory { $dir = opendir ($path); $files = array(); while ($file = readdir ($dir)) { if (($file = or ($file == { continue; } if (filetype("$path/$file") == "dir") { } else { $files[] = $file; } } //End of while closedir(Sdir); sort($files); return $files; }//End of function findDirFilesQ 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ?> OUTPUT. PHP /* *This program is free to use (and modification) for education and research. This program has developed as part of masters thesis in the area of Information Retrieval(IR). *If you have any question Contact with *Zaman email zaman@unbc.ca *or *Dr. Charles Brown email brownc@unbc.ca *Computer Science Program *University of Northern British Columbia */ /* This program segment organizes (by sorting) and displays the output files*/ Step 3:: Output"; $outputs = array(); Soutputs = findDirFiles('outputs'); if(sizeof($outputs)) { sort($outputs); foreach($outputs as $filename) { echo '
'.$filename.''; } else { echo "No output files found."; } function findDirFiles($path) //to access and organize outputs directory { $dir = opendir (Spath); $files = array(); while ($file = readdir ($dir)) { if (($file == ".") or ($file == "..")) 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. { continue; } if (filetype("$path/$file") == "dir") { ? } else { $files[] = $file; } } //End of while closedir($dir); sort($files); return $files; }//End of function findDirFiles() ?> 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix C General Text Parser [University of Tennessee][Giles et al 2001] GTP has a command-line interface of the form gtp filename -c common_wordJile -t tempdir [options] where, filename must be the first argument. The -c and -t specifiers must exist, but they may be in any order. Options are enlisted in the following table- Options Explanation -i Index of input files -c common words file (stop word file -w Specify a custom weighting scheme. Local and global refer to local and global weighting formulas. Local can be tf (term frequency), log, or binary. Global can be normal, idf, or entropy. -t Temp Directory (where temporary working files are stored.) -z Perform singular value decomposition. -R Specify a name for the current run of GTP. -h Create the Harwell-Boeing compressed matrix. -u Keep the Harwell-Boeing compressed matrix in an uncompressed file (on output) if the matrix is created. -0 Specify that the output file is to be in one binary file for SVD. (keys.gdbm) -o Specify that the key, id# global frequency, (e.g., freqtable.rtf) Table Table B. 1: GTP Options [Giles et al 2001] The first argument, filename, is the name of the file or directory to be parsed. If filename is actually a directory, gtp traverses this directory and all subdirectories in a recursive fashion and parses each regular file it encounters. If filename is a single file, gtp simply parses it only. Gtp moves sequentially through each file, extracting keys/terms comprised of relevant characters, and ignoring keys contained in the common word list specified by the arguments -c common_word_file (set of stop words). As gtp tokenizes the terms, it associates them with a number. The number is incremented sequentially in the order they are encountered. After tokenizing the keys and associating 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. each one with the document it was extracted from, gtp begins calculating term weights. The (global) weights of the terms are computed over the collection of documents if no global weighting scheme is specified each term carries a global weight of 1. By default, only a local weight is assigned and this is simply the frequency with which the term appears in a document. Two thresholds exist for term frequencies: Global and local. By default, the global and local thresholds are both 1. A term must appear more than 1 time in the entire collection AND in more than one document in the collection before it will be weighted. With the use of different options, GTP produces several other files in varying formats. The output file, created by specifying the -O option, is a binary file that contains all the vector (term and document) and singular value information produced by the SVD. The layout of the output file is as follows: header information consisting of the number of terms parsed, number of documents encountered, and the number of factors used; term vectors for all terms parsed; document vectors for all documents parsed; singular values computed. The following table presents the generated files/folder by GTP: File name Type Description (Section) RUN SUMMARY Summary of options used (Listed in table 3) LAST RUN Summary of options used on most recent GTP run (Listed in table 3) Output Binary Vector information generated by SVD keys.gdbm DBM Database of keys generated (http://gnuwin32.sourceforge.net/packages/gdbm.htm) Rawmatrix.rtf Raw term-by-document matrix matrix.hb.rtf Contains the Harwell-Boeing matrix. lao2 Summary of SVD calculation. nonz Total terms in the document collection, including those are not parsed. index,i.rtf Index of input files temp Directory where temporary working files are stored. Frequency.rtf Term, serial, raw_global_frequency, #files_present_the_term, global_weight (idf2) [ idf2 formula is fount in the files wgt.h and weight.cc in the folder src\wingtp\ Table B. 2 GTP Generated Files/Folder [Giles et al 2001] 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Query Processing with GTPQUERY GTPQUERY relies completely on files produced by a standard GTP run. Those files are output, keys (keys.gdbm), and LAST RUN. Query processing will fail if these files are not in the working directory. Gtpquery has a command-line interface of the form gtpquery filename -c common word Jile [options] - where filename must be the first argument. The -c specifier must exist, but may be in any order. The first argument, filename, is the name of the file or directory containing queries to be parsed. If filename is actually a directory, gtpquery traverses this directory and all subdirectories in a recursive fashion and parses each regular file it encounters . If filename is a single file, gtpquery simply parses it only. Gtpquery moves sequentially through each file, extracting keys comprised of relevant characters, and ignoring keys contained in the common word list specified by the arguments -c commonwordfile. By default, only keys that begin with characters A-z will be parsed. Keys beginning with a digit (0-9), with the exception of numbers that could be interpreted as dates in the 1700's 1800's and 1900's, will be ignored to the next whitespace character. By default, only characters A-z and 0-9 are considered to be characters to be included as part of the remainder of a key. Other options are enlisted below- -help Summarize options. -S Scale the query vector by the singular values before calculating cosine similarity. -n Set the number of factors to use (default is found in the LASTRUN file; the value of nfact. The LAST RUN file is a shortened version of the RUNSUMMARY file. It contains all the parameters used in your last gtp run.). -k# Set the number of results returned (default is all). -P 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Do not print query completion messages to standard output. Query processing is performed by finding a cosine similarity measure between a query vector and document vectors. The query vector can be considered a pseudo-document. The files that GTPQUERY produces are strictly results files. Each file has a prefix of q result.#, where # is a number starting with 1. The number represents the corresponding number id of the query that was performed. The file contains a list of document ids and cosine similarity measures organized by most relevant to least relevant. GTP is a software package that provides text parsing of small to large document collections and matrix decomposition for use in information retrieval applications. The brief summery of GTP is presented below(i) Parse text (single files or directories), (ii) Construct sparse matrix data structures (with choices of different term weighting strategies), (iii) Perform selected matrix decompositions for the representation of terms, documents, and queries in a reduced-rank vector space, and (iv) Convert user-supplied natural language queries into appropriate query vectors for cosine-based matching against term and/or document vectors in that vector space. GTP is a public domain software, which (includes documentations) is freely available for anyone to download form the following linkhttp://www.cs.utk.edu/~lsi 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix D T-Test Calculations Notes Output Created 20-Aug-2009 16:17:02 Comments Input Data Wpg-uni-fs01\zaman\TSProfile\Desktop\data.sav Active Dataset DataSetl Filter Weight Split File 10 N of Rows in Working Data File Missing Value Handling Definition of Missing User defined missing values are treated as missing. Cases Used Statistics for each analysis are based on the cases with no missing or out-ofrange data for any variable in the analysis. T-TEST PAIRS=System1 Systeml Syntax System1 WITH system2 system3 system4 (PAIRED) /CRITERIA=CI(.9500) /MISSING=ANALYSIS. Resources Processor Time 0:00:00.000 Elapsed Time 0:00:00.000 [DataSetl ] \\pg-uni-fs-01\zaman\TSProfile\Desktop\data.sav 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Paired Samples Statistics Mean Pair 1 Pair 2 Pair 3 Std. Deviation N Std. Error Mean System1 .080710 10 .0381119 .0120520 system2 .074230 10 .0306572 .0096946 Systeml .080710 10 .0381119 .0120520 system3 .072560 10 .0326347 .0103200 Systeml .080710 10 .0381119 .0120520 system4 .072110 10 .0219340 .0069361 Paired Samples Correlations N Correlation Sig. Pair 1 Systeml & system2 10 .958 .000 Pair 2 Systeml & system3 10 .989 .000 Pair 3 Systeml & system4 10 .967 .000 Paired Samples Test Paired Differences Mean Std. Deviation Std. Error Mean Pair 1 Systeml -system2 .0064800 .0124390 .0039336 Pair 2 Systeml - system3 .0081500 .0075823 .0023977 Pair 3 Systeml - system4 .0086000 .0178220 .0056358 Paired Samples Test Paired Differences 95% Confidence Interval of the Difference Lower Upper df t Sig. (2-tailed) Pair 1 Systeml -system2 -.0024183 .0153783 1.647 9 .134 Pair 2 Systeml -system3 .0027259 .0135741 3.399 9 .008 Pair 3 Systeml -system4 -.0041491 .0213491 1.526 9 .161 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Level of significance for two-tailed t-test Degrees of Freedom (df) 1 2 3 4 ... 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 infinity 0.1 6.31 2.92 2.35 2.13 2.02 1.94 1.89 1.86 1.83 1.81 1.80 1.78 1.77 1.76 1.75 1.75 1.74 1.73 1.73 1.72 1.72 1.72 1.71 1.71 1.71 1.71 1.70 1.70 1.70 1.70 1.68 1.67 1.66 1.65 Level of significance for two-tailed test 0.05 0.01 12.71 63.66 9.93 4.30 3.18 5.84 2.78 4.60 4.03 2.57 3.71 2.45 3.50 2.37 3.36 2.31 2.26 3.25 3.17 2.23 2.20 3.11 3.06 2.18 2.16 3.01 2.98 2.14 2.95 2.13 2.92 2.12 2.90 2.11 2.10 2.88 2.86 2.09 2.09 2.85 2.08 2.83 2.07 2.82 2.82 2.07 2.06 2.80 2.79 2.06 2.78 2.06 2.77 2.05 2.05 2.76 2.05 2.76 2.75 2.04 2.02 2.70 2.66 2.00 2.62 1.98 1.96 2.58 0.001 636.62 31.60 12.92 8.61 6.87 5.96 5.41 5.04 4.78 4.59 4.44 4.32 4.22 4.14 4.07 4.02 3.97 3.92 3.88 3.85 3.82 3.79 3.77 3.75 3.73 3.71 3.69 3.67 3.66 3.65 3.55 3.46 3.37 3.29 Table C: Level of significance for two-tailed t-test [Fisher & Yates 1995] 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Publications 1. A N K Zaman, and Charles Grant Brown, - "Latent Semantic Indexing and Large Dataset: Study of Term-Weighting Schemes", accepted for publication in the IEEE Fifth International Conference on Digital Information Management (ICDIM 2010), Lakehead University, Thunder Bay, Canada from July 5-8, 2010. 2. A N K Zaman, and Charles Grant Brown, - "Information Retrieval (IR) from a Large Data Set" published in the proceedings of International Conference on Electronics, Computer and Communication (ICECC 2008), June 27-29, 2008, University of Rajshahi, Bangladesh. ISBN 984-300-002131-3. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.