INFORMATION RETRIEVAL BY MULTI-PERSPECTIVE REPRESENTATION by Jia Zeng B.E., Huazhong University of Science and Technology, 2003 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in MATHEMATICAL, COMPUTER AND PHYSICAL SCIENCES (COM PUTER SCIENCE) THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA July 2005 Jia Zeng, 2005 1^1 Library and A rchives C anada Bibliothèque et A rchives C anada Published H eritage Branch Direction du Patrim oine d e l'édition 395 W ellington Street Ottawa ON K 1A 0N 4 Canada 395, rue W ellington Ottawa ON K 1A 0N 4 Canada Your file Votre référence ISBN: 978-0-494-28380-6 Our file Notre référence ISBN: 978-0-494-28380-6 NOTICE: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distribute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. AVIS: L'auteur a accordé une licence non exclusive permettant à la Bibliothèque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par télécommunication ou par l'Internet, prêter, distribuer et vendre des thèses partout dans le monde, à des fins commerciales ou autres, sur support microforme, papier, électronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriété du droit d'auteur et des droits moraux qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conformément à la loi canadienne sur la protection de la vie privée, quelques formulaires secondaires ont été enlevés de cette thèse. While these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. Canada A bstract This thesis proposes an innovative document representation approach, called multi­ perspective representation (MPR) model, for textual information retrieval (IR) purposes. It assumes th at a document can be observed from multiple perspectives to obtain its derivatives and it can be associated with multiple representations obtained through the derived documents. We propose a generalized M PR solution th at is applicable to the M PR model. We also present the differential latent semantic indexing (DLSI) method proposed in the literature, which is considered to be a special application of the M PR model. Experiments conducted on standard collections using the latent semantic index­ ing method and the term vector approach with the generalized M PR solution and the traditional representation method have shown th at the generalized M PR solution makes significant contribution to the retrieval performance of both IR methods. The applica­ bility of the DLSI method is also explored and the experiment on a benchmark database has dem onstrated its effectiveness. 11 TA BLE OF C O N T E N T S I A b s tr a c t............................................................................................................................ ii Table of C o n te n ts ........................................................................................................... iii List of T a b le s .................................................................................................................. vil List of F ig u res.................................................................................................................. viii Acknowledgement........................................................................................................... ix In trod u ction 1 1 Problem S ta te m e n t............................................................................................... 1 2 Our C o n trib u tio n .................................................................................................. 2 3 O verview.................................................................................................................. 4 II B ackground and Literature R eview 5 Introduction to Information R e trie v a l............................................................... 5 1.1 Information Retrieval vs D ata R etrieval........................................................... 5 1.2 Information Retrieval S y ste m s ........................................................................... 7 2 Information Retrieval A pp ro ach es.................................................................... 8 ..................................................................................... 9 2.1.1 Classical Boolean A p p r o a c h .................................................................. 10 2.1.2 Extended Boolean A p p ro ach .................................................................. 11 2.1.3 Vector Space A p p ro ach ............................................................................ 12 1 2.1 Statistical Analysis . . iii 2.1.4 Probabilistic A p p ro a c h ........................................................................... 21 2.2 Semantic A nalysis................................................................................................... 24 3 25 Evaluation of Information Retrieval Perform ance............................................ 3.1 Precision and Recall ............................................................................................ 26 3.2 E M e a su re ................................................................................................................ 27 3.3 Recall-Precision G r a p h ......................................................................................... 28 3.4 Average P re cisio n ................................................................................................... 28 4 Special T o p ic s ......................................................................................................... 29 4.1 Representation M o d e ls ......................................................................................... 29 4.2 Retrieval T a s k s ...................................................................................................... 31 III M eth o d o lo g y 32 1 Overview of the M PR M e th o d ............................................................................ 32 2 A Generalized M PR Solution ............................................................................ 33 2.1 Guidelines for Im p lem entation ............................................................................ 34 2.1.1 Obtain Multi-perspective Representations ........................................ 34 2.1.2 Gombination S trategies............................................................................ 35 A pplications........................................................................................................... 36 2.2.1 LSI Retrieval S y s te m ............................................................................... 37 2.2.2 LSI-MPR Retrieval System .................................................................. 46 Differential Latent Semantic Indexing M e th o d ............................................... 49 3.1 M otivation................................................................................................................ 49 3.2 T h e o ry ....................................................................................................................... 50 2.2 3 iv 4 IV 3.2.1 Differential Term V e c to r ......................................................................... 50 3.2.2 Posterior M o d e l......................................................................................... 51 3.2.3 Geometric In te rp re ta tio n ......................................................................... 52 3.2.4 A lg o rith m ................................................................................................... 52 Use M PR Solutions for C lassifiers..................................................................... 54 E xp erim en ts 1 56 The Generahzed M PR Solution Field T e s t ..................................................... 56 1.1 Document Collections............................................................................................ 56 1.2 P r o c e d u re ................................................................................................................ 57 1.2.1 LSI Retrieval S y s te m ................................................................................ 57 1.2.2 LSI-MPR Retrieval S y s te m s ................................................................... 59 1.3 Results and E v a lu a tio n ......................................................................................... 61 1.3.1 Standard LSI vs L S I-M P R ...................................................................... 61 1.3.2 Standard Term Vector Approach vs Term Vector Approach with M PR Schem es........................................................................................... 63 1.3.3 Statistical A n a ly s is ................................................................................... 63 .......................................................................... 66 2.1 Problem S ta te m e n t............................................................................................... 66 2.2 P r o c e d u re ................................................................................................................ 67 2.3 Results and E v a lu a tio n ......................................................................................... 68 2 The DLSI Approach Field Test V C onclusion 70 1 Summary ............................................................................................................... 70 2 Future W o rk ........................................................................................................... 71 R eferences 73 A p p en d ix 80 VI List o f Tables 1 Information Retrieval vs D ata R etrieval............................................................ 6 2 Evaluation Contingency T a b le ............................................................................ 26 3 Characteristics of TIME & A D I ......................................................................... 57 4 Retrieval Performances of Term Vector Approach using Traditional Rep­ 5 resentation Model & Multi-perspective Representation M o d el.................... 63 Significance Levels of t-tests on all the Samples on TIM E & A D I .............. 66 VII List o f Figures 1 Framework of An Information Retrieval S y s t e m ..................................... 8 2 Recall-Precision G r a p h .................................................................................. 28 3 LSI Retrieval S y s te m ..................................................................................... 38 4 LSI-MPR Retrieval S ystem ........................................................................... 47 5 Projecting a V e c to r ........................................................................................ 50 6 Retrieval Performances on TIME using LSI, LSl-MPR-A and LSI-MPR-V M e th o d s ........................................................................................................... 7 Retrieval Performances on ADI using LSI, LSl-MPR-A and LSI-MPR-V M e th o d s .......................................................................................................... 8 62 Significance Level of LSI/(Term Vector) Approach with M PR Schemes on T I M E .............................................................................................................. 9 62 64 Significance Level of LSI/(Term Vector) Approach with M PR Schemes on A D I ......................................................................................................................................... 65 10 Classification Accuracy of DLSI Classifier on W B C D ........................... 68 Vlll A cknow ledgem ent I would like to thank my supervisor, Dr. Liang Chen, for his inspiration, support and guidance throughout my study and research work at UNBC. I also wish to thank Dr. Charles Brown, Dr. Jianbing Li for their encouragement and valuable advice on my thesis. As well, I would acknowledge the external reviewer for his/her willingness to serve on my thesis committee. I would also give special thanks to Kevin Brammer who helps me along the way. Personally, I would like to thank my parents, without whom, I cannot have reached this point. This thesis contains the contents of our previous publications for the subject. (Please refer to the appendix for a list of the publications.) I thank all the coauthors. IX C hapter I Introd u ction This chapter presents an introduction to the topic of this thesis and the proposed methodology. The organization of the rest of the paper will be provided as well. 1 P rob lem S tatem en t W ith the exponential development of information technologies, more and more in­ tellectual resources have been recorded in numerical forms of information th at can be digitally transm itted or processed. The evolution of communication networks has made an immense number of the d ata sets available in the public domain, which can be ob­ tained by ordinary users. However, due to the tremendous volume of the data collections, it has become more and more difficult for the users to find the information th a t interests them. This has led to a huge demand for the information management technologies, which can facilitate easy access of the information for the users. Information retrieval (IR) is a discipline which looks into developing algorithms th at can retrieve relevant information from storage in response to a user’s request. Information can be recorded in many forms: textual, image, audio and video, etc. In this thesis, our research focuses on a major branch of information retrieval, called textual retrieval or document retrieval which seeks to retrieve free-form natural language textual records th a t possibly satisfy the user’s information needs from an organized repository. By convention, we call the textual records documents, the information requests queries and the repository a collection or corpus. A number of approaches have been proposed for textual retrieval in the literature. One common method for document representation and indexing is called the vector space model. It considers each document as a set of term,s and represents it as a term, vector in a document space. Based on this model, a series of IR methods have been developed. Most of these approaches employ the traditional representation model, i.e., each document is associated with one and only one representation. Are there any other possible solutions to represent a document other than the tradi­ tional model? How will they affect the performance of the IR methods? These questions are the inspirations for our research. The observation that multi-perspective descrip­ tions have always been beneficial in visual and acoustic recognition has led to our belief th a t perceiving objects from two or more perspectives is always an advantage. Our re­ search objective was to find out if it is helpful to observe a document from two or more perspectives, i.e., to associate the document with multiple representations. 2 Our C on tribution In this thesis, we have proposed a multi-perspective representation (MPR) method for document representation. Two sample solutions th at are appficable to the M PR m ethod have been provided. Experiments based upon these solution strategies have been conducted and their results have been reported as well. The following is a list of our contribution. • The multi-perspective representation model we propose is an innovative document representation approach th at attem pts to imitate a human’s manner of depicting objects, th a t is, to describe them from different perspectives. It is assumed th at each document can be observed from multiple perspectives to obtain its multiple derivatives, called perspective documents^ and the original document can be rep­ resented by all the perspective documents th at are derived from it. By applying the multi-perspective representation on documents, the procedure of measuring the similarity between a query and a document is converted to a process of integrat­ ing values th at indicate the similarity degrees between the query and each of the perspective documents derived from th at document. • We have provided a generalized MPR solution, which is expected to be applied to most IR methods th a t are based upon the vector space model. Experiments have been conducted on several standard corpora using two IR methods: the latent semantic indexing (LSI) method and the term vector method, with the generalized MPR solution and the traditional representation model. The results have shown th a t both IR methods achieve significant improvement of retrieval performances with the M PR model over th at with the traditional model [CZT05]. • We also discovered th at the differential latent semantic indexing (DLSI) method th at was proposed in the literature exhibits attributes that fall into the category of multi-perspective representation. We have conducted experiments th at employ the DLSI method to solve an IR-related problem: medical data classification. Its performance on a benchmark data set has demonstrated the advantage of this approach [CZP04]. 3 O verview The purpose of this thesis is to present the multi-perspective representation model, to explore its applicability and to demonstrate its effectiveness. The following chapters outhne the background, development and application of the M PR approach. Chapter II provides an overview of the basic IR background and reviews the approaches th at are related to the proposed method in the literature. Chapter III presents the rationale of the M PR method and offers some guideline of its implementation. In Chapter IV, the experimental procedures and results are demonstrated. A summary of this thesis and a discussion of future work are provided in Chapter V. C hapter II B ackground and L iterature R eview In this chapter, we will provide a brief overview to the field of information retrieval in order to facilitate the introduction of the proposed method. A historical review of a number of IR approaches will be presented, with the emphasis on the cognate methods th a t are relevant to the thesis topic. Evaluation strategies th at are used to estim ate the retrieval performance of an IR method will also be covered. As well, we will conduct a discussion over some special topics th at are closely relating to our proposed approach. 1 In trod u ction to Inform ation R etrieval In Chapter 1, we stated th at “Information retrieval is a disciphne which looks into developing algorithms th at can retrieve relevant information from storage in response to a user’s request” . Before we proceed, it would be useful to conduct a discussion over this definition and clarify a few points. 1.1 Inform ation R etrieval vs D ata R etrieval When the term information retrieval is mentioned in this paper, we refer to the autom atic information retrieval systems th at can search the database for data records th at are relating to the user’s information need, and inform the user on the existence/ non­ existence as well as the whereabouts of the document. Although the term information has a close association with the term data, they are not equivalent concepts. The field of data retrieval (DR) possesses some properties that are distinguishable from information retrieval. Table 1 illustrates some differences between the two topics [vR79]. Table 1: Information Retrieval vs D ata Retrieval Information Retrieval D ata Retrieval query language natural language artificial language query specification incomplete complete matching best match exact match satisfied retrieval relevant exact matching model probabihstic deterministic In IR, the query statem ent is usually expressed in natural language and does not necessarily need to be complete. On the contrary, DR usually requires the request to comply to the specified syntax and to provide as complete a description of the information need as possible. W ith regard to evaluating the retrieved records, IR considers the items th a t are relevant to the query to be good matches and among them the most relevant one judged by the user is determined to be the best match, whereas DR only regards the exact matches to be successful. Due to the differences between the characteristics of IR documents (unstructured records) and DR documents (structured records), information retrieval engines are possibly able to, but not guaranteed to find out all the relevant documents from the storage. Even if they manage to retrieve a list of relevant records, such a list might not be complete. On the other hand, DR systems are guaranteed to output all occurrences of the records satisfying a match. 1.2 Inform ation R etrieval S ystem s An information retrieval system is a device interposed between a potential user of information and the information collection itself [Har86]. For a given information prob­ lem, the purpose of an IR system is to capture wanted items and to filter out unwanted items [Har86]. A typical IR system deals with representation, storage and retrieval of unstructured data, thus should contain some/all of the following parts; indexing, query operation, matching, output module, feedback module and user interface. The indexing component usually contains two primary processes. The first process is to conduct conceptual analysis on the information resources in the collection to obtain the concepts th a t are contained. These concepts, usually called effective terms, make up a system vocabulary th at is applicable to all the information pieces in this system. The output of the first process flows to the second stage, in which a translation mechanism is employed and a database of information representations can be obtained. When an information request is posed, the query operation process will parse it into its constituent elements. An analysis will be conducted over its conceptual content and the query will be transformed into a representation th at is consistent with all the information items in storage. Given the query representation, the matching mechanism evaluates the relatedness of all the potential targets to the query and obtains a rank of relevance. An ordered set of information items will be returned to the user by the output module. When interaction between the user and the information retrieval system is available, he/she can communicate with the system through the feedback module by refining the query during one search session in the light of a sample retrieval. The user interface serves as a bridge ~ Conceptual Analysis Translation Database of Information Representations Evaluation Strategy Request Representation Conceptual Analysis Translation Fig. 1: Framework of An Information Retrieval System connecting the client to the other modules of the system. The infrastructure of a typical IR system is depicted in fig. 1. 2 Inform ation R etrieval A pproaches In the field of information retrieval, there are two major categories of techniques: statistical analysis and semantic analysis. The statistical approaches consider the free­ form natural language documents to be pure data records and index them in terms of some statistical measure. The assessment of the relevance between a pair of documents is also based on a certain statistical metric. The semantic approaches, however, attem pt to reproduce to some degree the understanding of the natural language text th a t a human may provide. This section selectively reviews some of the IR methods in the literature. 2.1 S ta tis tic a l A n aly sis Statistical methods break documents and queries into terms, which make up a popu­ lation th at is counted and measured statistically. Most commonly, these terms are words th at occur in a collection and/or a given query. Textual records often undergo pre­ processing when words are usually converted into a standard form, e.g., lowercase. Other techniques such as stemming and stop-listing are also frequently employed in the pre­ processing stages. The stemming algorithm, first introduced by Porter [PorSO, Por97], is to extract the root of each word in order to eliminate the variation th at arises from the occurrence of different grammatical forms of the same word. By replacing the terms by their base forms, the number of terms can be reduced in the meanwhile. The objective of stop-listing is to remove the common words th at have little discriminatory power. The hst th a t describes such words is always called a stop list or negative dictionary. In addition to using words as terms, some sophisticated methods also extract phrases as terms, where a phrase is a combination of adjacent words which may be recognized by frequency of co-occurrence in a given collection or by presence in a phrase dictionary. Some other methods break documents into n-grams, i.e., arbitrary strings of n consecutive characters [Dam95]. It is worth noting th at the use of n-grams has proved to be language independent and appear to be relatively insensitive to degraded text, e.g., typos, errors due to poor print quality in optical character reader (OCR) transmission, etc [PN96]. By far, the greatest amount of work has been devoted to the statistical approaches, which fall into four categories: classical Boolean^ extended Boolean, vector space and probabilistic. The rest of the section presents some representative methods of these categories. 2.1.1 C lassical B oolean A pproach The classical Boolean approach is based upon the theory of Boolean algebra. A conventional Boolean query combines terms with the classical Boolean operators AND, OR and NO T, and is evaluated by the classical rules of Boolean algebra. Such a model is very straightforward and easy to implement. However, due to the characteristics of the standard Boolean model, the classical Boolean method encounters some major limitations in the field of information retrieval. Like any Boolean expression, the query only has two values: true or false. Correspond­ ingly, a document is either relevant to a query or non-relevant to it. Therefore, no ranking strategy is possible. W ith regard to effective term weighting, only two values are avail­ able: 0 for an absent term and 1 for a present term. Such an all-or-nothing condition tends to have the effect th at either an intimidatingly large number of documents or none at all are retrieved [Har92]. As well, the classical Boolean rules tend to produce counter­ intuitive results because of this all-or-nothing characteristic. For example, in response to a multi-term OR operation, “a document containing all (or many of) the query terms is not treated better than a document containing one term” [SB88]. Similarly, in re­ sponse to a multi-term AND operation “a document containing all but one query term is treated just as badly as a document containing no query term at all” [SB88]. These features of the classical Boolean model have emerged as a considerable issue th at needs to be overcome. 10 2.1.2 E xten d ed B oolean A pproach As we mentioned above, the classical Boolean scheme is based upon the assumption of all or nothing, and expresses the relevant relationship between documents in a crude way. Due to this reason, a number of extended Boolean models have been proposed in the literature [WK79, SM83, Pai84, Zim91, GCT97] trying to integrate a ranking strategy into the Boolean model. The extended Boolean models employ extended Boolean operators, also called soft Boolean operators. These operators make use of the weights assigned to the terms in each document. They also extend the tru th value range from a discrete two-element-set: {0,1} in the case of classical Boolean model to a consecutive range: [0,1]. In other words, the operators evaluate their arguments to a number, corresponding to the estimated degree to which the given logical expression matches the given document. By doing this, the extended Boolean methods are able to provide a ranked output allowing some documents to satisfy the query condition more closely than others [Lee94]. Therefore it manages to overcome the limitation of the classical Boolean approach. Experiments have shown th a t the extended Boolean model can achieve better IR performance than either the classical Boolean or the vector space model [GreOl]. However, there is a big price for this performance improvement. Formulating effective extended Boolean queries involves more thought and expertise in the query domain than either the classical Boolean method or the vector space approach [GreOl]. 11 2.1.3 V ector Space A pproach One common solution to document representation for statistical purposes is to rep­ resent each document as a set of terms. The vector space approaches have achieved great success in IR by applying the theory of linear algebra on this representation model. In the traditional vector space method^ also called the term vector method, the union of the effective terms defines a document space so th at each distinct term represents one dimension in this space. For a given document, each term is assigned a numeric weight, which indicates an estimate of the usefulness of the term as a descriptor of the given document, i.e., the discriminatory power of the term for this document. By exploiting the weights of all the terms for a document, the document is then encoded as a term vector in the document space. It is worth noting that a query is usually considered to be a pseudo-document and can also be represented as a term vector. Sometimes, it is also desirable to define a term space, where each document corre­ sponds to one dimension. Accordingly, a term is represented by a document vector in this space. We can combine the perspectives of document space and term space by viewing the entire collection as a term-hy-document matrix, also called an indexing matrix. Each row of this m atrix represents a term and each column of this matrix represents a document. The element m ij at row i, column j reflects the importance of term i in representing the characteristics of document j. A data set of d documents and t terms can be represented by a m atrix shown below. 12 / m il rni2 ■■■ mid \ ■^21 M m 2 2 y mti mt2 ■ ■ ■ " m 2 d / Please note th at any defined denotation, e.g. d, t and M, or abbreviation, will be applicable to the rest of the text. T erm W eighting One significant issue th at any vector space model needs to consider is term weighting, i.e., how to assign a weight to a certain term for a given document so th at it properly estimates the contribution this term makes to the document in the respect of distinguish­ ing it from other documents. A variety of weighting schemes have been proposed, which basically fall into two categories: local weighting and global weighting. Local weighting schemes attem pt to refiect the importance of a term within a given document by document-specific statistics. Usually, the local weights assigned to the same term vary from document to document. Some popular local weighting functions include the raw term frequency, binary and logarithm of the term frequency (or logarithm for short). Let us denote Lij to be the local weight of term i in document j and denote tfij to be the frequency with which term i appears in document j. The local weights by the three types of schemes are evaluated as follows. • Raw Term Frequency: Lij = tfi. . Binary: = { J • Logarithm: Lij = log{l + tfij) 13 It is worth mentioning th at the logarithm weighting scheme exploits the logarithmic function to transform the raw term frequency so th at it can dampen effects of large differences in frequencies. In addition to estimating the document-specific statistics, it could also be useful to characterize a term ’s overall importance in the whole collection. Global weighting strate­ gies are designed to measure the distribution of a term within the given collection, thus are able to estimate how hkely a term occurs in a certain document by chance. Generally, they give less weight to terms th at occur frequently or occur in many documents because these terms are not considered to be strong descriptors for any specific document in which they appear. Four well-known global weighting schemes are: normalized term frequency (or normal for short), inverse document frequency ( or idf for short), global frequency-idf (or gf-idf for short) and entropy. Let Gi be the global weight assigned to term i, g fi be the frequency term i occurs in the entire collection, dfi be the frequency of documents in which term i occurs and d be the number of documents in the whole collection. The evaluation of Gi by the four types of global weighting methods are represented as follows. • Normal: Gi = Idf: Gi = log{-^) + 1 Gf-idf: G, = g Entropy: Q - 1 - ^ lo g(d) All of the global weighting schemes share a principle of assigning less weight to terms th at occur frequently or in many documents. The ways in which this is done involve 14 variations in the relative importance of local frequency, global frequency and document frequency. The normal weighting scheme normalizes the length of the vector for a term to 1. This has the effect of giving high weight to infrequent terms. However, it only depends on the sum of the squared frequencies and not the distribution of those frequencies per se. Gf-idf and idf are closely related because both schemes weight terms inversely by the number of different documents in which they appear. Gf-idf also increases the weight of frequently occurring terms. Neither method depends on the distribution of terms in documents but the number of different documents in which a term occurs. The entropy scheme is based on information theoretic ideas and is the most sophisticated weighting scheme. It takes into account the distribution of terms over documents. The main idea is to assign httle weights to terms th at are equally distributed over documents and assign big weights to terms which are concentrated in a few documents [Dum91]. After all, weighting scheme contains advantages and limitations. There is not a fixed solution for choosing a term weighting scheme. In the cases when both local weights and global weights are used to measure the term weights, the value of the ith row, j t h column element can be evaluated as follows. TTiij — L 'lj Gi ( 1) N orm alization Another technique th at is always employed in vector space approaches is normaliza­ tion. It is based on the recognition th at the size of documents in the collection may be different from each other. In order to allow for the variation, it normahzes each term vector so th a t its length is 1 for a document of any size. 15 Sim ilarity M easurem ent We have discussed about how to obtain term vectors to represent documents (includ­ ing queries), and how to construct the representation for the entire document collection accordingly. The next step is to apply a similarity measurement th at can estimate the similarity between a pair of vectors. A practical use of this metric is to evaluate the sim­ ilarity between a pair of term vectors representing a document and a query respectively. This similarity value is used as an indicator of how relevant the document is to the query. A usual similarity measure employed in the vector space model is the inner product [Sal89]. It can be computed by multiplying one element in one of the two vectors by the corresponding element in the other vector and summing these products over all dimensions in the vector. Suppose a term vector has t dimensions. Let us denote the vector representation of a document (query) by D R and denote the fth element of the vector D R by DR 4 . The inner product of document D /s vector, denoted by D R j and query g’s vector, denoted by DR^, is defined as follows: t D R j ■ D R q = D R j ^ ■ D R q , (2) i=l Another popular metric derived from the inner product measure is called cosine sim i­ larity. Denote \DR\ to be the length of vector DR. The cosine similarity between vector D R j and DRq, denoted by cos{DRj, DRq), can be evaluated by the following equation: Two strategies are commonly used to utilize the similarity values for retrieving rele­ vant documents. One is called range queries, which is to retrieve all documents up to a distance threshold. The other one is called nearest-neighbour queries, which is to retrieve 16 the N best matches. Although we do not expect any similarity metric to be a perfect model that corresponds exactly with the human judgement of relevance, the measure­ ment should somehow be able to assign higher values to the documents with a higher proportion of the relevant text as well as assigning lower values to the documents with fewer relevant content. By integrating the ranking strategy in IR systems, the human user can restrict his/her attention to a set of documents of manageable size instead of having to go through every single document in the corpus. D iscu ssion As an efficient model, the traditional vector space scheme is becoming very popular in the IR research. Since it has a sound mathematical foundation, a variety of similarity measures can be developed based on this model and some of them could be transformed into a linear form. It is also accounted as an advantage th at there exists a probabilistic interpretation of this model. The traditional vector space approach provides an effective way to approximate the statistical properties of the document set, however, it is obviously an oversimplification. The m ajor problem th a t exists with this method is that it assumes all the terms are independent, orthogonal dimensions of the document space so it simply ignores the re­ lationship among terms. However, it is a fact th at there are strong associations among terms in natural languages, the above assumption is never satisfied [Hul94]. Another feature this model has th at can be a drawback in some applications is th at the number of terms, which occur in a collection can be quite large, the traditional term-based document space tends to have a large number of dimensions. Some alternative approaches based 17 on the traditional vector space model have been developed to overcome these limitations. Latent semantic indexing method is one of them. L atent Sem antic Indexing M otivation In the research of retrieving free-form natural language data, it is al­ ways useful to analyze the features of human verbal behavior. There are two issues th at are discussed the most: synonymy and polysemy. Synonymy refers to the states when two or more words or expressions have the same or nearly the same meaning in some or all senses [MW98]. Polysemy describes the cases when one word has multiple meanings. These characteristics of natural languages result in the deficiencies of some IR algorithms th at do not offer comparison methods on terms. Latent semantic indexing method was proposed in order to capture the statistical relationship among terms and accordingly provide an effective solution to the problems of synonymy and polysemy th at cannot be tackled by either word-based approaches or the traditional vector space approach. T heory Latent semantic indexing, also called the latent semantic analysis (LSA) method, was first proposed by Deerwester et al. [DDL+90]. It assumes that, in the textual data, there is some underlying latent semantic structure th at is partially obscured by the randomness of word choice with respect to retrieval. This structure can be estim ated by statistical techniques and the obscuring noise can be removed. Like the traditional vector space approach, the LSI method starts with a term-bydocument m atrix th at represents the association of terms to documents. It applies a 18 dimensional reduction scheme, singular value decomposition (SVD), on the m atrix to construct a reduced unified semantic space for retrieval. This smaller space, called L SI space, consists of derived dimensions that are assumed to convey truly independent con­ cepts. By employing the dimensional reduction strategy, LSI not only captures most of the im portant underlying semantic structure in associating terms with documents, but also has a better chance in removing the noise or possible variability in word usage. Singular V alue D ecom p osition mensional reduction scheme. Singular value decomposition is an effective di­ It is closely related to a number of mathem atical and statistical techniques th at have been widely used in other fields, such as the principal component analysis (PCA) for image processing and face recognition [TP91]. It has been proved to be a very good choice for uncovering latent semantic structure. (See [DDL+90] for a further discussion of SVD and the other alternative models.) It begins with an arbitrary rectangular matrix with different entries on the rows and columns. The m atrix is then decomposed into three matrices containing singular vec­ tors and/or singular values. These three matrices with special forms show a breakdown of the original m atrix into linearly independent components or factors. Many of these components are very small, leading to an approximate model th at contains many fewer dimensions. Thus, for information retrieval purposes, SVD provides a reduced model for representing the term-to-term, document-to-document and term-to-document rela­ tionships. By dimension reduction, it is possible for documents with somewhat different profiles of term usage to be mapped into the same vector of factor values [DDL+90]. This property helps to eliminate the noise in the original data, thus improving the reliability 19 of the algorithm. Suppose we obtained a t x d term-by-dociiment m atrix M from the collection indexing process of the traditional vector space method (page 12). We can apply SVD on M , which is then decomposed into three special matrices V, S and V. The decomposition can be w ritten as: Itf = (4) U is the t x t orthogonal m atrix having the left singular vectors of M as its columns, and V is the d x d orthogonal m atrix having the right singular vectors as its columns, and S is the t x d diagonal m atrix having the singular values 0 -2 > ■• • > crmin{t,d) of M in order along its diagonal. It should be noted th at for any arbitrary m atrix, such a factorization exists. (For details, see [Str98] for reference.) Generally, in Eq.4, the matrices U, S and V must all be of full rank.However, one great facilitation th at SVD offersis to allow a simple strategy foroptimal approximate fit using smaller matrices [DDL+90]. If the singular values in S are ordered by size, the first k largest values may be kept and the remaining smaller ones are set to zero. The product of the resulting matrices is a m atrix which is only approximately equal to M, and is of rank k. Since zeros were introduced into S', the representation can be simplified by deleting the zero rows and columns of S to obtain a new diagonal m atrix Sk, and then deleting the corresponding columns of U and V to obtain Uk and \4 respectively. The rank-A; model with the best possible least-squares-fit to M can be written as follows: = HkSktïT (5) where Mk is a m atrix of size t x d, Uk is of size t x k, Sk is of size k x k, and Vk is o f 20 size k X d. SVD provides an optimal solution to dimensionality reduction in th a t it derives an orthonormal space, where the dimensions are ordered. Therefore, projecting the set of documents onto the k lowest dimensions is guaranteed to have, among all possible projections to a & dimensional space, the lowest possible least-square distance to the original documents [SS97]. R etriev in g D ocu m en ts U sing LSI The process of retrieving is to sort the doc­ uments according to their similarity degrees to the query and to return a ranked list of documents to the user. This involves the consideration of representing documents and queries in the same manner and applying a certain function to estimate the similarity between them. In the LSI method, each document/ query is projected onto the LSI space th at is ob­ tained by using SVD. LSI then exploits the cosine measurement between the projections of a pair of term vectors in the LSI space to make comparison between the two docu­ ments. Thus, the similarity can be obtained by computing the cosine value of the angle between the document’s term vector and the query’s term vector. All the documents can be ranked according to their similarity values and an ordered set of documents wiU be retrieved. 2 .1 .4 P rob ab ilistic A pproach T h eory There is no clear hne which separates probabilistic methods from statistical methods. 21 as probabilities are often calculated on the basis of some statistical evidence. The biggest feature of probabilistic methods th at distinguishes them from other statistical methods is th at, in probabilistic methodology, formal probability theory and statistics are used to arrive at the estimates of probability of relevance by which the documents are ranked. On the other hand, in pure statistical approaches, a similarity measure is used, whose value is not necessarily directly interpretable as probabilities [CGD92]. In a probabilistic method, one usually computes the conditional probabihty P {D \R ) th a t a given document D is observed on a random basis given event R, th at D is relevant to a given query [Sal89, vR79]. Typically, queries and documents are represented by sets of terms, P {D \R ) is then calculated as a function of the probability of occurrence of these terms in relevant versus non-relevant documents [GreOl]. The term probabilities are analogous to the term weights in the vector space model and may be computed using the same statistical measures. A probabilistic formula is then employed to evaluate P{D \R) and it depends on the specific model used as well as on the assumptions made about the distribution of terms, etc. In a more general case, P{D \R) may be computed based on any clues available about the document. This has led to the idea of a staged computation, in which a probabilistic model is first applied to each composite clue (stage one), and then applied to the combi­ nation of these composite clues (stage two) [CGD92]. The logistic regression analytical m ethod is derived from this idea. Another scheme originating from this idea is the in­ ference net in which rules can be specified for combining different sources of evidence to compute a belief th at an information need has been satisfied by a given document [TG91]. Bayesian inference network model is a famous application of this scheme. 22 P rob ab ilistic Latent Sem antic Indexing Probabilistic latent semantic indexing (PLSI) is a probabilistic approach, which was proposed by Hofmann [Hof99]. He pointed out that, although LSI has been applied with remarkable success in different domains, it has some deficits due to its unsatisfactory statistical foundation. The PLSI approach is based on a statistical latent class model for factor analysis. It has a sofid statistical background since it is based on the likelihood principle and defines a proper generative model of the data [Hof99]. Moreover, the factor representation obtained by PLSI allows us to deal with polysemous words and to explicitly distinguish between different meanings and different types of word usage [Hof99]. The core of PLSI is a statistical model called aspect model [SP97]. It exploits a latent variable model for general co-occurrence data that associates an unobserved class variable z with each occurrence of a word w in a document d. Let F{d) denote the probability of selecting a document d. Let P{z\d) denote the probability of picking a latent class z. Let P{w\z) denote the probability of generating a word w. A likelihood function combines all the aspects of P{w\z) and eventually obtains a pair (d, w), while discarding the latent class variable z. In the procedure for maximum likelihood estimation in latent variable models, Hofmann proposed a generalization of maximum likelihood for mixture models, called tempered expectation maximization (TEM), which is based on entropie regularization and is closely related to a method known as deterministic annealing (refer to [RGF90] for details). Hofmann claims th at with respect to retrieval performance, PLSI yields better result th an direct term matching methods as well as over LSI. The combination of models with different dimensionalities has proven to be advantageous. 23 D iscu ssion Despite all the advantages th at a truly probabilistic design methodology can offer, advocates of non-probabihstic methods regard the formulation of exact statistical as­ sumptions as an unnecessary theoretical burden on the researcher. They m aintain that the time and effort spent on such analysis would be better spent on ad hoc experimen­ tation using formahsms looser and friendher than probability theory [CGD92]. 2.2 Sem antic A nalysis Semantic analysis, also called natural language processing (NLP), refers to the meth­ ods th a t are based on knowledge of the syntax and/or semantics of the natural language in which the document text is written. Unlike statistical approaches, which merely use sta­ tistical measurement, semantic approaches attem pt to address the structure and meaning of textual documents directly. The idea of retrieving relevant information based on the understanding of the context seems to be very intuitive to human beings. However, in autom ated systems, it is much more difficult to implement than the statistical methods. As a result, within the realm of IR, semantic approaches are often used as a supplement to the statistical analysis since even the best statistical method to date will fail to grasp all the features of the documents th a t a semantic approach may be able to discover. Some research work has been done trying to combine both semantic and statistical methods together hoping that this combo will perform better than the statistical methods alone [R1195, CH95, LPY94]. 24 3 E valuation o f Inform ation R etrieval Perform ance In previous sections, we presented the basic architecture of an IR system and intro­ duced some representative approaches th at can be used to implement it. In this part of the chapter, we will discuss criteria for evaluating an information retrieval system. Prakes et al. [FBY92] provided a summary of the evaluation process: an information retrieval system can be evaluated in terms of many criteria, including execution efficiency, storage efficiency, retrieval effectiveness and the features they offer a user. The system designers look into the relative importance of these factors based on the particular envi­ ronment and expectation in order to make appropriate selection of data structures and algorithms. Execution efficiency is measured by the time it takes a system, or part of a system, to perform a computation. This can be measured in C based systems by using profiling tools such as prof on UNIX [Ear84]. This factor has always been a m ajor concern for the interactive IR systems because a long retrieval time will interfere with the usefulness of the system. The requirements of such IR systems usually specify maximum acceptable times for searching, and for database maintenance operations such as adding and deleting documents. Storage efficiency is measured by the number of bytes needed to store the data. Space overhead, a common measure of storage efficiency, is the ratio of the sizes of the index files plus the size of the document files over the size of the document files. Most IR experimentation has focused on retrieval effectiveness, which is based on relevance judgement. Relevance is an inherently subjective concept in th at the ultim ate 25 goal is to satisfy the human users’ needs. Due to variation of the user’s personal back­ ground, it is impossible to design a perfect system th at meets all of the expectations for all users. Therefore, it is necessary to introduce some measures to evaluate the perfor­ mance of a retrieving process by estimating the degree of relevance at which the retrieved information matches the query. 3.1 P recision and R ecall Two widely used parameters to measure IR success, which are based on the concept of relevance, are precision and recall. Precision is the ratio of relevant items retrieved to all items retrieved. Recall is the ratio of relevant items retrieved to all the relevant items. To facilitate the understanding of these definitions, we present an evaluation contingency table, shown as below: Table 2; Evaluation Contingency Table Relevant Irrelevant Retrieved A B Not Retrieved C D where A denotes the number of relevant items retrieved, B denotes the number of irrelevant items retrieved, C denotes the number of relevant items not retrieved but in the database, and D denotes the number of irrelevant items not retrieved but in the 26 database. Precision = , A+B (6) ^ ’ The expectation of the users may vary from one person to another. Some users attach more importance to precision, i.e., they want to see relevant information without going through a lot of junk. Others take recall as a preference, i.e., they want to see all the documents th at are considered to be highly relevant. Hence the evaluation th at involves only one of the two parameters may be biased. Due to this reason, some methods th at evaluate the IR performance in terms of precision and recall simultaneously, have been developed. They are discussed in the following sections. 3.2 E M easure A method th at combines both measures of recall and precision, called E measure (short for effective measure)^ was proposed by van Rijsbergen [vR79]. It allows the user to specify the relative importance of precision and recall. The evaluation measure is defined as, -k (1 where P is precision, R is recall, and a is a param eter which varies from 0 to 1. The higher the value of a , the more im portant the measure of precision is considered and vice versa. 27 1 Precision 0 0 Recall 1 Fig. 2: Recall-Precision Graph 3.3 R ecall-P recision G raph Another method, called recall-precision graph, depicts the tendencies in which recall and precision values change and demonstrates the inter-relationship between them. It is illustrated by a bivariate plot where one axis is recall and the other precision. The recall-precision plots show th at recall and precision are inversely related. T hat is, when precision goes up, recall typically goes down and vice versa. Fig. 2 illustrates this curve. 3 .4 A verage P recision Although the recall-precision graph is a good indicator of the trade-off between the precision and recall, it is difficult to make a comparison of retrieval effectiveness with it. A more common type of measure th at is widely used in the research community, called average precision, attem pts to summarize this kind of curve as a single value in order to make different IR algorithms or the same algorithm on different document collections 28 comparable. For a given query, a ranked list of documents ranging from the most relevant to the least relevant is obtained. An average precision for this query can be computed. This is done by computing the precision after a new relevant document is retrieved and then averaging the precision values th at are computed in this way. These query average precisions are then combined (averaged) across all query topics in the collection to create the average precision for th a t collection. One variant of the average precision is called interpolated average precision. It in­ troduces the concept of standard recall level by defining a set of recall values, e.g, 0.25,0.5,0.75. However, among the set of experimental recall values, some or all of these standard recalls might not exist. In order to solve this problem, we employ in­ terpolation. For a given query, to interpolate the precision at a standard recall level p, we take the highest precision after an experimental recall is greater than or equal to p. The interpolated average precision for this query can be computed by averaging the interpolated precision values at all standard recall levels. 4 4.1 Special Topics R ep resen tation M odels Most of the current IR research employs a traditional model to represent a docu­ ment, i.e., regarding it as a bag of terms. This paper proposes an innovative representa­ tion method, called multi-perspective ( “stereo” ) document representation, which assumes each document is associated with multiple representations th at are obtained by observing 29 the document from different perspectives. By reviewing some recent research work in the area of information science, we were able to find some traces of the multi-perspective idea [CZT05]. Researchers in the area of face recognition have realized th at by associating each person with more than one picture in the database, the system usually achieves better performance [WZ02]. As well, data fusion, a method which is based on the idea of integrating many answers to one question into a single best answer, has been successful in its application in meta-search [Mon02]. The main idea of meta-search is to combine the results from many different search engines, each employing different search strategies, to produce a single list. We can interpret this as each document being viewed in many different ways by many different search strategies. Experiments have shown th a t m eta­ search can significantly improve the raw performance of the input search engines [Mon02], Researchers have also achieved success in information retrieval for structured documents such as HTML documents by using language models derived from multiple sources of structural information, such as in-links, title, URLs and out-degrees [OC03]. A successful autom ated tutoring system, AutoTutor, evaluates a student’s response to a question by selecting the optimal value out of a set of matching values th at result from comparing the response with each correct answer (aspect) and all possible combinations of the correct answers (aspects) [GWHWn+99]. An intelligent essay assessor, developed by researchers at University of Colorado, attem pted to find out if it is valuable to evaluate a student’s essay by representing it with two vectors, one based on the technical words only and the other based on the non-technical words only, and found th at nothing was gained by doing so [RSW+98]. 30 4.2 R etrieval Tasks Based on the traditional information retrieval modes, there are two m ajor types of tasks: an ad hoc task and a routing task. In the ad hoc task, it is assumed th a t new questions are being asked against a static set of data. In other words, the user can formulate any number of arbitrary queries but applies them to a fixed search target. In the routing task, it is assumed th at the same questions are always being asked, but that new d ata is being searched [Har95]. That is, the queries cover a fixed number of topics, and when a new message comes, it is routed to the class corresponding to the topic that fits most closely to the message. In many routing experiments, there is just one topic or query, therefore, there are just two classes, relevant and non-relevant [GreOl]. It is worth mentioning th at an information retrieval task has a very close association with a classification task. The problem a classifier tries to address is to compare an unlabeled d ata to all its existing clusters and group it into the cluster th at shares the most commonalities with the data. If we consider the unlabeled data as an arbitrary query input by the user and clusters as the documents in the collection, we can see th at the way, in which a classifier finds the cluster that is closest to the data, is analog to how an IR system manages to find the document that is the most relevant to a query. Thus the classification process can be viewed as a combination of information retrieval and database updating processes. Due to this reason, an IR method can have a direct apphcation onto a classification system. 31 C hapter III M eth o d o lo g y This chapter looks into the methodology of the multi-perspective representation method. It presents an overview of the MPR approach. It also provides a generalized M PR solution that we have proposed in [CZT05] as well as a differential latent semantic indexing method th a t was proposed in the literature [CTNOl, CTN03]. 1 O verview o f th e M P R M ethod Throughout the hum ans’ experience of recognizing objects, it has been noted that multi-perspective descriptions have always been beneficial. recognition for example. Take visual and acoustic We know that two eyes are required for the stereographic recognition of objects, which give us our sense of depth perception. As well, two ears are necessary for distinguishing subtle differences among different music styles. Our experience in reading a document also seems to support this idea. We believe th at, it is not necessary for us to read the entire context of one document before we can make a decision on its relevance to a posed query. Reading one part of a document may be enough for judging whether the document is relevant to a query or not. Reading two or more parts can enhance the confidence of the decision. These observations have led to our belief th at perceiving objects from two or more perspectives is always an advantage. In order to explore the applicability of the idea of multi-perspective descriptions in the field of information retrieval, we have proposed the multi-perspective representation (MPR) method. Traditionally in information retrieval, a document is associated with 32 one and only one representation. However, in the M PR method, it is assumed th a t a document can be observed from different perspectives to obtain its multiple derivatives (which we call the perspective documents). The original document, also called the origin, can then be represented by all the perspective documents that are derived from it. We classify any solution th at uses multi-perspective representations of a document as a multi­ perspective representation solution. 2 A G eneralized M P R Solution This section presents a generalized solution to the multi-perspective representation model. This solution is based upon the vector space model and its basic strategies are described as follows. Each document is perceived from different points of view to obtain its perspective documents. These derivatives are then temporarily regarded as ordinary documents and analyzed by some vector space based IR method. At the completion of these processes, each perspective document wiU be assigned a value th at indicates its sim­ ilarity to the given query. Then the similarity values of the perspective documents, which are derived from the same origin, are combined together to arrive at one single similarity value for the original document. In other words, by exploiting the perspective documents as intermediates, the analysis of an original document can be indirectly constructed by integrating the results obtained from the analysis conducted on its derivatives. The following sections provide a guideline of how this solution can be developed as weU as presenting some sample implementations th at can facilitate the understanding of the proposed methodology. 33 2.1 G uidelines for Im plem entation 2.1.1 O btain M u lti-p ersp ective R ep resentation s In order to apply the M PR method on the vector space model, it is essential th at we have a look into how to generate distinct perspective documents from an original document and how to obtain their vector representations accordingly. We call the vec­ tor representation of a perspective document the perspective document’s term vector, or perspective term vector for short. The following paragraphs may offer you some idea on how this process can be implemented. We split each original document into several perspective documents, where each per­ spective document conveys partial information of the original document. We do not have a fixed solution for creating the sub-files. The only requirement th a t we have is that the created perspective document should contain enough information for representing its origin. When we analyze different pictures taken of the same object in order to construct a visual perception of the full dimensions, we are only interested in the ones th at share some commonalities. The concept of perspective documents is similar to the above picture example, thus we suggest th at perspective documents derived from the same origin should contain overlapped content as well. Having generated perspective documents from each original document, the perspec­ tive documents can then directly obtain their vector representations in the document space by following the indexing procedure of the vector space model (page 12). 34 2.1.2 C om bination S trategies To effectively reconstruct the analysis for each original document through the inter­ mediates’ contribution, we have to employ a combination strategy. Two sample schemes are presented as follows. The first scheme is called multi-perspective representation average scheme, or MPR-A for short. It exploits a direct strategy, which takes the average of the similarity values th a t are evaluated between a query q and any of the perspective documents th at are used to represent the original document Dj. Let us denote Dj. to be the perspective document of original document Dj from the ith perspective, denote p to be the number of different perspectives applied to Dj. We define the similarity between query q and original document Dj as: (9) = The second strategy is called multi-perspective representation voting scheme, or MPRV for short. It is based on the idea of interpreting the original document Dj as a cluster and its associated perspective documents as the member documents in this cluster. The retrieval process can be converted into a classification problem by regarding the query to be the unlabeled document and its most relevant document to be the cluster the unlabeled document belongs to. We can interpret the similarity between a member document in a cluster and the query to be the probability th at this member document votes for the query to be labeled by this class. We can then use the noisy-or operation to combine the similarity values of all the member documents in one cluster to arrive at one single value per class [CH88]. Accordingly, the similarity between query q and original document Dj 35 is defined as: p sim {D j,q) = 1 - ^ ( 1 - sim{Dj^,q)) (10) i= l We would like to point out th at, the MPR-V scheme is associated with probabilities, whose value range is from 0 to 1. It is advisable th at before you decide to employ this voting strategy, make sure th at the similarity metric of the IR m ethod outputs a value th a t is within range [0,1]. In addition to the two sample strategies we proposed above, there are many other possible solutions for integrating the results of assessment between a query and the inter­ mediates to compute the similarity value between the query and the original document. Making a comparison of the effectiveness of all the alternatives is beyond the scope of this thesis. (For more alternatives, refer to [RSW+98].) 2.2 A p p lication s We have apphed the multi-perspective document representation method on two vec­ tor space model related IR approaches: the standard latent semantic indexing method and the standard term vector method. As described in Chapter II, the LSI method performs SVD to the standard term vector method (i.e., traditional vector space model) in order to derive an LSI space of k dimensions, where k is less than or equal to the full rank of the term-by-document matrix. It should be noted th at when k is equal to the full rank of the matrix, the LSI method is equivalent to the standard term vector approach. From this perspective, we consider the standard term vector method to be a special case of the LSI method and we will use the standard LSI method as the representative for vector space based methods 36 to demonstrate the implementation of the MPR model. In order to highlight the characteristics of our representation method, we will start with an introduction to a standard LSI retrieval system followed by an LSI-MPR re­ trieval system, which adds the multi-perspective representation idea onto the standard LSI method. The complexity analyses of the standard LSI system and the LSI-MPR system will be discussed in terms of the analysis on their components’ complexities. 2.2.1 LSI R etrieval S ystem In the previous chapter, we presented the methodologies of the latent semantic index­ ing method. The following paragraphs will present a detailed description of its implemen­ tation, together with some im portant formulae for computing the vector representation for documents using SVD matrices. A typical LSI retrieval system may consist of six major components. They are parser, indexer, deriver, transformer, evaluator and ranker. Fig. 3 illustrates the system’s basic framework. We name a sub-system consisting of the first five components (the parts inside the dotted box of fig. 3), the L SI base. Parser Given a document collection, the parser conducts lexical analysis on each document and converts it into a set of terms. We exploit the technique of stop listing, which helps to remove the punctuation and common words in the files. A term list can then be generated from the remaining terms. This term list, also called the system vocabulary, will be applicable to the following processes. 37 . Parser LSI Base Indexer Document Representations Denver , Parser Fig. 3: LSI Retrieval System 38 Ranker ^ Document Set> Lexical analysis is expensive because it requires examination of every input charac­ ter. Although no studies on the cost of lexical analysis in information retrieval systems have been done, lexical analysis has been shown to account for as much as 50% of the com putational expense of compilation [FBY92], The specific data structure our lexical analyzer utilizes is the finite state machine. Indexer In LSI, or any other vector space based approach, documents are represented in terms of vectors. The LSI system includes an indexing process to analyze the association between terms and documents. The indexer can simply apply raw frequency analysis or employ more complicated term weighting techniques. At the completion of indexing, each document is represented as a vector of terms. A term-by-document m atrix is then constructed where each column corresponds to a term vector of a document. The basic d ata structure th at we employ for implementing the indexer is the C+4standard template library (STL) map class template. We use the terms as the keys for the map and store the statistical information of a given term in its related entry. While the C + + standard does not specifically require that the map container be implemented using any specific d ata structure, the time complexity requirements imposed by the standard on each map operation suggest a balanced binary search tree. Many STL implementations use a red/black tree to implement a map. Map operations such as searching for an element or adding an element require 0{logn) operations ^ . Similarly, the tim e complexity for the major map operation in our implementation, operator[ ] is ^Retrieval Time: Ju ly 2005. h ttp ://w w w .m tsu .e d u /~ c sju d y /S T L /M a p .h tm l 39 0 {logn). In STL, the operation of comparing two strings has been implemented, and the map container with a key of string type will be sorted alphabetically. Therefore, the indexer will output a list of terms in a dictionary format, which is beneficial. Another benefit of using STL map container is th at the common operations in map take logarithmic time, which is very suitable for storing a collection of any size, thus it is an advantage for information retrieval systems. D eriver As we learned from Chapter 11, LSI assumes th at a latent semantic structure is underlying all the textual d ata and it can be estimated by some statistical techniques. In order to derive such a higher-order structure, which we call the L SI stnicture or L SI space, we apply the singular value decomposition algorithm, which has been generally applied in many of the LSI related approaches in the deriver. SVD is a technique th at highly demands considerable computing resources. Its algo­ rithm has been well developed and implemented in a number of programming languages, such as M atlab ^ , C [BDO+96], C+-1- ^ and Java The experiments conducted by us have applied the svd function th a t is provided by the M atlab software. More precisely, we employ a function in Matlab called svds, to approximate the original term-by-perspective document m atrix by truncating the components of the smallest singular values. M atlab SVD uses the LAPACK routines to compute the singular value decomposition and the ^Retrieval Time: Ju ly 2005. h ttp ://w w w .m ath w o rk s.c o m /a cc e ss/h e lp d esk /h e lp /tec h d o c /re f/sv d .h tm I ^Retrieval Time: Ju ly 2005. h ttp ://w w w .c s.u te x a s.e d u /u se rs/su v rit/w o rk /p ro g s/ssv d .h tm l "•Retrieval Time: July 2005. h ttp ://jm at.so u rc efo rg e .n et 40 solution will converge if the limit of 75 QR step iterations is exhausted.® Although SVD is a very powerful tool for factor analysis, its computational complex­ ity is too big to be ignored. This has become a bottleneck for the application of the information retrieval systems based on this technique. Transform er Having obtained the LSI structure by the deriver module, we exploit the transformer to obtain the document representations in the LSI space. There is a one-to-one cor­ respondence between the representation in the document space and the representation in the LSI space. To implement the process of transforming, we project the vectors in document space onto the LSI space. The specific formulas that are used to complete the procedure are demonstrated as follows. Let us start from the principle formula of SVD, as shown in Eq. 5 (page 20), and derive the formula th a t computes the representation of a document using the SVD components. Let Uki Sk and 14 be the matrices that are approximate to V, S and V respectively. Let Mk be the approximate term-to-document m atrix resulting from the product of Vfc, Sk and 14- M kS rows correspond to terms and columns correspond to documents. Please note th at the Uk and 14 are orthogonal matrices and Sk is a diagonal matrix. Due to the ®QR is a m atrix decom position m ethod. Suppose M is an a rb itra ry rectangle m atrix, whose size is t X d. I t can be factorized into QR m atrices as: M = QR, where Q is a. t x t orthogonal m atrix and R is a t X d u pp er trian g u lar m atrix. 41 characteristics of these special matrices,® M^Mk can be derived as follows: = = % %% ; = (I46fk)(tt5ky (11) From the above equations, it can be seen that the dot product between two columns of Mfc reflects the extent to which two documents have a similar profile of terms. Observe th at the m atrix M'^Mk is a square symmetric matrix containing all the document-todocument dot products, i.e., the cell {i,j) of m atrix equals to the dot product between the zth row and the j t h row of the matrix VkSk- So we can consider rows of m atrix VkSk as coordinates for documents. Let 6j denote the j t h canonical vector of d x d identity matrix. For a document j , the representation by the j t h row of matrix VkSk is given by: DRj = {VkSkYej = (12) The time complexity for loading the SVD matrices: Uk, Sk and 14, wiU be 0 {M a x{t, d) x k) (see page 21 for a more detailed description of these matrices). To transform one vector ®For an orthogonal m atrix A: AA! = A!A = / , where I is an identity m atrix and A! denotes th e tran sp o se m atrix of A. For a diagonal m atrix B: B = B '. 42 from the document space to the LSI space, we need to apply the multiplication operation on matrices 14 and Sk- This procedure costs 0{d x k) time. P ro cess Q ueries Up to this point, all the documents have been pre-processed and represented by vectors in the LSI space. The next step is to generate similar representations for query statements. Like in most IR systems, a query is considered as a pseudo-document in the LSI retrieval system. The process of analyzing a query is very close to the one apphed to documents, however, there are still some differences between these two processes in our LSI system th a t are worth mentioning. When a query comes, it is first processed by the parser to obtain a set of terms. These terms are compared to the system vocabulary established when the documents are parsed. We keep the terms th at are accepted by the system environment and simply discard the ones th at never appeared before. The indexer then takes the filtered set of terms as input and outputs its representation as a vector of terms. When the transformer was used to compute the documents’ representations in the LSI space, each document corresponds to a point in the LSI space and its coordinates can be computed in terms of the SVD component matrices. A query, however, can be a completely new textual object th at was not considered in the original analysis. Thus we need to find a way to fold the query into the system. Suppose there is a column vector lA in matrix 14 which corresponds to query vector 43 q. Then the representation of Vk^ can be given as follows: q = UkSkV).^ 14, == g'C/fSr' (13) Apply Eq. 12 (page 42), the representation for query g can be obtained by D& , - = 3 k ( S 4 i) '% = b%g ^4) By exploring the algebraic attributes of the orthogonal and diagonal matrices, we can find a way to represent a new document based on the existent LSI structure. Thus given any document, whether or not it exists in the database, we will be able to represent it in a standard format. Evaluator Once vectors have been computed for the query and for all the documents in the collection, the next target is to assess the similarity between a pair of vectors, most often, a query’s term vector and a document’s term vector. The evaluator isdesigned for this purpose. The evaluation criterion th at we use is the cosine similarity (page 16). Denote D R j to be the vector representation for document Dj in the LSI space, DRq to be the query 44 vector representation in the same space, and Bj to be the angle between the above two vectors. By applying the Eq. 12 (page 42) and Eq. 14 (page 44), the similarity between Dj and q can be estimated by sim {D R j, DRq) = cosBj DR'.DRa \DRj\\DRq\ (^,V^ej)'([/(g) (15) At the completion of the evaluation process, each of the documents in the collection will be assigned a similarity value indicating its similarity (i.e., relevance) to the given query. Suppose all d documents in the collection and the query have been represented in the LSI space by fc-dimensional vectors. The time complexity for the evaluator to estimate the similarity between a query and all the documents in the collection is 0 {d x k). Ranker In order to allow the human user to restrict his/her attention to a set of documents of manageable size, the output of the retrieval system needs to be an ordered set of documents, with the most relevant documents appearing prior to the least relevant ones. This is why we have introduced the ranker component to our system. The ranker employs a ranking strategy, called nearest-neighbour queries (page 17) on all the similarity values assigned to all the documents. The documents can be sorted accordingly and the ones with top N ranks will be selected, which are considered to have a greater potential of meeting the user’s information request. 45 The specific sorting algorithm we employ is the STL sorting method. The time complexity is Q {N x log{N)) for average case and 0 { N x N ) for the worst case. Although the cost is very high for the worst scenarios, it is very rare too. In general, the algorithm has been very efficient, where the average complexity applies. It should be noted th at no m atter how good an IR method can be, there is no guarantee th a t all the retrieved documents are relevant nor is it certain th at aU the relevant documents will be returned. 2.2.2 L S I-M P R R etrieval S ystem In order to integrate the concept of perspective documents and to reconstruct the analysis for each original document based on the analytical results conducted on its derivatives, the LSI-MPR system introduces two new components onto the standard LSI system: the observer and the combiner. The structure of the new system is depicted in hg. 4. O bserver The observer aims at generating perspective documents for an ordinary document collection. It perceives each document in the original collection from different perspectives and outputs a derived collection consisting of aU the perspective documents. Having observed the fact th at perspectives of the same object should share some properties, we make sure th at in the process of deriving multiple variants from the same document, overlapped content exists among all of its associated derivatives, i.e. the perspective documents. There are many possible solutions for implementing an observer 46 Observer LSI Base Ranker Fig. 4: LSI-MPR Retrieval System component th at meets the requirement for our guideline in Section 2.1.1. A very straightforward idea is to collect abstracts of each original document from different individuals and regard different versions of abstracts to be the different per­ spective documents for the original document. This solution rigidly follows the linguistic definition of the word perspective, i.e., the interrelation in which a subject or its parts are mentally viewed [MW98] and it certainly makes sure that all the derived documents share some properties. However, it involves a lot of human interference, which conflicts with our expectation of developing a fully autom ated system. Another solution th at leads to a much wider application is to evenly distribute the content of each original document to its variants. We introduce a definition of overlapping rate, denoted by r, to represent the quota of the overlapped content in one perspective document, r can be defined as the ratio of the overlapped content in any perspective document to the entire content of this perspective document. 47 In the LSI-MPR system we have developed, each original document is divided by sentences and it is observed by p different perspectives. The time complexity for the observer to process the original collection and to generate the perspective document collection is 0 {d x p). LSI B ase The LSI base is an abstraction for the sub-system of the LSI system we discussed ear­ lier (Section 2.2.1). It includes the components of parser, indexer, deriver, transformer and evaluator. In our LSI-MPR system, the LSI base takes the collection of perspective documents generated by the observer and the user’s query as input. It considers perspec­ tive documents to be ordinary documents and utilizes the parser, indexer, deriver and transformer to conduct analysis on them. It also exploits the evaluator to compare these perspective documents with the query. This leads to the output of the similarity values for the perspective documents. C om biner Having obtained the results from the analysis conducted on the perspective docu­ ments, the combiner employs a combination strategy on the intermediate results. Exam­ ples were presented in Section 2.1.2. By doing so, it outputs the similarity values that indicate the relevance of the original documents to a given query. The time complexity for the combiner to compute the similarity values for all the original documents from their intermediates’ similarity values is 0 { d x p). 48 Ranker The ranker of the LSI-MPR system has the same functionality as LSFs. Please refer to page 45 for further details. 3 D ifferential L atent Sem antic Indexin g M eth od We discovered th at in the literature, the differential latent semantic indexing (DLSl) method, proposed by Chen et al., exhibits attributes th at are falling into the category of multi-perspective representation. In this section, we will provide a brief overview of this method. 3.1 M otivation As we mentioned in Chapter 11, the standard latent semantic indexing m ethod applies a dimension reduction method SVD to the traditional vector space model and has shown to have a distinct advantage in dampening the effect of synonymy and polysemy problems. However, it also has some drawbacks th at should not be neglected. In the LSI method, the term vector of each document is projected onto the same reduced dimensional space. The projection is measured and used as the representation of the document, whereas the distance from the term vector to the reduced space is neglected. Fig. 5 illustrates the process of projecting. Accordingly, to evaluate the similarity between a pair of documents, the LSI method only measures the similarity between their term vectors’ projections. As pointed out by Schutze et al. [SS97], the LSI m ethod is indeed a global dimensional reduction (global projection) approach and because of this, it encounters a difficulty in adapting to the unique characteristics of each 49 document. The DLSI method was proposed in order to overcome this disadvantage of the standard LSI method. Besides projection of term vectors on a reduced dimensional space, the DLSI method also makes use of the distance from these vectors to the reduced space. Term Vector Distance Projection R educed Dimensional S pace Fig. 5: Projecting a Vector 3.2 T heory 3.2.1 D ifferential Term V ector Like the generalized solution strategy, the DLSI method assumes th at each document is associated with multiple perspective documents. It introduces the concept of differen­ tial term vector. Denote D Ri. to be the vector representation of a perspective document Di- (i.e., the j t h perspective document of original document Di). A differential term vector is defined as D D = D R i^. — DÆ ». , where DRi. . and DRi.,. are term vector representations of two perspective documents and . In particular, when these two perspective documents are distinct derivatives of the same original document, D D is called an intra differential term vector] when they are distinct perspective documents 50 having different origins, D D is called an extra differential term vector. The way to select two perspective documents for constructing a differential term vector can be arbitrary. A pragmatic solution is called mean vector strategy. A mean vector takes the average of the perspective documents’ vectors th at have the same origin. Denote Si to be the mean vector of original document Di. It can be defined as: ^ M h (16, P By introducing mean vectors, an intra differential term vector can be constructed by Di. —Si and an extra differential term vector can be constructed by Di^. — (A f A)- Please note th at Si is derived from the perspective documents’ term vectors of the original document A , thus it is also considered as a perspective term vector of Di and can be used for constructing the differential term vectors. The intra and extra differential term-hy-document matrices: M i and M e can be defined accordingly, each column of which is a differential intra or extra term vector respectively. 3.2.2 P osterior M odel DLSI applies SVD on both M i and M e matrices and obtains the intra and extra DLSI spaces. In order to estimate the similarity between a query q and a document Di, we can construct a differential term vector x by assigning q — Di. or q — Si to it, a likelihood function can be evaluated on both DLSI spaces, which exploits the vector’s projection on and its distance to the space. A Bayesian posterior function then combines the likelihood values in both cases to arrive at an estimate of the similarity between the 51 two term vectors th a t are used to construct the differential term vector. 3.2.3 G eom etric Interp retation In the DLSI method, both the differential term vector and the hkelihood function convey a rich geometric sense. As mentioned earlier, a differential term vector D D can be obtained by conducting subtraction over two different term vectors D{^.^ and . In the likelihood function, the length of D D is exploited, which is approximately equivalent to the cosine value of the angle between Di^.^ and , assuming th at normalization has been apphed to all the term vectors. In addition, the distance of D D to the DLSI space is also taken into consideration, providing a representation of the characteristics of an individual document, which LSI method is unable to offer. 3 .2 .4 A lgorithm The implementation of DLSI consists of two major components. One is to process the document collection and to set up the retrieval system. The other one is to apply the retrieval system on a query and to obtain an ordered document set accordingly. The algorithms of the procedures are shown as follows: Set up S ystem (1) Generate perspective documents for each original document in the database and obtain their term vector representations. (2) Construct an intra differential term-by-document matrix M / such th a t each of its columns is an intra differential term vector. 52 (3) Decompose M j, by the SVD algorithm, into M/ = U iS iV f where 5 / = diag{6i^i, Find an appropriate • • • )• and apply it to M/ in order to get an approximate matrix where Mi^ki — UkjSkjV^^. Then evaluate the likelihood function [CTN03]; where - C/ga;, %/?, p; = and r; is the rank of matrix M/. (4) Construct an extra differential term-by-document m atrix M e such th a t each of its columns is an extra differential term vector. (5) Decompose Mg, by the SVD algorithm, into Mg = V g5gF g, S e = diag{ÔE,i, Ôe,2 , ■■■)■ Find an appropriate /cg, and apply it to obtain an approximate m atrix M g,tg, where Mg_kg = Then calculate the likelihood function [CTN03]: P(a;|Mg) where y = \ _ _ / n T-rkr, r ' > £ -fc £ ;)/2 (2,r)”-/2 n k , fe,. • PP e ‘^ {x) = (l|x||)^ - E Ï i v h Pe = ; i i t ; ’ and Te is the rank of m atrix Mg. (6) Define the posterior function [CTN03] as P (M d .) = ________ F(:;|M ;)f(M ; ) _________ ^ ^ f(T |M ;)P (M ;)-kf(a;|M g)f(M g)' ’’’T h e way how to choose a value k is not fixed. ^ ^ A lthough reduction in k can help rem oving noise, keeping too few dimensions may lose im p o rta n t inform ation. Therefore, only by applying experim ents on a certain d a ta set and observing its perform ance on different fcs, can we find the m ost ap p ro p riate value. 53 where P{M i) is set to 1 /d, and d is the number of original documents in the collection, and P { M e ) is set to 1 —P{M j). A u to m a ted D ocu m en t Search (1) Given a query, set up its term vector. For each original document in the collection, repeat step (2) to (5) below. (2) Construct a differential term vector x by subtracting the query’s term vector by any of the perspective term vectors or the mean vector of the original document. (3) Evaluate the intra document likelihood function P{x\M j) and the extra document likelihood function P { x \ M e )(4) Calculate the Bayesian posterior probability function P{M i\x). (5) Conduct ranking strategy over the P {M i\x) values for all the original documents and return the most relevant ones. 4 U se M P R S olutions for Classifiers In addition to the application on information retrieval systems, the multi-perspective representation method can also be utilized by classifiers. On page 31, we discussed about the close relationship between a retrieval task and a classification problem, and pointed out th at an IR method can be apphed to set up a classifier as well. In particular, there exists a direct classifiable interpretation of our multi-perspective representation model. This statem ent can be verified as follows. 54 In a typical classification d ata set, there are many different classes consisting of mul­ tiple different member items, and all the member items from the same class share some common attributes. This can be interpreted by the M PR method as: in this data set, there exist different documents (classes), and each of them is associated with multiple distinct perspective documents (member items from the same class) th at are used to rep­ resent it. By doing so, the M PR method can be easily extended to the field of document classification and thus greatly widens the fields of its application. 55 C hapter IV E xp erim en ts This chapter focuses on presenting the experiments that have been conducted using the multi-perspective representation solutions we provided in Chapter III. The basic research procedures will be demonstrated. The experimental results as well as their evaluation will also be covered. 1 T h e G eneralized M P R S olution Field Test This series of experiments intends to conduct field tests over the generalized M PR solution, to apply it on the standard LSI and the standard term vector method and to evaluate its performances. 1.1 D ocu m en t C ollections We conducted our experiments on two standard IR document collections, TIM E ^ and ADI ^ where queries and relevance judgement are readily available. The TIM E collection consists of articles from Time magazine’s world news section in year 1963. ADI is a test collection of document abstracts from the library science and related areas. Table 3 shows some characteristics of these data sets. The vocabulary size refers to the number of terms th at are accepted by the system. A document size is the total number of terms in this document. In cases where a term appears more than once, all its occurrences are counted. ^Retrieval Time: July 2005. h ttp ://w w w .c s.u tk .e d u /~ lsi/ ^Retrieval Time: Ju ly 2005. h ttp ://w w w .c s.u tk .e d u /~ lsi/ 56 Table 3: Characteristics of TIME & ADI TIME ADI number of documents 425 82 number of queries 83 35 average number of documents 4 5 vocabulary size 20959 1513 average document size 588.12 66.72 relevant to a query 1.2 P rocedure Firstly, we set up an LSI retrieval system using the standard LSI method. Then we construct two LSI-MPR retrieval systems by integrating the multi-perspective representa­ tion into the LSI system. We call the variant using the MPR-A strategy the LSI-MPR-A system and the variant using the MPR-V strategy the LSI-MPR-V system (refer to page 35 for details of the M PR strategies). Following that, we conducted experiments on both TIM E and ADI collections. 1.2.1 LSI R etrieval S ystem In the LSI system, the parser applies the SMART stop list to eliminate the stop words from the documents. The SMART stop list is a negative dictionary th at is used in the vector space model based IR system SMART [Sal71]. The indexer then analyzes the output of the parser to generate the indexing matrix, 57 i.e., the term-by-document matrix. It is worth mentioning th at in the literature, exper­ iments have been done using the standard LSI method on the same test collections by Dumais [Dum91]. She not only removed the words th at occur in the SMART stop list, but also eliminated the ones th at occur only once in the collection. In order to make a fair comparison to her system, we followed her way of constructing effective terms. Consequently, in the process of indexing, only the terms that occur more than once are used to construct the term-by-document matrix. As well, Dumais employed the raw term frequency strategy for local term weighting and assigned the global weight to be 1 for all the terms. Therefore our indexer employs the same strategies. The deriver performs SVD algorithm on the term-by-document m atrix th at is gener­ ated by the indexer in order to build up the LSI structure. As presented in Eq. 4 (page 20) and Eq. 5 (page 20), SVD provides a simple scheme to approximate the original term-by-document m atrix with a smaller m atrix of rank k. The amount of dimension reduction, i.e., the choice of k, is a very critical param eter to our experiments. Ideally, we want a value of k th a t is large enough to fit all the real structure in the data, but small enough so th at the sampling error or noise data are filtered. However, the proper way to make such choices remains an open issue in the factor analytic literature. In practice, we are using an operational criterion. That is, a value of k which yields the best retrieval performance is the best. A rule th at most researchers follow is to assign value fc to be a certain percentage (e.g., 10%) of the full rank of the original term-by-document matrix, or assign it to be a value between [100, 300] for very large collections. To provide a proper representation of the dimension reduction, we introduce a term dimension reduction rate, which is defined as the dimension of the dimensional reduced LSI space (i.e., the value 58 k) over to the full rank of the original indexing matrix. Based upon this higher-order structure, the transformer uses Eq. 12 (page 42) to construct the document representation in the LSI space by using the SVD components. Each query in the query list undergoes similar processes to obtain its representation in the LSI space (Eq. 14, page 44). Then the evaluator employs cosine similarity metric (Eq. 15, page 45) to compute the similarity values for each document and the ranker outputs the documents in an order th at the most relevant ones come prior to the least relevant ones. 1.2.2 L S I-M P R R etrieval S ystem s To set up the LSI-MPR retrieval systems, we reuse the LSI base of the standard LSI system, and add in the observer, combiner and ranker components. In the observer, we apply two different perspectives (i.e., p = 2) to observe each original document. Due to the fact th at both collections consist of relatively short articles and no abstracts are available, the observer constructs the perspective documents in the following way. Each original document is divided by sentences. Denote o to be the number of overlapped sentences, and p to be the number of different perspectives. For every o+ p sentences in the document, o sentences are shared by all the perspective documents and the other p sentences are evenly distributed to the p perspective documents. Please note th a t the sizes of the perspective documents associated with the same origin do not have to be exactly the same. Thus the overlapping rate r, which is used to represent the quota of the overlapped content in the perspective document, equals to For example, there is an original document Dj th at has 7 lines, we assign o = 1 and 59 p — 2. The method to distribute D /s content into its perspective documents: Dj^ and is demonstrated as follows: Df Line 1. Line 2. Line 3. Line 4. Line 5. Line 6. D jp Line 1. Line 2. Line 4. Line 5. Line 7. Djg: Line 1. Line 3. Line 4. Line 6. Line 7. Line 7. In order to make sure all perspective documents have reasonable sizes (i.e., contain enough information to represent their origin), we set r = | for TIME collection and r = | for ADI collection. In the parsing process, the LSI-MPR systems are implemented in a way th a t is a bit different from the LSI system. In order to keep a consistent system vocabulary for both systems, we filter the perspective documents with the effective term list th a t is generated by the standard LSI system. In other words, we only keep the words th at are not in the stop list and the ones th at occur more than once in the original document collection. The reason why we do not try to set up a different system vocabulary for the LSI-MPR systems is th a t we realize in the observer, we introduced overlapped content in the perspective documents collection, therefore, such a scenario becomes possible, th at a word only occurs once in the original document collection might occur more than one time in the derived collection if it is used as the overlapped parts and is assigned to all the perspective documents having the same origin. The LSI base processes the perspective document collection and the query to output the results of the relevance judgement analysis conducted on the perspective documents. ^°For T IM E collection, o = 2 and p = 2. For ADI collection, o = 5 and p = 2. 60 These intermediate results are then analyzed by the combiner, which exploits a combina­ tion strategy (Eq. 9, page 35 or Eq. 10, page 36) to integrate the similarity values of the associated perspective documents. It then outputs the reconstructed similarity values for the original documents. The ranker then simply applies a sorting operation over the list of similarity values and according to that, an ordered set of original documents can be obtained. 1.3 R esu lts and E valuation In our experiments, the criteria employed for evaluating the retrieval performance of the IR methods is the interpolated average precision (see page 29 for details). Following Dumais’ report, we used three standard recall levels: 0.25, 0.50 and 0.75. 1.3.1 Standard LSI vs L SI-M PR In order to compare the retrieval performance of the LSI method using traditional representation model and the multi-perspective representation model, we conducted ex­ periments on the LSI system and the LSI-MPR-A, LSI-MPR-V systems with different dimension reduction rates (i.e., the dimension of the reduced space over the dimension of the rank of the term-by-document matrix, page 59): 0.1,0.2, • • ■,0.9. The results on TIM E and ADI collections are shown in fig. 6 and fig. 7 respectively, indicating that the retrieval performance of the LSI approach can be significantly improved by using the multi-perspective document representation method. 61 0.6 0.58 0.56 0.54 § 1 0.52 "S 0.5 S 0.48 b < LSI — i LSI-M PR-A LSI-M PR-V 0.46 0.44 0.42 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 D im ension R eduction Rate Fig. 6: Retrieval Performances on TIME using LSI, LSI-MPR-A and LSI-MPR-V Meth­ ods 0.32 ! 0.3 r ' - 0.28 ...... : / 0.26 £ { 0.24 LSI — *— L Sl-M PR -A — * L Sl- MPR-V <' A " — y /- 0.22 / ...^ i 0.2 r / ......!.............. i i 0.18 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 D im ension Reduction Rate Fig. 7: Retrieval Performances on ADl using LSI, LSl-MPR-A and LSl-MPR-V Methods 62 Table 4; Retrieval Performances of Term Vector Approach using Traditional Represen­ tation Model & Multi-perspective Representation Model Collection TIME ADI 1.3.2 Approach Average Precision standard term vector approach 57.92% term vector approach with MPR-A 58.95% term vector approach with MPR-V 58.87% standard term vector approach 28.28% term vector approach with MPR-A 30.95% term vector approach with MPR-V 30.66% Standard Term V ector A pproach vs Term V ector A pproach w ith M P R Schem es As stated on page 36, we consider the standard term vector method to be a special case of the LSI method when the dimension of reduced space equals to the rank of the term-by-document matrix, i.e., the dimension reduction rate is 1.0. Table 4 demonstrates th a t the performance of term vector method can be greatly improved by employing the multi-perspective document representations. 1.3.3 S ta tistica l A nalysis We have conducted statistical (-tests to examine if the improvement obtained from the multi-perspective representation model remains to be significant in order to ensure the significance of the results. Based upon the assumption th at the difference between 63 the precision values obtained through the standard LSI/term vector approach and the LSI/term vector approach with the M PR model conforms to normal distribution, we exploit the tripled precision values for all queries at all standard recall levels on each di­ mension reduction rate using the standard LSI/term vector approach, and the LSI/term vector approaches with M PR schemes. The paired sample t-test has been apphed to test the null hypothesis th at the standard LSI/term vector approach and any of the two LSI/term vector approaches with MPR schemes do not have different performances on any dimension. The levels of significance, which are used to indicate whether the im­ proved retrieval performance of our representation model has occurred merely by chance, are shown in fig. 8 and fig. 9. From the figures we can see that, the improvement of the multi-perspective representation schemes are significant in most cases. 0.09 0.08 LSI/(Term V ector) Approach with M PR -A — LSI/(Term Vector) Approach with M PR -V — 0.07 ^ I 0.06 ■h 0.05 s 0.04 g> 0.03 0.02 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 D im ension Reduction Rate Fig. 8: Significance Level of LSI/(Term Vector) Approach with M PR Schemes on TIME Furthermore, we have conducted (-tests on the tripled samples for all the dimension 64 0.3 LSI/(Term Vector) Approach with M PR -A - 0.25 . LSI/(Term Vector) Approach with M PR -V - "3 0.2 # 0.15 S 0.1 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 D im ension Reduction Rate Fig. 9: Significance Level of LSI/(Term Vector) Approach with M PR Schemes on ADI reduction rates we employed. The results of statistical analysis are dem onstrated in table 5. The statistics of the significance levels (way beyond the 10“ ^^ level and the 10“^ level for TIM E and ADI respectively), have shown th at the improved performances of the multi-perspective representation schemes are significant for both LSI and term vector approaches. By considering the overall performance in either the TIME or the ADI collection, the probability th at the improvements come about by chance is extremely small. Consequently, we can claim that, the improvements gained through using the multi-perspective representation solutions are statistically significant. 65 Table 5: Significance Levels of t-tests on all the Samples on TIME & ADI Significance Level TIME ADI 7.95 X 10-^^ T29xlO-9 2.13 X 1 0 -^ 1.13 X 10-^ improved performances of LSI/(term vector) approach with MPR-A over the standard LSI/(term vector) approach improved performances of LSI/(term vector) approach with MPR-V over the standard LSI/(term vector) approach 2 T h e D L SI A pproach Field Test 2.1 P roblem S tatem en t The differential latent semantic indexing method has been apphed in the field of in­ formation retrieval and document classification. Experiments have shown th at it demon­ strates a significant improvement over the standard LSI method [CTNOl, CTN03]. The intention of our experiments with the DLSI method is to test its effectiveness and explore its applicability in medical d ata classification problem. We conducted our study on a well-known medical data set: Wisconsin breast cancer data (WBCD), which is often used as a benchmark for testing effectiveness of classifiers . WBCD is a breast cancer sample collection periodically collected by Dr. Wolberg in his clinical cases. Each sample has been assigned a vector of 9 elements, all of which are in the interval of 1 to 10, with value 1 corresponding to a normal state and 10 to the R etrieval Time: Ju ly 2005. h ttp ://ftp .c s.w isc .e d u /m a th -p ro g /c p o -d a ta se t/m a c h in e -le a rn /c a n c e r 66 most abnormal state [MW90]. The problem our classifier needs to address is how to classify an unlabeled data entry to one of the two classes: benign and malignant. The data set contains 699 entries, including 16 incomplete entries and 1 outlying entry and its class distribution is 65% for benign points and 35% for malignant points. 2.2 P rocedure In Chapter III, we provided an algorithm th at applies the DLSI method to set up an information retrieval system. In order to utilize this algorithm in a classifier, we have to convert a classification process into a retrieval problem. Suppose we have a data set of n classes, and for class z, it contains pi member items. We can regard these classes as distinct original documents. The j t h member entry in class i can be considered as a perspective document of original document i from the jt h perspective. We separated the entire WBCD data set into two subsets th at do not have intersection: a training set and a testing set. To set up the classifier, we process each member item in the training d ata set and construct its vector representation accordingly. We then employ the mean vector strategy (see page 51 for details) to construct intra and extra differential term-by-document matrices, which are then decomposed by the SVD algorithm. For a testing d ata entry g, we assign cc to be g — 5'*, where Si is the mean vector from cluster i. Follow Eq. 17 (page 53), Eq. 18 (page 53) and Eq. 19 (page 53), we calculate P {x\M i), P { x \ M e ) and P {M i\x). The cluster th at has the largest P {M i\x) value is chosen as the cluster to which entry g belongs to. 67 2.3 R esu lts and Evaluation We conducted fields tests on different training sets. The ratio of the training set to the complete d ata set ranges from 20% to 80%. Fig. 10 depicts the performance of our classifier. D LSI Classifier 0.975 a 097 H 0.965 C o 0 0.96 < 0.955 0.95 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Ratio o f Training Set to Com plete Data Set Fig. 10: Classification Accuracy of DLSI Classifier on WBCD To date, much research work has been conducted on the WBCD d ata set. Methods such as rule generation approach, fuzzy-genetic approach, neural network approach have been presented in the literature [CZP04]. Taha and Gosh proposed a method using neural network to extract rules [TG96]. But it can be only applied to d ata with binary attributes. Setiono proposed a method based on the idea of finding a set of concise rules using a pruned neural network [Set96]. His classifier achieves good performance but the extraction of rules is manually processed, which involves much human intervention. The DLSI method, however, can overcome this limitation. Our classifier is a fully 68 autom ated classification system and it obtains an average of 96.6% accuracy (derived from fig. 10). The best accuracy of 97.6% is obtained when we use 50% data for training. The classification performance still achieves 95.5% accuracy when only 20% of the data (i.e., 136 entries) are used for training. As far as we know, no known method has ever reached such a high performance with less than 200 training samples. The only comparable result using a small training set can be found in Wolberg’s paper [WM90], where an accuracy of 93.5% is achieved by using 185 entries for training. 69 C hapter V C onclusion In this chapter, we will present a summary of the thesis as well as describing directions in our future research. 1 Sum m ary In this paper, we have proposed the multi-perspective representation (MPR) method, which is an innovative document representation model th at imitates a hum an’s manner of depicting objects, i.e., describing them from different perspectives. Two sample solutions were presented: a generalized MPR solution and the differential latent semantic indexing (DLSI) approach. The generalized M PR solution is expected to be applied by many IR methods that are based on the vector space model. Experiments have been conducted employing the generalized M PR solution on the latent semantic indexing method and the term vector method. Two standard document collections (TIME and ADI) have been used in field testing. The experimental results and the statistical analysis conducted on them have dem onstrated the effectiveness of the generalized MPR solution on both IR methods. The DLSI method is an LSI-based IR approach proposed in the hterature, which also exhibits the attributes of the M PR model. We have explored its applicability in an IR-related problem: medical data classification. Results from these experiments have demonstrated th a t the DLSI method achieves high performance on a benchmark medical d ata set (WBCD). 70 2 Future W ork It might be argued th at the improved performance of the IR methods using the MPR model does not agree with a common understanding in IR, th at retrieval performance normally increases with the size of the documents, because our M PR model proposes the use of perspective documents, which are necessarily smaller than their origin (the original document). We would like to point out th at although the size of an individual perspective document is smaller than th at of an ordinary document, the sum of all the perspective documents is equal to or greater than the original document. We also believe th at the improvement of the MPR solution results from the final fusion process, which integrates all the perspective documents. We believe th at the improvement of the M PR model should be more significant for longer documents than for shorter ones. The curves of significance level illustrated in fig. 8 and fig. 9 seem to support this idea (t i m e ’s average document size is bigger than ADPs). However, since both TIM E and ADI collections consist of relatively short documents, at our current stage of development, we still do not know whether the observed improvement resulting from the M PR model’s contribution will decrease or even disappear as the corpus’ size increases. Experiments on larger corpora will be necessary for further verification. While presenting the generalized MPR solution, we provided some sample implemen­ tation strategies. We presented how we split the original document into multiple per­ spective documents and suggest an overlapped content should exist in all the perspective documents having the same origin. Yet, we do not know how different strategies of ob­ taining multi-perspective documents will affect the retrieval performance. Experiments 71 on more data collections may shed more light on these points. We also introduced two sample combination strategies to reconstruct the analysis on the original document through its intermediates. Further research on other alternative strategies will be an interesting topic. As well, using n-gram as a scheme for extracting terms might bring to the IR methods the advantages of language-independency as well as insensitivity to degraded text. In our future work, we can also explore the application of this scheme onto the M PR solutions. 72 R eferences [BDO+96] M. Berry, T. Do, G. O’Brien, V. Krishna, and S. Varadhan. Svdpacke (version 1.0) user’s guide, 1996. [CGD92] W. Cooper, F. Gey, and D. Dabney. Probabilistic retrieval based on staged logistic regression. In Proceedings of the 15th Annual International A C M SIGIR Conference on Research and Development in Information Retrieval, pages 198-210, Copenhagen, Denmark, 1992. [CH88] W. Cohen and H. Hirsh. Text categorization using whirl. In Proceed­ ings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 169-173, New York, 1988. [CH95] A. Ghakravarthy and K. Haase. Netserf: Using semantic knowledge to find internet archives. In Proceedings of the 18th Annual International A C M SIGIR Conference on Research and Development in Information Retrieval, pages 4-11, 1995. [GTNOl] L. Chen, N. Tokuda, and A. Nagai. Probabihstic information retrieval method based on differential latent semantic index space. lE IC E Trans, on Information and Systems, E84-D(7):910-914, 2001. [CTN03] L. Chen, N. Tokuda, and A. Nagai. A new differential Isi space-based probabilistic document classifier. Information Processing Letters, 88:203212, 2003. 73 [CZP04] L. Chen, J. Zeng, and J. Pei. Classifying noisy and incomplete medical data by a differential latent semantic indexing approach. In Data Mining in Biomedicine. Springer Press, 2004. Accepted for publication. [CZT05] L. Chen, J. Zeng, and N. Tokuda. A “stereo” document representation for textual information retrieval. Journal of the American Society for Information Science and Technology., 2005. Accepted for publication. [Dam95] M. Damashek. Gauging similarity with n-grams. Science, 267:843-848, 1995. [DDL+90] S. Deerwester, S. Dumais, T. Landauer, Furnas. G., and R. Harshman. Indexing by latent semantic analysis. Journal of the Society fo r Infor­ mation Science, 41:391-407, 1990. [Dum91] S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 23:229-236, 1991. [EarS4] S. Ear hart. The UNIX Programming Language, volume 1. New York: Holt, Rinehart, and Winston, 1984. [FBY92] W. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992. [GCT97] W. Greiff, W. Croft, and H. Turtle. Computationally tractable proba­ bilistic modeling of boolean operators. In Proceedings of the 20th Annual 74 International AC M SIGIR Conference on Research and Development in Information Retrieval, pages 119-128, 1997. [GreOl] E. Greengrass. Information retrieval: A survey. Technical Report DOD Technical Report TR-R52-008-001, 2001. [GWHWH+99] A. Graesser, K. Wiemer-Hasting, P. Wiemer-Hasting, R. Kreuz, and University of Memphis the Tutoring Research Group. Autotutor: A simulation of a human tutor. Journal of Cognitive Systems Research 1, pages 35-51, 1999. [Har86] S. Harter. Online Information Retrieval: Concepts, Principles, and Tech­ niques. Academic Press, Inc., 1986. [Har92] D. Harman. User-friendly systems instead of user-friendly front-ends. Journal of the American Society for Information Science, 43(2):164-174, 1992. [Har95] D. Harman. Overview of the third text retrieval conference. In Third Text REtrieval Conference (TREC-3), pages 1-19. National Institute of Standards and Technology Special Publication 500-207, 1995. [Hof99] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual A C M Conference on Research and Development in In­ form ation Retrieval, pages 50-57, Berkeley, California, 1999. [Hul94] D. Huh. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th Annual International AC M 75 SIG IR Conference on Research and Development in Information Re­ trieval, pages 282-291, 1994. [Lee94] J. Lee. Properties of extended boolean models in information retrieval. In Proceedings of the 17th Annual International AC M SIC IR Conference on Research and Development in Information Retrieval, pages 182-190, 1994. [LPY94] E. Liddy, W. Paik, and E. Yu. Texf categorization for multiple users based on semantic features from a machine-readable dictionary. AC M Transactions on Information Systems, 12(3):278-295, 1994. [Mon02] M. Montague. Metasearch: Data fusion for document retrieval. PhD thesis, Dartm outh College, Hanover, New Hampshire, 2002. [MW90] 0 . Mangasarian and W. Wolberg. Cancer diagnosis via linear program­ ming. SIA M News, 2 3 (5 ): 1 -1 8 , 1990. [MW98] Merriam-Webster, editor. M erriam-W ebster’s Collegiate Dictionary. 10th edition, 1998. [OC03] P. Ogilvie and J. Callan. Combining document representations for known item search. In Proceedings of the Twenty Sixth Annual International A C M SIC IR Conference on Research and Development in Information Retrieval, pages 143-150, Toronto, Canada, 2003. ACM. [Pai84] C. Paice. Soft evaluation of boolean search queries in information re- 76 trieval systems. Information Technology: Research and Development, 3(l):33-42, 1984. [PN96] C. Pearce and C. Nicholas. Telltale: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal of the American Society for Information Science, 47(4):263-275, 1996. [PorSO] M. Porter. An algorithm for suffix stripping. Program, 14(3): 130-137, 1980. [Por97] M. Porter. An algorithm for suffix stripping. In K. S. Jones and W illett P., editors, Readings in information retrieval, pages 313-316. Morgan Kaufmann Publishers Inc., 1997. [RGF90] K. Rose, E. Gurewitz, and G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters 11, 11:589-594, 1990. [Ril95] E. RilofF. Little words can make a big difference for text classification. In Proceedings of the 18th Annual International AC M SIG IR Conference on Research and Development in Information Retrieval, pages 130-136, 1995. [R8W"'"98] B. Rehder, M. Schreiner, M. Wolfe, D. Laham, T. Landauer, and W. Kintsch. Using latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes, 25:337-354, 1998. [Sal71] G. Salton, editor. The SM A R T Retrieval System: Experiments in Auto- 77 matic Document Processing. Prentice Hall, Englewood Cliffs, New Jersey, 1971. [Sal89] G. Salton. Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading, MA, 1989. [SB88] G. Salton and C. Buckley. Term-weighting approaches in autom atic text retrieval. Information Processing & Management, 24(5):513-523, 1988. [Set96] R. Setiono. Extracting rules from pruned neural network for breast cancer diagnosis, pages 37-51, 1996. [SM83] G. Salton and M. McGill. Introduction to Modem Information Retrieval. McGraw Hill Publishing company. New York, 1983. [SP97] L. Saul and F. Pereira. Aggregate and mixed-order markov models for statistical language processing. In Claire Cardie and Ralph Weischedel, editors. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 81-89. Association for Com puta­ tional Linguistics, Somerset, New Jersey, 1997. [SS97] H. Schutze and C. Silverstein. Projections for efficient document cluster­ ing. In AC M /SIG IR, pages 74-81, 1997. [Str98] G. Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press, 3rd edition, 1998. 78 [TC91] H. Turtle and W. Croft. Evaluation of an inference network-based re­ trieval model. AC M Transactions on Information Systems, 9(3), 1991. [TG96] I. Taha and J Gosh. Characterization of the Wisconsin breast cancer database using a hybrid symbolic-connectionist system. Technical Re­ port UT-CVISS-TR-97-007, the Computer and Vision Research Center, University of Texas, 1996. [TP91] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cogni­ tive Neuroscience, 3, 1991. [vR79] C. van Rijsbergen. Information Retrieval. Butterworths, London, 2nd edition, 1979. [WK79] W. Waller and D. Kraft. A mathematical model of a weighted boolean retrieval system. Information Processing and Management, 15:235-245, 1979. [WM90] W. Wolberg and 0 . Mangasarian. Multisurface method of pattern sep­ aration for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, volume 87, pages 9193-9196, 1990. [WZ02] J. Wu and Z. Zhou. Face recognition with one training image per person. Pattern Recognition Letters, 23(14): 1711-1719, 2002. [Zim91] H. Zimmerman. Fuzzy Set Theory and its Applications. Kluwer Academic Publishers, 2nd edition, 1991. 79 A p p en d ix List of P ub lication s 1. Refereed Journal Article: Liang Chen, Jia Zeng and Naoyuki Tokuda, A “stereo” document representation for textual information retrieval. Journal of the American Society for Information Science and Technology (JASIST), Accepted for publication (2005). 2. Refereed Book Chapter: Liang Chen, Jia Zeng and Jian Pei, Classifying Noisy and Incomplete Medical D ata by a Differential Latent Semantic Indexing Approach, in P. Pardalos et al. Eds: D ata Mining in Biomedicine, Springer Press, Accepted for publication (2004). 80