STUDY OF DOCUMENT RETRIEVAL USING LATENT SEMANTIC
INDEXING (LSI) ON A VERY LARGE DATA SET

by

A N K Zaman
B.Sc. Computer Science and Technology, University of Rajshahi, Bangladesh
M.Sc. Computer Science and Technology, University of Rajshahi, Bangladesh

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
MATHEMATICAL, COMPUTER, AND PHYSICAL SCIENCES
(COMPUTER SCIENCE)

THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA
January 2010

© A N K Zaman, 2010

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

1+1

Library and Archives
Canada

Bibliothgque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de l'6dition

395 Wellington Street
Ottawa ON K1A0N4
Canada

395, rue Wellington
Ottawa ON K1A0N4
Canada
Your file Votre reference
ISBN: 978-0-494-60849-4
Our file Notre r6f6rence
ISBN: 978-0-494-60849-4

NOTICE:

AVIS:

The author has granted a non­
exclusive license allowing Library and
Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or non­
commercial purposes, in microform,
paper, electronic and/or any other
formats.

L'auteur a accorde une licence non exclusive
permettant a la Bibliotheque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par telecommunication ou par I'lnternet, preter,
distribuer et vendre des theses partout dans le
monde, a des ftns commerciales ou autres, sur
support microforme, papier, electronique et/ou
autres formats.

The author retains copyright
ownership and moral rights in this
thesis. Neither the thesis nor
substantial extracts from it may be
printed or otherwise reproduced
without the author's permission.

L'auteur conserve la propriete du droit d'auteur
et des droits moraux qui protege cette these. Ni
la these ni des extraits substantiels de celle-ci
ne doivent etre imprimes ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting forms
may have been removed from this
thesis.

Conformement a la loi canadienne sur la
protection de la vie privee, quelques
formulaires secondaires ont ete enleves de
cette these.

While these forms may be included
in the document page count, their
removal does not represent any loss
of content from the thesis.

Bien que ces formulaires aient inclus dans
la pagination, il n'y aura aucun contenu
manquant.

I+I

Canada
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Abstract

The primary purpose of an information retrieval system is to retrieve all the
relevant documents, which are relevant to the user query. The Latent Semantic Indexing
(LSI) based ad hoc document retrieval task investigates the performance of retrieval
systems that search a static set of documents using new questions/queries. Performance of
LSI has been tested for several smaller datasets (e.g., MED, CISI abstracts etc) however,
LSI has not been tested for a large dataset. In this research, we concentrated on the
performance of LSI on large dataset. Stop word list and term weighting schemes are two
key parameters in the area of information retrieval. We investigated the performance of
LSI by using three different set of stop word lists and, also, without removing the stop
words from the test collection. We also applied three different term-weighting (raw term
frequency, log-entropy, and tf-idf) schemes to measure retrieval performance of LSI. We
observed that, firstly, for a LSI based ad hoc information retrieval system, a tailored stop
word list must be assembled for every unique large dataset. Secondly, the use of tf-idf
term weighting scheme shows better retrieval performance than log-entropy and raw term
frequency weighting schemes even when the test collection became large.

ii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

®e3ioa£ed,

ta nujr juwesitb,

my,
<m2r nujr {Ulovted aoa

iii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

TABLE OF CONTENTS
ABSTRACT

II

LIST OF TABLES

VI

LIST OF FIGURES

VII

LIST OF FIGURES

VII

ACKNOWLEDGEMENT

VIII

ACKNOWLEDGEMENT

VIII

CHAPTER-1

1

INTRODUCTION

1

1.1 BACKGROUND
1.2 BASIC TERMINOLOGY
1.2.1 TREC
1.2.2 Term Weighting Scheme
1.2.3 Stop Words
1.2.4 Stemming
1.3 PROBLEM STATEMENT
1.4 THESIS OUTLINE

1
1
1
2
2
2
2
3

CHAPTER-2

4

THEORIES OF INFORMATION RETRIEVAL TECHNIQUES

4

2.1 INTRODUCTION
2.2 BACKGROUND AND RELATED THEORY
2.3 INFORMATION RETRIEVAL vs DATA RETRIEVAL
2.4 INFORMATION RETRIEVAL SYSTEMS
2.5 INFORMATION RETRIEVAL APPROACHES
2.5.1 Classical Boolean Approach
2.5.2 Extended Boolean Approach
2.5.3 Vector Space Approach
2.5.4 Latent Semantic Indexing
2.6 TREC AND RELEVANCE JUDGMENT
2.6.1 Text REtrieval Conference-8 (TREC-8) Relevance Judgment
2.6.2 Relevance Judgement Problems
2.7 EVALUATION OF INFORMATION RETRIEVAL PERFORMANCE
2.7.1 Recall and Precision
2.7.2 Recall-Precision Calculations and Plotting
2.8 PAIRED SAMPLE T-TEST

4
4
4
5
6
7
7

8
14
20
21
21
22
23
23
26

CHAPTER-3

28

ISSUES IN LATENT SEMANTIC INDEXING (LSI) RESEARCH

28

3.1 INTRODUCTION
28
3.2 RELATED RESEARCH WORK ON LATENT SEMANTIC INDEXING (LSI)
28
3.2.1 Using Latent Semantic Analysis to Improve Access to Textual Information
28
3.2.2 Indexing by Latent Semantic Analysis
29
3.2.3 Large-Scale Information Retrieval Using Latent Semantic Indexing
30
3.2.4 Automatic Cross-Language Information Retrieval using Latent Semantic Indexing
31
3.2.5 Latent Semantic Indexing-Based Intelligent Information Retrieval System for Digital Libraries
31
3.2.6 Singular Value Decomposition and Rank K Approximation
32
3.2.7 Singular Value Decomposition for Text Retrieval
32
3.3 CONCLUSION
34
CHAPTER-4

35

TEST COLLECTION AND PRE-PROCESSING

35

4.1 INTRODUCTION

35

iv

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4.2 STANDARD TEST COLLECTION
4.2.1 TREC and Standard Test Collection
4.2.2 Los Angeles Times (LA Times) Text Collection (TREC-8)
4.3 PRE-PROCESSING OF DATA
4.3.1 Method to Extract Documents
4.3.2 Stop Word Removal and Stemming
4.4 STOP WORD REMOVAL AND STEMMING PROGRAM
4.5 QUERY SELECTION
4.7 CONCLUSION
CHAPTER-5

35
35
36
38
38
39
42
44
45
46

EXPERIMENTAL STUDIES ON STOP WORD LISTS AND TERM WEIGHTING SCHEMES.... 46
5.1 INTRODUCTION
5.2 GTP (GENERAL TEXT PARSER)
5.3 EXPERIMENTAL DESIGN
5.3.1 Experiment-1: Study of Stop Words
5.3.2 Experiment-2: Study of Weighting Schemes
5.4 RESULTS AND EVALUATIONS
5.4.1 Result: Experiment-1
5.4.2 Result: Experiment-2
5.5 CONCLUSION

46
46
48
48
49
49
49
53
54

CHAPTER-6

56

CONCLUSION AND FUTURE DIRECTIONS

56

6.1 INTRODUCTION
6.2 SUMMARY
6.3 POSSIBLE EXTENSIONS OF THE EXPERIMENTS

56
56
57

BIBLIOGRAPHY

59

APPENDIX A

67

1. STOP WORDS LIST: GENERATED STOP WORDS LIST (UNIVERSITY OF NORTHERN BRITISH COLUMBIA)67
2. STOP WORDS LIST: UNIVERSITY OF TENNESSEE
74
3. STOP WORDS LIST: UNIVERSITY OF GLASGOW
76
APPENDIX B
SOURCE CODE: TREC FILE EXTRACTOR
APPENDIX C
GENERAL TEXT PARSER [UNIVERSITY OF TENNESSEE][GILES ET AL 2001]
QUERY PROCESSING WITH GTPQUERY
APPENDIX D
T-TEST CALCULATIONS
LEVEL OF SIGNIFICANCE FOR TWO-TAILED T-TEST
PUBLICATIONS

79
79
96
96
98
100
100
102
103

v

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

List of Tables

Table 2.1: Information Retrieval Vs Data Retrieval
Table 2.2:Term-by-document matrix
Table 2.3: Angles and their Cosine values
Table 2.4: Recall-precision of a single query
Table 2.5: Interpolated precision of a query
Table 2.6: Interpolated average precision of two different systems

5
9
20
24
24
25

Table 3.1: A set of datasets [Letsche 1996]

30

Table 4.1: Input Data Characteristics (LA Times Data Set (1989, 1990))

38

Table 5.1: Different Parameters for the Study of Stop Words
49
Table 5.2: Different Parameters for the Study of Weighting Schemes
49
Table 5.3: 10-point Interpolated Precision of Four Different Systems
50
Table 5.4: 10-point Interpolated Precision, Mean and Standard Deviation of Four Systems
52
Table 5.5: Comparison among Systems in Terms of T-test
53
Table 5.6: 10-point Interpolated Precision of Three Different Systems
53

vi

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

List of Figures
Figure 2.1: Framework of an Information Retrieval System
Figure 2.2: Recall-Precision Graph of a query
Figure 2.3: Recall-precision graph of two retrieval systems

6
25
26

Figure 3.1: LSI vs. Term Matching on MEDLINE [Husbands et al 2001]
Figure 3.2: LSI vs. Term Matching on TREC6 [Husbands et al 2001]

33
33

Figure 4.2: An extract from the LA Times Collection, file name LAI 23190, and
illustrates one document
Figure 4.3: TREC File Extractor
Figure 4.4: An Input File before Pre-processing
Figure 4.5: An Input File after Pre-processing
Figure 4.6: Pre-processing Software
Figure 4.7: First Two Queries (401 and 402) from the TREC-8 Query Set

37
39
41
42
43
45

Figure 5.1: Latent Semantic Indexing (LSI) Retrieval Systems [GILES ET AL 2001]....47
Figure 5.2: Recall-Precision Graph for Experiment-1
51
Figure 5.3: Recall-Precision Graph for Experiment-2
54

vii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Acknowledgement
I would like to express my deepest gratitude to my supervisors, Dr. Charles Brown and
Dr. Liang Chen, for their active support throughout my master's study at UNBC. Their
suggestions, comments, and encouragement helped me to complete this thesis. I also wish
to thank Dr. Jennifer Hyndman for her encouragement and valuable advice on my thesis.
As well, I would acknowledge the external reviewer for his willingness to serve on my
thesis committee.

The work carried out in this thesis is financially

partly supported by the Research

Assistantship (RA) award from the Computer Science Program, UNBC. I am also very
much thankful to the RA committee for granting me the award.

I would also like to extend my thanks to Heath de la Giroday who actively helped me in
editing my thesis.

I am indebted to my parents and in-laws for their encouragement and support in
everything I have done. Finally, I would like to thank my wife Majida Akter for her
constant support and encouragement which made it possible for me to pursue and
complete this work.

viii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter-1
Introduction
1.1 Background
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data,
especially textual documents, in response to a query or topic statement. The need for
effective methods of automated IR has grown in importance because of the tremendous
explosion in the amount of unstructured data in digital form.

IR focuses on retrieving documents based on the content of their unstructured
components. An IR request (typically called a "query") may specify desired
characteristics of both the structured and unstructured components of the documents to be
retrieved, e.g., "The documents should be about 'Global Warming' and their author must
be 'Rosemary'." In this example, the query asks for documents whose body (the
unstructured part) is "about" a certain topic and whose author (a structured part) has a
specified value.

IR typically seeks to find documents in a given collection that are "about" a given topic or
that satisfy a given information need. The topic or information need is expressed by a
query, generated by the user. Documents that satisfy the given query in the judgment of
the user are said to be "relevant." Documents that are not about the given topic are said to
be "non-relevant."
1.2 Basic Terminology
Before mentioning the problem statement we present some terminology associated with
this thesis:
1.2.1 TREC
TREC (Text REtrieval Conference) is part of U.S. National Institute of Standards and
Technology (NIST), and TREC provides the infrastructure (e.g., organizing text retrieval
conferences, provide standard dataset etc) for large-scale testing of text retrieval
technology.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

1.2.2 Term Weighting Scheme
In the text based IR process term weighting schemes help us to assign a weight/value to
every term of a document. Tf-idf, and log-entropy are examples of term weighing
schemes. Term weighting schemes are discussed in Chapter-2 in details.
1.2.3 Stop Words
Stop words are the list of words (e.g., Articles: a, an, and the, Prepositions: at, by, in, to,
from, and with, Conjunctions: and, but, as, and because, others: become, everywhere,
and likely) that are not considered influential during the execution of Latent Semantic
Analysis (LSA) process. A stop word list is also called a negative-dictionary.
1.2.4 Stemming
Stemming is a technique to reduce a word to its stem or root form. Stemming improves
retrieval performance by reducing distinct index terms. For example:
• Plurals:
1. The word cats is stemmed as cat because of the rule s -> 0
2. The word stresses is stemmed as stress because of the rules sses ->ss and ss->ss
• Participles:
1. The word examined is stemmed as examin because of the rule ed -> 0
2. The word playing is stemmed as play because of the rule ing -> 0
1.3 Problem Statement
We have reviewed the published work on LSI for the last 20 years, and we found some
open questions on LSI, stop word lists, term weighting schemes, and large datasets. So,
we are interested in exploring the following issues:
•

LSI has not been tested on a large standard dataset with the most accepted
weighting schemes tf-idf and log-entropy. So, we are interested in the
performance of LSI on a large standard dataset.

•

TREC encourages text retrieval research, but TREC does not provide any
evidence/standard rule to use stop words in IR research. So, we propose an
algorithm to produce a stop word list based on TREC-8 LA Times dataset for our
experiments.

•

Different information retrieval research groups use different stop words lists. Our
aim is to find the effect of stop words in the IR process in the case of a large

2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

standard dataset. Also to find the effect of arbitrary (a stop word lists is not
tailored for a unique dataset) use of stop word list on LSI-based retrieval system.
•

Term weighting schemes improve retrieval performance, however, we did not find
any concrete evidence that mentions which weighting scheme gives better
performance in case of a large dataset. So, we decided to find the effects of
different term weighting schemes (raw term frequency, tf-idf, and log-entropy) on
LSI based IR process in case of large standard dataset.

1.4 Thesis Outline
The rest of the thesis is organized as follows:
Chapter-2 focuses on the theories of IR and LSI as well as the evaluation methods of
retrieval performance. Chapter-3 is the literature review of the Latent Semantic Indexing
(LSI) research. Chapter-4 focuses on the characteristics of the large test collection and
associated query set. It also illustrates the different phases of data pre-processing, e.g.,
stemming. Chapter-5 focuses on the experimental setup, obtained results, and the
evaluations. Chapter-6 contains the concluding remarks.

3

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter-2
Theories of Information Retrieval Techniques
2.1 Introduction
With the exponential development of information technologies, a vast amount of
intellectual resources has been recorded in computer-readable forms of information that
can be digitally transmitted or processed. The evolution of communication networks has
made an immense number of datasets available in the public domain that can be used by
ordinary users. However, due to the tremendous volume of the data, it has become
extremely difficult for the users to find the desired information. This challenge has led to
a huge demand for information management technologies which can facilitate access to
the information for users. [Zeng 2005] The objective of this thesis to study the effects of
different stop word lists, different term weighting schemes on LSI based IR system with a
very large dataset.
2.2 Background and Related Theory
Information retrieval encompasses the techniques for retrieving relevant information from
storage in response to a user's request. In the following sections, a brief overview of the
field of information retrieval is provided in order to introduce the proposed research
work. The following sections present a historical review of a number of IR approaches
that are relevant to the thesis topic. The evaluation strategy that is used to estimate the
retrieval performance of an IR method is also covered.
2.3 Information Retrieval vs Data Retrieval
When the term information retrieval (IR) is mentioned in this thesis, it refers to the
automatic information retrieval systems that search unstructured databases for data
records related to the user's queries, and inform the user regarding the existence/nonexistence as well as the location of the document. Although the term information has a
close association with the term data, they are not equivalent concepts. The differences
between information retrieval and data retrieval are illustrated in Table 2.1. [Rijsbergen
1979]

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Information Retrieval

Data Retrieval

Natural

Artificial (not natural)

Incomplete

Complete

Best Match, Partial Matching

Exact match

Relevance

Exact matching

Probabilistic or Statistical

Deterministic

Query Language
Query Specification
Matching
Items Wanted
Model

Table 2.1: Information Retrieval vs Data Retrieval

In IR, the query statement is usually expressed in natural language and does not
necessarily need to be structured. On the contrary, data retrieval usually requires the
request to comply with specified syntax and to provide a complete description of the
desired information. With regard to evaluating the retrieved records, IR judges relevant
items to be good matches, and among them the most relevant one as judged by the user, is
determined to be the best match. Whereas, data retrieval only regards exact matches as an
acceptable match. Due to the differences between information retrieval documents
(unstructured records) and data retrieval documents (structured records), information
retrieval engines might, but are not guaranteed to, retrieve all the relevant documents
from the storage. Even if IR systems manage to retrieve a list of relevant records, such a
list might not be complete. In contrast, data retrieval systems are guaranteed to output all
occurrences of the records that match.
2.4 Information Retrieval Systems
An information retrieval system is a "device interposed between a potential user of
information and the information collection itself'. [Harter 1986] The purpose of an IR
system is to capture wanted items and to filter out unwanted items. [Harter 1986] A
typical IR system performs representation, storage, and retrieval of unstructured data, and
may contain some or all of the following parts/components: indexing, query operation,
matching, output module, and feedback module. The indexing usually contains two
primary processes. The first

process conducts conceptual analysis of information

resources in the collection to obtain the contained concepts. These concepts, usually
called effective terms, make up a system vocabulary that is applicable to all the
information pieces in this system. The output of the first process flows to the second
stage, in which a translation mechanism is employed and a database of information
representations is created. When an information request is posed, the query operation

5

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

process parses it into its constituent elements. An analysis is conducted over its
conceptual content and the query is transformed into a representation that is consistent
with all the information items in storage. Given the query representation, the matching
mechanism evaluates the relatedness of all the potential targets to the query and obtains a
rank of relevance. An ordered set of information items is returned to the user by the
output module. When interaction between the user and the retrieval system is available,
one can communicate with the system through the feedback module by refining the query
during one search session in the light of sample retrieval.

information
Resources>

Conceptual
* Analysis

•

Translation

Database of
Information
Representations

<Retrieved
> Information>

Evaluation
Strategy

Request
Representation

<Information
Needs>

Conceptual
Analysis

Translation

Figure 2.1: Framework of an Information Retrieval System

The user interface serves as a bridge connecting the client to the other modules of the
system. The infrastructure of a typical IR system is depicted in Fig. 2.1.
2.5 Information Retrieval Approaches
In the field of information retrieval, there are two major categories of techniques:
statistical analysis and semantic analysis. Statistical approaches consider the freeform
natural language documents to be pure data records and index them in terms of some
statistical measure. Assessment of the relevance between a pair of documents is also

6

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

based on a certain statistical metric. Semantic approaches, however, reproduce some
degree of understanding of natural language texts that a human may provide.

By far, the greatest amount of work has been devoted to statistical approaches which fall
into four categories: classical Boolean, extended Boolean, vector space, and probabilistic.
The following sections briefly discuss classical Boolean, extended Boolean, and vector
space approaches.
2.5.1 Classical Boolean Approach
The classical Boolean approach is based on the theory of Boolean algebra. A
conventional Boolean query combines terms with the classical Boolean operators AND,
OR, and NOT, and is evaluated by the classical rules of Boolean algebra. Such a model is
very straightforward and relatively tractable to implement.

However, the classical Boolean method has some major limitations due to the
characteristics of the standard Boolean model. Like all Boolean expression, the query
only has two values: true or false. Correspondingly, a document is either relevant to a
query or non-relevant to it. Therefore, no ranking strategy is possible in the classical
Boolean method. With regard to the term weighting, only two values are available: 0 for
an absent term and 1 for a present term. Such an all-or-nothing condition tends to have
the effect that either an intimidating large amount of documents or none at all are
retrieved [Harman 1992], As well, the classical Boolean rules tend to produce
counterintuitive results because of this all-or-nothing characteristic. For example, in
response to a multi-term OR operation, "a document containing all (or many of) the query
terms is not treated better than a document containing one term" [Salton & Buckley
1987], Similarly, in response to a multi-term AND operation "a document containing all
but one query term is treated just as badly as a document containing no query term at all"
[Salton & Buckley 1987]. The ranking inefficiency of the classical Boolean model is a
considerable issue that makes this method ineffective in document retrieval.
2.5.2 Extended Boolean Approach
As mentioned above, the classical Boolean scheme has ranking inefficiency and to
overcome this inefficiency several extended Boolean models that integrate ranking

7

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

strategies are discussed in the literature. [Waller & Kraft 1979], [Salton & McGill 1983],
[Paice 1984], [Zimmerman 1991], [Grei et al 1997]

The extended Boolean models employ extended Boolean operators, also called soft
Boolean operators. These operators assign weights to the terms in each document. They
also extend the truth value range from a discrete two-element-set: {0, 1} in the case of
classical Boolean model to a consecutive range: [0, 1]. In other words, the operators
evaluate their arguments to a number, corresponding to the estimated degree to which a
given logical expression matches the given document. By doing this, the extended
Boolean methods are able to provide a ranked output allowing some documents to satisfy
the query condition more closely than others. [Lee 1994] Therefore extended Boolean
methods manage to overcome the limitation of the classical Boolean approach. From
literature it is found that the extended Boolean model can achieve greater IR performance
than either the classical Boolean or the vector space model. [Greengrass 2001] However,
there is a big price for this performance improvement. Formulating effective extended
Boolean queries involves more thought and expertise in the query domain than either the
classical Boolean method or the vector space approach. [Greengrass 2001]
2.5.3 Vector Space Approach
A Vector space model (or term vector model) is an algebraic model for representing text
documents (and any objects, in general) as vectors of identifiers, such as terms. A Vector
space model is used for various operations of IR such as information filtering,
information retrieval, indexing and relevancy rankings. The vector space approaches have
achieved great success in IR by applying the theory of linear algebra in this representation
model.

In the traditional vector space method, the union of the effective terms defines a
document space so that each distinct term represents one dimension in this space. For a
given document, each term is assigned a numeric weight which estimates the usefulness
of the term as a descriptor of the given document, i.e., the discriminatory power of the
term for the document. By exploiting the weights of all the terms for a document, the
document is then encoded as a term vector in the document space. It is worth noting that a
query is usually considered to be a pseudo-document and can also be represented as a
term vector.

8

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Perspectives of document space and term space can be combined by viewing the entire
collection as a term-by-document matrix, also called an indexing matrix. Each row of this
matrix represents a term, and each column of this matrix represents a document. The
element at row /, column j reflects the importance of term i in representing the
characteristics of document j. A data set of d documents and t terms can be represented by
the matrix shown below:

M=
\ m n m,2 . . . mtd j
Please note that any defined denotation, e.g., d, t and M, will be applicable to the rest of
this text. An example of term-by-document matrix is given below:

Let us consider three documents (dl, d2, and d3) and each document contains one
sentence:
dl: You read magazine.
d2: You play cricket.
d3: You like like like pizza.
The term-by-document matrix takes the following form for the above documents:

Terms
you
read
play
cricket
like
pizza

dl
1
1
0
0
0
0

Documents
d2
1
0
1
1
0
0

d3
1
0
0
0
3
1

Table 2.2: Term-by-document Matrix

We can also write down this term-by-document matrix as follows:

9

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

'i

i

1

0 0

Mg*3—

r

0 1

0

0 1

0

0

3

0

,0 0 1,
2.5.3.1 Term Weighting
One significant issue is that any vector space model needs to address term weighting, i.e.
assigning weight to a term for a document so that the assigned weight properly reflects
the contribution of the term in distinguishing the document from other documents. A
variety of weighting schemes have been proposed, which basically fall into two
categories: local weighting and global weighting.

Local weighting schemes attempt to reflect the importance of a term within a given
document by a document-specific statistic. Usually, the local weights assigned to
identical terms vary among documents. Some popular local weighting functions include
the raw term frequency, binary, and logarithm of the term frequency. Let us denote L,j to
be the local weight of term i in document j and denote tfj to be the frequency with which
term i appears in document j. The local weights for three weighting schemes are as
follows:
o

Raw Term Frequency: L,j = tfj

o

Binary L,y=

o

1(a)

fl

if
J

[0

otherwise

fl + log tfu
Logarithm L,j = <

0

tf
>1
J,j

if

tf>0
''
otherwise

1(b)

1 (c)

There is a critical problem associated with raw term frequency. In the case of raw term
frequency, all terms are considered equally important when it comes to assessing
relevancy on a query; as a result, certain terms have little or no discriminating power in
determining relevance, and thus have no effect in scoring. For example, a document with
10 occurrences of the term is more relevant than a document with one occurrence of the
term but not ten times as relevant.

10

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The problem with the binary weighting scheme is that it fails to capture the fact that some
terms might be more important than others as the binary weighting scheme only considers
the presence or absence of a term in a document.

It is worth mentioning that the logarithm weighting scheme exploits the logarithm
function to transform the raw term frequency, thereby dampening the effects of large
differences in frequencies of individual terms. For example, let us consider tfi = 1000 is
the term frequency of a term i in the entire collection and tf^ =10 is the term frequency of
a term j in the entire collection. The ratio of two raw term frequencies is tfi/tf2 =100. If we
go for logarithm weighting scheme (please see equation 1(c)) then local weight, Lj for tfi
is 4 and local weight, L2 for tf2 is 2, so after applying logarithm weighting schemes the
ratio of two local weights L1/L2-4/2=2, so it is clear that the logarithm weighting scheme
reduces the large differences in frequencies of individual terms.

Global weighting schemes on the other hand attempts to attenuate the effect of terms that
occur too often in the collection to be meaningful in determining relevance of documents
against a query. The idea is to scale down the term weights of terms with high collection
frequency (defined to be the total number of occurrences of a term in the collection) by
reducing the tf (term frequency) weight of a term by a factor that grows with its collection
frequency. In addition to estimating the document-specific statistics, characterizing a
term's overall importance in the whole collection can also be useful. Global weighting
strategies are designed to measure the distribution of a term within the given collection.
Thus they are able to estimate how frequently a term occurs in a certain document by
chance. Generally, they give less weight to terms that occur frequently or occur in many
documents because these terms are not considered to be strong descriptors for any
specific document. Four well-known global weighting schemes are: normalized term
frequency (or normal for short), inverse document frequency (or idf for short), term
frequency-idf (or tf-idf for short), and entropy. Let G, be the global weight assigned to
term /, let gfi be the frequency term i occurs in the entire collection, let df be the
frequency of documents in which term i occurs, and let d be the number of documents in
the whole collection.

The evaluation of G, by the four types of global weighting methods is discussed below:
11

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The normal weighting scheme normalizes the length of the vector for a term to 1, giving
high weight to infrequent terms in the collection. However, normal weighting only
depends on the sum of the squared frequencies, not the distribution of those frequencies
per se. The following equation presents the normal weighting scheme:

o

Normal: Gj

2(a)

The tf-idf and idf weighting schemes are closely related because both schemes weight
terms inversely by the number of documents in which they appear. Neither method
depends on the distribution of terms; rather they depend on the number of different
documents in which a term occurs. The tf-idf weighting scheme is used to scale the
frequency of a term. By nature tf assigns high weights to the frequent terms and low
weights to the rare terms. On the other hand, idf (inverse document frequency) assigns
high weight to a rare term and low weight to a frequent term. So, tf-idf term weighting
scheme produces a composite weighting schemes with the following characteristics:
•

Weight of a certain term becomes highest when the term occurs many times
within a small number of documents.

•

Weight of a certain term becomes lower when the term occurs fewer times in a
document or occurs in many documents.

•

Weight of a certain term becomes lowest when the term occurs in virtually all
documents.

•

The tf-idf weight values are strictly smaller than 1.

Reviewing the tf-idf weighting scheme in the IR literature leads to the following points
being noted:
•

tf-idf attenuates the effect of terms that occur too often in the input dataset.

•

The discriminatory power of tf-idf allows the retrieval engine to quickly find
relevant documents that are likely to satisfy the user.

•

tf-idf exhibits an overall advantage over other term weighting system. [TREC-3,
1997]

The following equations present the idf and tf-idf weighting schemes:

12

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

o

2(b)

Idf: G,= logf -j\dfi)

o

Tf-idf: G, = tft] x idf

2(c)

The entropy scheme is based on information theoretic ideas and is the most
sophisticated weighting scheme. It exploits the distribution of terms over documents.
The main idea is to assign less weight to terms that are equally distributed over all the
documents and assign more weight to terms that are concentrated in a few documents.
[Dumais 1991], [Manning et al 2008] The following equation presents the entropy
weighting scheme:

Entropy: G,= 1 - V'' ——7^7^ where p =—
^ \og{d)
gf,

2(d)

All of the global weighting schemes share a principle of assigning less weight to terms
that occur frequently. Global weights involve variations in the relative importance of
local frequency, global frequency, and document frequency.

We can combine local and global weighting schemes to measure weight of a term as each
of the weighting schemes contains advantages and limitations. There is not a fixed
solution for choosing a term weighting scheme. Both local weights and global weights are
used to measure the term weights; the value of the z'th row, y'th column element can be
evaluated as shown below:

o

m,j = Ljj x G ,

(3)

If you look at the individual value of m in the matrix M [section 2.5.3] and the local and
global weightings equations [1(a), 1(b), 1(c) and 2(a), 2(b), 2(c)], it is clear that the value
of m is dependent on both local and global weights of terms.
2.5.3.2 Normalization (equivalence classing of terms)
After getting the tokens from the documents and query, the simple case is if tokens in the
query just match with the tokens in the token list of a document. However, there are many

13

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

situations when two character sequences are not quite the same but you would like a
match to occur. For instance, if we search for UK, we might hope to also match
documents containing U.K.

Term/Token normalization is the process of canonicalizing tokens so that matches occur
despite superficial differences in the character sequences of the tokens. The standard way
to normalize is to implicitly create equivalence classes, which are normally named after
one member of the set. For example, if the tokens anti-war and antiwar are both mapped
onto the term anti-war, in both the document text and queries, then searches for one term
will retrieve documents that contain either. [Manning et al 2008]
2.5.3.3 Discussion
As an efficient model, the traditional vector space scheme is becoming very popular in IR
research. Since it has a sound mathematical foundation, it applies a similarity
measurement [see section 2.5.4.5] technique between document and query to produce a
ranked list of relevant documents.

The traditional vector space approach provides an effective way to approximate the
statistical properties of the document set. The major problem that exists with this method
is that it assumes all the terms are independent, orthogonal dimensions of the document
space so it simply ignores the relationship among terms. However, as it is a fact that there
are strong associations among terms in natural languages, the above assumption is never
satisfied [Hull 1994], Another drawback of this model in some applications is that the
number of terms that occur in a collection can be quite large; the traditional term-based
document space tends to have a large number of dimensions. Some alternative approaches
such as LSI based on the traditional vector space model have overcome these limitations.
2.5.4 Latent Semantic Indexing
Latent semantic indexing (LSI) is a method that exploits the idea of vector space model
and singular value decomposition (SVD).
2.5.4.1 Motivation of LSI
In the research of retrieving free-form natural language data, it is always useful to analyze
the features of human verbal behavior. There are two issues that are discussed the most:

14

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

synonymy and polysemy. Synonymy refers to the states when two or more words or
expressions have the same or nearly the same meaning in some or all senses. Polysemy
describes the cases when one word has multiple meanings. These characteristics of
natural languages result in the deficiencies of some IR algorithms that do not offer
comparison methods on terms. The Latent semantic indexing method was proposed in
order to capture the statistical relationship among terms and accordingly provide an
effective solution to the problems of synonymy and polysemy that cannot be tackled by
either word-based approaches or the traditional vector space approach.
2.5.4.2 Theory
Latent semantic indexing, also called latent semantic analysis (LSA) method, was first
proposed by Deerwester et al. [Deerwester et al 1990], LSI assumes that, in the textual
data, there is some underlying latent semantic structure that is partially obscured by the
randomness of word choice with respect to retrieval. This structure can be estimated by
statistical techniques and the obscuring noise can be removed.

Like the traditional vector space approach, the LSI method starts with a term-bydocument matrix that represents the association of terms to documents. It applies a
dimensional reduction scheme, singular value decomposition (SVD), on the matrix to
construct a reduced unified semantic space for retrieval. This smaller space, called LSI
space, consists of derived dimensions that are assumed to convey truly independent
concepts. By employing the dimensional reduction strategy, LSI not only captures most
of the important underlying semantic structure in associating terms with documents, but
also has a better chance in removing the noise or possible variability in word usage.
2.5.4.3 Singular Value Decomposition
Singular value decomposition is an effective dimensional reduction scheme. It is closely
related to a number of mathematical and statistical techniques that have been widely used
in other fields, such as the principal component analysis (PCA) for image processing and
face recognition. SVD has been proved to be a very good choice for uncovering latent
semantic structure (See [Deerwester et al 1990] for a further discussion of SVD and the
other alternative models).

15

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

SVD can be applied with an arbitrary rectangle matrix to decompose the matrix into three
matrices containing singular vectors and/or singular values. These three matrices with
special forms show a breakdown of the original matrix into linearly independent
components or factors. Many of these components are very small, leading to an
approximate model that contains many fewer dimensions. Thus, for information retrieval
purposes, SVD provides a reduced model for representing the term-to-term, document-todocument and term-to-document relationships.

By dimension reduction, it is possible for documents with somewhat different profiles of
term usage to be mapped into the same vector of factor values [Deerwester et al 1990].
This property helps to eliminate the noise in the original data, thus improving the
reliability of the algorithm.

Suppose we obtained a t*d term-by-document matrix M from the collection indexing
process of the traditional vector space method. We can apply SVD on M, which is then
decomposed into three special matrices U, S, and V. The decomposition can be written as:
M = USVr

4(a)

U is the t*t orthogonal matrix (UUT=It) having the left singular vectors of M as its
columns, and V is the d*d orthogonal matrix (VVT=Id) having the right singular vectors
as its columns, and S is the t*d diagonal matrix having the singular values oj > 02 > • • • >
<Jmin(t,d) of M in order along its diagonal. It should be noted that for any arbitrary matrix,
such a factorization exists. For details, please see [Strang 1998],

Generally, in Eq. 6, the matrices U, S and V must all be of full rank. However, one great
facilitation is that SVD offers a simple strategy for optimal approximate fit using smaller
matrices. [Deerwester et al 1990] If the singular values in S are ordered by size, the first k
largest values may be kept and the remaining smaller ones are set to zero. The product of
the resulting matrices is a matrix Aft which is only approximately equal to M, and is of
rank k. Since zeros were introduced into S, the representation can be simplified by
deleting the zero rows and columns of S to obtain a new diagonal matrix Sk, and then
deleting the corresponding columns of U and V to obtain Uk and F* respectively. The
rank-k model with the best possible least-squares-fit to M can be written as following:
Mk = Uk Sk VkT

4(b)

16

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Where, Mk is a matrix of size t*d, Uk is of size t*k, S* is of size k*k, and V* is of size k*d.

SVD provides an optimal solution to dimensionality reduction in that it derives an
orthonormal space, where the dimensions are ordered. Therefore, projecting the set of
documents onto the k lowest dimensions is guaranteed to have, among all possible
projections to a k dimensional space, the lowest possible least-square distance to the
original documents. [Schutze & Silverstein 1997]

The following section presents a numerical example of SVD calculation.
Let us consider a t*d (t>d) matrix M, where t=3 and d=2:

M= 0

/

1

0

We decompose this matrix by using the equation 6, and we get the following values:
M=USVt where, U is a t*t, S is a t*d, and VT is a d*d matrix.

U=

f-0.816

0.000

- 0.057^

0.408

-0.707

-0.057

V -0.408

-0.707

0.057 /

1.732 0.000
0.000 1.000
0.000

0.000

-0.707

0.707

-0.707 -0.707

M=USVT=

-0.816

0.000

0.408

-0.707 -0.057

-0.408 -0.707

-0.057

1.732 0.000
'- 0.707
0.000 1.000
0.707
0.057 y 0.000 0.000

0.707
-0.707

The reduced singular value decomposition of matrix M is given below:
f -0.816

0.000 A
-0.707 1.732 0.000
0.000 1.000
-0.408 -0.707

M=USV = 0.408

-0.707

0.707

-0.707 -0.707 y

17

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.5.4.4 Document and Query as Vectors
To find the ranked list of documents we need to apply a similarity measurement between
a set of documents against a query. In this thesis we used cosine similarity [section
2.5.4.5] measure. To do this we need to derive individual document and query vectors.
From SVD calculation we have the following equation:

Equation-4(a): M=USV T

Equation-5: MT =(USVT )T

Equation-6: M T = U T S T V

Equation-7: M T US~ X =U T SVUS~ L

Equation-8: V=M T US~ 1

Now the document and query vectors are represented by the following equations:

Document Vector: DRJ = DRJ T US

Query Vector:

1

DRQ =DR Q US

Thus the reduced JC-dimensional LSI space can be presented as:
T

Document Vector:

DR j—DR j U^S^

Query Vector:

DR^=DR (J U^S^

—1

9(a)

9(b)

2.5.4.5 Similarity Measurement
After deriving the term vector of documents and queries, a similarity measurement
technique is applied to estimate the similarity between a pair of vectors. This similarity
value is an indicator of how relevant the document is to the query.

18

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

A common similarity measure employed in the vector space model is called cosine
similarity. In case of cosine similarity, the lower angles presents higher similarity and
higher value of an angle represent dissimilarity between a query and a set of documents.
Denote \DR\ to be the length of vector DR. The cosine similarity between vector DR; and
DRq denoted by cos(DRj, DRq), can be evaluated by the following equation:

(10)

o

Two strategies are commonly used to utilize the similarity values for retrieving relevant
documents. One is called range queries, which is to retrieve all documents up to a
distance threshold. The other one is called nearest-neighbor queries, which is to retrieve
the N best matches. Although we do not expect any similarity metric to be a perfect
model that corresponds exactly with human judgment of relevance, the measurement
should somehow be able to assign higher values to the documents with a higher
proportion of the relevant text as well as assigning lower values to the documents with
less relevant content. By integrating the ranking strategy into IR systems, the human user
can restrict their attention to a set of documents of manageable size instead of having to
go through every single record in the data set.

The following figure presents the idea of cosine similarity graphically:

Term B
1.0

0.8

D2

0.6

0.4

D,
0.2

0

0.2

0.4

0.6

0.8

1.0

Term A

Where:
Di represents document 1,
D2 represents document 2 and
Q represents a query

19

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

ai is the angle between Di and Q
(X2 is the angle between D2 and Q

Angle a (Degree)
0
30
45
60
90

Cosine(a)
1
0.86
0.70
0.5
0

Table 2.3: Angles and Their Cosine Values.

The Table 2.3 shows that when the angle between two vectors is very small, the cosine
approaches 1. More precisely, when two documents are identical their cosine is 1; when
they are orthogonal (no common terms/ totally dissimilar), their cosine is 0. As we are
interested in rank-based retrieval and cosine similarity measure is a tool that helps us to
rank the documents based on the similarity values against a query, we are using cosine
similarity to rank the documents for retrieval purpose.
2.5.4.6 Retrieving Documents Using LSI
The retrieval process sorts the documents according to their similarity values, and returns
a ranked list of documents to the user. The retrieval process represents documents and
queries in the same manner and applies a certain function to estimate the similarity
between them.

In the LSI method, each document/query is projected onto the LSI space that is obtained
by using SVD. LSI then exploits the cosine measurement between the projections of a
pair of term vectors in the LSI space to make comparison between the two documents.
Thus, the similarity can be obtained by computing the cosine value of the angle between
the document's term vector and the query's term vector. All the documents can be ranked
according to their similarity values and an ordered set of documents retrieved.
2.6 TREC and Relevance Judgment
Relevance judgment is an obvious part of a standard IR dataset. Relevance judgments
help us to measure recall-precision by indicating which documents are relevant to which
topics or query. The following sections discuss the background information of TREC-8
relevance judgment.

20

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.6.1 Text REtrieval Conference-8 (TREC-8) Relevance Judgment
Relevance Judgments are of critical importance to a test collection. For each topic it is
necessary to compile a list of relevant documents - as comprehensive a list as possible.
All TRECs have used the pooling method to assemble the relevance assessments. In this
method a pool of possible relevant documents is created by taking a sample of documents
selected by the various systems of participating research groups. The pool is then shown
to the human assessor, who makes a binary (yes/no) relevance judgment for each
document in the pool. Un-judged documents are assumed to be not relevant. The
particular sampling method used in TREC is to take the top 100 documents (pooling
method with the depth 100) retrieved per judge run for a given topic and merge them into
the pool for assessment. This is a valid sampling technique since all the systems used
ranked retrieval methods, with those documents most likely to be relevant returned first.
Each pool is sorted by document identifier so assessors cannot tell if a document was
highly ranked by some systems or how many systems (or which systems) retrieved the
document.
The method can be summarized as:
1. To have dozens of research groups from universities and companies participate.
2. To run all 50 queries through their system (TREC-8, Topic 401-450).
3. To submit raw retrieval results (Take the top 100 highest ranked documents from
each topic e.g., TREC-8: 7100 documents).
4. To merge them into the candidate set e.g., TREC-8: 1736 documents.
5. To have human assessors judge relevance of each document.
6. To evaluate results and compile of performance statistics (e.g., TREC-8: 94
documents).
2.6.2 Relevance Judgement Problems
There are two potential problems to preparing the relevance judgments. Those are:
1. Human assessors make errors; and
2. There are often many more relevant documents in the corpus beyond the candidate
set; TREC procedure will consider them all irrelevant.

A very recent research paper published in SIGIR 2008, Singapore tried to assess the
accuracy of TREC relevance judgments with their own relevance judgments created by

21

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

their own human users and they found 63% similarity with TREC relevance judgments.
[Al-Maskari et al 2008]
2.7 Evaluation of Information Retrieval Performance
In previous sections, we presented the basic architecture of an IR system and introduced
some representative approaches that can be used to implement it. In this part of the
chapter, we discuss the criteria that may be investigated while evaluating an IR system.

Frakes et al. [Frakes & Yates 1992] provided a summary of the evaluation process: an
information retrieval system can be evaluated in terms of many criteria, including
execution efficiency, storage efficiency, retrieval effectiveness and the features they offer
a user. The system designers look into the relative importance of these factors based on
the particular environment and expectations. Accordingly, they select appropriate data
structures and algorithms for implementation.

Execution efficiency is measured by the time it takes a system, or part of a system, to
perform a computation. This can be measured in C based systems by using profiling tools
such as proof on UNIX. This factor has always been a major concern for the interactive
IR systems because a long retrieval time will interfere with the usefulness of the system.
The requirements of such IR systems usually specify maximum acceptable times for
searching, and for database maintenance operations such as adding and deleting
documents.

Storage efficiency is measured by the number of bytes needed to store the data. Space
overhead, a common measure of storage efficiency, is the ratio of the sizes of the index
files plus the size of the document files over the size of the document files.

Most IR experimentation has focused on retrieval effectiveness, which is based on
relevance judgment. Relevance is an inherently subjective concept in that the ultimate
goal is to satisfy the human users' needs. Due to the variety of the user's personal
background, it is impossible to design a perfect system that meets all the expectations of
all human users. Therefore, it is necessary to introduce a method to evaluate the
performance of a retrieving process by estimating the degree of relevance at which the

22

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

retrieved information matches the query.

In this thesis we used recall-precision and

paired sample t-test to evaluate our retrieval systems.
2.7.1 Recall and Precision
Two widely used parameters to measure IR success, which are based on the concept of
relevance, are precision and recall. Precision is the ratio of relevant items retrieved to all
items retrieved. Recall is the ratio of relevant items retrieved to all the relevant items. To
facilitate the understanding of the definitions, the following equations represent the recall
and precision:
„ Number of relevant documents retrieved
recall =
Total number of relevant documents
. .
Number of relevant documents retrieved
precision —
Total number of documents retrieved

(11)

(11\
y
j

The expectation of the users may vary from one person to another. Some users attach
more importance to precision, i.e., they want to see relevant information without going
through a lot of junk. Others take recall as a preference, i.e., they want to see all the
documents that are considered to be highly relevant. Hence, the evaluation that involves
only one of the two parameters may be biased. Due to this reason, some methods that
evaluate the IR performance in terms of precision and recall simultaneously, have been
developed.
2.7.2 Recall-Precision Calculations and Plotting
The following section explains how to calculate and plot recall-precision using a
numerical example. For more information [Manning et al 2008], [Croft et al 2009].

According to the theory of information retrieval evaluation, precision is calculated at 11
standard levels of recall 0, 10, 20 .... 100 percent (or 0, 0.1, 0.2 ... 1.0). To measure the
performance of an IR system, multiple queries or topics (e.g. 50 queries) are used. The
precision averages at 11 standard recall levels are used to compare the performance of
different systems and as the input for plotting the recall-precision graph (see Fig-2.2 and
2.3). Each recall-precision average is computed by summing the interpolated precisions at
the specified recall level by using the following standard rule:

23

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

•

For any standard recall level i, take maximum precision at any actual recall level
>= i

•

This defines a precision at the standard recall of 0 even though precision at actual
recall 0 is undefined

Let us consider the list of relevant documents:

0123

0132

0241

Relevant Documents (9)/DocID
0256
0299
0324
0311

0357

0399

Let us consider an IR system that retrieved 12 documents as shown in Table-2.4. The
relevant documents are shaded. The values of recall and precision of a single query were
calculated by using the equations 11 and 12. Table-2.5 represents the interpolated
precision of the same query.

Rank DocID Recall

Precision

Recall level

0
10

1

0234

0

2
3
4

0132

0.111
0.111
0.111

0.5

5
6
7

0123

0.4

8

0256
0078
0311
0231
0177

0.222
0.222
0.222
0.333
0.333

9

10
11
12

0115
0193
0345
0387

0.444

20
30
40

0.375
0.4

0.444
0.444

Table 2.4: Recall-precision of a single query

50
60
70
80
90
100

Interpolated
Precision
1.0

0.5
0.4
0.4
0.4
0
0
0
0
0
0

Table 2.5: Interpolated precision of a query

Fig 2.2 represents the recall-precision graph using the Table 2.5.

24

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

0.6

| •

Example IR System)

0.5

0.4

e

.2,;0.3

0.
0.2

0.1

00

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Figure 2.2: Recall-Precision Graph of a Query

We already mentioned that the performance of a single query does not necessarily
represent the performance of a retrieval system. So, we need to compute the precision for
multiple queries and average the interpolated precision values at each recall level. Table2.6 represents the average interpolated precision of two different retrieval systems
(Sytem-1 and System-2).
Recall
level
0
10
20
30
40
50
60
70
80
90
100

Average Interpolated Precision
System 1
1.0
0.5
0.4
0.37
0.33
0.33
0.33
0.2
0.2
0.12
0.0

System 2
1.0
0.8
0.75
0.6
0.6
0.5
0.4
0.37
0.33
0.2
0.0

Table 2.6: Interpolated Average Precision of Two Different Systems

25

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

System -1

0.2

0.3

0.4

0.5

0.6

-H -System-2

0.9

Recall

Figure 2.3: Recall-precision Graph of Two Retrieval Systems

Fig 2.3 represents the recall-precision graph of the Table 2.6
2.8 Paired Sample T-test
The paired samples t-test compares the means of two variables to determine whether there
is a statistically significant difference between two populations. We could do this by
calculating confidence intervals for the two populations and seeing if the mean (x) of
one population lies outside the 95% confidence interval of the other population.

A

formalized and slightly less cumbersome approach to answering this question is the t-test.
A confidence interval (CI) is a range of values within which lies the true population
mean. There are many possible confidence intervals. Each interval specifies a probability
of "capturing" the true population mean. [Triola 2006] To perform the t-test we need to
follow the following steps:
1. Choose the hypothesis:
•

Null: There is no significant difference between the means of the two
variables.

•

Alternate: There is a significant difference between the means of the two
variables.

2. Select the level of significance:

26

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

•

In paired sample t-test, after making the hypothesis, we choose the level of
significance. In most of the cases in the paired sample t-test, significance
level is 5% or 0.05.

After doing the calculations of the t-test if we get the significance values greater than 0.05
it means that there is a significant difference between two systems otherwise the null
hypothesis holds (no significant difference). Table-C in the appendix D represents the
different values for significance level 5% or 0.05.

We calculated precision for different retrieval systems (by varying parameters e.g.
weighting scheme, stop word). Every retrieval system is independent; however precision
was calculated on the same dataset. Paired sample t-test is used to compare two means
that are repeated measures for the same dataset. So, we used a paired sample t-test to find
the significance of the difference among the precision levels of retrieval systems to
compare the performance of the retrieval systems.

27

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter-3
Issues in Latent Semantic Indexing (LSI) Research
3.1 Introduction
Information retrieval (IR) deals with the representation, storage, organization of, and
access to information. The representation and organization of the information should
provide users with easy access to the information in which they are interested.
Unfortunately characterization of the user's information need is not a simple problem.
[Yates & Neto 1999]

For hundreds of years, people have understood the importance of archiving and finding
information. With the advancement of computer technology, it is now possible to store
large amounts of information and finding useful information from such collections
became a necessity. On the basis of this necessity, the field of Information Retrieval (IR)
was born in the 1950s. Over the last few decades, the field

of IR has matured

considerably. A number of automatic Information Retrieval systems are used on a daily
basis by a variety of users. [Singhal 2008]
3.2 Related Research Work on Latent Semantic Indexing (LSI)
This chapter examines the short comings and problems in existing published works on
Latent Semantic Indexing (LSI), one of the most popular techniques in the area of IR. LSI
is more fully discussed in the section 2.5.4, if the reader requires more background. The
published works revealed that there are open questions about very large datasets, stop
words, and term weighting schemes in LSI. This thesis attempts to answer these
questions, particularly for very large datasets used in LSI. The following sections discuss
the related published research that used the LSI algorithm.
3.2.1 Using Latent Semantic Analysis to Improve Access to Textual Information
The first research work published on LSI was entitled "Using latent semantic analysis to
improve access to textual information". [Dumais et al 1988] The authors used two
different datasets. The descriptions of the datasets are given below:
•

MED: The first database consisted of 1033 medical reference abstracts and titles.
A 5823*1033 term-document matrix was obtained and retrieval effectiveness
evaluated against 30 queries available with the dataset.

28

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

•

CISI: The second standard dataset consisted of 1460 information science
abstracts. A 5135*1460 term-by-document matrix was obtained and retrieval
performance evaluated using 35 queries available with the dataset.

They did not use any stemming algorithm or remove stop words. Points to be concluded
about this paper are:
The dataset is small for use in IR. For the first dataset (1033 medical reference
abstracts and titles), the researchers claim that LSI shows 13% average
improvement over raw term matching and show that LSI captured some structure
in the data that was missed by raw term matching. For the second dataset (1460
information science abstracts), the authors do not claim any improvement of LSI
over raw term matching technique. Homogeneity of documents and poorly stated
queries in the second dataset caused very poor performance in precision level, so,
the authors said that CISI was not a reliable dataset.
3.2.2 Indexing by Latent Semantic Analysis
Deerwester's [Deerwester et al 1990] paper "Indexing by latent semantic analysis"
describes automatic indexing and retrieval. This paper also described the use of singular
value decomposition. They used datasets with standard queries from the Medlars
collection (same dataset mentioned in section 3.2.1).

They removed stop words (SMART'S stop word list was used) but they did not use
stemming on the input dataset. By reviewing their results we find the following weakness
in their work:

LSI was evaluated with a very small dataset (input data and query set). It was a
poor choice for a test collection because the selected dataset contains
homogeneous documents. Some queries are vague and poorly stated.

In the first experiment, they claimed 13% average improvement in LSI results over the
raw term matching technique. In another experiment, the authors claimed 50%
improvement in retrieval performance in case of LSI (k=100/100-factor) against factor
analytic techniques for information retrieval using small numbers of factors e.g., 7
dimensions, 13 dimensions and, 21 dimensions.

29

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The authors only tested the above mentioned dataset with LSI and compared their results
against a straightforward term matching method, a version of SMART, and the vector
method reported by Voorhees (1985) for the same standard datasets. The SMART and
Voorhees systems are information retrieval systems, and have different indexing, term
weighting, and query processing procedure than LSI method.

A set of 439 common (stop) words was applied to reduce the number of terms. However,
they did not use stemming on the input dataset. They summarize their findings, as "These
results are modestly encouraging. The latent semantic indexing method is superior to
simple term matching in one standard case and equal in another" when considering two
different datasets. [Deerwester et al 1990]
3.2.3 Large-Scale Information Retrieval Using Latent Semantic Indexing
Todd A. Letsche [Letsche 1996] used the datasets shown in Table 3.1 in his master's
thesis at the University of Tennessee, Knoxville, USA.
Document Collection

Abbrev.

Number
of Terms

Number of
Documents

PVM

2547

170

PLMTP

9842

1154

CCE

31110

15486

USENET

534493

100000

PVM: parallel Virtual machine [GBD+94]
All of the following Combined:
PVM: parallel Virtual machine [GBD+94]
LAPACK User's Guide Release 2.0 [ABB+95]
MPI: The Complete Reference [SOHL+96]
Templates for the Solution of Linear Systems
[BBC94]
Parallel Computing works [FWM94]
The Concise Columbia Encyclopedia
Usenet News Archives

Table 3.1: A set of datasets [Letsche 1996]

By reviewing his results we find the following issues:
Letsche removed stop words and applied Porter's stemming algorithm to reduce
the terms. However, the length of the stop word list is not mentioned. LSI was

30

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

used to search the World Wide Web. He used a large dataset of 100,000
documents from USENET. However, Letsche claimed due to memory constraint
(low hardware resources), the system was unable to search the USENET dataset.
[Letsche 1996] In his thesis, the results are only based on small dataset. LSI
achieved up to 30% better retrieval performance than lexical searching technique
(Lexical analysis

based). Lexical

analysis

searching technique compare

tokens/terms to find the matching. This technique does not use singular value
decomposition (SVD) or cosine similarity to measure the relevance.
3.2.4 Automatic Cross-Language Information Retrieval using Latent Semantic
Indexing
Michael [Dumais et al 1997] used 2,482 documents for experimentation in the area of
automatic cross-language information retrieval using latent semantic indexing. They used
French and English documents. In this work, queries in one language can retrieve
documents in the other language (as well as the original language).

Reviewing the findings of this paper, the following is noted:
They did not use any standard dataset (like TREC). They did not remove stop
words or apply stemming. The authors compared LSI result with human generated
retrieval. With the human generated text, the overall performance of this LSI
system was about 10% worse for retrieving the corresponding category, but 15%
better when looking at the top 10 retrievals. [Dumais et al 1997]
3.2.5 Latent Semantic Indexing-Based Intelligent Information Retrieval System for
Digital Libraries
Aswani [Kumar et al 2006] used titles of 116 journals and periodicals available in the
Vellore Institute of Technology, India as their own input dataset. Reviewing the findings
of this paper, the following is noted:
The authors used a very small dataset and the data set and queries were not
standard. There is no numerical indication of the improvements in retrieval
performance. They applied the Porter's stemming algorithm on the input data set;
however, they did not remove stop words. Authors mentioned in their paper that
LSI has superiority over the standard vector space model. [Kumar et al 2006]

31

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

3.2.6 Singular Value Decomposition and Rank K Approximation
Bast and Majumder's [Bast & Majumder 2005] paper "Why Spectral Retrieval Works"
studied singular value decomposition method by varying the value of K (rank K
approximation) on three different (in size and content) datasets. They tried to find a
feasible value of K by measuring the relatedness scores between two terms. Using this
method they choose some values of K and measured the retrieval performance of LSI and
LSI-based methods. They used the following dataset and corresponding queries for their
experiments:
•

Time Collection contains 425 documents and 3882 unique terms.

•

Reuters Collection contains 21,578 documents and 5701 unique terms.

•

Ohsumed Collection contains 233,445 documents and 99,117 unique terms

Reviewing the findings of this paper, the following is noted:
The Ohsumed dataset is large. For all three datasets, by varying the value of K it is
found that the value of K does not play a significant positive role in the retrieval
performance. For all three datasets, a bigger value of K reduces the relatedness of
word pairs, so, reduces the retrieval performance in terms of recall-precision
measures.
3.2.7 Singular Value Decomposition for Text Retrieval
Husbands' [Husbands et al 2001] paper entitled "On the Use of the Singular Value
Decomposition for Text Retrieval" explored LSI retrieval performance for large datasets.
The authors used the following datasets for their experiments:
•

MEDLINE/ MED collection (please see section 3.2.1 for details): This dataset is
an example of a small dataset.

•

TREC-6 Dataset: A collection with 115,000 terms and 528,155 documents. This
is an example of a large dataset.

•

NPL Dataset: NPL dataset contains 11,429 documents and it is an example of a
small dataset.

In their results they compare the performance of LSI with term matching retrieval. For
MED data set, LSI (K=200) showed better retrieval performance than the term matching
technique. However, the large TREC-6 dataset LSI (K=200) showed poor retrieval (lower

32

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

recall-precision values) performance. The following figure presents the performance in
terms of recall-precision graph:

c
.2

it

\
t
Figure 3.1: LSI vs. Term Matching on
MEDLINE [Husbands et al 2001]

Figure 3.2: LSI vs. Term Matching on TREC6
[Husbands et al 2001)

Reviewing the findings of this paper, the following is noted:
Most LSI research shows LSI is superior than term matching technique and other
methods [Dumais et al 1988][Deerwester et al 1990][Kumar et al 2006][Letsche
1996], however, this paper shows poor performance of LSI over the term
matching technique in case of a specific large dataset. Stemming improves
retrieval performance [Hull 1995], but the authors did not use stemming on their
input dataset. Term weighting schemes have positive effects on retrieval
performance, and tf-idf and log-entropy weighting schemes are proven weighting
schemes in IR. However, the authors only used idf weighting schemes for their
experiments. The authors proposed a new method called "Term norms", and
mentioned that this method has great influence and variability however; they
showed LSI (Jf=300) had poor performance.
Husbands and co-authors reported poorer performance of LSI (K= 200 and jfiT=300), and
they did not use stop word removal or stemming on the TREC-6 input dataset. They only
used the idf term weighting scheme in their experiment. From the IR literature, we know
that pre-processing (stemming and stop word removal) and term weighting schemes
improve retrieval performance significantly, so, we are encouraged to test the

33

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

performance of LSI on a large dataset with the use of pre-processing of the input dataset
and the most accepted weighting schemes (tf-idf and log-entropy).
3.3 Conclusion
From the above discussion it is found that performance of LSI has been tested for several
smaller datasets (e.g., MED, CISI abstracts etc) however, LSI has not been tested for a
large standard dataset. In this research, we concentrated on the performance of LSI on a
standard large dataset by varying different parameters e.g. stop word lists and term
weighting schemes. The specification of the chosen standard large dataset or test
collection is given below:
•

Dataset: LA Times dataset TREC-8 text collection. It includes articles published
in the years 1989 and 1990 (all articles from 1 Jan 1989 to 31 Dec 1990). The
number of the documents is 131,321 and size is 508 MB.

•

Topics (query): TREC-8 query set includes 50 standard queries (401-450).

•

TREC-8 test collection also provides the relevant judgment information associated
with each topic (query) for the above-mentioned dataset.

The significance of performance of retrieval systems is mentioned in terms of recallprecision values/graphs, and t-test.

34

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter-4
Test Collection and Pre-processing
4.1 Introduction
To have the answers to our research questions from LSI based information retrieval
system, we need to perform the following steps:
1. Standard dataset/test collection selection and pre-processing;
2. Processing of data and query using LSI based IR system to retrieve information;
and
3. Evaluating the LSI based IR system for retrieval performance.
This chapter thoroughly discusses the above mentioned first step that focuses on the
standard test collection and pre-processing of input data. Pre-processing includes
extracting documents from large SGML files, stop word removal, and stemming. We
developed our own software for pre-processing of the selected dataset. The following
sections discuss the preprocessing steps in mere detail.
4.2 Standard Test Collection
Ad hoc information retrieval is the task where the document collection is fixed, and users
submit queries to the information retrieval system, and the system returns a set of ranked
retrieval results (i.e., documents). To experiment with a LSI based ad hoc information
retrieval (IR) system, it is important to have a standard test collection consisting of three
things: A document collection, a test suite of information needs expressible as queries,
and a set of relevance judgments (a binary assessment of either relevant or non-relevant
for each query-document pair). [Manning et al 2008] For LSI based ad hoc IR, the most
well known and recognized test collection is provided by the Text Retrieval Conference
(TREC).
4.2.1 TREC and Standard Test Collection
The U.S. National Institute of Standards and Technology (N1ST) has run a large IR test
bed evaluation series since 1992. Within this framework, there have been many
tasks\tracks (e.g., Blog Track, Chemical IR Track, Legal Track, Filtering Track and
Genomics Track, etc.) over a range of different test collections, but the best known test
collections are the ones used for the TREC Ad Hoc task or track during the first eight
TREC evaluations between 1992 and 1999. In total, these test collections contain 1.89

35

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

million documents (mainly, but not exclusively, newswire articles) and relevance
judgments for 450 information needs, which are called topics and specified in detailed
text passages. Individual test collections are defined over different subsets of this data.
The early TRECs each consisted of fifty information needs, evaluated over different but
overlapping sets of documents. TRECs 6 through 8 provide 150 information needs over
about 528,000 newswire and Foreign Broadcast Information Service articles. This is
probably the best sub-collection to use in future work, because it is the largest and the
topics are consistent. Because the test document collections are so large, there are no
exhaustive relevance judgments. Rather, NIST assessors' relevance judgments are
available only for the documents that were among the top K documents (e.g., top K= 100
documents) returned for some system that was entered in the TREC evaluation for which
the information need was developed. For this thesis, Los Angeles Times (LA Times) test
collection from TREC-8 has been chosen for conducting this research.
4.2.2 Los Angeles Times (LA Times) Text Collection (TREC-8)
The data/text collection includes the articles published by the Los Angeles Times in the
two year period from Jan 1, 1989 - December 31, 1990. Each file contains the articles
from one day (e.g., a file with the name "LA123190" contains articles published on 31
Dec 1990). Every such file contains a number of documents (e.g., the LA123190 contains
134 different documents). The following figure represents part of the content of file
LA123190.

36

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

<DOC>
<DQCNO> LA123190-0001 </DOCNO>
<DOCID> 329361 </DOCID>

Document Number

<DATE>
<P>
December 31, 1390, Monday, Home Edition
</P>
</DATE>

<SECTICN>
<P>
2*letro; Parr B; Page 4; Col^smn 3; Letters Desk:
</P>

<7 SECTION
•<L ENGT H>
<P>
33 words
</F>

</LENGTH>
<H£ADLINE>
<P>

"TAGGER ARREST'
</P>

</HEADLINE>
<XEXT>
<P>
The only way we are ever going to end the nasty, filthy graffiti
problem is to- come down hard on the idiots doing it. I would be happy
to contribute to a reward fund.
</P>
<P>

IRY BUSH, Marina del Rey
</P>
</TEXT>

<TYPT>
<P>
Letter to the Editor
</P>

</TYPE>
</DOC>
<D0C>
<DOCNO> LA12319Q-00Q2 </DOCNO>
<D0CID> 329362 </DOCID>

Document Number

Figure 4.1: An extract from the LA Times Collection, file name LA123190, and illustrates one
document

Some notable characteristics of LA Times files:
1. LA Times files are encoded as SGML (Standard Generalized Markup Language)
file format (SGML is a superset of XML file format).
2. The uppercase words enclosed by o and </> are tag names.
3. Every document is separated by <DOC> and </DOC> tag names.
4. <DOCNO> and </DOCON> tag names enclose the unique name of each
document
The following table summarizes the characteristics of the input data set:

37

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Characteristics of the LA Times Data Set (1989,1990)
Number of documents

131,321

Size of the Input Data Set

476MB

Average Vocabulary size (approximately)

500

Average document size (approximately)

40 KB

Largest file size

828 KB (LA052089 0101)

Smallest Size

352 Bytes (LA070189_0001)

Number of words in the smallest file

91

Number of words in the largest file

167,045

Number of relevant files (Out of 131,312 files)
1,151
with respect to TREC-8 query set

Table 4.1: Input Data Characteristics (LA Times Data Set (1989,1990))

4.3 Pre-processing of Data
Pre-processing of data is a very important step in doing research in the area of
information retrieval. The pre-processing of data encompasses the following steps:
1. Extract documents from the large SGML data file (e.g., LA 123190) into
individual files for each document.
2. Remove stop words from the input dataset.
3. Apply stemming to the input dataset.
4.3.1 Method to Extract Documents
In the selected corpora every file contains approximately 100 to 130 documents. Each
document is separated by a document number that is presented using a special tag (e.g.,
<DOCNO> LA010189-0001 </DOCNO>). So, in the very first step of the processing of
raw data, smaller files are extracted from large files and each smaller file is named using
the document tag number e.g., LA010189 0001. After extracting all the documents from
TREC-8, 131,321 files are found. The following section presents the automatic TREC file
extractor program in better detail:

TREC file extractor is a software system written in PHP. The flow diagram of the
software system is depicted in figure 4.2. TREC file extractor takes the large TREC-8
SGML files as input one after another and produces the small individual documents. The

38

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

name of the extracted document is the same as the unique name found between the
<DOCNO> and </DOCNO> tags. TREC file extractor was tested in Windows Vista and
Windows Xp environments. The code developed to extract documents from TREC-8
SGML files is available in appendix B. The output of the TREC file extractor is used as
the input of the pre-processing software, which is presented in the figure 4.2.

TKECFi®
Extractor

Largs TREC
Doruratnt

Extracted
SvQlSii
Documents

—'

Figure 4.2: TREC File Extractor

Algorithmic steps of TREC File Extractor software:
1. Read each SGML files (LA123190 is a name of a file) one after another from the
input directory.
2. Find each <DOC> and </DOC> tag pair (top-down methodology) as these tag
pairs identify individual documents.
3. Extract (read) the content between a <DOC> and </DOC> tag pair including the
tags and write the extracted content in a new SGML file as lower case (e.g.,
change all upper case letters to lower case).
4. Specify a unique name for the new file as "input file namedocument number"
(e.g., LA123190_0001).
5. Write the new files as text files to the output directory.
4.3.2 Stop Word Removal and Stemming
The IR group, University of Glasgow provides a list of stop words that contains 319 stop
words and the IR group, University of Tennessee also provides a list of 439 stop words.
In this research, University of Tennessee, University of Glasgow and an extended version
of stop word lists were used in our experiments. We call the extended stop word list that
we have prepared the UNBCLAT stop word list. UNBCLAT stop word list contains
1911 stop words. Algorithmic steps to create this stop word list are given below:
1. Consider the terms to create a stop word list those that have at least frequency 2 (a
term must be present in a document at least twice).

39

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2. Create an initial stop word list by combining the stop word list of both IR groups,
University of Tennessee and University of Glasgow (no duplication in the list).
3. Remove all the punctuation from the input TREC-8 LA Times dataset.
4. Create a list of terms from the input dataset, in descending order of term
frequencies (i.e. the terms with the highest term frequency will be at the top of the
list).
5. Manually extract the special items to be added to the initial list (those terms are
not already in the initial list) to create an extended stop word list:
o

Add all file names to the initial list as every file contains file names e.g.,
LA123190.

o

Add all tags names (e.g., <doc>, and <docno>) to the initial stop word list,

o

Add roman numbers to the initial list e.g., xvii.

o

Add scale units e.g., ft, mm, etc.

o

Add all adjectives, adverbs (e.g., ago, bye ).

o

Add prefixes from words (e.g., non comes from non-governmental)

o

Add special words (e.g., haven (comes from haven't), doesn (comes from
doesn't)

o

Add dates e.g., feb-92, 11-apr

o

Add foreign words as dataset is newspaper articles. Add suspicious words
e.g., aaftink, aachen, ora etc

o

Add other words e.g., ext (telephone extension), 19th, z90, v6 (engine)

To create this stop word list, we manually searched the high and low frequency terms out
of 132,785 terms in the frequency table at the initial stage. We repeated this in a number
of cycles by removing different stop word lists (stop word list of the other IR groups)
from the TREC-8 LA Times dataset, and manually searched for the remaining stop words
from the frequency table. To search stop words is very time consuming as the dataset as
well as the number of terms is also large. The most difficult thing is to choose a word as a
stop word.

TREC-8 LA Times dataset contains newspaper articles, so, there are variations in the
contents. As we started to work with the stop word list of the other IR groups and found
that there are many high and low frequency stop words in our terms list (to be used to
create a term-by-document matrix), we decided to build our own stop word list for

40

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

efficient processing of the input dataset because the existing stop word lists do not include
many of the high and low frequency terms in our dataset.

All the stop word lists used in this thesis work are given in Appendix A. After creating
the extended stop words list we remove these stop words from the input dataset.

The idea of stemming is introduced in section 1.2.4. In this thesis, we used Porter's
stemming on the input dataset to produce stem words. Software developed to remove stop
words from the input dataset and to stem the input dataset is discussed in Section 4.2.2.

Fig. 4.3 and Fig. 4.4 present a single input file before and after pre-processing
respectively. All the stop words (all tags between < >, etc.) are removed from the input
file (LA052190 0019) after pre-processing. After removing stop words an input file still
contains a few symbols (<, >, /) and numerical values. These special symbols and
numerical values will be removed by the LSI system while parsing the input files during
creation of the term-by-document matrix.
OSS
D cS S3 & Bk

£4

•!-.

* B. •"

% j

<DOC>
•<docno> la052190 - OOX3 </docna>
<
docid
> 221951 <
/
docid
< tiste >
<
p
>
itay 21. .1990
<
/
p >

<

moncSay

>

home edicion

/ dare >

<
ssccion
>
<
p
>
sports
pare c

<

page 7

/ p >

<
/
section
<
lengt£i
>
<
p
>
17 wc-rds
<
/
p
>

colair«n 4

sports desk-

>

<
/ length >
< iieadline >
<
p
>
newswire ssaroes in tiie news
<
/ p >
<
<

<

/
headline
text
>

p

>

>

center oaldwsii jones
he sen antcr.io spurs
<
/ P
>

<

/

text

>

<
<

/
/

p
>
type

>

< TYPE >
< P >
column

< subject >

<
p
>
fcasJcetfcall players
retirement
<
/ P
>
<
J
subject
>

at 39 t*ie oldest player in tfee nfcsa

san arstonio spurs

basketball team

retired from

3ones

aldwell

</£OC>

Figure 4.3: An Input File before Pre-processing

41

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The following figure represents an input file after pre-processing.
I.WhV

NOV.-.K!

File Edit Format View Help

1-0019
<>221951 < / >
<>
<>

21 raofidai home
</>
</>
<>
<>

sport 7 4 sport
</>
</>
<>
<>

17
</>
</>
<>
<>

newswir
</>
</>
<>

<>

center caidvve! jone 39 oldest player nba retir san
antonio spur
</>
</>
<>
<>
</ >
</>
<>
<>

basketbal player san antonio spur basketbal team jone caldwel
retir
</>
</>

Figure 4.4: An Input File after Pre-processing

4.4 Stop Word Removal and Stemming Program
Pre-processing of input data includes stop word removal and stemming. Concepts of stop
words and stemming have been introduced in the sections 2.4 and 2.5 respectively. The
following section explains the working procedure of the data pre-processing phase.

The pre-processing software takes input from the output of the TREC file extractor
program. Pre-processing software reads each word from each file (from input folder) one
after another. It reads every term/word of a file from the beginning, and checks the word
to determine whether it is in the stop word list or not. If the word is in the stop word list

42

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

then it simply deletes that word and looks for the next word from the same file. If the
word is absent from the stop word list then the word is stemmed using Porter's Stemming
algorithm [Porter 1997] and the stemmed word is saved in a file (in the output folder)
with the same name as the file it has read. Pre-processing software repeats the same
process for all input files to discard all the stop words and to produce stemmed output of
all the input files. The code developed for the pre-processing software is available in
appendix B.

The flow diagram of the pre-processing software system is depicted in the figure 4.5. Pre­
processing software has been tested in Windows Vista and Windows Xp environments.
The output of the pre-processing software system is the input of the Latent Semantic
Indexing (LSI) retrieval system.
Stop
Word
List

No
Check for
stop words

Iiqiut Files

Porter's
Stemmer

Output Files
with
stemmed
words

Yes
Remove Word
from Input File

Figure 4.5: Pre-processing Software

Algorithmic steps of stop word removal and stemming software are:
1. Read every input file (LA123 231 is a name of a file) one after another from the
input directory.
2. Read a term/word from an input file.
3. Search for the word in the stop word list.
4. If the term is not in the stop word list, go to step-5 or if the term is in the stop
word list, delete the term and put a NULL (in the position of the deleted word)
then go to step-2 to read next word/term.
5. Apply Porter's stemming to create a stemmed word and write the stemmed word
in a file with the same name of the file from which the word was read and save the
file to the output directory.
6. Repeat steps-2 to step-5 to remove and stem all words from input files.

43

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4.5 Query Selection
In this research work TREC-8 query sets (401 to 450) are used to evaluate the efficiency
of the Latent Semantic Indexing (LSI) technique in terms of information retrieval for the
mentioned dataset (section 4.2.2). TREC-8 query sets are taken from Text Retrieval
Conference (TREC, http://trec.nist.gov/data/topics_eng/index.html) that is maintained by
the National Institute of Standards and Technology (NIST). The relevance judgment
information also taken from TREC. Fig. 4.6 represents the first two queries from the
TREC-8 query set. The characteristics of a TREC-8 query: Every TREC-8 query starts
with the tag <top> and ends with </top>. <num>: num tag presents the query number, in
Fig 4.6 query numbers are 401 and 402. <desc>: description tag describes the actual
query. The text for query number 401 is "What language and cultural differences impede
the integration of foreign minorities in Germany"? <narr>: narrative tag contains the
information for the human assessors to judge the relevance of documents retrieved by
this query.

44

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

jlTbJspics,481-45P - Wargp^d

D & H S BT #4

'-!r

&

< FFL

%

<top>

\M

<n-inn> Nuxsbe r: 401
<titie> foreign minorities, Germany
<desc> Description:
What language and cultural differences impede the integration
of foreign minorities in Germany?
<narr> Narrative:
A relevant document' will focus on the causes of the lack" of
integration in a significant way; that is, the mere mention of
immigration difficulties is not relevant. Documents that discuss
immigration problems unrelated to G-ermany are also not relevant.
</top>
<top>
<num> Number: *§02
<title> behavioral genetics
<desc> Description:
What is happening in the field of behavioral genetics,
the study of the relative influence of genetic
and environmental factors on an individual's behavior
or personality?
<narr> Narrative:
Documents describing genetic or environmental factors relating
to understanding and preventing substance abuse and addictions
are relevant. Documents pertaining to attention deficit disorders
tied in with, genetics are also relevant, as are genetic disorders
affecting hearing or muscles. The genome project is relevant
when tied in with behavior disorders (i.e., mood disorders,
Alzheimer's disease).
</top>
For Help; press F1

Figure 4.6: First Two Queries (401 and 402) from the TREC-8 Query Set

4.7 Conclusion
This chapter discussed the idea and characteristics of test collection. The advantage of the
standard test collection is that it contains human evaluation of relevance judgment which
can be used in the evaluation of the LSI based IR system to be tested. Pre-processing of
the input dataset and the program developed for pre-processing are also covered. The next
chapter describes the experiments carried out after the pre-processing of the input dataset.

45

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter-5
Experimental Studies on Stop Word Lists and Term Weighting Schemes
5.1 Introduction
This chapter describes the experiments that have been conducted using the Latent
Semantic Indexing (LSI) based information retrieval system. The basic research
procedures are demonstrated in this chapter. The experimental results as well as their
evaluations are also covered in the following sections.
5.2 GTP (General Text Parser)
The LSI based retrieval system used for the experiments is called the GTP (General Text
Parser) [Giles et al 2001]. GTP is a general purpose text parser with a matrix
decomposition option which can be used for generating vector space IR models. GTP is
an open source tool developed by a group in the Department of Computer Science,
University of Tennessee. The specific vector-space model exploited by GTP is Latent
Semantic Indexing (LSI). The architecture of the GTP software is depicted in the Figure
5.1. The pre-processed (with stemming and stop word removal) data is submitted to the
GTP processor for further processing.

46

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Pre-processed
Query

Pre-processed
Dataset

GTP

GTPQUERY

Parser
Raw Term-byDocument
Matrix

Term Weight
Calculation

Raw Query
Matrix

Cleaning
Temporary file

SVD

Reduced ranked
Document Vector
Space

Reduced Ranked
Query Vector

Summary of
Run

Cosine
Similarity
Calculation

Ranked List of
Documents

Figure 5.1: Latent Semantic Indexing (LSI) Retrieval Systems [GILES ET AL 2001]

GTP takes pre-processed text files as input and parses the files to create a database of
terms/keys/words and their corresponding weights to create a raw term-by-document
matrix, and then LSI performs Singular Value decomposition (SVD) on that raw matrix to
produce the low-rank vector space in a binary file. Finally, LSI cleans up temporary
working files and writes the summary of the run to the RUN SUMMARY file.

47

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

GTPQUERY takes a query file as input and extracts the terms/keywords from the
database and the corresponding weights of each term to create a raw query vector. Then,
SVD creates a reduced ranked query vector.

Once the query vector is generated, a cosine similarity calculation is performed between
the document vectors and the query vector to create a ranked list of documents based on
the similarity values. The results (ranked list of documents) are written to a file and the
name of the file takes the form q_result.#, where # is the query number. The results
written in these files are sorted by the most relevant to least relevant.
5.3 Experimental Design
The following sections discuss the experiments done for this thesis. The experiments are
performed to determine the answers to the research questions (Chapter-1, Section 1.3)
associated with stop word lists and term weighting issues in LSI based information
retrieval (IR) for the TREC-8 LA Times dataset (1989-1990).
5.3.1 Experiment-1: Study of Stop Words
This study tries to find out the effect of "stop words/Common words" on a text based LSI
information retrieval (IR) process. Evidence will be developed to indicate the most
effective stop word lists for LSI based ad hoc IR processes.
By surveying the literature, it was found that different research groups use their own stop
word lists and these lists vary in the number of stop words they contain. For example,
Information Retrieval Group, University of Glasgow (319 stop words) and University of
Tennessee (439 stop words) have their own stop word lists. TREC encourages text
retrieval research; however, TREC does not recommend any standard stop word list for
IR based research. Christopher D. Manning [Manning et al 2008], in his book mentioned
a brief way to prepare the list of stop words. The process is "the general strategy for
determining a stop list is to sort the terms by collection frequency (the total number of
times each term appears in the document collection), and then to take the most frequent
terms, often hand-filtered for their semantic content relative to the domain of the
documents being indexed, as a stop list, the members of which are then discarded during
indexing." By following the above idea in this thesis, an extended stop word (also called
UNBC_LAT stop word list) list (please see Section 4.3.2) is used that includes University
of Glasgow and University of Tennessee stop word lists, 730 TREC file names (input
48

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

data set), 22 tag names (e.g., doc, docno etc) and other words (e.g., alpha numeric words,
roman numbers etc.). The total length of the UNBCLAT stop word list is 1911.
The following table lists the various inputs and parameters used in this study.
Data Set

Stop Word List

Stemming

Weighting
Scheme

Number of
Queries

TREC-8 Los
Angeles
Times (LA
Times)

University of Glasgow
University of Tennessee
UNBC LAT stop word List
Without removing stop words

Porter's
Stemming

tf-idf

50 Queries
for each
experiment

Table 5.1: Different Parameters for the Study of Stop Words

5.3.2 Experiment-2: Study of Weighting Schemes
By surveying the literature of text based information retrieval, it was found that research
suggested the use of log-entropy [Dumais 1991] or tf-idf [Manning et al 2008] (term
frequency- inverse document frequency) weighting schemes because those result in better
retrieval performance over idf and raw term frequency. We also studied raw term
frequency weighting scheme for our research interest.

The objective of the following three experiments is to study the two most popular
weighting schemes, tf-idf and log-entropy, to find which weighting scheme is effective on
a large data set on LSI based ad hoc text based IR system.
The following table lists the various inputs and parameters used in this study.
Data Set

Stop Word List

Stemming

TREC-8 Los
Angeles Times
(LA Times)

The Best Stop
Word List from
Experiment -1

Porter's
Stemming

Number of
Queries

Weighting Scheme
tf-idf
log-entropy
raw term frequency

50 Queries for
each
Experiment

Table 5.2: Different Parameters for the Study of Weighting Schemes

5.4 Results and Evaluations
The following sections present the findings and the evaluations of the experiments.
5.4.1 Result: Experiment-1
Table 5.3 shows the 10-point interpolated precision of four different systems. The recall
precision graph based on the Table 5.3 data is shown in Figure 5.2.

49

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

10Points
Recall
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0

PRmylisttfidf
(System-1)

PRStemtfidf
(System-2)

PRTentfidf
(System-3)

PRGlasgowtfidf
(System-4)

0.1757
0.1108
0.0888
0.0724
0.0678
0.0659
0.0646
0.0621
0.0554
0.0436

0.1385
0.1058
0.0841
0.0743
0.0710
0.0662
0.0632
0.0592
0.0553
0.0247

0.1513
0.0994
0.0735
0.0693
0.0672
0.0647
0.0609
0.0588
0.0489
0.0316

0.1221
0.0864
0.0799
0.0722
0.0694
0.0671
0.0660
0.0628
0.0578
0.0374

Table 5.3: 10-point Interpolated Precision of Four Different Systems

In the Table 5.3, the recall value of 0.1 represents the top 10% of the documents (in the
collection) which are relevant to a query set. As an example using PRmylisttfidf the
precision associated with top 10% of the documents is 0.1757 or 17.57%. This value is
calculated by interpolating the precision values of all 50 queries used for this thesis at the
standard recall value 0.1. Details of the interpolation technique can be found in [Manning
et al 2008],

We can compare retrieval systems with different parameters (e.g., stop word list) in terms
of precision in different standard recall points e.g., 0.1, 0.2 etc. For example, at recall
point 0.3 means, (top ranked 30% documents) the precision values for system1(PR mylist tfidf) is 8.88% and for system-3(PR_ten_tfidf) is 7.35%. So, if we subtract
(8.88-7.35) %=1.53%, means that system-1 shows 1.53% better retrieval performance
than sytem-2 retrieval system for the top 30% retrieved documents. If we look at the
Figure 5.2 at the point of recall 0.3, we can see the difference.

50

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

- PR_mylist_tfidf (System-1)
-PR_stein_tfidf(System-2)
~ A - PR_ten_tfi df (Svstem-3}

(3Sgow_tfidtXSysfein-4}

—» • PR_G

e

0.0

0.5

0.1

Recall

o.s

1

Figure 5.2: Recall-Precision Graph for Experiment-1

The system-1 (mylist tf idf) with extended stop word list (mylist means UNBC LAT
stop word list) provides the best result when compared to the other 3 systems. For the top
10% retrieval, system-1 shows 5.37% better retrieval performance than system-4
(Glasgow means University of Glasgow stop word list), 3.68% better retrieval
performance than system-2 (stemtfidf, stem means without removing stop word list),
and 2.44% better retrieval performance than system-3. However, after top 40% retrieval
all the systems showed almost the same retrieval performance.

In the system-2 (stem tf idf), we just applied Porter's stemming without removing stop
words. It is pretty interesting that the retrieval performance of system-2 compared to
system-4 (glasgow_tf_idf) is 1.64% better.

From the above results, it is found that, in case of LSI based ah hoc information retrieval,
the set of stop words is dependent on the set of input data. However, if someone has a
powerful computer then one can expect significant retrieval performance without
removing stop words. From the above result is it clear that the use of an arbitrary set of

51

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

stop words reduces retrieval performance in case of LSI-based ah hoc information
retrieval with large dataset.
5.4.1.1 Paired Sample T-test
The idea of a t-test is introduced in section 2.8. We calculated the t-test using software
SPSS 17.0. Table-5.4 represents recall-precision of four different systems and means and
standard deviations of precision. Table 5.5 shows output of the t-test (detail step by step
calculations are available in Appendix D). We compared system-1 with three other
systems in terms of the level of significance. We found 3 different level of significance
for every pair of systems. By analyzing the significance level (see Table-C in Appendix
D) it is found that the t-test produced a result that is below the threshold level. This is an
interesting finding in that if we apply paired sample t-test to compare different retrieval
systems in terms of level of significance, paired sample t-test results indicate insignificant
differences for all cases.

10-Points
Recall
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Mean
Standard
Deviation

PR_mylist_tfidf
(System-1)
0.1757
0.1108
0.0888
0.0724
0.0678
0.0659
0.0646
0.0621
0.0554
0.0436
0.0807
0.0381

PR Stem tfidf
(System-2)
0.1385
0.1058
0.0841
0.0743
0.0710
0.0662
0.0632
0.0592
0.0553
0.0247
0.0742
0.0307

PR Ten tfidf
(System-3)
0.1513
0.0994
0.0735
0.0693
0.0672
0.0647
0.0609
0.0588
0.0489
0.0316
0.0726
0.0326

PR_Glasgow_tfidf
(System-4)
0.1221
0.0864
0.0799
0.0722
0.0694
0.0671
0.0660
0.0628
0.0578
0.0374
0.0721
0.0219

Table 5.4: 10-point Interpolated Precision, Mean and Standard Deviation of Four Systems

52

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Paired Samples Test
Paired Differences
Degree
of
Level of
freedom
(N-l)
Significance

95% Confidence Interval of

the Difference
Lower

Upper

t

Pair 1

System 1 - System2

-.0024183

.0153783

1.647

9

.134

Pair 2

System 1 - System3

.0027259

.0135741

3.399

9

.008

Pair 3

System 1 - System4

-.0041491

.0213491

1.526

9

.161

Table 5.5: Comparison among Systems in Terms of T-test

5.4.2 Result: Experiment-2
Table 5.6 shows the 10-point interpolated precision of three different systems. The recall
precision graph based on the Table 5.6 data is shown in Figure 5.3.
10-Points
Recall
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0

PR_mylist_tfidf
(System -1)
0.1757
0.1108
0.0888
0.0724
0.0678
0.0659
0.0646
0.0621
0.0554
0.0436

PR_mylist_lofentropy
(System - 5)
0.1464
0.1061
0.0803
0.0784
0.0743
0.0714
0.0688
0.0666
0.0515
0.0388

PR_mylist_raw_freq
(System - 6)
0.0910
0.0874
0.0789
0.0720
0.0690
0.0636
0.0626
0.0596
0.0569
0.0339

Table 5.6: 10-point Interpolated Precision of Three Different Systems

The objective of this study was to find the right term weighting scheme for LSI based
retrieval system for a large dataset. From the results presented in Table 5.6, it is found
that in case of top 10% retrieved documents, system-1 with tf-idf term weighting scheme
showed 17.57% and system-6 with raw term frequency showed 9.1% retrieval
performance. So if you subtract 17.57-9.1= 8.47 = 9%, means system-1 showed 9% better
retrieval performance than system-6. Similarly, log-entropy (system-5) showed 5.54%
better retrieval performance than raw term frequency (system-6). On the other hand, tf-idf
(system-1) showed 2.93=3% better retrieval performance than system-5 with the logentropy term weighting scheme.

53

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

•?R_mylist_tfidf(System-!)
~M ~PR_mySst_logen«opy{System-5)

k* • PR_myHst_ra\v_fre q(System -6)

0.2

c
©

o«»
o.

0.0

0. 1

0.2

0.3

0.4

0.5

0.$

0.7

0.8

0.9

1

Recall

Figure 5.3: Recall-Precision Graph for Experiment-2

If we look at Figure 5.3 in the recall-precision graph at the recall level 0.1, we can see the
above results in a graphical form. So, in case of a large dataset tf-idf performs better than
log-entropy and raw term frequency weighting schemes. However, after top 40% retrieval
(in Figure 5.3 please look at the recall values 0.4 to 1.0) all the retrieval systems showed
almost the same retrieval performance. As 1151 documents are relevant out of 131,321
documents (only 0.88% documents are relevant in this test collection) in LA Times
TREC-8 standard test collection, we suspect that almost all relevant documents are
retrieved in the top ranked 30% retrieved documents, that's why after 40% retrieved
documents every retrieval system showed equal performance as shown in the Figures 5.2
and 5.3.
5.5 Conclusion
In the above two studies (section 5.4.1 and 5.4.2), we performed seven sets of
experiments, and used 50 queries (TREC-8) to evaluate the results of the experiments. 10point recall-precision graphs, and a t-test have been used to measure the retrieval

54

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

performance of LSI for large test collection. By performing these experiments we found
the following research results:
•

Different input datasets contain different contents (e.g., news paper articles,
medical journals etc), so, the term list of an input dataset also is different for the
other datasets. It is clear that for any unique document collection a tailored stop
word list must be assembled.

•

In the existing published results on the LSI based retrieval system there was no
concrete evidence about which term weighting scheme works better in case of
large datasets. In our research, we conclude that the result (in numerical forms)
shows tf-idf term weighting performs better than log-entropy and raw term
frequency weighting schemes when the test collection becomes large.

•

Our findings on LSI based retrieval system with large dataset extend the earlier
published results discussed in chapter-3 by providing concrete evidence of
relevance performance.

•

The t-test failed to distinguish among the retrieval performance of different
retrieval systems used in this research work.

LSI showed promising results in case of TREC-8 LA Times dataset. TREC-8 LA Times
dataset is an example of a large unstructured digital textual dataset. So, we can benefit
from using LSI to find information from large digital unstructured textual archives.

55

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter-6
Conclusion and Future Directions
6.1 Introduction
In this chapter, we present a summary of this thesis as well as mention possible directions
of our future research.
6.2 Summary
The field of Information Retrieval (IR) was born in the 1950s and over the last forty
years, the field has matured considerably. Several IR systems are used on an everyday
basis by a wide variety of users. [Singhal 2008] Retrieving information from digital
archives is a challenge to the users, and an automatic IR system is an obvious solution to
this challenge.

In Chapter-1, we introduced the fundamental terms of LSI research and presented the
problem statement.

In Chapter-2, we discussed the theories of LSI based IR system and related matters. This
chapter covers the term weighting schemes, ranked based similarity (between a query and
documents) measurement technique, singular value decomposition, TREC-8 relevance
judgment, and the idea of recall-precision and t-test. We discussed TREC-8 relevance
judgment and the recall-precision graph because those help us to evaluate the retrieval
results.

In Chapter-3, we surveyed the published results on LSI for last 20 years that helped us to
identify the potential research questions on LSI based IR system.

In Chapter-4, we described the characteristics of standard large TREC-8 test collection,
and the pre-processing of the selected input dataset. To address the research questions on
a large dataset, we created and used our own stop words lists, Porter's stemming, and LSI
based vector space model (singular value decomposition), for a structured set of
experiments.

56

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

In Chapter-5, we presented the experimental design, experimental results, and the list of
findings from the experimented results. All the results are based on the top 10% retrieval
of documents. We performed two different experiments to judge the performance of LSI
in case of large dataset. The experiments are given below briefly:
•

Study the effect of different stop word lists.

•

Study the effect of different term-weighting schemes.

In the first experiment, we tested the retrieval performance of LSI using 3 different stop
word lists and without removing stop words from the input dataset. The stop word list
(UNBC LAT stop word list) we created and used in LSI showed 5.37% better
performance than University of Glasgow stop word list, 2.44% better than the University
of Tennessee stop word list, and 4% better than without removing stop word list.

In the second experiment, we tested three different term-weighting schemes (raw term
frequency, tf-idf, and log-entropy) on a large dataset. tf-ifd weighting schemes showed
9% better retrieval performance, and log-entropy showed 5.54% better performance than
raw term frequency. On the other hand, tf-idf showed 3% better retrieval performance
than log-entropy. In summary, tf-idf showed better performance than log-entropy
weighting and raw term frequency schemes in case of large dataset.

In general, LSI has a number of advantages to process large dataset as follows:
1. Both terms and documents are explicitly represented in the same space.
2. Queries and new documents can easily be added.
3. LSI uses SVD to reduce dimension and to remove noise.
4. LSI is able to handle polysemy and synonymy.

Thus, LSI is able to represent and manipulate large data sets, making it viable for realworld applications. [Deerwester et al 1990] Our results on LSI with large dataset also
proved that LSI is able to handle large dataset.
6.3 Possible Extensions of the Experiments
Latent Semantic Indexing (LSI) is being used in a variety of information retrieval and text
processing applications, although its primary application has been for conceptual text

57

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

retrieval and automated document categorization. [Dumais 2004] In this thesis we studied
the accuracy of LSI in case of large data set for document retrieval.

Some additional research works related to this study might include the following:
1. By analyzing the results [section 4.3.1 and 4.3.2] of this research, it is found that
every experiment shows almost equal performance at the recall points 0.4 to 1.0.
As only 0.88% documents are relevant, we suspect that almost all relevant
documents are retrieved in 30% top ranked documents; meaning that the graphs
show almost equal retrieval performance after recall point 0.4. This suspicion
might be an interesting study to find the reason(s) behind this equal performance
in this dataset. This suspicion could also be explored for other large dataset.
2. Term-weighting schemes are very influential in LSI based IR systems. One
possible study that might be interesting is to apply a position based term
weighting scheme (sometimes called positional indexing). This weighting scheme
is very costly in terms of computing time, but may produce better retrieval
performance. We plan to work on position based term-weighting schemes in the
future.
3. The General Text Parser [GTP] [Giles et al 2001] used in this research has a
command line interface. Adding Graphical User Interface (GUI) could enhance
user interaction with the GTP. A java version of GTP is available with GUI;
however, it has memory limitation that prevents it from processing a large dataset.

58

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Bibliography
[Al-Maskari et al 2008]

Azzah Al-Maskari, Mark Sanderson, and Paul Clough.
Relevance Judgments between TREC and Non-TREC.
Assessors. SIGIR' 08, Singapore, July 20-24, 2008.

[Anderson et al 2009]

Anderson, David Ray, Sweeney, Dennis J., and Williams,
Thomas Arthur (2009). Statistics for business and
economics. 11th ed. Cengage Learning Inc. lorence, KY

[Baeza-Yates & RibeiroNeto 1999]

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. Addison Wesley Longman
Publishing Co. Inc., first edition, 1999.

[Berry & Dumais 2008]

Berry Michael W. & Dumais Susan. Latent Semantic
Indexing Web Site, http://www.cs.utk.edu/~lsi/, access
time: Jan 2008.

[Berry et al 1993]

M. Berry, T. Do, G. O'Brien, V. Krishna, and S. Varadhan.
Svdpackc (version 1.0) user's guide. Technical Report No.
CS-93-194, University of Tennessee, May 1993.

[Chakravarthy & Netserf A. Chakravarthy and K. Haase. Netserf. Using semantic
1995]
knowledge to find internet archives. In Proceedings of the
18th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
Seattle, Washington, USA, 1995. Pp. 4-11.
[Chen et al 2001]

L. Chen, N. Tokuda, and A. Nagai. Probabilistic
information retrieval method based on differential latent
semantic index space. IEICE Trans, on Information and
Systems, E84-D (7):910-914, 2001.

[Chen et al 2003]

L. Chen, N. Tokuda, and A. Nagai. A new differential Isi
space-based probabilistic document classifier. Information
Processing Letters, 88:203-212, 2003.

[Chen et al 2004]

L. Chen, J. Zeng, and J. Pei. Classifying noisy and
incomplete medical data by a differential latent semantic
indexing approach. In Data Mining in Biomedicine.
Springer Press, 2004.

[Chen et al 2005]

L. Chen, J. Zeng, and N. Tokuda. A "stereo" document
representation for textual information retrieval. Journal of
the American Society for Information Science and
Technology, 2005.

[Chen et al 2006]

Liang Chen, Jia Zeng and Naoyuki Tokuda. A "Stereo"
Document Representation for Textual Information
Retrieval. Journal of the American Society for Information
59

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Science and Technology (JASIST). 57:6, pp. 768—774.
2006.
[Chen et al 1999]

Liang Chen, Naoyuki Tokuda and Akira Nagai,
Probabilistic Information Retrieval Method Based on
Differential Latent Semantic Index Space. IEICE Trans.
Fundamentals, Vol.E82, No.l January 1999.

[Coefficient of
Variation]

http.//www.graphpad.com/help/prism5/pri sm 5help.htm l?co
efficient_of_variation _%28cv%29.htm. (Access time Dec
2009)

[Cohen & Hirsh 1998]

W. Cohen and H. Hirsh. Text categorization using whirl. In
Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining, pages 169-173,
New York, 1988.

[Cooper & Dabney
1992]

W. Cooper, F. Gey, and D. Dabney. Probabilistic retrieval
based on staged logistic regression. In Proceedings of the
15th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages
198-210, Copenhagen, Denmark, 1992.

[Croft et al 2009]

W. Bruce Croft, Donald Metzler, and Trevor Strohman.
Information Retrieval in Practice. Addison Wesley, 2009.

[Deerwester et al 1990]

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T.
K., and Harshman, R. A. Indexing by latent semantic
analysis. Journal of the American Society for Information
Science, 1990, 41(6), pp. 391-407.

[Downing & Clark
2003]

Downing, Douglas and Clark, Jeffrey (2003). Business
Statictics. 4th ed. Barron's
Educational
Series,
Inc.
Hauppauge, NY

[Dumais et al 1988]

Dumais S. T., Furnas, G. W., Landauer, T. K. and
Deerwester, S. Using latent semantic analysis to improve
access to textual information. In Proceedings of CHI'88:
Conference on Human Factors in Computing, New York:
ACM, 1988. pp. 281-285.

[Dumais et al 1997]

Susan T. Dumais, Todd A. Letsche, Michael L. Littman,
Thomas K. Landauer. Automatic Cross-Language
Retrieval Using Latent Semantic Indexing. AAAI-97
Spring Symposium Series: Cross- Language Text and
Speech Retrieval, 1997. Stanford University, pp. 18-24.

[Dumais et al 1996]

Susan T Dumais, Thomas K Landauer, Michael L Littman.
Automatic Cross Linguistic Information Retrieval using
Latent Semantic Indexing. SIGIR 1996.

60

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[Dumais 1991]

S. Dumais. Improving the retrieval of information from
external
sources.
Behavior
Research
Methods,
Instruments, & Computers, 23(2):229—236, 1991.

[Dumais 2004]

Dumais, S. Latent Semantic Analysis. ARIST Review of
Information Science and Technology, vol. 38, 2004,
Chapter 4.

[Dumais 1992]

Dumais, S. T. Enhancing performance in LSI (Latent
Semantic Indexing) Retrieval. Technical memo, 1992.

[Fisher & Yates 1995]

R.A. Fisher and F. Yates. 6th ed. Statistical tables for
biological, agricultural and medical research. London:
Longman Group, 1995.

[Foltz 1990]

P. W. Foltz. Using Latent Semantic Indexing for
Information Filtering. In R. B. Allen (Ed.) Proceedings of
the Conference on Office Information Systems,
Cambridge, MA, 1990. pp. 40-47.

[Frakes & Yates 1992]

W. Frakes and R. Baeza-Yates. Information Retrieval:
Data Structures and Algorithms. Prentice Hall, 1992.

[Gansterer et al 2007]

Wilfried Gansterer, Andreas Janecek, and Robert
Neumayer. Spam filtering based on latent semantic
indexing. 5th SIAM Workshop on Text Mining,
Minneapolis, MN, USA, 2007.

[Greengrass 2001]

E. Greengrass. Information retrieval: A survey. Technical
Report DOD Technical Report TR-R52-008-001, 2001.

[Grei et al 1997]

W. Grei, W. Croft, and H. Turtle. Computationally
tractable probabilistic modeling of Boolean operators. In
Proceedings of the 20th Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, 1997. pp. 119-128.

[Giles et al 2001]

Justin T. Giles, Ling Wo, Michael W. Berry. GTP
(General Text Parser) Software for Text Mining. CRC
Press 2001.

[Harman 1992]

D. Harman. User-friendly systems instead of user-friendly
front-ends. Journal of the American Society for
Information Science, 43(2):164-174, 1992.

[Harman 1995]

D. Harman. Overview of the third text retrieval conference.
Third Text REtrieval Conference (TREC-3), pp. 1-19.
National Institute of Standards and Technology Special
Publication 500-207, 1995.

61

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[Harter 1986]

S. Harter. Online Information Retrieval: Concepts,
Principles, and Techniques. Academic Press, Inc., 1986.

[Hofmann 1999]

T. Hofmann. Probabilistic latent semantic indexing. 22nd
Annual ACM Conference on Research and Development
in Information Retrieval, pp. 50-57, Berkeley, California,
1999.

[Hull 1994]

D. Hull. Improving text retrieval for the routing problem
using latent semantic indexing. 17th Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 282-291, 1994.

[Husbands et al 2001]

Husbands, P., Simon, H., & Ding, C. On the use of
singular value decomposition for text retrieval. In
Computational Information Retrieval (SIAM) (pp. 45156), Philadelphia, PA, 2001.

[Kumar et al 2006]

Ch. Aswani Kumar, Ankush Gupta, Mahmooda Batool and
Shagun
Trehan. Latent Semantic Indexing-Based
Intelligent Information Retrieval System for Digital
Libraries. Journal of Computing and Information
Technology - CIT 14, India 2006, pp. 191-196

[Labsky et al 2005-A]

Labsky M., Vacura M., and Praks P. Web Image
Classification
for
Information
Extraction.
First
International Workshop on Representation and Analysis of
Web Space (RAWS-05), Prague, 2005; pp. 55-62;

[Labsky et al 2005-B]

Martin Labsk'y, Vojfech Sv'atek and Ond"rej "Sv'ab.
Information Extraction from HTML Product Catalogues:
from Source Code and Images to RDF. The 2005
IEEE/WIC/ACM International Conference on Web
Intelligence, pp. 401-404, Washington, USA, 2005

[Lee 1994]

J. Lee. Properties of extended Boolean models in
information retrieval. 17th Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, pages 182-190, 1994.

[Letsche 1996]

Todd A. Letsche. Toward Large-Scale Information
Retrieval Using Latent Semantic Indexing. A masters thesis
supervised by Dr. Michael Berry, The University of
Tennessee, Knoxville, USA, 1996.

[Letsche & Berry 2009]

Todd A. Letsche and Michael W. Berry. Large-Scale
Information retrieval with Latent Semantic Indexing.
http://www.cs.utk.edu/~berry/lsi++/, access time: Feb 2009

62

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[Liddy et al 1994]

E. Liddy, W. Paik, and E. Yu. Text categorization for
multiple users based on semantic features from a machinereadable dictionary. ACM Transactions on Information
Systems, 12(3):278—295, 1994.

[Lynn 2002]

Patrick A. Lynn. Evolving the General Text Parser (GTP)
Utility into a Usable Application Via Interface Design. A
Thesis Presented for the Master of Science Degree, The
University of Tennessee, Knoxville By, December 2002

[Manning et al 2008]

Christopher D. Manning, Prabhakar Raghavan, Hinrich
Schiitze. An Introduction to Information Retrieval.
Cambridge University Press Cambridge, 2008.

[Mayfield & McNamee
1997]

James Mayfield and Paul McNamee. N-grams vs. words as
indexing terms. In TREC-6 Conference Notebook Papers,
1997.

[Montague 2002]

M. Montague. Metasearch. Data fusion for document
retrieval. PhD thesis, Dartmouth College, Hanover, New
Hampshire, 2002.

[Machala et al 2006]

Praks P., Machala L., Snasel V., Strizik M. Intelligent
Information retrieval using numerical linear algebra and
applications. International Symposium GIS Ostrava 2006
Proceedings. Ed. J. Ruzicka VSB-TU Ostrava, January 2325, 2006. pp. 1-13 (CD-ROM).

[Ogilvie & Callan 2003]

P. Ogilvie and J. Callan. Combining document
representations for known item search. Twenty Sixth
Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 143-150,
Toronto, Canada, 2003.

[Ott & Larsen 1992]

Ott, R Lyman and Larson, Richard (1992). Statistics a tool
for the social sciences,
ed. Pws-Kent publishing
company, Boston.

[Paice 1984]

C. Paice. Soft evaluation of Boolean search queries in
information retrieval systems. Butterworth Publishers
Stoneham, MA, USA, pp.33—42, 1984.

[Pearce & Nicholas
1996]

C. Pearce and C. Nicholas. Telltale: Experiments in a
dynamic hypertext environment for degraded and
multilingual data. Journal of the American Society for
Information Science, 47(4):263-275, 1996.

[Porter 1980]

M. Porter. An algorithm for suffix stripping. Program,
14(3): 130-137, 1980.

63

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[Porter 1997]

M. Porter. An algorithm for suffix stripping. In K. S. Jones
and Willett P., editors, Readings in information retrieval,
pp. 313-316. Morgan Kaufmann Publishers Inc., 1997.

[Praks et al 2004]

Pavel Praks, Libor Machala and Vaclav Snasel. On SVDFree Latent Semantic Indexing for Iris Recognition of
Large Databases. Fifth International Workshop on
Multimedia Data Mining (MDM/KDD'04) in conjunction
with ACM SIGKDD Tenth International Conference on
Knowledge Discovery and Data Mining, pp. 67-71, Seattle,
USA, 2004.

[Praks et al 2003]

Praks P., Dvorsky J., Snasel V. Latent Semantic Indexing
for Image Retrieval Systems. SIAM Conference on Applied
Linear Algebra, Williamsburg, VA - USA, 2003

[Qin et al 2005]

Bing Qin, Ting Liu, Zhang Yu, Sheng Li. Research on
Multi-Document Summarization Based on Latent Semantic
Indexing. Journal of Harbin Institute of Technology, China
2005, pp. 91-94

[Rehder et al 1998]

B. Rehder, M. Schreiner, M. Wolfe, D. Laham, T.
Landauer, and W. Kintsch. Using latent semantic analysis
to assess knowledge: Some technical considerations.
Discourse Processes, 25:337-354, 1998.

[Rijsbergen 1979]

C. van Rijsbergen. Information Retrieval. Butterworths,
London, 2nd edition, 1979.

[Rilo 1995]

E. Rilo. Little words can make a big difference for text
classification. 18th Annual International ACM SIGIR
conference on Research and Development in Information
Retrieval, pages 130-136, 1995.

[Rose et al 1990]

K. Rose, E. Gurewitz, and G. Fox. A deterministic
annealing approach to clustering. Pattern Recognition
Letters 11, 11:589-594, 1990.

[Salton 1971]

G. Salton. The SMART Retrieval System: Experiments in
Automatic
Document
Processing.
Prentice
Hall,
Englewood Clis, New Jersey, 1971.

[Salton 1989]

G. Salton. Automatic text processing: The transformation,
analysis, and retrieval of information by computer.
Addison-Wesley, Reading, MA, 1989.

[Salton & Buckley
1987]

G. Salton and C. Buckley. Term-weighting approaches in
automatic text retrieval. Technical Report, Cornell
University, USA, 1987.

64

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[Salton & McGill 1983]

G. Salton and M. McGill. Introduction to Modern
Information Retrieval. McGraw Hill Publishing company,
New York, 1983.

[Saul & Pereira 1997]

L. Saul and F. Pereira. Aggregate and mixed-order Markov
models for statistical language processing. In Claire
Cardie and Ralph Weischedel, editors, Second Conference
on Empirical Methods in Natural Language Processing,
pages 81-89. Association for Computational Linguistics,
Somerset, New Jersey, 1997.

[Schutze & Silverstein
1997]

H. Schutze and C. Silverstein. Projections for efficient
document clustering. In ACM/SIGIR, pages 74-81, 1997.

[Singhal 2008]

Amit Singhal. Modern Information Retrieval: A Brief
Overview. Google, Inc. http://pages.cs.wisc.edu/~anhai/,
access time Jan 2008

[Smith 2003]

Lindsay I. Smith. A Tutorial on Principal Component
Analysis. Unpublish Document downloaded for the link
kybele.psych.cornell.edu/~edelman/Psych-465-Spring2003/PCA- tutorial.pdf, in the date, 9 Sep 2007

[Spigel 1988]

Spigel, Murray R (1988). Theory and problem of Statictics.
2nd ed. Schaum's
outline series, McGraw-Hill.

[Spoerri 2007]

Anselm Spoerri. A Visual Tool for Information Retrieval.
Ph.D. Research and Thesis at MIT, 1995.
http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/InfoCry
stal.htm, access time: Sep 2007.

[Strang 1998]

G. Strang. Introduction to Linear Algebra. WellesleyCambridge Press, 3rd edition, 1998.

[SVDA 2008]

Singular
value
decomposition,
http://www.mathworks.com/access/helpdesk/help/techdoc/
ref/svd.html, accessing time: Dec 2008.

[STLR 2007]

Standard
Template
Library
Resources.
http://www.sgi.com/tech/stl/, accessing time: Dec 2007.

[Taha & Gosh 1996]

I. Taha and J Gosh. Characterization of the Wisconsin
breast cancer database using a hybrid symbolicconnectionist system. Technical Report UT-CVISS-TR-97007, the Computer and Vision Research Center, University
of Texas, 1996.

[Triola 2006]

Triola, Mario F (2006). Elementary Statistics. 3rd ed.
Pearson, Addison-Wesley.

65

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[TREC]

The
Text
REtrieval
http://trec.nist.gov/

Conference

(TREC),

[Triola 2006]

Mario F. Triola. Elementary Statistics. 10* edition (2006),
Addison Wesley.

[Turtle & Croft 1991]

H. Turtle and W. Croft. Evaluation of an inference
network-based retrieval model. ACM Transactions on
Information Systems, 9(3), 1991.

[Wegner 2010]

Wegner T (2010). Applied Business Statistics. 2nd ed. Juta
Academic.

[Wolfe & Goldman
2003]

Wolfe MB, Goldman SR. Use of latent semantic analysis
for predicting psychological phenomena: two issues and
proposed solutions. Behavior Research Methods,
Instruments and Computers, 2003, 35(1):22-31.

[Wiemer-Hastings et al
1999]

P. Wiemer-Hastings, K. Wiemer-Hastings, and A.
Graesser. How Latent is Latent Semantic Analysis? In
Proceedings of the 16th International Joint Congress on
Artificial Intelligence, pp. 932-937, Morgan Kaufmann,
San Francisco, 1999.

[Wikipedia]

Wikipedia, the free encyclopedia, http://en.wikipedia.org/

[Waller & Kraft 1979]

W. Waller and D. Kraft. A mathematical model of a
weighted
Boolean
Retrieval
system. Information
Processing and Management, 15:235-245, 1979.

[Wolberg &
Mangasarian 1990]

W. Wolberg and O. Mangasarian. Multisurface method of
pattern separation for medical diagnosis applied to breast
cytology. In Proceedings of the National Academy of
Sciences, volume 87, 1990. pp. 9193-9196.

[Wu & Zhou 2002]

J. Wu and Z. Zhou. Face recognition with one training
image per person. Pattern Recognition Letters, 23(14):
1711-1719, 2002.

[Zeng 2005]

Jia Zeng. Information Retrieval by Multi-Perspective
Representation. Thesis (M.Sc.)--University of Northern
British Columbia, 2005.

[Zimmerman 1991]

H. Zimmerman. Fuzzy Set Theory and its Applications.
Kluwer Academic Publishers, 2nd edition, 1991.

66

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Appendix A
1. Stop Words List: Generated Stop Words List (University of Northern British
Columbia)

apr-31
aug-44
dec-66
feb-92
11-apr
15-aug
6-dec
a
al
alO
all
al2
al3
al4
al5
al6
al8
al9
a2
a20
a21
a22
a23
a24
a25
a26
a27
a28
a29
a3
a30
a300
a301
a31
a310
a32
a320
a320s
a321
a33
a330
a330s
a340

a340s
a35
a36
a4
a41
a43
a5
a6
a7
a8
a9
aa
aaa
aaaa
aaaah
aable
aachen
aad
aaf
aafitink
aagla
aah
aahed
aahing
aahs
aai
aar
aas
ab
aba
abb
abc
abl
about
above
abt
ac4
aca
acc
accordingly
ace
ach
acn

acoc
across
ad
ada
adl
adm
adn
ae
afc
afi
afl
after
afterwards
ag
again
against
agn
ago
ah
ahl
ahm
ak
al
ala
alf
all
allows
aim
almost
alone
along
alp
already
also
although
always
am
ama
amc
ami
amo
among
amongst

amp
an
ana
anc
and
ang
anh
ani
ann
another
any
anyawi
anybody
anyhow
anyone
anything
anyway
anywhere
ap
apart
appear
appropriate
ar
are
around
as
asa
ash
aside
aso
associated
ast
asu
at
att
av
ava
available
aw
away
awfully
b
bl

blO
blOO
bl2
b2
b210
b3
b4
b5
b52
b6

bl
b747
b8
b9
ba
ba2
ba3
ba4
ba6
baa
baal
baa2
baa3
back
bbdo
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
ben
ber
beside
besides
best
better

67

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

between
beyond
blm
bmg
bnd
bo
boa
bol
boo
both
bp
brief
bs
bsd
bta
btu
bu
but
bv
by
bye
byline
byu
c
cl
clO
ell
cl2
cl3
cl4
cl5
cl6
c!7
cl8
cl9
c2
c20
c21
c22
c23
c24
c2h2
c3
c4
c5
c6
c7
C759915
c8
c9

cac
came
can
cannot
cant
cause
causes
cb
cbn
cbo
cc
ccaa
ccdc
cch
cct
ce
cee
cellrule
certain
cfa
ch
cha
changes
chi
chj
chp
chr
chu
cio
ck
clo
cmc
cmdr
cmv
CO

column
come
consequentl
y
contain
containing
contains
coq
correction
correspondi
ng
cot
could
cpa
cpl

era
crc
currently
cvj
cwl
cy
cya
d
dl
dlO
dl 1
dl2
dl3
dl4
dl 5
dl6
d2
d3
d4
d5
d6
d7
d8
d9
date
dateline
day
db
de
dea
ded
def
described
dh
di
dib
did
didn
different
dl
dmc
do
doc
doc
doc
docid
docno
docno
docno
does

doesn
doing
don
done
doo
dosen
down
downwards
dr
ds
du
during
dy
e
el
e5
each
eb
ebb
ec
ed
edit
ee
eel
eg
eh
ei
eight
eip
either
ek
el
elk
else
elsewhere
em
en
enough
eo
ep
epa
epp
er
es
et
etc
ev
even
ever
every

everybody
everyone
everything
everywhere
ew
ex
example
except
ext
f
fl
faa
far
fee
fe
few
fg
fi
fifth
fig
first
five
fl
flo
followed
following
foo
for
former
formerly
forth
four
from
ft
fu
further
furthermore
g
g3
g4
g5
g6
ga
gc
ge
get
gets
given
gives
gn

68

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

go
gone
goo
good
got
gq
graphic
graphics
great
grp
gte
gto
gtp
gu
h
hi
hi 1
h 13
hi 8
h2
h20
h2a
h2o
h3
h4
h5
h6
h7
ha
haa
had
hadn
hardly
has
have
haven
having
he
headline
hem
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
hi

him
himself
his
hither
hj
hmmm
hmmmmm
ho
hoo
how
howbeit
however
hoy
ht
hu
hy
ibf
iby
ic
ida
ie
if
ignored
ii
iii
ik
il
im
ima
immediate
in
ina
inasmuch
inc
indeed
indicate
ing
inner
insofar
instead
into
inward
ip
q
ir
iri
irk
is
isl

!

isn
it
its
itself
iv
ivo
ix
j
jl
jio
j2
j3
j4
j5
j6
j8
ja
jc
ji
jo
ji"
just
k
kl
kin
k2
k2r
k9
ka
ka7
kan
kao
kee
keep
kept
kg2
kg7
kg8
khl
kh2
kh7
kh8
know
knx
ko
ky
1
10
12
14

15th
16
17
185
1983
1985
1986
1987
1989
1990
19th
la
laO10189
la010190
laO 10289
laO 10290
la010389
la010390
laO10489
laO10490
la010589
laO10590
laO 10689
laO10690
laO10789
laO10790
laO 10889
laO 10890
laO 10989
laO 10990
laOl 1089
laOl 1090
laO11189
laOl 1190
laOl1289
laO11290
laO11389
laO 11390
laO 11489
laOl 1490
laO11589
laOl 1590
laO 11689
laOl 1690
laO11789
laOl 1790
laOl 1889
laOl 1890
laO11989
laO 11990

laO12089
laO12090
la012189
la012190
laO12289
laO12290
la012389
laO12390
laO12489
laO12490
laO12589
laO12590
laO12689
laO12690
laO12789
laO12790
laO12889
la012890
laO12989
laO12990
la013089
la013090
la013189
laO13190
la020189
la020190
la020289
la020290
la020389
la020390
la020489
la020490
la020589
la020590
la020689
la020690
la020789
la020790
la020889
la020890
la020989
la020990
la021089
la021090
la021189
la021190
la021289
la021290
la021389
la021390

69

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ia021489
la021490
la021589
la021590
la021689
la021690
la021789
la021790
la021889
la021890
la021989
la021990
la022089
la022090
la022189
la022190
la022289
la022290
la0223 89
la022390
la022489
la022490
la022589
la022590
la022689
la022690
la022789
la022790
la022889
la022890
la030189
la030190
la030289
la030290
la030389
la030390
la030489
la030490
la030589
la030590
la030689
la030690
la030789
la030790
la030889
la030890
Ia030989
la030990
la031089
la031090

la031189
la031190
la031289
la031290
la031389
la031390
la031489
la031490
la031589
la031590
la031689
la031690
la031789
la031790
la031889
la031890
la031989
la031990
la032089
la032090
la032189
la032190
la032289
la032290
la032389
la032390
la032489
la032490
la032589
la032590
la032689
la032690
la032789
la032790
la032889
la032890
la032989
la032990
la033089
la033090
la033189
la033190
la040189
la040190
la040289
la040290
la040389
la040390
la040489
la040490

la040589
la040590
la040689
la040690
la040789
la040790
la040889
la040890
la040989
la040990
la041089
la041090
la041189
la041190
la041289
la041290
la041389
la041390
la041489
la041490
la041589
la041590
la041689
la041690
la041789
la041790
la041889
la041890
la041989
la041990
la042089
la042090
la042189
la042190
la042289
la042290
la042389
la042390
la042489
la042490
la042589
la042590
la042689
la042690
la042789
la042790
la042889
la042890
la042989
la042990

la043089
la043090
la050189
la050190
la050289
la050290
la050389
la050390
la050489
la050490
la050589
la050590
la050689
la050690
la050789
la050790
la050889
la050890
la050989
la050990
la051089
la051090
la051189
la051190
la051289
la051290
la051389
la051390
la051489
la051490
la051589
la051590
la051689
la051690
la051789
la051790
la051889
la051890
la051989
la051990
la052089
la052090
la052189
la052190
la052289
la052290
la052389
la052390
la052489
la052490

la052589
la052590
la052689
la052690
la052789
la052790
la052889
la052890
la052989
la052990
la053089
la053090
la053189
la053190
la060189
la060190
la060289
la060290
la060389
la060390
la060489
la060490
la060589
la060590
la060689
la060690
la060789
la060790
la060889
la060890
la060989
la060990
la061089
la061090
la061189
la061190
la061289
la061290
la061389
la061390
la061489
la061490
la061589
la061590
la061689
la061690
la061789
la061790
la061889
la061890

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ia061989
la061990
la062089
la062090
la062189
la062190
la062289
la062290
la062389
la062390
la062489
la062490
la062589
la062590
la062689
la062690
la062789
la062790
la062889
la062890
la062989
la062990
la063089
la063090
la070189
la070190
la070289
la070290
la070389
la070390
la070489
la070490
la070589
la070590
la070689
la070690
la070789
la070790
la070889
la070890
la070989
la070990
la071089
la071090
la071189
la071190
la071289
la071290
la071389
la071390

la071489
la071490
la071589
la071590
la071689
la071690
la071789
la071790
la071889
la071890
la071989
la071990
la072089
la072090
la072189
la072190
la072289
la072290
la072389
la072390
la072489
la072490
la072589
la072590
la072689
la072690
la072789
la072790
la072889
la072890
la072989
la072990
la073089
la073090
la073189
la073190
la080189
la080190
la080289
la080290
la080389
la080390
la080489
la080490
la080589
la080590
la080689
la080690
la080789
la080790

la080889
la080890
la080989
la080990
la081089
la081090
la081189
la081190
la081289
la081290
la081389
la081390
la081489
la081490
la081589
la081590
la081689
la081690
la081789
la081790
la081889
la081890
la081989
la081990
la082089
la082090
la082189
la082190
la082289
la082290
la082389
la082390
la082489
la082490
la082589
la082590
la082689
la082690
la082789
la082790
la082889
la082890
la082989
la082990
la083089
la083090
la083189
la083190
la090189
la090190

la090289
la090290
la090389
la090390
la090489
la090490
la090589
la090590
la090689
la090690
la090789
la090790
la090889
la090890
la090989
la090990
la091089
la091090
la091189
la091190
la091289
la091290
la091389
la091390
la091489
la091490
la091589
la091590
la091689
la091690
la091789
la091790
la091889
la091890
la091989
la091990
la092089
la092090
la092189
la092190
la092289
la092290
la092389
la092390
la092489
la092490
la092589
la092590
la092689
la092690

la092789
la092790
la092889
la092890
la092989
la092990
la093089
la093090
lal00189
lal00190
lal00289
lal00290
lal00389
lal00390
lal00489
lal00490
lal00589
M00590
lal00689
lal00690
lal00789
lal00790
lal00889
lal00890
lal00989
lal00990
lal01089
lal01090
lalOl189
lal01190
lal01289
lal01290
lal01389
lal01390
lal01489
lal01490
lalOl589
lal01590
lal01689
lal01690
lal01789
lal01790
lalOl889
lal01890
lal01989
lal01990
lal02089
lal02090
lal02189
lal02190

71

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Ial02289
lal02290
la102389
lal02390
lal02489
lal02490
lal02589
lal02590
lal02689
lal02690
lal02789
lal02790
lal02889
la102890
lal02989
lal02990
lal 03089
la103090
lal03189
lal03190
lal10189
lal10190
lal10289
lal10290
lal10389
lal10390
lal10489
lal10490
lal10589
lal10590
lal10689
lal10690
lal10789
lal10790
lal10889
lal10890
lal10989
lal10990
lal11089
lal11090
lal11189
lal11190
lal11289
lal11290
lal11389
lal11390
lal11489
lal11490
lal11589
lal11590

lal 1689
lal 1690
lal 1789
lal 1790
lal 1889
lal 1890
lal 1989
lal 1990
lal 2089
lal 2090
lal 2189
lal 2190
lal 2289
lal 2290
lal 2389
lal 2390
lal 2489
lal 2490
lal 2589
lal 2590
lal 2689
lal 2690
lal 2789
lal 2790
lal 2889
lal 2890
lal 2989
lal 2990
lal 3089
lal 3090
lal20189
lal20190
lal20289
la120290
lal20389
lal20390
lal20489
la120490
lal20589
la120590
lal20689
lal20690
lal20789
lal20790
lal20889
lal20890
lal20989
lal20990
lal21089
lal21090

lal21189
lal21190
lal21289
lal21290
lal21389
lal21390
lal21489
lal21490
lal21589
lal21590
lal21689
lal21690
lal21789
lal21790
lal21889
lal21890
lal21989
lal21990
lal22089
lal22090
lal22189
lal22190
lal22289
lal22290
lal22389
lal22390
lal22489
lal22490
lal22589
lal22590
lal22689
lal22690
lal22789
lal22790
lal22889
lal22890
lal22989
la122990
lal23089
lal23090
lal23189
lal23190
la2
la23
la89
la90
last
latter
latterly
le

least
length
less
lest
let
li
life
like
little
11
lo
long
loo
lot
lou
Ip
Is
It
ltd
lx
iy
m
ml
in 16
m2
m3
m4
m5
m6
m62
m71
m78
ma
maa
made
mae
make
mal
man
many
may
mc
mca
mcb
md
mdc
me
mea
meanwhile
mee

men
mfg
mg
mi
might
mk
mm
mme
mo
moc
moo
more
moreover
most
mostly
mott
mr
ms
msg
mt
mu
much
must
mvp
mx
my
myself
n
nl
n2
n5
n6
na
na3
na6
name
namely
ncols
ncr
ne
near
necessary
neither
nev
never
nevertheles
s
new
next
nfc

72

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

nfl
ng
ni
nine
nl
nlrb
nmb
no
nobody
nom
non
none
noone
nor
normally
not
nothing
nov
novel
now
nowhere
nr
nu
o
oat
oc
occ
och
od
oda
odd
of
off
oft
often
oh
°j
ol
old
on
once
one
ones
only
onto
ooh
ooo
0000
ooooo
ooz

op
opt
or
ora
ord
ot
other
others
otherwise
ou
ought
our
ours
ourselves
out
outside
over
overall
ow
own
P
Pi
p2
pa
page
part
particular
particularly
pb
pba
pc
pcb
pcc
pel
pep
pe
people
per
perhaps
pfc
Pg
ph
pic
Pj
placed
pic
please
plo
plus
possible

PP
pr
pre
pre
pro
probably
provides
Pt
pta
ptl
q
ql06
qb
qd2
qd3
qd4
qd7
qd8
qe2
qe4
qe5
qe6
qe7
qe8
qed
qf2
qf5
qf7
qg4
qg5
qg6
qh4
qi
qtr
que
quite
r
rl
r2
r2d2
r3
ra
ra3
ra6
rather
rb
rbi
rc
rda
re

really
relatively
respectivel
y
right
rk
ro
roo
rowrule
rv
s
si
sl630
slw
s2
s358
s4
s605
said
same
sb
sba
sbk
sc
sea
see
scca
see
scr
sdg
sdy
se
second
secondly
section
see
seem
seemed
seeming
seems
self
selves
sensible
sent
serious
seven
several
shall
she
shoo

should
si
since
six
smc
smu
so
some
somebody
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soo
sou
specified
specify
specifying
spn
sq
sr
ss
ssi
ssy
st
state
still
stu
su
sub
subject
such
sup
sw
sy
syd
t
t4
ta
table
tablecel
tablecell
tablerow
take
taken
tcu
td

73

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

te
text
than
that
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
thi
third
this
thorough
thoroughly
those
though
three
through
throughout
thru
thus
thy
time
tmc
tnt
to
together
too
tot
toward

towards
ts
tu
twa
twice
two
type
u
u2
u60
ua
uc
ul
ulf
un
unc
und
under
unless
uno
until
unto
up
upa
upon
us
use
used
useful
uses
usf
using
usually
ut
uw
V

v20
v2500
v6
v8

v8v
v91
v9w
va
value
various
ve
very
vh
vi
via
vii
viii
vill
vin
viz
vs
vt
vu
vw
vy
w
w6
w8
wa
was
wasn
way
wbc
web
wdm
we
wee
well
went
were
what
whatever
whatnot
when

whence
whenever
where
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
wig
will
with
within
without
wk
wok
word
words
work
world
would
wouldn
wow
wr
wt
wu
wuz
wwii

x6
xi
xiii
xiv
xix
xr
xt
xtra
XV

xvi
xxi
xxiii
xxiv
XXV

xxvii
XXX

x2

xyz
y
y95
yap
ye
year
years
yet
yo
yoo
you
your
yours
yourself
yourselves
yve
z
z28
z90
zac
zero
zx
zz
zzzz

alone
along
already
also

although
always
am
among

X

2. Stop Words List: University of Tennessee

a
about
above
accordingly

across
after
afterwards
again

against
all
allows
almost

74

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

amongst
an
and
another
any
anybody
anyhow
anyone
anything
anywhere
apart
appear
appropriate
are
around
as
aside
associated
at
available
away
awfully
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
best
better
between
beyond
both
brief
but
by
c
came
can
cannot

cant
cause
causes
certain
changes
CO
come
consequentl
y
contain
containing
contains
correspondi
ng
could
currently
d
day
described
did
different
do
does
doing
done
down
downwards
during
e
each
eg
eight
either
else
elsewhere
enough
et
etc
even
ever
every
everybody
everyone
everything
everywhere
ex
example
except
f
far

few
fifth
first
five
followed
following
for
former
formerly
forth
four
from
further
furthermore
g
get
gets
given
gives
go
gone
good
got
great
h
had
hardly
has
have
having
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
him
himself
his
hither
how
howbeit
however
i
ie
if

ignored
immediate
in
inasmuch
inc
indeed
indicate
indicated
indicates
inner
insofar
instead
into
inward
is
it
its
itself
j
just
k
keep
kept
know
1
last
latter
latterly
least
less
lest
life
like
little
long
ltd
m
made
make
man
many
may
me
meanwhile
men
might
more
moreover
most
mostly

mr
much
must
my
myself
n
name
namely
near
necessary
neither
never
nevertheless
new
next
nine
no
nobody
none
noone
nor
normally
not
nothing
novel
now
nowhere
o
of
off
often
oh
old
on
once
one
ones
only
onto
or
other
others
otherwise
ought
our
ours
ourselves
out
outside
over

75

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

overall
own
P
particular
particularly
people
per
perhaps
placed
please
plus
possible
probably
provides
q
que
quite
r
rather
really
relatively
respectively
right
s
said
same
second
secondly
see
seem
seemed
seeming
seems
self
selves

sensible
sent
serious
seven
several
shall
she
should
since
six
so
some
somebody
somehow
someone
something
sometime
sometimes
somewhat
somewhere
specified
specify
specifying
state
still
sub
such
sup
t
take
taken
than
that
the
their

theirs
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
third
this
thorough
thoroughly
those
though
three
through
throughout
thru
thus
time
to
together
too
toward
towards
twice
two
u
under
unless

until
unto
up
upon
us
use
used
useful
uses
using
usually
V

value
various
very
via
viz
vs
w
was
way
we
well
went
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby
wherein

whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
work
world
would
X

y
year
years
yet
you
your
yours
yourself
yourselves
z
zero

3. Stop Words List: University of Glasgow

76

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between

beyond
bill
both
bottom
but
by
call
can
cannot
cant
CO

computer
con
could
couldnt
cry
de
describe
detail
do
done
down
due
during
each
eg
eight
either
eleven
else
elsewhere
empty
enough
etc
even
ever
every
everyone
everything
everywhere
except
few
fifteen
fify
fill
find
fire
first
five
for

former
formerly
forty
found
four
from
front
full
further
get
give
go
had
has
hasnt
have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
him
himself
his
how
however
hundred
i
ie
if
in
inc
indeed
interest
into
is
it
its
itself
keep
last
latter
latterly
least
less

ltd
made
many
may
me
meanwhile
might
mill
mine
more
moreover
most
mostly
move
much
must
my
myself
name
namely
neither
never
nevertheles
s
next
nine
no
nobody
none
noone
nor
not
nothing
now
nowhere
of
off
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
ourselves

out
over
own
part
per
perhaps
please
put
rather
re
same
see
seem
seemed
seeming
seems
serious
several
she
should
show
side
since
sincere
six
sixty
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such
system
take
ten
than
that
the
their
them
themselves
then
thence
there
thereafter
thereby

77

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

therefore
therein
thereupon
these
they
thick
thin
third
this
those
though
three
through
throughout
thru
thus
to
together
too
top
toward
towards
twelve
twenty
two
un
under
until
up
upon
us
very
via
was
we
well
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby
wherein
whereupon
wherever
whether

which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
would
yet
you
your
yours
yourself
yourselv

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Appendix B
Source Code: TREC File Extractor
Download and Install following XAMPP software
http://www.apachefriends.Org/download.php7xampp-win32-l.6.6a-installer.exe
Click on the EXE and follow the easy steps (Please check "Install Apache as
Serivce","Install MySQL as Serivce"). It will create PHP/MySQL & Apache environment
for you in the local PC. Please choose a small named directory as the installation
directory (for example "c:\php" or "c:\apache" names are very smalls enough to easily
access from code).
How to install:
Extract "TREC file spliter.zip". Put all the files under a webroot directory. As you are
using XAMPP (apache 2 friends), the webroot directory is "htdocs".
Please create a directory (e.g., dp) under directory "c:\php\htdocs" and put all files there
(e.g., c:\php\htdocs\dp). Then open the ".htaccess" file using wordpad.
I consider your webroot is "C:/php/htdocs/" and "dp" is your project dir where you kept
all files.
Now type this address in your browser:
http://localhost/dp/
user:ankzbd
password:unbcca

TIPS:
Goto the "dp" directory there are two other directories "new input" Copy all large files
here and goto http://localhost/dp/ (to access the program) to process the files and now
check the "new output" directory for output files. It will help you to avoid upload and
download the files.
Version 1.0

Main.php
/*

1. Creates the initial GUI
2. Calls show_input_files.php
3. Calls show output files.php
4. Calls uploadfile.php

*/

<table align="left" width="90%">
<tr>
79

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

<td><img src="images/unbc_logo.gif" border="0" align="absmiddle"><td>
<td>&nbsp;</td>
</tr>
<tr bgcolor="#fcfcfc">
<td colspan="2" align="center" class="header"><h2>Data Pre- processing</h2></td>
</tr>
<?php
if(!empty($msg))

{
?>
<tr bgcolor="#C 1CDCD">
<td>&nbsp;</td><td colspan="l" align="center"><?=$msg?></td>
</tr>
<?php
}
?>
<tr valign="top">
<td width="15%">
<table vvidth="90%" style="border: lpx solid #9c9c9c">
<tr>
<td><a href="?action=show_input_files"
class="<?=$classl?>">Input Files</a></td>
</tr>
<tr>
<td><a href="?action=show_output_files"
class="<?=$class2?>">Output Files</a></td>
</tr>
<tr>
<td><a href="?action=upload_file" class="<?=$class3?>">Upload
Input</a></td>
</tr>
</table>
</td>
<td id="content">
<?php
include($file_to_include);
?>
</td>
</tr>
80

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

<tr bgcolor="#fcfcfc">
<td colspan="2" align="center" class-'footer">
&copy;ANK Zaman, zaman@unbc.ca, University of Northern British
Columbia.
</td>
</tr>
</table>
index.php
/*

1. Calling common.lib.php and XmlToArray.class.php
2. Switches between following four operations
i. createoutputfiles ii. show output files iii. show_uploader (to
upload files to be processed iv. show input files
3. Handles some error messages and file deletion operation (if needed)

*/

<?php
ob_start(); // Turn on output buffering
require_once("common.lib.php");
require_once("XmlToArray.class.php");
$files = array();
Saction = 'showinput';
if(!empty($_GET['created']))
$msg = "Output files created successfully";

if(!empty($_GET['action']))
$action = trim($_GET['action']);

if(!empty($_GET['action2']))
$action2 = trim($_GET['action2']);
$file2del = $_GET['file'];
$filetype = S_GET['type'];
$file2del = 'new_'. $filetype . 7$file2del";
if(unlink($file2del))

{
$msg = "File deleted succesfully";

}
}
if(!empty($_POST['delete_selected']))
81

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

{
$filetype = $_POST['file_type'];
$del_cnt = 0;
foreach($_POST['files'] as $file2del=>$val)
{

if($val == 1)

{
$file2del = 'new_'. $filetype . 7$file2del";
unlink($file2del);
$del_cnt++;

}
}
if($del_cnt)
{

$msg = "$del_cnt files deleted succesfully";

}

}
if(!empty($_FILES['input_file']['name']))

{
$dest = "new_input/". $_FILES['input_file']['name'];
if(move_uploaded_file($_FILES['input_file']['tmp_name'], $dest))

{
$msg = Tile <'. $_FILES['input_file']['name'] . '> uploaded

successfully';

}
else
{
$msg = 'Error: file <'. $_FILES['input_file']['name'] . '> was not

uploaded';

}
}
Sclassl = 'unselected';
$class2 = 'unselected';
$class3 = 'unselected';
$class4 = 'unselected';
switch($action)

{
case 'create':
$file_to_include = "tpls/create_output_files.php";
break;
case 'show output files':
$files = findDirFiles("new_output");
$file_to_include = "tpls/show_output_files.php";
$class2 = 'selected';
82

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

break;
case 'upload file':
$file_to_include = "tpls/show_uploader,php";
$class3 = 'selected';
break;
case 'show_input_files':
default:
$files = findDirFiles("new_input");
$file_to_include = "tpls/show_input_files.php";
Sclassl = 'selected';

}
?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtmll/DTD/xhtmll-transitional.dtd">
<html>
<head>
<title>Data Pre-processing</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="common.js"></script>
</head>
<body bgcolor="#cccccc">
<?php
include("tpls/main.php");
?>
</body>
</html>

Show uploader.php
/*
It creates file upload form to upload files to be processed
uploading by calling show input files.php

and displays files after

*/

<table width="100%">
<form name="inputForm" action="?action=show_input_files" enctype="multipart/formdata" method="POST">
<input type="hidden" name="upload" value="yes">
<tr bgcolor="#D4D0C2" height="20">
<th colspan="2" align="lefit">Please Upload Your Input File</th>
</tr>
<tr>
<td><input type="file" name="input_file"></td>
</tr>

83

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

<tr>
<td><input type="submit" value="Upload"></td>
</tr>
<tr bgcolor="#D4D0C2 ">
<td colspan="2">&nbsp;</td>
</tr>
</form>
</table>
Common.lib.php (traverse the input directory)
/*
Travers the input folder reads every input file one after another and
every document as a separate file with unique name
*/
<?php
function getbetweenTwo($textBefore,$textAfter,$allText,$offset=0)

extract

{
$pattern='#'.$textBefore.'(.*)'.$textAfter.'#isU';
preg m ate h a11($ pattern, $allText,$matches);
/* Searches s u b j e c t for all matches to the regular expression given in p a t t e r n
and puts
them in matches in the order specified by flags . After the first match is
found, the
subsequent searches are continued on from end of the last match. */

return @$matches[l][$offset];
}//EO Method
function getbetween2($textBefore,$textAfter,$allText,$offset=0)
{

$pattern='#'.$textBefore.'(.*)'.$text After.'# isU';
preg_match_all($pattern, $allText,$matches);
return @$matches[l];
}//End of function getbetween2()
/**

* Traverses & Find Files in Directory
* @param $path to traverse
* @return $files
*/

function findDirFiles($path)

{
$dir = opendir ($path);
$files = array();
while ($file = readdir ($dir))

{
84

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

if (($file ==

or ($file ==

{

continue;

}
if (filetype("$path/$file") = "dir")

{
?

}
else
{
$files[] = $file;

}
} //End of while
closedir($dir);
return $files;
}//End of function findDirFiles()
function createOutputFile($inputFile)//extracting documents form large files.

{
$xml_data = file_get_contents("new_input/$inputFile");
SarrayData = array();
$arrayData = getbetween2('<DOC>', '</DOC>', $xml_data);
//Displaying the Array
echo "<pre>";
print_r($arrayData);
echo "</pre>";
if($num_docs = sizeof($arrayData))

{
foreach($arrayData as $v)
{

$doc_no = getbetweenTwo('<DOCNO>', '</DOCNO>', $v);
$doc_no = str_replace('-',trim($doc_no));
//$file_data = "<DOC>\r\n". $v . "\r\n</DOC>";
$file_data = "<DOC>". strtolower($v). "</DOC>";
$fp = fopen("new_output/$doc_no", "w");
fwrite($fp, $file_data);
fclose($fp);

}
}//End of createOutputFileQ Function

85

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

?>
ShowinputFile.php
/*

1. Displays the list of input files (file name and size)
2. Controls the deletion of input files
3. Controls the download option of input files by calling
4. Calls showFile()

download.php

*/

<table width="100%">
<form name="inputForm" action="?action=show_input_files" method="POST">
<input type="hidden" name="delete_selected" value="yes">
<input type="hidden" name="file_type" value="input">
<tr bgcolor="#D4D0C2" >
<td width-'10"xinput type="checkbox" name="checkall"
onclick="checkUncheckAll(this);"/></td>
<td>File Name</td>
<td>File Size (Bytes)</td>
</tr>
<?php
if(sizeof($files))
{

foreach($files as $fname)

{
$fsize = filesize("new_input/$fname");
?>
<tr>
<td><input type="checkbox" name="files[<?=$fname?>]"
value="1 "/></td>
<td><?=$fname?></td>
<td><?=$fsize?></td>
<td><a href="javascript: showFile('input',
'<?=$fname?>');">View Content</a> | <a
href="?action=create&file=<?=$fname?>">Create
Output

Files</a>

<a
I
href="?action=<?=$_GET['action']?>&action2=delete&type=input&file=<
=$fiiame?>" onClick="javascript: return confirm('Are you sure to
delete?')">Delete</a>
|
<a
href="download.php?type=input&file=<?=$fname?>">Download</a></td

?

>

</tr>
<?php
}

}
?>
<tr bgcolor="#D4D0C2">
86

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

<td colspan="4"><input type="submit" name="delete_selected" value="
Delete Selected "></td>
</tr>
</form>
</table>
Creat_output_files.php
/*
calls createOutputFile() function */
<?php
Sifile = $_GET['file'];
createOutputF ile($ ifile);
header("location: ?action=show_output_files&created=l");
exit;
?>
Show_outputjfiles.php
/*

1. Displays the list of output files (file name and size)
2. Controls the deletion of output files
3. Controls the download option of output files by calling
4. Calls showFile()

download.php

*/

ctable width-'100%">
<form name="inputForm" action="?action=show_output_files" method="POST">
<input type="hidden" name="delete_selected" value="yes">
<input type-'hidden" name="file_type" value="output">
<tr bgcolor="#D4D0C2" >
<td width="10"xinput type="checkbox" name="checkall"
onclick="checkUncheckAll(this);"/x/td>
<td>File Name</td>
<td>File Size (Bytes)</td>
<td></td>
</tr>
<?php
if(sizeof($files))

{
foreach($files as $fname)

{
$fsize = filesize("new_output/$fname");
?>
<tr>
<td><input type="checkbox" name="files[<?=$fname?>]"
value=" 1"/></td>
<td><?=$fname?></td>
<td><?=$fsize?></td>

87

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

<td><a href-'javascript: showFile('output', '<?=$fname?>');">View
Content</a>
|
<a
href="?action=<?=$_GET['action']?>&action2=delete&type=output&file=
<?=$fname?>" onClick="javascript: return confirm('Are you sure want to
delete?')">Delete</a>
|
<a
href="download.php?type=output&file=<?=$fname?>">Download</a></t
d>
</tr>
<?php
}

}
?>
<tr bgcolor="#D4D0C2">
<td colspan="4"><input type="submit" name="delete_selected" value="
Delete Selected "></td>
</tr>
</form>
</table>

Showflle.php
/*
creates a button "close window" at the top (while displaying a file content
*/
<?php
$dir = "new_". $_GET['ftype'];
$file = $_GET['fname'];
Sfilename = "$dir/$file";
?>
<a href="javascript:window.close();">CLOSE WINDOW</a><br/>
<p style="width: 100%; background-color: #9c9c9c;">
<?php
highlight_file($filename); // //highlights the selected file.
?>
</p>
Download.php (
<?php

88

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

/* PHP scripts often generate dynamic content that must not be cached by the client
browser or any proxy caches between the server and the client browser. Many proxies and
clients can be forced to disable caching with following headerQ:*/
## avoid all caching
header('Cache-Control: no-cache, must-revalidate');
header('Pragma: no-cache');
header('Expires: Mon,26 Jul 1980 05:00:00 GMT'); // date in the past
$type = $_GET['type'];
$filename = "new_$type/".$_GET['file'];
if(file_exists($filename))//to download extracted files
{
header('Content-Type: application/octet-stream');
header("Content-Length: ". filesize($filename));
header('Content-Transfer-Encoding: base64');
header('Content-Disposition:
filename='".basename($filename)."");
readfile($filename);

attachment;

}
else
{
die("<font color=RED><h2>This resource does not exist</h2></font>");
// Equivalent to exit()

}
?

INTRODUCTION: Stemming and Stop Word Removal Program
Version 1.0
pstem folder contains the following two folders
inputs : this folder contains the input text files
outputs: generated output text files will be saved here
and the following files
index.php : this file start to access the input folder by calling the process.php file
output.php : this file helps to display the generated output files
process.php: this file creates temporary array of terms and compare then with stop word
list for (to keep or to delete it) then call the stemming.php for stemming.
stemming.php: this file contains the code of actual Porter's stemming.
stopword.txt: it contains the list of stop words that will be discarded form the input files

89

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

How to Install
Download and Install following XAMPP software
http://www.apachefriends.Org/download.php7xampp-win32-l.6.6a-installer.exe
Click on the EXE and follow the easy steps (Please check "Install Apache as
Serivce","Install
MySQL as Serivce"). It will create PHP/MySQL & Apache environment for you in the
local PC.
Please choose a small named directory as the installation directory (for example "c:\php"
or
"c:\apache" names are very smalls enough to easily access from code).
How to install:
Extract "pstem.zip". Copy "pstem" folder and put it into webroot directory. As you are
using
XAMPP (apache 2 friends), the webroot directory is "htdocs".
I consider your webroot is "C:/php/htdocs/" and "pstem" is your program dir where you
kept all files.

INDEX.PHP
/*

This program is free to use (and modification) for education and research.
This program has developed as part of masters thesis in the area of Information
Retrieval(IR).
If you have any question Contact with
Zaman email zaman@unbc.ca
or
Dr. Charles Brown email brownc@unbc.ca
Computer Science Program
University of Northern British Columbia
*/

/* This program creates a button called "CONTIONUE" and call the process.php program
segment.*/
<hl> Stop Word Removal and Porter's Stemming :: Step l</hl>
Please place all input files in dir "inputs" and then press "Continue" button.
<br />
<br/>
<input
type="button"
value="Continue
&raquo;
onClick="window.location.href='process.php'">

PROCESS.PHP
90

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

/*

This program is free to use (and modification) for education and research.
This program has developed as part of masters thesis in the area of Information
Retrieval(IR).
If you have any question Contact with
Zaman email zaman@unbc.ca
or
Dr. Charles Brown email brownc@unbc.ca
Computer Science Program
University of Northern British Columbia
*/

/*

This program segment calls the stemming.php
reads the input files
removes stop words
performs stemming
creates new files with the same name as input files

*/

<?php
ob_start();// Turn on output buffering
set_time_limit(0);
// Limits the maximum execution time, If set to zero, no time limit is imposed
echo "<hl>Step 2 :: Processing Files</hl>Please wait while processing ...<BR/>";
include('stemming.php');
$obj = new PorterStemmerQ;
$stop_words = array();
$dir_files = array();
$dir_files = findDirFiles("inputs");
if(!sizeof($dir_files))

{
echo "<br />No input files found. <a href=index.php>Go
again.";
exit;

back</a>

and

try

}
$file_index = 1;
if(!empty($_GET['file_index']))

{
$file_index = $_GET['file_index'];

}
91

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

if(!empty($dir_files) && $file_index <=

sizeof($dir_files))

$inp = $dir_files[$file index - 1];
$input_file = "inputs/$inp";
$file_data = file_get_contents($input_file);
$stop_w°rds = file('stopword.txt');
array_walk($stop_w°rds, 'trim_value');
foreach($stop_words as $word)
Sword = trim($word);
//cleans every word with extra space, tab, new

//line

etc
$words = array(" $word "\n$word ", "\r\n$word
" $word\r\n", "\n$word\n",
"\r\n$word\r\n");
Sreplacements = array('");
$file_data = str_ireplace($words, $replacements,
//replaces stop words with NULL character

", " $word\n",

$file_data);

}
// Implementing Porter's Stemmer
$file_data = preg replace callback('/(\w+)/i',
$file_data);

'replace_fn',

/* preg_replace_callback() - returns an array if the subject parameter is
an array, or a string otherwise.If matches are found, the new
subject will be returned, otherwise subject will be returned unchanged.
*/

$output_file = "outputs/$inp";

$fp = fopen($output_file, "w");
fwrite($fp, $file_data);
fclose($fp);
echo "<b>$file_index input files processed so

far.</b>";

$file_index++;
echo "<meta http-equiv=\"refresh\" content=\"2;
url=process.php?file_index=$file_index\">";

}
else
{

92

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

echo "<br /><b>Processing Finished.</b> <a
files</a>.";

href=output.php>See

output

}
function trim_value(&$v)
//Referencing the word to be trimmed

{
$v = trim($v);

}
function replace_fn($matches)
//Calling stem() function from stemming.php

{
global $obj;
$matches[l] = $obj->Stem($matches[l]);
return $matches[l];

}
function findDirFiles($path)
//to access and organize inputs directory
{
$dir = opendir ($path);
$files = array();
while ($file = readdir ($dir))

{
if (($file =

or ($file ==

{
continue;

}
if (filetype("$path/$file") == "dir")
{

}
else
{
$files[] = $file;

}
} //End of while
closedir(Sdir);
sort($files);
return $files;
}//End of function findDirFilesQ

93

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

?>

OUTPUT. PHP
/*

*This program is free to use (and modification) for education and research.
This program has developed as part of masters thesis in the area of Information
Retrieval(IR).
*If you have any question Contact with
*Zaman email zaman@unbc.ca
*or
*Dr. Charles Brown email brownc@unbc.ca
*Computer Science Program
*University of Northern British Columbia
*/

/* This program segment organizes (by sorting) and displays the output files*/
<?php
echo "<hl>Step 3:: Output</hl>";
$outputs = array();
Soutputs = findDirFiles('outputs');
if(sizeof($outputs))

{
sort($outputs);
foreach($outputs as $filename)

{
echo '<br /><b><a href="outputs/'.$filename.'">'.$filename.'</a></b>';

}
else
{
echo "No output files found.";

}
function findDirFiles($path)
//to access and organize outputs directory
{
$dir = opendir (Spath);
$files = array();
while ($file = readdir ($dir))

{
if (($file == ".") or ($file == ".."))
94

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

{
continue;

}
if (filetype("$path/$file") == "dir")

{
?

}
else
{
$files[] = $file;

}
} //End of while
closedir($dir);
sort($files);
return $files;
}//End of function findDirFiles()
?>

95

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Appendix C
General Text Parser [University of Tennessee][Giles et al 2001]
GTP has a command-line interface of the form gtp filename -c common_wordJile -t tempdir [options]
where, filename must be the first argument. The -c and -t specifiers must exist, but they
may be in any order. Options are enlisted in the following table-

Options

Explanation

-i

Index of input files

-c

common words file (stop word file

-w

Specify a custom weighting scheme. Local and global refer to local
and global weighting formulas. Local can be tf (term frequency),
log, or binary. Global can be normal, idf, or entropy.

-t

Temp Directory (where temporary working files are stored.)

-z

Perform singular value decomposition.

-R

Specify a name for the current run of GTP.

-h

Create the Harwell-Boeing compressed matrix.

-u

Keep the Harwell-Boeing compressed matrix in an uncompressed
file (on output) if the matrix is created.

-0

Specify that the output file is to be in one binary file for SVD.
(keys.gdbm)

-o

Specify that the key, id# global frequency, (e.g., freqtable.rtf)
Table Table B. 1: GTP Options [Giles et al 2001]

The first argument, filename, is the name of the file or directory to be parsed. If filename
is actually a directory, gtp traverses this directory and all subdirectories in a recursive
fashion and parses each regular file it encounters. If filename is a single file, gtp simply
parses it only. Gtp moves sequentially through each file, extracting keys/terms comprised
of relevant characters, and ignoring keys contained in the common word list specified by
the arguments -c common_word_file (set of stop words).

As gtp tokenizes the terms, it associates them with a number. The number is incremented
sequentially in the order they are encountered. After tokenizing the keys and associating
96

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

each one with the document it was extracted from, gtp begins calculating term weights.
The (global) weights of the terms are computed over the collection of documents if no
global weighting scheme is specified each term carries a global weight of 1. By default,
only a local weight is assigned and this is simply the frequency with which the term
appears in a document. Two thresholds exist for term frequencies: Global and local. By
default, the global and local thresholds are both 1. A term must appear more than 1 time
in the entire collection AND in more than one document in the collection before it will be
weighted.

With the use of different options, GTP produces several other files in varying formats.
The output file, created by specifying the -O option, is a binary file that contains all the
vector (term and document) and singular value information produced by the SVD. The
layout of the output file is as follows: header information consisting of the number of
terms parsed, number of documents encountered, and the number of factors used; term
vectors for all terms parsed; document vectors for all documents parsed; singular values
computed. The following table presents the generated files/folder by GTP:

File name

Type Description (Section)

RUN SUMMARY

Summary of options used (Listed in table 3)

LAST RUN

Summary of options used on most recent GTP run (Listed in table 3)

Output

Binary Vector information generated by SVD

keys.gdbm

DBM Database of keys generated
(http://gnuwin32.sourceforge.net/packages/gdbm.htm)

Rawmatrix.rtf

Raw term-by-document matrix

matrix.hb.rtf

Contains the Harwell-Boeing matrix.

lao2

Summary of SVD calculation.

nonz

Total terms in the document collection, including those are not parsed.

index,i.rtf

Index of input files

temp

Directory where temporary working files are stored.

Frequency.rtf

Term,

serial,

raw_global_frequency,

#files_present_the_term,

global_weight (idf2) [ idf2 formula is fount in the files wgt.h and
weight.cc in the folder src\wingtp\

Table B. 2 GTP Generated Files/Folder [Giles et al 2001]

97

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Query Processing with GTPQUERY

GTPQUERY relies completely on files produced by a standard GTP run. Those files are
output, keys (keys.gdbm), and LAST RUN. Query processing will fail if these files are
not in the working directory.

Gtpquery has a command-line interface of the form gtpquery filename -c common word Jile [options]
- where filename must be the first argument. The -c specifier must exist, but may be in
any order.

The first argument, filename, is the name of the file or directory containing queries to be
parsed. If filename is actually a directory, gtpquery traverses this directory and all
subdirectories in a recursive fashion and parses each regular file it encounters .

If

filename is a single file, gtpquery simply parses it only. Gtpquery moves sequentially
through each file, extracting keys comprised of relevant characters, and ignoring keys
contained in the common word list specified by the arguments -c commonwordfile. By
default, only keys that begin with characters A-z will be parsed. Keys beginning with a
digit (0-9), with the exception of numbers that could be interpreted as dates in the 1700's
1800's and 1900's, will be ignored to the next whitespace character. By default, only
characters A-z and 0-9 are considered to be characters to be included as part of the
remainder of a key. Other options are enlisted below-

-help
Summarize options.
-S
Scale the query vector by the singular values before calculating cosine similarity.
-n
Set the number of factors to use (default is found in the LASTRUN file; the
value of nfact.

The LAST RUN file

is a shortened version of the

RUNSUMMARY file. It contains all the parameters used in your last gtp run.).
-k#
Set the number of results returned (default is all).
-P
98

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Do not print query completion messages to standard output.

Query processing is performed by finding a cosine similarity measure between a query
vector and document vectors. The query vector can be considered a pseudo-document.

The files that GTPQUERY produces are strictly results files. Each file has a prefix of q
result.#, where # is a number starting with 1. The number represents the corresponding
number id of the query that was performed. The file contains a list of document ids and
cosine similarity measures organized by most relevant to least relevant. GTP is a software
package that provides text parsing of small to large document collections and matrix
decomposition for use in information retrieval applications. The brief summery of GTP is
presented below(i) Parse text (single files or directories),
(ii) Construct sparse matrix data structures (with choices of different term
weighting strategies),
(iii) Perform selected matrix decompositions for the representation of terms,
documents, and queries in a reduced-rank vector space, and
(iv) Convert user-supplied natural language queries into appropriate query vectors
for cosine-based matching against term and/or document vectors in that vector
space.

GTP is a public domain software, which (includes documentations) is freely available for
anyone to download form the following linkhttp://www.cs.utk.edu/~lsi

99

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Appendix D
T-Test Calculations

Notes
Output Created

20-Aug-2009 16:17:02

Comments
Input

Data

Wpg-uni-fs01\zaman\TSProfile\Desktop\data.sav

Active Dataset

DataSetl

Filter

<none>

Weight

<none>

Split File

<none>
10

N of Rows in Working Data
File
Missing Value Handling

Definition of Missing

User defined missing values are treated
as missing.

Cases Used

Statistics for each analysis are based on
the cases with no missing or out-ofrange data for any variable in the
analysis.
T-TEST PAIRS=System1 Systeml

Syntax

System1 WITH system2 system3
system4 (PAIRED)
/CRITERIA=CI(.9500)
/MISSING=ANALYSIS.

Resources

Processor Time

0:00:00.000

Elapsed Time

0:00:00.000

[DataSetl ] \\pg-uni-fs-01\zaman\TSProfile\Desktop\data.sav

100

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Paired Samples Statistics
Mean
Pair 1

Pair 2

Pair 3

Std. Deviation

N

Std. Error Mean

System1

.080710

10

.0381119

.0120520

system2

.074230

10

.0306572

.0096946

Systeml

.080710

10

.0381119

.0120520

system3

.072560

10

.0326347

.0103200

Systeml

.080710

10

.0381119

.0120520

system4

.072110

10

.0219340

.0069361

Paired Samples Correlations
N

Correlation

Sig.

Pair 1

Systeml & system2

10

.958

.000

Pair 2

Systeml & system3

10

.989

.000

Pair 3

Systeml & system4

10

.967

.000

Paired Samples Test
Paired Differences
Mean

Std. Deviation

Std. Error Mean

Pair 1

Systeml -system2

.0064800

.0124390

.0039336

Pair 2

Systeml - system3

.0081500

.0075823

.0023977

Pair 3

Systeml - system4

.0086000

.0178220

.0056358

Paired Samples Test
Paired Differences
95% Confidence Interval of the
Difference
Lower

Upper

df

t

Sig. (2-tailed)

Pair 1

Systeml -system2

-.0024183

.0153783

1.647

9

.134

Pair 2

Systeml -system3

.0027259

.0135741

3.399

9

.008

Pair 3

Systeml -system4

-.0041491

.0213491

1.526

9

.161

101

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Level of significance for two-tailed t-test
Degrees of
Freedom (df)
1
2
3
4
... 5

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
120
infinity

0.1
6.31
2.92
2.35
2.13
2.02
1.94
1.89
1.86
1.83
1.81
1.80
1.78
1.77
1.76
1.75
1.75
1.74
1.73
1.73
1.72
1.72
1.72
1.71
1.71
1.71
1.71
1.70
1.70
1.70
1.70
1.68
1.67
1.66
1.65

Level of significance for two-tailed test
0.05
0.01
12.71
63.66
9.93
4.30
3.18
5.84
2.78
4.60
4.03
2.57
3.71
2.45
3.50
2.37
3.36
2.31
2.26
3.25
3.17
2.23
2.20
3.11
3.06
2.18
2.16
3.01
2.98
2.14
2.95
2.13
2.92
2.12
2.90
2.11
2.10
2.88
2.86
2.09
2.09
2.85
2.08
2.83
2.07
2.82
2.82
2.07
2.06
2.80
2.79
2.06
2.78
2.06
2.77
2.05
2.05
2.76
2.05
2.76
2.75
2.04
2.02
2.70
2.66
2.00
2.62
1.98
1.96
2.58

0.001
636.62
31.60
12.92
8.61
6.87
5.96
5.41
5.04
4.78
4.59
4.44
4.32
4.22
4.14
4.07
4.02
3.97
3.92
3.88
3.85
3.82
3.79
3.77
3.75
3.73
3.71
3.69
3.67
3.66
3.65
3.55
3.46
3.37
3.29

Table C: Level of significance for two-tailed t-test [Fisher & Yates 1995]

102

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Publications
1. A N K Zaman, and Charles Grant Brown, - "Latent Semantic Indexing and
Large Dataset: Study of Term-Weighting Schemes", accepted for publication in
the IEEE Fifth International Conference on Digital Information Management
(ICDIM 2010), Lakehead University, Thunder Bay, Canada from July 5-8, 2010.

2. A N K Zaman, and Charles Grant Brown, - "Information Retrieval (IR) from a
Large Data Set" published in the proceedings of International Conference on
Electronics, Computer and Communication (ICECC 2008), June 27-29, 2008,
University of Rajshahi, Bangladesh. ISBN 984-300-002131-3.

103

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.