P arallelism In M u ltip le Sequence
A lign m en t
Q iong B ai
B. Sc., Simon Fraser University, 2003

Thesis Submitted In Partial Fulfillment Of
The Requirements For The Degree Of
Master Of Science
in
M athematical, Computer, and Physical Sciences
(Computer Science)

The University Of Northern British Columbia
February 2006
0 Qiong Bai, 2006

1^1

Library and
Archives Canada

Bibliothèque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de l'édition

395 W ellington Street
Ottawa ON K 1A 0N 4
Canada

395, rue W ellington
Ottawa ON K 1A 0N 4
Canada
Your file Votre référence
ISBN: 978-0-494-28349-3
Our file Notre référence
ISBN: 978-0-494-28349-3

NOTICE:
The author has granted a non­
exclusive license allowing Library
and Archives Canada to reproduce,
publish, archive, preserve, conserve,
com m unicate to the public by
telecom m unication or on the Internet,
loan, distribute and sell theses
worldwide, for com mercial or non­
com mercial purposes, in microform,
paper, electronic and/or any other
formats.

AVIS:
L'auteur a accordé une licence non exclusive
permettant à la Bibliothèque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transm ettre au public
par télécom m unication ou par l'Internet, prêter,
distribuer et vendre des thèses partout dans
le monde, à des fins com m erciales ou autres,
sur support microforme, papier, électronique
et/ou autres formats.

The author retains copyright
ownership and moral rights in
this thesis. Neither the thesis
nor substantial extracts from it
may be printed or otherwise
reproduced without the author's
permission.

L'auteur conserve la propriété du droit d'auteur
et des droits moraux qui protège cette thèse.
Ni la thèse ni des extraits substantiels de
celle-ci ne doivent être imprimés ou autrem ent
reproduits sans son autorisation.

In com pliance with the Canadian
Privacy Act some supporting
form s may have been removed
from this thesis.

Conform ém ent à la loi canadienne
sur la protection de la vie privée,
quelques form ulaires secondaires
ont été enlevés de cette thèse.

W hile these form s may be included
in the docum ent page count,
their removal does not represent
any loss of content from the
thesis.

Bien que ces formulaires
aient inclus dans la pagination,
il n'y aura aucun contenu manquant.

Canada

Abstract
Aligning multiple sequences is a daunting task that can be extremely de­
manding on computer power and memory resources. By comparing the ho­
mologues sequences from different species of animals, one can draw inferences
about the evolution of these species from their common ancestors. This thesis
applied shared memory parallel processing techniques to the global alignment
of multiple DNA and protein sequences, as well as gene orders. The multi­
threaded method deployed for multiple DNA and protein sequence alignment
was based on the divide-and-conquer technique, which firstly cut the sequences
into sub-sequences, but it is computational expensive to find the optimal cut
positions. Since speed performance is the focus of this thesis, we did not con­
sider the optimal cut positions in our implementations. Meanwhile, an original
and promising graph-based algorithm with parallel processing properties was
introduced to simplify and speed up the alignment operations for multiple gene
orders.

C ontents

1

A b s tr a c t......................................................................................................................

ii

C o n te n ts ......................................................................................................................

iii

List of F ig u res............................................................................................................

vii

List of T a b l e s ............................................................................................................

viii

Publications from this T h e s i s ................................................................................

ix

A cknowledgm ents......................................................................................................

x

In trod u ction

1

1.1 Biological Background for SequenceA n a ly sis .............................................

1

1.1.1 Biological S e q u e n c e s .........................................................................

1

1.1.2 Phylogenetic T r e e s .............................................................................

4

1.1.3 Gene O rd e rs ..........................................................................................

6

1.2 Why Sequence Analysis

2

...............................................................................

7

1.3 Why Parallelizing Multiple A lig n m e n ts .....................................................

9

1.4 Thesis O rganization.........................................................................................

11

A lgorith m s and R ela ted W ork

13

2.1 Pairwise Sequence A lig n m en ts......................................................................

13

2.1.1

Dynamic Programming A lg o rith m s ...............................................

2.2 Multiple Sequence Alignment

14

......................................................................

16

Exact A lg o rith m s................................................................................

17

2.2.2 Divide-and-Conquer A lig n m en ts......................................................

18

2.2.3 Progressive A lig n m e n t......................................................................

20

2.2.1

111

2.3

2.4
3

Iterative Algorithms

........................................................................

2.2.5

Parallelized Multiple Sequence Alignment

23

................................

23

Multiple Gene Order A lig n m e n t.................................................................

25

2.3.1

Multiple Gene Order Alignments

.................................................

25

2.3.2

Representation of a g e n o m e ...........................................................

26

2.3.3

Genome R e a rra n g e m e n t.................................................................

27

2.3.4

Rearrangement Distances

..............................................................

28

2.3.5

Algorithms and M e th o d s .................................................................

30

C onclusion........................................................................................................

34

M u ltith read ed M u ltiple Sequence A lign m en t

35

3.1

Complexity A n a ly s is .....................................................................................

35

3.2

Evaluations and C o m p a ris o n s ....................................................................

36

3.3

Research P ro b le m ...........................................................................................

37

3.4

Approaches and S im u latio n s........................................................................

38

3.4.1

Single-Tree A lig n m e n t....................................................................

39

3.4.2

Multiple-Tree A lig n m e n t.................................................................

40

3.4.3

Speed P e rfo rm a n c e ...........................................................................

40

3.4.4

Alignment S ensitivity........................................................................

41

Sensitivity Im provem ents..............................................................................

43

3.5.1

Overlapping A lig n m en t....................................................................

43

3.5.2

Sliding Windows for Cut Points C alcu latio n s.............................

44

C o n c lu sio n s .....................................................................................................

45

3.5

3.6
4

2.2.4

G raph-based C onsensus M u ltip le G ene Order A lign m en t

46

4.1

Complexity A n a ly s is .....................................................................................

46

4.2

R a tio n a le for G r a p h T h e o r y ...................................................................................

47

4.3

The Precedence Graph-Based Consensus A lg o rith m ...............................

48

4.3.1

The Minimum Spanning T r e e .......................................................

48

4.3.2

An E x am p le ........................................................................................

51

IV

4.3.3 Algorithm F o rm alizatio n ...................................................................

54

4.4 Parallelism of the A lg o rith m .........................................................................

58

4.5 Evaluations and C o m p a ris o n s......................................................................

58

4.5.1 Evaluation without a Reference a lig n m e n t..................................

58

4.5.2 Evaluation against a Reference alig n m en t.....................................

59

4.6 V a lid a tio n .........................................................................................................

60

4.7 C onclusion..........................................................................................................

62

5 D iscu ssio n and Future D irection s

63

5.1 C o n trib u tio n......................................................................................................

65

5.2

Future D ire c tio n s ............................................................................................

66

5.2.1

Multiple Sequence A lig n m e n t.........................................................

66

5.2.2

Multiple Gene Order A lig n m en t.....................................................

67

B ibliography

68

List o f Figures
1.1 An example of single strand DNA s e q u e n c e s...........................................

2

1.2 A protein example: a goose hemoglobin protein known as “HAGSI” .

2

1.3 A phylogenetic tree for some warbler s p e c ie s ...........................................

4

1.4 A phylogenetic tree built for five a n im a ls ..................................................

6

1.5 The transform ation of the mouse gene order into the human gene order
on the X chromosome: only the 6 longest syntency blocks are shown
...................................................................................................................

8

2.1 Alignment examples of Needleman-Wunsch a lg o rith m ...........................

15

2.2 A sample alignment of three s e q u e n c e s .....................................................

17

2.3 Illustrating the search space with the divide-and-conquer technique

.

19

2.4 Demonstrating the 3-step progressive alignment by C lu s ta lW ..............

20

2.5 Mouse vs. Human: different X chromosome gene o r d e r ........................

27

here

2.6 Examples for breakpoints: “-> ” means ascending order; “< -” means
descending order; a single gene can be represented by either “- > ” or
............................................................................................................................. 29
3.1 Divide-and-conquer te c h n iq u e .....................................................................

38

3.2 Details of single-tree alignment

..................................................................

40

D e ta ils o f m u ltip le -tre e a l i g n m e n t .........................................................................

41

3.3

3.4 Speed improvement for single-tree and multiple-tree implementations
in term s of different number of t h r e a d s .....................................................

VI

42

3.5

SP-scores for single-tree and multiple-tree implementations with re­
spect to different number of th r e a d s ...........................................................

3.6

Illustrating the sliding window technique for improving alignment ac­
curacy ...............................................................................................................

4.1

..................................................................................................

54

Illustrating the steps of the precedence graph-based consensus algo­
rithm for genome sequence a lig n m e n t........................................................

4.6

53

The minimum spanning tree built from the summarized final prece­
dence m atrix

4.5

50

The minimum spanning tree built from the precedence m atrix of se­
quence ( a ) .........................................................................................................

4.4

49

A simple network represented by a weighted graph and an adjacency
m a t r i x ...............................................................................................................

4.3

44

A simple network represented by a weighted graph and an adjacency
m a t r i x ...............................................................................................................

4.2

43

55

An example showing how to obtain different final alignments by travers­
ing the minimum spanning tree in different o r d e r s .................................

VII

57

List o f Tables
1.1

The twenty amino acids found in proteins

V lll

P u b lica tio n s from th is T hesis

[1] “M ultithreaded Multiple Sequence Alignments” , Joanne Bai, Siamak Rezaei.
The Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th
Annual Conference, Shanghai, China, September 1-4, 2005.

[2] “Multiple Gene Order Alignmen” , Siamak Rezaei, Joanne Bai. The Proceedings
of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference,
Shanghai, China, September 1-4, 2005.

IX

A ck now ledgm ents
I would like to express my gratitude to all those who have given the support to
complete this thesis.
I am greatly indebted to my supervisor, Dr. Siamak Rezaei, for kindly providing
suggestions and encouragement which helped me in all the time of research and writing
of this thesis. His comments have been of greatest help at all times.
My gratitude also goes to my supervision commettee members Dr. Liang Chen
and Dr. Cecilia Alstrom-Rapaport, for their suggestions and encouragement th a t led
to substantial improvements of this thesis.
I would also like to thank the external examiner Dr. Alan Wagner for reviewing
this thesis and monitoring the work with his valuable suggestions.
I thank the Dean of G raduate Studies Dr. Robert Tait, the secretaries of the
Dean of G raduate Studies Ms. Bethany Haffner and Ms. Trinity Posteraro for having
always been so supportive.
I thank Pruthvi, Kouhyar, Abbie, Jeremy, and Sean for their nice company and
suggestions during my education.
Lastly, but most im portantly I am very grateful for the love and support of my
parents Fan Bai and Yufen Chen, who have been always encouraging me to fulfill my
dreams.

C hapter 1
Introduction
The focus of this thesis is the alignment of multiple biological sequences, including
gene sequences (i.e. DNA and RNA), protein sequences, and genome sequences (a
list of ordered gene blocks). In order to distinguish these two levels of sequences,
DNA, RNA and protein sequences will be referred to as normal ‘sequence’, while the
genome sequences will sometimes be referred to as ‘gene orders’ in this thesis.

1.1

B iological B ackground for S eq u en ce A n alysis

1.1.1

B io lo g ic a l S eq u en ces

DNA (Deoxyribonucleic Acid) is a nucleic acid th a t carries the genetic information
in the cell and is capable of self-replication and synthesis of RNA (Ribonucleic Acid).
DNA consists of two long chains of nucleotides twisted into a double helix and joined
by hydrogen bonds between the complementary bases A and T or C and G. The
sequence of nucleotides determines individual hereditary characteristics. RNA is a
very similar polymer. It is usually a single-stranded chain of alternating phosphate
and ribose units with the bases A, U, G, G bonded to the ribose. The structure and
base sequence of RNA are determinants of protein synthesis and the transmission of
genetic information. Figure 1.1 shows a simple example of DNA sequences.

> A Single S t r a n d D N A S e q u e n c e
ATTTCGTGCATATCTGACGTTAGGACCACGT

Figure 1.1; An example of single strand DNA sequences
The fundamental building blocks of life are proteins.

Twenty different amino

acids are commonly found in proteins, and each protein has a unique, genetically
defined amino acid sequence. They serve as enzymes, structural elements, hormones,
immunoglobulins, etc., and are involved in oxygen transport, muscle contraction, elec­
tron transport, and other activities throughout the body. One of the most im portant
concepts in modern biology is th a t the functional properties of proteins is determined
largely by the sequence of the 20 amino acids shown in Table 1.1.1 in the linear
polypeptide chain. Thus, in theory, knowing the sequence of a protein (the order
with which the amino acids occurred) one could infer its function.
W hat determines the order of amino acids in a protein? The central dogma of
Molecular Biology states the relationships between genes and proteins. It describes
th at each gene in the DNA molecule carries the information needed to construct one
protein, which, acting as an enzyme, controls one chemical reaction in the cell. The
basic structure of any protein can be described by its sequence of amino acids, and the
shapes those proteins fold up into make them different from one another. However,
in this thesis, other than primary structure, the protein shapes, such as secondary
structure, tertiary structure, and quaternary structure, will not be discussed. Figure
1.2 shows an example of a protein sequence.

> HAGSI

h e m o g l o b i n alp h a - A c h a i n - b a r - h e a d e d g o o s e

V LSA A D K TN V K G V FSK ISG H A EE Y G A ET LE R M FTA Y PQ T K T Y FPH FD L Q H G SA Q IK A H G K
K V V A A LV E A V N H ID D IA G A LSK L S D L H A Q K L R V D P V N F K F L G H C FL V V V A IH H PS A L T A E V
HAS L D K F L C A V G T VLT A K Y R

Figure 1.2: A protein example: a goose hemoglobin protein known as “HAGSI”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Name
Alanine
Cysteine
Aspartic Acid
Glutamic Acid
Phenylalanine
Glycine
Histidine
Isoleucine
Lysine
Leucine
Methionine
Asparagine
Proline
Glutamine
Arginine
Serine
Threonine
Valine
Tryptophan
Tyrosine

Three-letter Code
Ala
Cys
Asp
Glu
Phe
Gly
His
He
Lys
Leu
Met
Asn
Pro
Gin
Arg
Ser
Thr
Val
Trp
Tyr

One-letter Code
A
C
D
E
F
G
H
I
K
L
M
N
P

Q
R
S
T
V
W
Y

Table 1.1; The twenty amino acids found in proteins

1.1.2

P h y lo g e n e tic T rees

Biologists often use the idea of portraying graphically how extant species evolved
to reconstrnct evolutionary trees. “Phylogenetic tree” is a more technical name for
“evolutionary tree” . The sample tree shown in Figure 1.3 ^ portrays how certain
warblers are related to one another. From this tree one can guess th a t at one time
there was one general, warbler-like species, and this general species evolved into at
least three different species. Each of these species further diversified into the extant
genera known as Vermivora, Seirus, and Dendroica. These three genera, in turn,
have eventually fragmented into numerous and various species recognizable today.
For example, the genus Dendroica includes more than twenty-five species.
Palm
Y ellow -R um ped

O range-C row ned

O venbird
N orthern
W aterth ru sh

N ashv ille

M a g n o l ia
B lac k po ll

Tennessee

D endroica
V erm ivora

S e ir u s

Figure 1.3; A phylogenetic tree for some warbler species
In the past, biologists working on phylogenetic trees had to find plain evidences or
characteristics to content themselves. The more similar organisms looked, the more
closely related they were assumed to be. T h a t’s often really the way it works, but also,
th is a p p ro a c h c o u ld le a d to so m e p r e tty serio u s e rro rs. N o w a d ay s, n ew te c h n iq u e s

have given us a more clear idea of what the real phylogenetic trees should look like.
For instance, with a process known as DNA hybridization, scientists can actually
^adapted from http://w w w .b ack y ard n atu re.n et/ev o trees.h tm

determine how much genetic material th a t different species have in common. If a
large portion of the genetic code (i.e. DNA) of two species or two groups of species
is identical, then they are considered to be closely related. Otherwise, if only a small
portion of them is identical, then their relationship is distant.
A phylogenetic tree is constructed with extant species on the leaves (terminal
nodes), and the interior nodes representing hypothesized ancestors. Usually all inte­
rior nodes in such trees are binary. In the context of sequences, the nodes represent
sequences, and the edges (branches) represent mutations, either explicitly or by a
number indicating how many mutations have happened between the end nodes of an
edge. In a tree constructed from protein sequences or DNA sequences, an interior
node can in principle represent the same sequence as a leaf node.

Consider a set of DNA sequences 1, 2, 3, 4, 5.

Sequence 1: CA G G T A
Sequence 2: CA G A C A
Sequence 3: CG G A T G
Sequence 4: T G C G C T
Sequence 5: T G C G C A

The task is to construct a phylogenetic tree and find out which sequences are more
closely related to each other. This is the same as finding the mutations th a t have
occurred. A distance score will be used to reflect the number of mutations needed
from one sequence to another. For example, transforming sequence 4 into 5 needs
only 1 mutation, which occurs at the last position of sequence 4 and replaces T with
A. Therefore, t h e d is ta n c e sco re between se q u e n c e 4 a n d sequence 5 is 1. The smaller

the distance score is, the closer the two sequences are related. Assume th a t if the
distance score is no less than half of the average sequence length (no less than 3 in
this case), then the two sequences won’t belong to the same subset.

1. Looking at the first column, it seems th a t a m utation from C to T, or from T
to C, has occurred in the past. In the mean time, check the distance scores in
Figure 1.4 and divide the 5 sequences into two subsets; 1, 2, 3 and 4, 5
2. Looking at the second column in the larger subset 1, 2, 3 there could have
been a m utation from A to G, or from G to A. Again, check the corresponding
distance scores and divide it into 2 even smaller subsets: 1, 2 and 3
3. Repeating the same thing for the smaller subset 4, 5 and their distance score
(which is 1) tells th a t they are closely related and may not be further divided.
Finally, we end up with a phylogenetic tree shown in Figure 1.4.

0

2

3

5

4

-

0

3

5

4

-

-

0

5

5

-

-

-

0

1

-

-

-

-

0

S e q u e n c e D i s t a n c e M a trix

2

3

4

5

T h e C o n s tr u c te d Phylogenetic T ree

Figure 1.4: A phylogenetic tree built for five animals

1.1.3

G en e O rders

A g en e o rd e r re fe rs to t h e s e q u e n tia l lo c a tio n o f g en es o n a c h ro m o so m e. A c h ro m o ­

some is defined as a sequence of genes while a genome is defined as a set of chromo­
somes. Genes of an organism are arranged in a linear order in eukaryotes. This linear
order of genes can be examined to understand the similarity between the genomes of

6

two species. A given genome may be transformed or evolved into the genome of a
different organism by a sequence of elementary rearrangement events acting on the
genes. Hannenhalli and Pevzner have shown how cabbage gene orders were trans­
formed into turnip gene orders [30] by using an efficient algorithm th a t performs only
three rearrangement operations.
In the early 1990s, it was found th a t there are groups of genes in mice th a t appear
in the same order as they do in humans. These genes are likely to be present in
the same order in a common ancestor of human and mice - the ancient mammalian
genome. The human genome appears just the mouse genome cut into about 300
large genomic fragments, called syntency blocks, which have been pasted together
in a different order. Both sequences are just two different shufflings of the ancient
mammalian genome. For example, chromosome 2 in humans is built from fragments
th a t are similar to mouse DNA residing on chromosomes 1, 2, 3, 5, 6, 7, 10, 11, 12, 14
and 17. Hence, a location of a gene in mice can often lead to clues about the location
of a related gene in humans.
Every genome rearrangement results in a change of gene ordering, and a series of
these rearrangements can significantly alter the genomic architecture of a species. The
study of genome rearrangements involves solving the combinatorial puzzle of finding a
series of rearrangements th a t transform one genome into another. Figure 1.5 presents
a rearrangement scenario in which the mouse X chromosome is transformed into the
human X chromosome. The elementary rearrangement event in this scenario is the
flipping of a genomic segment, and it is called a reversal or an inversion. Details will
be introduced in the next chapter.

1.2

Why Sequence Analysis

Sequence analysis is the process of making biological inferences from the known se­
quence of monomers in protein, DNA and RNA polymers. Currently, many worth-

Mouse ( X chrom. )

Human ( X chrom. )

Figure 1.5: The transformation of the mouse gene order into the human gene order
on the X chromosome: only the 6 longest syntency blocks are shown here
while things can be done with sequence analysis.
Regions of DNA th a t encode proteins are first transcribed into messenger RNA
and then translated into protein. By examining the DNA sequence alone we can
determine the sequence of amino acids th a t will appear in the final protein.

In

translation, codons of three nucleotides determine which amino acid will be added
next in the growing protein chain. It is im portant then to decide which nucleotide
to start translation, and when to stop, this is called an open reading frame. Every
region of DNA has six possible reading frames, three in each of the two strands. The
reading frame th a t is used determines which amino acids will be encoded by a gene.
Typically only one reading frame is used in translating a gene in eukaryotes, and this
is often the longest open reading frame. Once the open reading frame is known the
DNA sequence can be translated into its corresponding amino acid sequence.
Sequence analysis helps to identify a protein’s primary sequence according to the
corresponding gene sequence. For example, if we need to find out where a protein
is encoded in a DNA sequence, it is very useful to know w hat peptides would be
encoded by all six reading frames.
Secondly, sequence analysis can be used to search databases for sequences similar
to a new sequence. If someone has just determined a sequence of an interesting bit of
DNA, one of the first questions he is likely to ask himself is “has anybody else seen
anything like this?” Fortunately, there has been a very successful international effort
to collect all the sequences people have determined in one place so th a t they can be
searched.

Thirdly, sequence analysis performs a calculation of sequence alignments for evo­
lutionary inferences and it aids in structural and functional analysis. Although it is
not possible to completely predict the function or shape (structure) of a protein from
a sequence, some useful inferences about structure and function can be drawn, by
comparing the sequence of a protein of unknown structure and function to sequences
of proteins with known structure and function. Second, the partial analysis done in
the present will help reach the goal of structure or function prediction in the future.
Third, by comparing the sequences of equivalent proteins from different species of
animals (such equivalent proteins are called “homologues” ), one can draw inferences
about the evolution of these species from their common ancestors.
The third application of sequence analysis, sequence alignments, is going to be
the topic of this thesis, and this includes multiple sequence alignments, and multiple
gene order alignments.
Comparing sequences, structures, and sequences with structures is the most fun­
damental operation in biological sequence (i.e. DNA, RNA and protein) and structure
analysis. W hen a comparison indicates a similarity between two proteins, it can im­
mediately suggest relationships involving structure, function and the evolution of the
two protein sequences from a common ancestor protein. When one of the proteins is
well characterized in terms of structure and function, a close connection with a novel
protein sequence may allow all the hard-earned biological d ata to be transferred to
the new protein. The degree of certainty with which this transfer can be made de­
pends on how similar the two sequences are. The two related protein sequences are
said to be homologous, and the information are transferred by homology.

1.3

W h y P arallelizin g M u ltip le A lign m en ts

Parallel computers are used primarily to speed up computations. A prallel algorithm
can be significantly faster than the best possible sequential solution. There is a grow­
ing number of applications in sciences, engineering, business, and medicine requiring

computing speeds th a t cannot be delivered by conventional computer. These applica­
tions involve processing huge amount of data, or perform a large number of iterations,
thus leading to inordinate running times. Parallel computation is the only approaches
known today th a t would make these computation feasible.
Research of genome projects generates enormous amounts of information. Re­
searchers want to access this information quickly and easily and also be able to trans­
form this information into other useful information. Researchers also want to access
cognate information, such as bibliographic or biological information associated with a
given DNA sequence. Subsequently, there is a demand for increased computer power,
both in speed and performance, and for enlarged memory capability, rapid networked
communication, and improved database design.
Multiprocessing or parallel processing in general means the use of more than
one processor or process in the computer handling of a given task. M ultithreaded
programming is one of the forms for parallel processing. So, what is a thread? Think
of sewing needles as the CPUs (or Light Weighted Processes) and the threads in a
program as the fiber. If you had two needles but only one thread, it would take longer
to finish the job than if you split the thread into two and used both needles at the
same time. Taking this analogy a little further, if one needle had to sew on a button
(blocking I/O ), the other needle could continue doing other useful work. If only one
needle is used there would be some extra hours for the single needle to do other useful
stuff. Moving to something more concrete, a thread is a sequence of instructions th at
can be executed in parallel with other threads. They are not entire processes, but
rather lightweight threads of execution. Threads of a program are smaller portions
of a process running concurrently (or in parallel).
Multiple sequence alignment is a demanding task to automatically generate an ac­
curate alignment. An in-depth knowledge of evolutionary and structural relationships
within a species family is often lacking or hard to use. General empirical models based
on mathematically sound principles can be extremely demanding in CPU power and
memory, and are difficult to apply. For some cases, statistical heuristics have been

10

developed to be able to cope with practical d ata set size. Therefore, in order to reduce
the computation time and obtain useful data more efficiently it would be very ideal
if some parallelism techniques could be applied for multiple sequence analysis.
Similarly, these complexity problems also occur in the computations for multiple
gene order alignment. For instance, there are two different complexities for gene order
alignment:
• For signed reversal and translocation distance, which marks the gene positively
or negatively, the computation complexity is polynomial.
• For unsigned reversal distance, which only marks the gene positively, the com­
plexity is much more than polynomial.
Besides, the existing algorithms for gene order alignment, including breakpoint anal­
ysis and reversal distance analysis, lack the parallelism properties for computational
speedup.

1.4

T h esis O rganization

The rest of the thesis will investigate the following issues:
• Chapter 2: discusses the related work th a t has been done for multiple sequence
and genome (gene order) alignments, including literature reviews on alignment
methods, algorithms, and applications.
• Chapter 3: presents a m ultithreaded implementation for multiple sequence
alignments based on the mixed idea of Progressive alignment and Divide-andconquer alignment. Depending on how the guide tree(s) would be applied, two
d iffe re n t a p p ro a c h e s a re im p le m e n te d for c h eck in g th e im p ro v e m e n ts of align­

ment speed and sensitivity.
• Chapter 4: describes a new algorithm with parallelism properties for aligning
two or more genome sequences, which is based on a precedence graph-based
11

consensus method and able to obtain the ancestor genome sequence. In this
section, the algorithm formalization and the issues related to alignment result
evaluation are included.
• Chapter 5: concludes the thesis and talks about the future work th a t is going
to be involved.

12

C hapter 2
A lgorithm s and R elated W ork
In this chapter, algorithms and a broad range of the preceding work in multiple
sequence and gene order alignment is described. It provides an overview of the earliest
papers to the most recent development, including the approaches and applications for
multiple alignment, and the work for parallelized multiple alignment.
Sequence alignments are either global or local. Global alignments find the best
match over the to tal length of both sequences. In many cases, however, sequences
share only a limited region of similarity. This may be a common domain or simply
a short region of recognizable similarity. This case is dealt with by local alignment.
Local alignment aims at identifying the best pair of regions, one from each sequence,
such th a t the optimal alignment of these two regions is the best possible. In this
thesis, only global alignment will be considered and discussed.

2.1

P airw ise Sequence A lign m en ts

A residue is a single unit within a polymer, such as an amino acid within a polypeptide
o r p ro te in .

T h is te r m re flects th e fa c t t h a t s u g a rs, n u c le o tid e s, a n d a m in o ac id s

usually lose a few atoms (usually hydrogen and oxygen) when they are polymerised
into a larger molecule. In making an alignment, a one-to-one correspondence is set
up between the residues of the two sequences. This has the evolutionary implication

13

th a t at one time the paired residues were the same in an ancestral sequence and have
diverged through the accumulation of point mutations in their DNA. Point m utation
(or sometimes called substitution) is not the only process at work and extra residues
may have been inserted or deleted giving rise to breaks or gaps in the alignment.
These are referred to as insertions and deletions or, as indels.
For example, one possible alignment between the sequences

S = A G C A C A C A
T = A C A C A C T A

is A l:

A G C A C A C - A
A - C A C A C T A

Or A2:

A G - C A C A C A
A C A C A C T - A
A l contains one insertion and one deletion. A2 needs one insertion, one deletion,
and two replacements. If we assume th a t the cost for each of these one-bit operations
(insertion, deletion, replacement) is 1, then the cost of A l is 2, and the cost of A2 is
4. Therefore, A l is a better alignment than A2 as it is cost effective.

2.1 .1

D y n a m ic P r o g r a m m in g A lg o r ith m s

There are different types of sequence comparison algorithms using dynamic program­
ming and the corresponding parallel formulations. The first algorithm for comparing
biological sequences using the dynamic programming techniques was introduced by
Needleman and Wunsch [42] in 1970. The algorithm consists of two parts; the cal­
14

culation of the total score indicating the similarity between the two given sequence,
and the identification of the alignments th a t leads the score.
Seque nc e Y:

A
X

I

T

A

-1
-2

T

G

C

A

-1

-2

-3

-4

-5

-6

-7

\

0

-1

-2

-3

-4

-5

0

-1

-2

-3

\

1

0

-1

1

0

0

\

\

-3

-1

1

-4

-2

0

1

2

\

-5

-3

-1

1

1

2

\

-6

-4

-2

0

1

1

2

A
G
T

2

A LIG N M EN T I:
Se que nce X:
S eque nce Y:
Score:
Total:

A T -- A A G T
A T G C A G T
I 1- 10 1 I I
4

A LIG N M EN T 2:
Seque nc e X:
Sequen ce Y:
Sc ore:
Total:

A T A -- A G T
A T G C A G T
1 1 0 - 1 1
11
4

Figure 2.1: Alignment examples of Needleman-Wunsch algorithm
Considering the two sequences X and Y as shown in 2.1, a similarity m atrix can
be initialized with deceasing values (0, -1, -2, -3, -4, ....) along the first row and the
first column to parallelize the consecutive gaps( insertions and deletions). The other
elements of the m atrix d[i,j] are calculated and filled by the maximum of the three
values: d[i-l,j-l], d[i-l,j], and d[i,j-l]. In another words, the value of cell d[i,j] solely
depends on the values of its 3 adjacent neighbors at the previous row, column and
diagnose positions. The similarity m atrix d[n, m] is built by applying the following
recurrence equation:

d[i, j] = max (d[i, j-1] + gp, d[i-l, j-1] + ss, d[i-l, j] + gp)

In this example, gp is -1. ss is 1 if the elements match or 0 otherwise. Since
global alignment takes into account the entire sequences, the final score will always
be found in the bottom right hand corner of the matrix. In our example, the final
score 4 gives us a measure of how similar the two sequences are. Figure 2.1 shows the
similarity m atrix and the two possible alignments. The two arrows going up and left
15

represent two optimal paths. Tracing an optimal path backwards leads to an optimal
alignment for two sequences.
The basic dynamic programming algorithm for making a global alignment is per­
haps the most widely used and im portant algorithm in bioinformatics. Variations of it
are used for local alignment and it can be extended to align more than two sequences
(multiple alignment), such as Smith-W aterman’s Algorithm, Pickett’s Algorithm [42],
and W ilbur-Lipman’s algorithm [88].
Almost all alignment methods find the best alignment between two sequences
under some scoring scheme. These scoring schemes can be as simple as '4-1 for a
match, -1 for a m ism atch’. Indeed, many early sequence alignment algorithms were
described in these terms. However, since a scoring scheme to give the biologically
best score is wanted, the fact th a t biological molecules have evolutionary histories,
three-dimensional folded structures, and other features which constrain their primary
sequence evolution need to be taken into account.

Therefore, in addition to the

mechanics of alignment and comparison algorithms, the scoring system can be very
complex.

2.2

M u ltip le Sequence A lign m en t

A multiple sequence alignment is an alignment of two or more sequences. It is a
natural extension of two sequence alignment, called pairwise alignment. A simple
example of multiple sequence alignment is shown in figure 2.2. Sequences may be
multiply aligned to visualize the effect of evolution across homologues proteins or
DNAs. Multiple alignment makes it possible to investigate a wide range of im portant
biological phenomena like the following:
• Phylogenetic analysis
• Identification of conserved motifs and domains
• Structure prediction
16

S e q l : V T S I T T C G S N I

G N V K W Y L P G

S e q 2 : V T S I T T C G S N I

-

-

V N W Y L P G

S e q 3 : V S T L L L C V G Y P

-

-

V E

W E G

-

-

Figure 2.2; A sample alignment of three sequences
Depending on the sequences compared and the goal or application, how the com­
parison is performed will be different.

For these reasons, there have been many

different kinds of algorithms and programs w ritten to compare sequences. Consider­
ing the obvious properties of existing multiple alignment algorithms, it is convenient
to classify them in three main categories [59]: exact, progressive and iterative.

2.2.1

E x a c t A lg o r ith m s

The simultaneous alignment of all the sequences is called dynamic programming align­
ment or exact algorithm. Exact algorithms are high quality heuristics th a t deliver an
alignment usually very close to optimality. When solving the com putational prob­
lem of multiple sequence alignment, a natural generalization is to expand the two
sequence case. The basic algorithm for the global comparison of two sequences, is to
assign a score. In the multiple sequence case, the issue of scoring becomes a little
more complex.
W ith several sequences, scores are calculated by assigning a score to each column
of the scoring matrix. The sum-of-pairs (or SP measure) is often used to calculate
values for a sequence. The sum-of-pairs function is the sum of pairwise scores of
all pairs of symbols in a given column. A k-dimensional array would be an obvious
solution for dynamic programming for the alignment of k sequences, but a standard
application of dynamic programming for multiple sequences takes exponential time.
Even with time saving measures, a multiple sequence alignment of three sequences
takes O(n^) time. As a result, only 3 or 4 sequences can be realistically used when
implementing a global dynamic programming algorithm.
Carrillo and Lipman [51] recognized a lower bound on the cost of the optimal

17

multiple sequence alignment, and presented an elegant branch-and-bound approach.
They described a technique for reducing the part of the graph (the dynamic program­
ming m atrix)that has to be examined. Only the paths th at are contained in a certain
“polytope” around the shortest path are explored by their algorithm.
The MSA program [52] implemented a slightly modified version of the Carrillo and
Lipman’s algorithm. Since the bounds of Carrillo and Lipman were not sufficiently
tight for solving the “real world” multiple sequence alignment instances, Lipman et al.
proposed heuristics to improve bounds. However, the number and length of sequences
th a t can be aligned is limited because the number of computational steps and the
amount of memory grow exponentially with the number of sequences to be analyzed.

2 .2 .2

D iv id e -a n d -C o n q u e r A lig n m e n ts

In Exact algorithm, the time and space complexity grows exponentially with the
number of sequences (O (m") for equal sequence length m), and for practical solutions
of such computationally expensive problems, generally, two approaches are used;
• One approach attem pts to reduce the running time by using pruning techniques
which still guarantee finding the highest-scoring alignment, but not reducing the
worst-case complexity. Divide-and-Conquer is one example of these approaches.
• The other approach uses heuristics. This means th at some ‘rules of thum b’ are
used in the solution, and the best (or correct) solution is not necessarily found.
For some of the heuristic methods it is possible to find an upper bound on the
deviation of the result from the correct one.
Divide and Conquer Alignment [79] [80] is one of the implementations based on
the Carillo and Lipman algorithm to limit computations to a small area in the multi­
dimensional search matrix. The idea of divide-and-conquer is straightforward: DCA
guesses a point in the lattice on the optimal path, splits the sequences at th a t point,
and recurses. On sequence fragments th a t are short enough, the exact algorithm

18

is invoked. Once all fragments are aligned, they are concatenated to yield an ap­
proximate alignment. By doing this, the search space for each sub-alignment can
be significantly reduced (as shown in Figure 2.3). This method finds the split point
at the middle way of the longest sequence, and then searches the multi-dimensional
m atrix for a set of coordinates with minimal score. DCA does not take direct advan­
tage of heuristic procedures for finding the split points, such as using the presence of
strong motifs in subsets of the input sequences. In molecular biology, motif refers to
the conserved smallest group of atoms in a polymer th at, when under the influence
of a rotation-translation operator, will assemble the rest of the atoms in the chain.

/

1/4

/ /

1/4

/

14

/4

/ -wr / / wr /
/ 1/4 / / 1/4 /
/
(a) T h e original s e a r c h in g s p a c e ; I

1/4

/

/

/

(b) T h e s e a r c h in g s p a c e after d v i d e - a n d - c o n q u e r :
( 1 / 4 * 1 /4 * 1/4) * 4 = 1/ 16

0

( c ) T h e d i v i d e - a n d - c o n q u e r t e c h n i q u e : T h e o r i g in a l 3 s e q u e n c e s a r e d i v i d e d
into 2 s e t s o f s u b - s e q u e n c e s . K e e p d i v id i n g in e a c h s u b s e t a n d ali g n i n g e a c h
i n d iv i d u a l s u b s e t . W h e n all t h e s u b - a l i g n m e n t s a r e o b t a i n e d , a s s e m b l i n g
th em t o g e th e r to f o rm the final glob al alignm en t.

Figure 2.3: Illustrating the search space with the divide-and-conquer technique

19

General Dynamic Programming (GDP) [27] is similar to DCA. The major differ­
ence is th a t DCA only commits to a set of split points without considering the ge­
nomics information hiding behind the sequences, whereas GDP uses local and global
approximate alignments to generate a large number of plausible “anchor points”
aimed at the optimal path through the lattice, and progressively computes the score
moving from one anchor to the next. It has been claimed by Gracy and Sallantin [27]
th at DCP successfully aligned up to 18 sequences with results superior to GlustalW

2 .2 .3

P r o g r e ssiv e A lig n m e n t

Among the many strategies implemented for multiple sequence alignment, progres­
sive alignment is the most successful and by far the most widely used. The idea is to
establish an initial order (i.e. a guide tree) for joining the sequences and to follow this
order to gradually build up the alignment. The strategies depend on a progressive
assembly of the multiple alignments where sequences or alignments are added one by
one so th a t never more than two sequences are simultaneously aligned using dynamic
programming. This approach has the great advantage of speed and simplicity com­
bined with reasonable sensitivity, even if it by nature is a heuristic method th a t does
not guarantee any level of optimization.
The most prominent implementation for this approach is GlustalW [82], which,
together with the window graphic user interface (GUI) version GlustalX [83], belongs
to the Glustal family. Glustal [84] progressive sequence alignment includes following
three steps, which are demonstrated in Figure 2.4;
4 Se qu ence s

Guide T r e e

;
3

- >

r
(A

Iterative Pairwise Alignment

:

: : : : :

A

,

:

--------------------------

4

4

Figure 2.4: Demonstrating the 3-step progressive alignment by GlustalW
20

• Step 1: Pairwise Alignment
- Aligns each sequence to each of the other sequences in the set, giving a
similarity score for each comparison.

Scores of each sequence pair are

displayed in a similarity score matrix
- The pairwise alignments are done by using dynamic programming (i.e.
Needleman-Wunsch global alignment algorithm).
• Step 2; Guide Tree
- The similarity m atrix is transformed into a distance matrix, and is used
by an algorithm (e.g. Neighbor-join algorithm [69] or UPGMA [54] [77])^
to create a Guide Tree. The guide tree here is similar to the concepts
of the phylogenetic tree introduced in Section 1.1.2, which specifies the
relationships between sequences.
• Step 3: Progressive Alignment
- Following the guide tree, the two most similar sequences are aligned, giving
the consensus sequence. Consensus sequence is the sequence th a t reflects
the most common choice of base or amino acid at each position of a series
of related DNA, RNA or protein sequences. Areas of particularly good
agreement often represent conserved functional domains.
- The next sequence will be repeatedly added and aligned to the existing
new consensus sequence, until the final alignment is achieved.
^UPGM A and Neighbor-join algorithms:
* U PG M A (Unweighted P air G roup M ethod using A rithm etic averages); It works by initially
having all sequences in separate clusters and continuously joining these. T he tree is constructed
by considering all initial clusters as leaf nodes in th e tree, and each tim e two clusters are joined,
a node is added to th e tree as th e p aren t of th e two chosen nodes. T he clusters to be joined
a re c h o se n a s th o s e w ith m in im a l p a irw is e d is ta n c e . T h e b r a n c h le n g th s a re s e t c o rre s p o n d in g

to the distance between clusters, which is calculated as th e average distance between pairs of
sequences in each cluster.
* Neighbor Joining: T he m ethod works very m uch like U PG M A . T he m ain difference is th a t
instead of using pairwise distance, th is m ethod su b tracts th e distance to all other nodes from
th e pairw ise distance.

21

ClustalW works well in alignment when sequences are closely matched to one
another, or in another words, have many positions in common. However, in cases
where the sequences are far apart from each other when pairwisely aligned, more
errors appear, and these errors are propagated to the tree construction as well as the
final results. Since the pairwise alignments are used in a greedy manner for progres­
sive alignment, the alignments formed during the progression towards the final MSA
cannot be changed any more. Thus, the difficulty with progressive alignment highly
depends upon the initial pairwise sequence alignment. T-Coffee [60] is an alignment
application th a t attem pts to mitigate this shortcoming by using information from
other global and local alignments to guide the progressive alignment.
T-Coffee is currently the most reliable MSA method available [49]. The basic
philosophy is th a t instead of looking at pairs of sequences in isolation, this proce­
dure allows information to be included from all other pairwise alignments. Thus the
resulting guide tree makes the sequences to be aligned based on how well aligned
they are with respect to the rest, so th a t the more confidently aligned sequences are
matched up first and the least confidently aligned last. However, this method suf­
fers from exaggerating the significance of shorter residue segments th a t share high
percentage identity.

As a result, although the method is beneficial for anchoring

conserved domains in an alignment of closely related sequences, it has the opposite
effect on outlier sequences th a t are usually misaligned due to the lack of a commonly
conserved segment.
In most of the current top performing progressive alignment methods, such as
Glustal series, DiAlign2 [58], POA [50], Praline [35], and T-Coffee [60], the dynamic
programming (DP) strategy is adopted. The main difference between the available
DP-based progressive methods is the way in which the information of aligned blocks of
sequences is represented. While early methods used consensus sequences to represent
alignment blocks, current methods mostly use a profile formalism to represent the
information in a MSA [76].

22

2 .2 .4

Ite r a tiv e A lg o r ith m s

Recent developments in multiple sequence alignment techniques have mainly focused
on sensitive and optimal models to represent MSA information. A class of techniques
th a t are able to revisit and optimize the MSA is th a t of iterative multiple alignment
techniques. Pioneered by Hogeweg and Hesper (1984) [36], iterative techniques de­
pend on algorithms able to produce an alignment and to refine it through a series of
cycles or iterations until no more improvements can be made.
Iterative methods for MSA can be deterministic or stochastic, depending on the
strategy used to improve the alignment. Deterministic iterative strategies are the
simplest. They extract sequences one by one from a multiple alignment and realign
them to the remaining sequences [29] [35]. This procedure is term inated when no
more improvement can be made. Stochastic iterative methods include HMM training
[33] [6], simulated annealing [45], and evolutionary computations such as genetic algo­
rithms [28] [90] and evolutionary programming [13]. The main advantage is to allow
for a good conceptual separation between the optimization processes and objective
functions (evaluation criteria). Objective function defines the aim of any optimization
procedure.
PR R P [29], SAM [39], HMMER [23], SAGA [61], and MUSCLE [24] are some of
the recent and less recent available iterative methods for multiple sequence alignment.
Some of these methods are a mixture of progressive and iterative strategies.

2 .2 .5

P a r a lle liz e d M u ltip le S e q u en ce A lig n m e n t

Parallel processing, sometimes called concurrent processing contains two or more
processors th a t work together to perform a task. Each process is a sequential pro­
gram, namely, a sequence of statem ents th a t are executed one after another another.
Whereas a sequential program has a single thread of control, a concurrent program
has multiple threads of control.
The processes in a parallel program work together by communicating with each

23

other.

Communication is programmed using shared variables or message passing.

In shared memory model, when shared variables are used, one process writes into a
variable th a t is read by another. In contrast, in message passing model, when message
passing is used, one process sends a message th a t is receied by another. The focus of
this thesis is on shared memory parallel algorithms.
Parallel algorithms for analyzing DNA and protein sequences are becoming in­
creasingly im portant as sequence data continues to grow. While dynamic program­
ming algorithms make large sequence alignment feasible, the quadratic time require­
ment still makes it a time-consuming process. A natnral approach is to reduce the
time requirement with the use of parallel computers.
Iyengar [5] examined the parallel characteristics of four sequence alignment al­
gorithms. The four algorithms presented were the dynamic programming algorithm
developed by Needleman, Wunsch, and Sellers (the NWS algorithm), Pickett’s algo­
rithm [42], a parallel algorithm using some of Pickett’s ideas, and an algorithm which
uses some of W ilbur and Lipman’s ideas [88] for constructing alignments which are
not always optimal. Iyengar found out th a t the NWS algorithm contains the most
properties to be parallelized but also does more work than any of the other algorithms
which were studied, and Pickett’s algorithm contains the least parallelism properties.
A shared-memory multithreaded parallel version of the Needleman-Wunsch’s Al­
gorithm [42] using dynamic programming for pairwise alignment is presented by Mar­
tins [53], which handles the d ata dependencies very well and performs as many opera­
tions as possible independently. Another algorithm, Berger-Munson algorithm [8] was
initially parallelized by Ishikawa et al. [40] [41] on a parallel inference machine (PIM)
using a parallel logic programming language K Ll. Later, Yap et al. [89] extended
and evaluated this approach on an Intel iPSC/860 parallel computer by applying
speculative computation to the parallelization of the Berger-Munson algorithm, and
achieved a higher speedup and a more scalable implementation.
Algorithms th a t both retain time optimality and reduce space requirement were
first presented by Edmiston et al. [26] and further developed by Aluru et al on an IBM

24

SP-2 and a Pentium cluster [3]. Assume th a t n and m are the lengths of the sequences
to be pairwisely aligned. Edmiston et. al. [26] discuss parallel algorithms for sequence
and subsequence alignment th a t achieve linear speedup and can use up to min(m, n)
processors. Lander et. al. [48] discusses an implementation on a shared memory
parallel computer. These algorithms store the entire dynamic programming table.
Huang [38] presented a parallel sequence alignment algorithm which increases the
run-time to O ((m +n)^/p), which is intended for a message-passing architecture with
one-dimensional-array topology. Recently, Rajko et al. [64] claimed having developed
the first space and time optimal parallel algorithm on an IBM xSeries cluster for the
pairwise sequence alignment problem, which requires only O ((m4-n)/p) space and O
(m n/p) time, and is suitable for implementation on parallel computers.
A widely studied problem th a t is identical to a special case of the sequence align­
ment problem is string editing. Highly parallel algorithms for this problem have been
developed for the hypercube models of computation [4], [65] using almost quadratic
number of processors.
The parallel algorithms mentioned above for dynamic programming pairwise align­
ment can be extended to align more than two sequences. At present, several alignment
tools for multiple sequences are parallelized and have become available online for free
user access, such as pClustalW [14], Parallel ClustalW on 16 CPUs[55], DiAlign p [75],
Praline [46], MUSCLE-p [25], and etc. The parallelism applied in these applications
is not limited to the alignment algorithm itself, but also to the sequence processes.

2.3

M u ltip le G ene Order A lign m en t

2 .3 .1

M u ltip le G en e O rder A lig n m e n ts

Gene order alignment is another form of biological sequence alignment. If one gene
can be thought as a word, gene order then can be presented as the order of words in
the sentence. In multiple sequence alignment, the positions of residues are fixed, and
a set of DNA or protein sequences are compared with each other, whereas in multiple
25

gene order alignment, a set of blocks of genes are compared in terms of their orders
or positions.
The DNA of eukaryotes is subdivided into chromosomes. In prokaryotes, chro­
mosomal DNA is circular, and the entire genome is carried on one chromosome. A
chromosome has the self-replicating genetic structures of cells containing the cellular
DNA th a t bears in its nucleotide sequence, the linear array of genes. Chromosome
breakage and mistakes in repair, along with a number of other processes, give rise to
changes in gene order. These have im portant consequences for the cell, the organ­
ism, the population, and for the evolution of species. Figure 2.5 shows the different
X-chromosome gene orders for mice and humans. Gene order alignments will help to
provide following information:
• Aligned gene orders allow us to speculate the closest common ancestor of the
genomes.
• Gene order alignment is useful when faced with “missing link” problem where
evolutionary intermediates are not known.
• Gene order alignment information could be used across multiple genomes to
form a phylogenetic tree.

2 .3 .2

R e p r e s e n ta tio n o f a g e n o m e

A unichromosomal genome can be considered as a sequence of n genes. Let’s denote
the genes by numbers 1, 2, ..., n, and represent the two signed orientations of gene i
as i and -i. As such, a genome is represented as a signed perm utation of the numbers
1, 2, ..., n. For example, a unichromosomal genome with n—5 genes is

5 -3 4 2 -1

26

3

5

2

4

3

2

5

4

3

2

1

4

1

2

3

4

M ouse

5

Human

Figure 2.5: Mouse vs. Human: different X chromosome gene order
A multichromosomal genome consists of n genes spread over m chromosomes, ft is
represented as a signed perm utation of 1, 2, ..., n, with delimiters

inserted between

the chromosomes. For example, a genome with 12 genes spread over 3 chromosomes
could be w ritten as

7 -2 8 3 ;
5 9 -6 -1 12 ;
11 4 10 ;

The genome alignment being discussed here is about unichromosomal genome or
one chromosome of the multichromosomal genome. Currently, neither repeated genes,
nor the gaps arised from insertion or deletion, are considered.

2 .3 .3

G en o m e R ea rr a n g e m en t

The algorithmic study of comparative genomics tries to explain differences in gene
orders in two or more genomes in terms of a limited number of rearrangement opera­
tions. For unichromosomal genomes, this requires the calculations of an edit distance
between two linear orders on the same set of objects, representing the ordering of
27

homologous genes in two genomes. In the “signed” version of the problem, a plus or
minus is associated with each gene, representing the direction of transcription.
Genome rearrangement operations considered here include reversal (also referred
to as inversion), transposition, and translocation [70]. Some other operations also
exist such as duplication, fusion and fission. Every study of genome rearrangement
involves solving a combinatorial “puzzle” to find a shortest series of operations th at
transform one genome into another.
• R eversal; the inversion of any number of consecutive term s in the ordered set,
which also reverses the polarity of each term within the scope of the inversion
in the case of signed orders.
• Transposition: the movement of a piece of DNA around the chromosome
(from one gene to another part of the genome), usually through the function of
a transposable element. Transposition may or may not involve an inversion.
• Translocation: the rearrangement of a chromosome in which a segment is
moved from one location to another, either within the same chromosome or
to another chromosome. This is sometimes reciprocal, when one fragment is
exchanged for another.

2 .3 .4

R e a rr a n g e m en t D ista n c e s

Literature often refers to the rearrangement events, reversal and translocation, as the
genomic sorting problem. The key question is to find the minimum number of steps
needed to transform genome A to genome B using reversal and translocation.
B reakpoint analysis
Breakpoint analysis [73] tries to minimize the breakpoint distance between two gene
order sequences. A pair of elements in two perm utations forms a breakpoint if they are
consecutive in one but nonconsecutive in the other sequence, as illustrated in Figure
2,6. The breakpoint distance between two permutations is the number of breakpoints.
28

— ^ ; c on se cut i v e ascendi ng o r d e r
^

>
0 1 2 *

: c o ns e cut i v e des c en di ng o r d e r

<
—>
7 6 5 * 8

<
*

4 3 *

><—
>
9 1 0 * 1 3 * 1112

i

b reakpoi nt

Figure 2.6: Examples for breakpoints:
means ascending order;
descending order; a single gene can be represented by either
or

means

R eversal D ista n ce
One can also calculate the reversal distance between two genome permutations as
well. A reversal in a signed perm utation is an operation th a t takes an interval in
a perm utation, reverses the order of the numbers, and changes all their signs. For
example, a reversal R(i, j) applied to the permutation:

P (l), ..., P (i-l), P(i), ..., P(j), P(j+1), ..., P(n)

and gives the following permutation:

P (l),

P (i-l), -P(j),

-P(i), P(j+1),

P(n)

By applying R(4, 8) to the following gene order:

123456789

One will obtain:

1 2 3 - 8 - 7 - 6 -5 -4 9
29

The reversal distance for a pair of genomes can be computed in polynomial time,
but the extension of pairwise alignment to multiple alignment using reversal distance
has faced computability problems in the past, as a result multiple alignment using
breakpoint distance has become a dominant technique for gene rearrangement.
GRAPPA[57] and MGR[9] are two applications designed for gene order analysis.
Moret and his colleagues developed GRAPPA by improving the breakpoint analysis
technique, while the MGR is based on reversal analysis.

2 .3 .5

A lg o r ith m s an d M e th o d s

Observations of gene duplication and repetitive sequences are much more common
among eukaryotes than prokaryotes, while genome rearrangement can be readily ob­
served between both closely-related and divergent organisms of all types. These ad­
ditional evolutionary mechanisms distinguish the genome comparison and alignment
task from traditional sequence alignment [15].
Comparing long sequences of genomes tends to be a very computationally expen­
sive task, in term s of both time and memory. Traditional algorithms and methods
for multiple sequence alignment do not scale to genomic size. If these algorithms
were applied to genomic sequences of over a million bases, they would likely run out
of memory or run for an unreasonable long time. In the past few years, there have
been several attem pts to solve these problems. Accordingly, several new methods
and tools for genome comparison have been developed. However, most of them hold
several techniques in common. Specifically, the existing tools can be grouped into one
of two categories: iterative pairwise alignment, or anchor-based m ultiple alignment.
MLAGAN [12] is the most well-known tool based on iterative pairwise alignment,
which will be introduced below. The m ajority of the remaining tools fall into anchorbased multiple alignment category, and almost all of these tools share very similar
algorithmic concepts.
Prom Darling et. al.’s [15] point of view, an anchor-based alignment typically
30

proceeds in three steps.

First, the aligner identifies a set of local alignments in

regions of high similarity among the genomes. Next, a subset of the regions identified
in the first step are selected as alignment anchors, based on whether the tool considers
them as are part of the correct alignment. Finally, the alignment anchors are used
to restrict the number of possible alignments considered when performing an (0(n^))
gapped alignment using dynamic programming. Many tools assume th a t the genomes
are collinear to complete a gapped alignment. Collinear means no significant inversion
or rearrangement events took place since their divergence.
P airw ise G enom e A lign m en t
Early research into genome alignment focused on scaling traditional pairwise align­
ment methods (0(n^)) to handle much longer genome sequences. Pairwise genome
alignment tools, such as MUMer [19], GLASS [7], and WABA [44] pioneered the use
of anchoring to accelerate the alignment process.
MUMer, released in 1999, is the first software program capable of aligning whole
genomes.

It combines a suffix tree, longest increasing subsequence, and Smith-

W aterman algorithms to align a pair of whole genomes.
LAGAN [10] is another pairwise whole genome alignment tool, which produces a
global alignment of two genomes in three steps. First, it finds a set of local alignments
using CHAOS algorithm [10]. Then, it chains an ordered subset of these local align­
ments to form a rough global alignment. Finally, it uses a bounded error dynamic
programming algorithm to find the best alignment within a certain distance of the
rough global alignment.
M u ltip le G enom e A lign m en t
When the genomic DNA sequences of closely related organisms became available,
there was an immediate need for reliable and automatic software to align three or
more genomic sequences.
MGA ([37], 2002) was the first software tool for multiple genome alignment, which
31

is an anchor-based method th a t uses the same idea as the pairwise genome alignment
tools (i.e. MUMer). The MGA algorithm works as follows;
• Find a set of exact matches in terms of the calculations of MultiMEMs (Maximal
Multiple Exact Matches) [37]. A MultiMEM is an exact match th a t occurs in
each sequence and cannot be extended in either direction without encountering
a mismatch.
• The alignment is anchored at different subsets (or matches). Select an optimal
chain of matches among all matches found.
• Close the gaps between chained matches using recursive calls.
• If the remaining gaps are short, hand them over to another sequence alignment
tool (e.g., ClustalW); If the gaps remained are long, leave them open.
Compared with MUMer 2 [20], the extension of MUMer, which allows alignment
of multiple genomes and translated protein sequences, MGA presents a significant
improvement. The MUMer systems identify maximal unique matches and use them
to anchor the alignments. MGA, however, completely discarded the uniqueness con­
straint, and searches for maximal multiple exact matches instead, which will theo­
retically increase the sensitivity of the alignments. Another improvement th a t MGA
makes over the MUMer systems is th a t MGA needs less memory by constructing a
virtual suffix tree.
Besides MGA and MUMer 2 mentioned above, some other anchor-based multiple
genome alignment system are also available, such as EMAGEN [21], CHAINER [1][2],
and M-GCAT [86].
Hohl et al. [37] also pointed out th a t an alignment of the genomes (gene orders)
of several organisms makes sense only if the organisms are closely related. Otherwise,
genome rearrangement should be taken into account. Shuflfle-LAGAN [11] , a variant
of the LAGAN alignment system, is the first genome comparison method th a t explic­
itly deals with genome rearrangements during the alignment process. Rather than
32

selecting a single collinear set of anchors, Shuffle-LAGAN selects anchors collinear in
the first sequence with rearrangements perm itted in the other sequence. However,
Shuffle-LAGAN works only for pairwise comparison.
Mauve [16] presented an anchor-based alignment algorithm to address the pres­
ence of significant inversion and rearrangements in a set of genomes to be aligned.
This algorithm is known as the first multiple alignment tool th a t considers genome
rearrangement during the alignment process. Mauve’s algorithm can be summarized
as follows [16]:
• Find local alignment in term s of the calculations of MultiMEMs (Maximal Mul­
tiple Exact Matches) [37].
• Use the multiMEMs to build a phylogenetic guide tree.
• Select a subset of the multiMEMs to use as anchors. These anchors are parti­
tioned into collinear groups called LGBs.
• Perform recursive anchoring to identifying additional alignment anchors within
and outside each LCB.
• Perform a progressive alignment of each LCB using the guide tree.
Manual cure of a multiple genome alignment on actual genome sequence is too
costly. Thus there is no ‘gold standard’ alignment to use when assessing the qual­
ity of calculated alignments.

Darling et.

al.

[15] performed several experiments

using simulated evolution environment to compare the accuracy to Multi-LAGAN,
Shuffle-LAGAN, and MAVID. These experiments demonstrated th a t Mauve’s algo­
rithm clearly excels at aligning genomes with rearrangement.
A s m e n tio n e d a b o v e , o th e r t h a n a n c h o r-b a s e d , M L A G A N [12] is a tool b a se d

on iterative pairwise alignment. It performs best when aligning and analyzing the
regions of homogeneity among distinct species, such as human and turkey. However, it
was not designed to yield the same results when dealing with closely related multiple

33

genomes. MLAGAN performs a progressive alignment of the genomes using LAGAN,
and takes a phylogenetic guide tree together with the sequences as input. Each time,
by following the input guide tree, LAGAN will select two closest genome sequences
and produce a pairwise alignment until all the sequences are aligned.

MLAGAN

then gives options to iteratively refine the resulted alignment until no significant
improvement can be made.

2.4

C onclusion

During the past decades, considerable work for aligning multiple sequences has been
done, but there is still room for improving alignment speed and sensitivity.
Multiple genome alignment is a large-scaled multiple sequence alignment. Tradi­
tional methods for sequence alignments cannot be applied directly for genome align­
ments. Since 1999, there have been several independent approaches to align multiple
genomes. However, most of them hold several techniques in common and lack paral­
lelism properties th a t would speed up the aligning process.

34

C hapter 3
M ultithreaded M u ltip le Sequence
A lignm ent
This chapter introduces a m ultithread approach for aligning multiple sequences. De­
pending on how the guide tree(s) would be applied for progressive alignment, two
different software programs, a single-tree program and a multiple tree program, have
been implemented. Their alignment results will be compared at the end to check the
improvements of alignment speed and sensitivity. To our best knowledge, this is a
novel approach to consider progressively aligning the same set of multiple sequences
by building a different number of guide trees.

3.1

C om p lexity A n alysis

According to the results of Wang and Jiang [87], the multiple sequence alignment
problem is NP-complete for a class of scoring matrices used for reality biological
applications.
P ro m a c o m p u ta tio n a l p o in t o f view , th e r e a re se v e ra l w ays to a d d re s s th e lack

of hard computing power for bioinformatics, especially in multiple sequence and gene
order alignments. The first is by developing new, faster heuristic algorithms th a t
reduce computational space for the most time-consuming tasks. The second is incor­
35

porating these algorithms into the ROM of a specialized chip. The third and most
promising consideration, is parallel computing. In parallel processing, two or more
microprocessors (or threads) can be used simultaneously to divide and conquer tasks
th at would overwhelm a single, sequential processor. However, parallel computing
still requires new paradigms in order to harness the additional processing power for
bioinformatics.

3.2

E valuations and C om parisons

The most frequently used reference MSA database is BAliBASE (Benchmark Align­
ment database) [85]. It is the only reference database for multiple sequence alignment
th a t has been specifically designed for MSA

benchmarking.

T h a t’swhyBAliBASE

is so appealing to MSA method developers.

As a result, it has been used in many

studies as the standard of tru th for comparing the performance of new MSA methods
with older ones. The current version of BAliBASE (version 2.0) [82] contains a total
of 167 reference alignments placed in eight different categories, which are aimed at
covering most of the problems alignment engines come up against:
1. Multiple sequence alignments containing equi-distant sequences of various con­
servation levels. Conserved gene is a gene th a t has remained essentially un­
changed throughout evolution. Conservation of a gene indicates th a t it is unique
and essential. There is not an extra copy of th a t gene with which evolution can
tinker. Changes in the gene are likely to be lethal.
2. Alignments with a single orphan sequence.
3. Alignments comprising two distant groups of less than 25% sequence identity.
4. Alignments containing long insertions.
5. Alignments containing long deletions.
6. Sequence repeats.
36

7. Transmembrane sequences.
8. Domain permutation.

3.3

R esearch P rob lem

Progressive alignments (Section 2.2.3) use an approximation of a guide tree between
the sequences as a guide tree th a t dictates the alignment order. The progressive
strategy is appropriate for many alignment problems, but also suffers from greedi­
ness. Errors made in the first alignments during the progressive protocol cannot be
corrected later as the remaining sequences are added in. A ttem pts to minimize such
alignment errors have generally been targeted at global sequence weighting [81], where
the contributions of individual sequences are weighted during the alignment process.
However, such global sequence-weighting schemes carry the risk of propagating rather
than reducing error when used in progressive multiple-alignment strategies [35].
Simultaneous alignments are high quality heuristics th a t deliver an alignment
usually very close to optimality. Nonetheless, they remain an extremely CPU and
memory-intensive approach, applicable only to about nine sequences of average length
of 20 characters for the fastest implementation (DCA). Prom Figure 3.1 one can
easily notice th a t the divide-and-conquer technique (Section 2.2.2) actually provides
a perfect structure for parallel programming, and each sub problem can be computed
independently. Another major advantage of using the divide-and-conquer technique is
th a t extremely long sequences can be also acceptable by a multiple sequence alignment
program as long as the sequences can be cut into small enough pieces.
Based on the characteristics of both progressive and divide-and-conquer align­
ments, long sequences will first be cut into several sets of sub-sequences, and each
of these sub-sequences will be aligned progressively and independently by a lightweighted process, thread. In some cases, the sequences are extremely long and cannot
be fed into a simultaneous alignment program even after they are divided into several
shorter pieces. T h a t’s why sometimes progressive alignment will be still considered
37

for the sub-sequences alignments.

Sequence X:

xo

XI

X2

X3

Sequence Y:

YO

Y1

Y2

Y3

Decomposed

XO

XI

X2

X3

YO

Y1

Y2

Y3

f

f
PC

f
PI

>
P2

f

>
P3
J

Figure 3.1: Divide-and-conquer technique
This multithreaded multiple sequence alignment approach actually combines the
idea of divide-and-conquer alignment and progressive alignment. In order to check
the alignment speed and sensitivity, two different alignment programs, depending on
how the guide tree(s) would be applied, are developed for getting a better sense of
which approach works better than the other.

3.4

A pproaches and Sim u lations

The single-tree and multiple-tree alignment programs were implemented using shared
memory M ultithreaded Java Programming. There are two concepts for parallel pro­
cessing, namely, processes and threads. A process is a single executable module that
runs concurrently with other executable modules. For example, in a multi-tasking
environment th a t supports processes, like Microsoft Windows, a word processor, an
internet browser are separate processes and can run concnrrently. Processes are sep­
arate executable, loadable modules as opposed to threads which are not loadable.
38

Multiple threads of execution may occur within a process. For example, from within
a database application, a user may start both a spellin check and a time consuming
sort. In the meantime, a thread can also be a task th a t runs concurrently with other
tasks within a single executable file (e.g., within a single MS-DOS EXE file). Unlike
processes, threads have access to common d ata through global variables.
Parallelism deployed in the program was based on the divide-and-conquer tech­
nique structure. On one hand it is clear th a t optimal cut positions exist; on the
other hand it is clear th a t it is computational expensive to find them. Since the focus
of this thesis is speed performance, only trivial idea are used now to determine the
cut positions in the implementations, such as choosing the middle points recursively
during each cut.
Since progressive alignment only performs global alignment and match sequences
over their full lengths, problems with this approach can arise when highly dissimilar
sequences are compared. Especially when there is a large difference in the lengths of
the two sequences to be compared, global alignment routines become unwarranted.
This is because highly similar internal regions may be overshadowed by dissimilar
regions and the high gap penalties normally are required to achieve proper global
matching. Moreover, many biological sequences are modular and show shuffled do­
mains, and the repeats of internal sequence can also severely limit the applicability
of global methods. Therefore, in the simulations, only long sequences with similar
length and over 40% identical are tested for the single-tree and multiple-tree align­
ment programs.

3 .4 .1

S in g le-T ree A lig n m e n t

Single-tree alignment uses a uniform guide tree built for the full-length sequences
at the beginning of the alignment. All the sub-alignment processes w ill follow this
single tree when the sequences are cut into sub-sequences. Details of the single-tree
alignment are presented in Figure 3.2.
The guide tree for single-tree alignment is built using UPGMA algorithm.
39

G u i d e - tree
D ecom pose

'N

P2

P3

V

PO

PI

_____ ;

'------ '

sub alignm ents
C oncatenate

V
_____________________________________________________________________o v e r a l l a l i g n m e n t

Figure 3.2: Details of single-tree alignment

3 .4 .2

M u ltip le -T r e e A lig n m e n t

As opposed to single-tree alignment, multiple-tree alignment use different guide trees
for each set of sub-sequences. Sequences will be cut into pieces first, then each set of
the sub-sequences will build their own guide tree to guide their individual alignments.
Figure 3.3 shows the details of multiple-tree alignment.
The guide tree for multiple-tree alignment is also built using UPGMA algorithm.

3 .4 .3

S p ee d P erfo rm a n ce

One of the major advantages of multithreaded programming is program speedup with
respect to time efficiency, because each thread processes a different piece of the same
job simultaneously and independently. However, this is obviously the case for the
multiple-tree alignment program, but not quite true for the single-tree alignment
(Figure 3.4), which does not gain any speed improvement after some point. The
reason is th a t in single-tree implementation, no m atter how many threads are used,
every time a single guide tree for the full length sequences are built at the start
of alignment for all the sub-alignments done by different threads. It appears th a t
40

Decompose

V

'X
PO

PI

P2

P3

sub alignm ents
C oncatenate

V
o v e r a ll a l i g n m e n t

Figure 3.3: Details of multiple-tree alignment
building the uniformed alignment guide tree becomes the hot spot (the most timeconsuming part) for all alignment procedures in the single-tree approach.
The figure also shows th a t the multiple-tree alignment program achieves seven
times faster than the sequential program where one thread was used. We also note
th a t the m ultithreaded multiple-tree algorithm is faster than the multithreaded single­
tree algorithm. In other words, using the same number of threads, the multiple-tree
approach performs faster and uses less memory.

3 .4 .4

A lig n m e n t S e n s itiv ity

There are several publicly available databases which have benchmark alignments.
A widely used one is BAliBase by Julie Thompson et al (Section 3.2). BAliBASE
provides a module (BaliScore) th a t defines two scores. SP (Sum-of-Pair) score is
the ratio of the number of correctly aligned pairs of positions in the test (predicted)
alignment to the number of aligned pairs in the reference (structurally informed)
alignment. TC (Total-Column) score is the ratio of the number of correctly aligned

41

Speedup with Respect to Number of
Threads
1600
1400
1200
lo 1000
E
0) 800
E 600
F
400
200
0
1

3

4

5

6

7

8

9

10

Number of Threads
-A— Multiple tree alignm ent —

Si ngl e tree alignm ent

Figure 3.4: Speed improvement for single-tree and multiple-tree implementations in
terms of different number of threads
columns in the test alignment to the number of aligned columns in the reference
alignment. Both SP and TC scores range from 1.0 for perfect agreement to 0.0 for
no agreement. The designers of BAliBASE recommend SP score as the best quality
score for Refsl, 2 and 3, TC score as the best score for Refs4 and Refs5 [81].
Currently, the tests were done mainly based on Refs2, thus the following Figure
3.5 inflects the average SP scores calculated by BAliScore in terms of the number of
threads used by the single-tree and multiple-tree alignment programs. It turns out
th a t the quality of alignments drops down for both approaches when the number of
threads increases, as unwanted gaps are inserted at the start or the end positions of
the sub-alignments; This thus brings more gaps in the final full-length alignments
and infects the values of SP scores.

42

SP-Scores with Respect to the
Number of Threads

£
8
(/)
D.
m

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1

2

3

4

5

N um b^ of Threads
■Multiple tree alignment —

Single tree alignment

Figure 3.5: SP-scores for single-tree and multiple-tree implementations with respect
to different number of threads

3.5

S en sitiv ity Im provem ents

3.5 .1

O v erla p p in g A lig n m e n t

This approach is considered to reduce the number of unwanted gaps introduced at
the start or the end positions of the sub-alignments when multiple threads are used.
In this approach, cut points are actually extended half length of the original sub se­
quence. Assume the length of original sub-sequence is t. The real length participating
in the alignment is t 4- (t/2 ), where the (t/2) part overlaps the first half of its next
neighbor. However, at the reassembling stage, those

(t/2) overlapping parts will be

ignored and only the first t characters in each subsequence will be extracted to build
the fin a l full le n g th a lig n m e n t.

43

3 .5 .2

S lid in g W in d o w s for C u t P o in ts C a lc u la tio n s

When divide-and-conquer technique is applied for sequence alignment, finding an
effective and efficient algorithm to calculate the cut positions for sequences is always
a big issue.

The basic idea of sliding window th a t we come up for cut position

calculation is explained below.
Assume we have two sequences SI, 82, and N number of threads are going to be
used for their alignment.
• Firstly, cut SI evenly in terms of the number of threads, and mark the first cut
position with 0. Then, mark the positions left to 0 with -1, -2, , -(S l/N )/2 ; and
mark the positions right to 0 with -fl, -f-2, , -|-(Sl/N )/2
• Secondly, in S2 mark the same position with the same number as th a t in SI
accordingly
• For simplicity purpose, we assume th a t SI has sliding window of size 2, and an
optimal cut position for S2 will reside within 1 left or right shifts of position
0, the sliding windows for the first cut position of both sequences will look like
below (Figure 3.6):

SI:

-1

+1

-1

+1

S2:

-2

+2

Figure 3.6: Illustrating the sliding window technique for improving alignment accu­
racy
The cut position of S2 will be computed in terms of the m a x im u m score of the
following alignments:

Score(O) = align [ S l(-l,-t-l), S2(-l,-l-l)]
44

Score(-1) = align [ Sl(-1,+1), S2(-2,-l)]
Score(+l) = align [ Sl(-1,+1), S2(+l,+2)]

For example, if max = score(+l), then the first cut point of S2 will be in between
cells +1 and +2.
Therefore, for general cases, if multiple sequences have similar length S and N
number of threads will be used for the alignments, then the size of the sliding windows
(thresholds) for all sequences will be less or equal than S/N, with half of the sliding
cells being negatively marked and the other half positively marked. The maximum
score calculated to get the cut position for next sequence upon the previous one
will then be computed among score(O), score(-l), ..., score(-S/2N), score(+l),

...,

score(+S/2N), depending on the sliding window size of its previous sequence.

3.6

C onclusions

The same set of multiple sequences can be progressively aligned either by a single guide
tree or multiple guide trees. The multiple-tree and single-tree alignment programs
presented in this thesis are coming from the idea of combining the use of divide-andconquer technique with the progressive alignment.

Multiple-tree alignment seems

having a better speedup performance than single tree alignment. The results also
show th a t in terms of the number of threads used, using two threads has the best
performance, which retains the similar accuracy as using one thread but accelerate
the alignment by saving half of the time. However, neither the single tree program
nor the multiple tree program shows very satisfying sensitivity results as the number
of threads increases.

45

C hapter 4
G raph-based C onsensus M nltip le
G ene Order A lignm ent
Multiple gene order alignment, sometimes called multiple genome alignment, is a
larger-scaled multiple sequence alignment. In the last chapter, a m ultithreaded ver­
sion for sequence alignment, to accelerate the alignment processing was introduced.
This chapter will focus on the parallel processing techniques for multiple genome
alignment. Previous approaches for gene-order analysis were largely based on break­
point analysis or gene rearrangements, and lack parallelism properties. As a result,
a novel algorithm for alignment of genome sequences based on a graph-based con­
sensus method is described in this chapter. This algorithm can be applied to align
two or more gene order sequences, obtain the ancestor gene order sequence, and be
implemented in parallel, to further increase its computation speed.

4.1

C om p lexity A n alysis

In m u ltip le g e n e -o rd e r a lig n m e n t, o n e e d it o p e ra tio n c o n sists o f th e in v e rsio n , o r

reversal, of any number of consecutive terms in the ordered set. In the case of signed
orders, the operation also reverses the polarity of each term within the scope of the
inversion. The calculation of the distance for unsigned genomes with inversions only
46

is NP-hard; for signed problem it is of polynomial complexity. For multi-chromosome
genomes, another im portant edit operation is reciprocal translocation, representing
the exchange of term inal fragments between two chromosomes. Some formulations
of the distance problem for translocation are of polynomial complexity, and some are
of NP-hardness. For the algorithm proposed here, the inversion operation is only
considered for a single genome sequence.

4.2

R ation ale for G raph T heory

Conceived by Euler, Cayley, and Hamilton, graph theory flourished in the twenties
century to become a critical component of discrete mathematics. In the 1950s, Sey­
mour Benzer applied graph theory to show th a t genes are linear [43]. W hen the
Human Genome Project started, DNA sequencing was a routine but time-consuming
and hard-to-autom ate procedure. In 1988 four groups of biologists independently and
simultaneously suggested a different sequencing technique called Sequencing by Hy­
bridization, abbreviated as SBH. SBH as a Hamiltonian P ath Problem, and SBH as
an Eulerian P ath Problem lead to the graph algorithms for sequence reconstruction.
Similarly, in this chapter, a graph algorithm for multiple gene order alignment will
be introduced.
As mentioned at the beginning of this chapter, previous approaches to gene-order
analysis were largely based on break-point analysis or gene rearrangements, and lack
of parallelism properties. The algorithm th a t is going to be proposed here introduces
a potentially simpler approach for multiple alignment of gene orders based on a notion
of precedence. The algorithm can also be implemented in parallel to further increase
its speed.
The original idea of this algorithm came from the mathematical and com putational
linguistics [71] [67]. In this view, “gene order” can be compared with “canonical word
order” for linguistic sequences [34]. The approach presented here is an extension of
an earlier work for calculating the canonical word order based on word order and

47

precedence constraints [66].

4.3

T he P reced en ce G raph-B ased C onsen su s A l­
gorith m

In the following an alternative algorithm is presented for aligning multiple genome
sequences and constructing consensus sequences. Similarly to the other graph based
approaches, a graph will be constructed over sequence fragments representing the data
set. However, this is a different algorithm th a t does not explicitly try to calculate
sequence alignments or overlaps. Instead, the graph will be traversed greedily, and
a path will be tried to pursue through the graph so th at each branch followed is as
consistent with the previous one as possible. The goal of this alignment heuristic are
to allow for more efficient gene order alignments th a t can accommodate for inversions,
translocations, deletions, and insertions in the evolutionary process between ancestor
and successor organisms. The precedence based gene order alignment algorithm pre­
sented here have a time complexity of 0 (d * n^)) where d is the number of genomes
being aligned and n is the number of gene blocks contained in the genomes.

4.3 .1

T h e M in im u m S p a n n in g T ree

A weighted graph, also known as a network is a graph whose lines are weighted. The
meaning of the weights depends on the application. For example, an airline might use
a weighted graph to represent the routes between cities th a t it serves. In this example,
the vertices represent the cities and the edges represent a route between two cities.
The weight of the edge could represent the flight distance or the price of the flight
between the two cities. The weight information, in this case, can be stored as the
intersection value in an adjacency matrix. The representation of a simple network in
adjacency m atrix is shown in Figure 4.1.
A spanning tree is a tree th a t contains all of the vertices in the graph. The network

48

0

6

3

0

0

0

6

0

2

5

0

0

3

2

0

3

4

0

0

5

3

0

2

3

0

0

4

2

0

5

0

0

0

3

5

0

Figure 4.1: A simple network represented by a weighted graph and an adjacency
matrix
shown in Figure 4.1 is also a spanning tree of the adjacency matrix.
Interesting algorithms, such as Kruskal’s algorithm [47] and Prim ’s algorithm [63],
derive the m inim um spanning tree of a weighted graph such th a t the sum of its
weights are guaranteed to be minimal. If the weights in the graph are unique, then
there will be only one minimum spanning tree.

Otherwise, there may be one or

more minimum spanning trees. To create a minimum spanning tree in a strongly
connected network, th a t is, in a weighted graph in which there is a path between
any two vertices, the edges for the minimum spanning tree are chosen so th a t the
following properties exist:
1. Every vertex is included.
2. The total edge weight of the spanning tree is the minimum possible th a t includes
a path between any two vertices.
Figure 4.2 demonstrates how the spanning tree shown in Figure 4.1 is developed
into a minimum spanning tree.
Similarly, in multiple gene order alignment, a precedence m atrix is built for each
49

genome sequence. The intersection values interpret the relative positions of different
genes th a t reside in the same genome, the details of which is described in the next
section.

T h e o rigin a l s p a n n i n g tree

0
(a) I n s e r t f i rs t v e r t e x

( b ) I n s e r t e g d e AC

( c ) I n s e r t e d g e BC

(e) I n s e r t e d g e D E

(d ) I n s e r t e d g e C D

(g) T h e final m i n i m u m s p a n n i n g tree

(f) I n s e r t e d g e D F

Figure 4.2; A simple network represented by a weighted graph and an adjacency
matrix

50

4 .3 .2

A n E x a m p le

This algorithm is based on precedence distance and contrasts with breakpoint or re­
versal distances. Consider the following three gene sequences:

a) 1 4 6 7 5 3 2
b) 1 4 6 5 7 2 3
c) 1 4 5 7 6 2 3

A gene order alignment of these sequences would like to be obtained. For this
purpose a precedence m atrix for each sequence is calculated. For the first sequence
(a), its precedence m atrix is:
1

4

6

7

5

3

2

1

0

-1

-2

-3

-4

-5

-6

4

1

0

-1

-2

-3

-4

-5

6

2

1

0

-1

-2

-3

-4

7

3

2

1

0

-1

-2

-3

5

4

3

2

1

0

-1

-2

3

5

4

3

2

1

0

-1

2

6

5

4

3

2

1

0

In the matrix, each cell (i, j) has a corresponding value n, which specifies the
distance between genes i and j in the same genome, and is distinguished by positive
direction and negative direction. Positive n specifies gene i (the row list) recedes n
positions from gene j (the column list) ; negative n specifies gene i precedes gene j by
n positions; 0 simply shows gene i and gene j are the same.
The precedence m atrix for the genome (b) is displayed as following:

51

1 4

6

5

7

2

3

1

0

-1

-2

-3

-4

-5

-6

4

1

0

-1

-2

-3

-4

-5

6

2

1

0

-1

-2

-3

-4

5

3

2

1

0

-1

-2

-3

7

4

3

2

1

0

-1

-2

2

5

4

3

2

1

0

-1

3

6

5

4

3

2

1

0

Similarly, one can compute the precedence m atrix for the third sequence (c)
1

4

5

7

6

2

3

1

0

-1

-2

-3

-4

-5

-6

4

1

0

-1

-2

-3

-4

-5

5

2

1

0

-1

-2

-3

-4

7

3

2

1

0

-1

-2

-3

6

4

3

2

1

0

-1

-2

2

5

4

3

2

1

0

-1

3

6

5

4

3

2

1

0

Next, these three matrices are merged by adding corresponding values of (i, j)
together to obtain the summarized m atrix th a t results in the final alignment. For
example, to calculate the value for cell (4,5) for the final matrix, which is in fact the
distance between gene 4 and gene 5, the values of correspondent cell (4,5) in the three
matrices are added up, i.e. (-3)+ (-2) + (-1). The value corresponding to cell (4,5)
will then be -6.
The finally merged m atrix is shown as following:

52

1

2

3

4

5

6

7

1

0

-16

-17

-3

-9

-8

-10

2

16

0

-1

-13

7

8

6

3

17

1

0

14

8

9

7

4

3

-13

-14

0

-6

-5

-7

5

9

-7

-8

6

0

1

-1

6

8

-8

-9

5

-1

0

-2

7

10

-6

-7

7

1

2

0

Number 0 in the m atrix does not necessarily represent two identical genes i and j
now, as the addition operation is involved. Moreover, the resulted m atrix is diagonally
symmetric, thus only half of it needs to be calculated and the calculation for the other
half is about copying the symmetric value and in the meantime, flipping the negative
or positive sign accordingly.

The merged m atrix gives a precedence m atrix from

aligning the three sequences. If the aligned sequence can be reconstructed from this
matrix, then the resulted multiple alignment for the three sequences will thus be
obtained.

Figure 4.3; The minimum spanning tree built from the precedence m atrix of sequence
(a)

53

Taking a look at the precedence m atrix corresponding to sequence (a), one can
realize th a t traversing the vertices of th e m inim um spanning tree for this m atrix
will be 1,4,6,7,5,3,2 which represents exactly the original sequence (a) (Figure 4.3).
The same thing happens for sequence (b) and (c) as well. In the ideal case, a directed
graph is constructed, where each gene or gene block is a vertex, and edges represent
distances between genes and gene blocks. A correct assembly of the original sequence
is then constructed through this graph by pre-order traversing the minimum spanning
tree of a precedence m atrix of a genome sequence.
W hen coming to find out the consensus or ancestor of these three sequences, the
same minimum tree traversal technique applies: calculating the minimum spanning
tree for the final merged matrix, then the pre-order traversal of this tree will give the
final aligned consensus sequence, or called the ancestor sequence for all (Figure 4.4).
The consensus sequence of sequence (a), (b), and (c) is: 1 4 6 5 7 2 3

Figure 4.4: The minimum spanning tree built from the summarized final precedence
matrix

4 .3 .3

A lg o r ith m F o rm a liza tio n

Sequence alignment and gene order alignment consist of different operations.

Se­

quence alignment is about aligning individual characters, and thus insertion and dele­
54

tion are involved. Gene order alignment cares more about gene order arrangement
than words matching, and does not allow any change or modification applied to the
current genes themselves. Another key difference between sequence comparison and
gene order comparison is th a t in the former, gaps are introduced as required, whereas
in the latter, the rearrangement of gene orders does not allow the insertion of gaps.
Figure 4.5 shows the flow of the algorithm.
SI

S2

S4

S3

S um

M inim al Spanning T re e

J^ ^ ra v e rse

Consensus G enom e Sequence

Figure 4.5; Illustrating the steps of the precedence graph-based consensus algorithm
for genome sequence alignment
Based on a summarized precedence m atrix for a set of genome sequences, a con­
sensus genome sequence is constructed by building a minimum spanning tree through
the matrix. The tree is traversed, trying to follow edges so th a t the vertices along
the path are consistent with a subset of the genes at corresponding positions of the
55

sequences.
The algorithm takes as parameters all the genome sequences to be aligned. They
will be used to build their own precedence m atrix respectively. Assume there are n
genome sequences, then n precedence m atrix will be built accordingly for each of the
sequences.
Building the final consensus matrix, the algorithm starts out by filling the first
row and first column of the m atrix with the union set of the genes appeared in all
the sequences in sorted ascending order. The rest of the cells of the m atrix will then
be filled with the summarized value from the corresponding cells of the sequence
matrices.
Then, the algorithm constructs a minimum spanning tree for the summarized ma­
trix. Each time it picks the edge(s) in positive or negative direction with the possible
minimum abstract value. By pre-order traversing the tree, a consensus sequence will
be obtained, which is also the final alignment for all the sequences. Note th a t a
slightly different final aligned sequence will be obtained depending on whether the
tree is traversed in an increasing or decreasing order for each vertex of sub trees from
right to left. Figure 4.6 shows an example for different final alignments obtained by
different traversal orders.

Many algorithms and methods th a t have been widely used for current multiple
genome sequence alignment.

However, problems like repeated genes in the same

sequence, different sequence containing different kinds of genes, and different sequence
with different number of genes, are not considered in most of the current genome
alignment software, such as GRIMM [92].
The graph-based consensus algorithm, a completely different approach from the
existent anchor-based genome alignment algorithms and methods, can be used to
align unsigned multiple genomes with some or all of the following properties:
• Genomes contain repeated genes

56

T r a v e r s i n g th e a b o v e t r e e in d e c r e a s i n g p r e - o r d e r , o b tain:

1, 4, 6, 7, 9, 8, 5, 3, 2, 10
I n s t e a d , t r a c e r s i n g it in i n c r e a s i n g p r e - o r d e r , obtain:

1, 4, 6, 7, 8, 3, 2, 10, 5, 9

Figure 4.6: An example showing how to obtain different final alignments by traversing
the minimum spanning tree in different orders
• Genomes are of different length
• Genomes consist of different kinds of genes
However, at this stage, the algorithm is at its early stage of evolution, and does
not give any suggestions for aligning signed genome sequences. For example, the two
orientations i and -i of gene i are treated as two completely different genes regardless
of the fact th a t one is the reversal of the other. Besides, the algorithm now does not
automatically change the sign of gene -i or i during the alignment. Gene -i and i will
retain their original signs throughont the alignment process.

57

4.4

P arallelism o f th e A lgorith m

The graph-based algorithm for multiple gene order alignment is straightforward, sim­
ple and easy to perform, and contains four steps:
1. Calculate an initial precedence m atrix for each individual genome sequence.
2. Summarize the individual matrices and get the final m atrix for consensus se­
quence.
3. Build a minimum spanning tree in terms of the final matrix.
4. Traverse the minimum spanning tree to get the consensus sequence.
The step th a t can be performed in parallel is the first step where the initial
precedence m atrix calculation for each genome sequence is independent of each other.
This step would become the most time-consuming part of the algorithm if the number
of genome sequences is fairly large and the computation is performed sequentially. In
another words, simultaneously calculating the initial precedence matrices for all the
sequences will tremendously reduce the computation time and increase the speed.

4.5

E valuations and C om parisons

To the best of our knowledge, there is not yet a standard reference database used
as benchmark for multiple gene order (or genome) alignment. But there are still
ways available for evaluating the sensitivity or accuracy of an alignment generated by
certain algorithms or software.

4 .5 .1

E v a lu a tio n w ith o u t a R e fer en ce a lig n m en t

W ithont a correct alignment of the multiple genomes, an alignment calculated by any
algorithm or method cannot be evaluated for accuracy. In fact, no manually generated
multiple alignment benchmark data sets account for genome-scale evolutionary events

58

such as inversion, rearrangement, and horizontal transfer. Despite the lack of correct
alignments, the alignment accuracy can be estimated by modeling evolution and
aligning simulated d ata sets.
The inferential power yielded by evaluating alignment accuracy using simulated
evolution is only as strong as the degree to which the simulation faithfully represents
the actual evolutionary processes th a t governed the history of the genomes under
study. In order to be able to evaluate a genome alignment, at lease a simplistic model
of genome evolution needs to be constructed, which captures the major types, pat­
terns, and frequencies of events in the history of the related genomes. Building such
a genome evolution model involves many complex elements, such as a rooted phylo­
genetic tree, an ancestral sequence, evolved sequences, regions conserved throughout
the simulated evolution. To effectively represent genome evolution, the simulation
must include nucleotide substitutions and indels (insertion and deletion) in addition
to genome-scale events such as horizontal transfer, inversion, and rearrangement.

4 .5 .2

E v a lu a tio n a g a in st a R e fe r e n c e a lig n m en t

Due to the complexity involved in evaluating genomes without a reference alignment,
current evaluations are mostly done with a reference alignment based on what is
available. There are two main schemes for comparing a proposed multiple genome
alignment to a reference multiple genome alignment: the gene score and the sum-ofpairs score (different to the SP score mentioned in the previous section).
It is well known th a t pairwise alignment optimize residue exchange scores and gap
penalties. Extending the pairwise sequence scores to get a single score for a multiple
alignment would be an obvious way of scoring multiple alignments. This is referred
to as the Sum -of-P airs (SP) score for alignment, which is used widely by multiple
sequence a lig n m e n t. T h e sum-of-pairs (S P ) is a c o m m o n sc o rin g sch em e, w h e re th e
score of each pair of sequences of the multiple alignment is added up to form the
overall score. The higher the SP-score is, the better the multiple alignment is. For
example, assume there are 4 sequences SI, S2, S3, S4. The sum-of-pairs score of the
59

alignment for all the 4 sequences is:

SP(S1, S2, S3, 84) = score(Sl, 82) + score(Sl, S3) +score(Sl, S4) + score(S2,
S3) + score (S2, S4) + score(S3, S4)

The gene score of a multiple genome alignment (MGA) is calculated by comparing
the alignment genes of the proposed MGA with those in the corresponding reference
MGA, and only those identical ones will be taken as correct ones. This is a more
salient measure than the sum-of-pairs scores, where over all observed aligned gene
pairs in a reference MGA, the fraction of those observed in the corresponding target
MGA is compiled. Whereas, a single misaligned sequence can zero the gene score,
the SP score only gradually goes down with more misaligned sequences. Note th a t
the SP scoring system here involves two MGAs, and is therefore different than the
previously mentioned SP scoring system for a single MGA without a reference.
As for the reference alignment, currently, there are quite a few multiple genome
alignment software th a t is available online or free for downloading, such as Mauve
[94], MGA [95], M-GGAT [96], GRIMM [92], GHAINER [91], and MALGEN(2.0)
[93].

4.6

V alidation

As mentioned above, the evaluation of multiple gene order (genome) alignment could
be done with or without a reference alignment. Due to the complexity of evaluation
without a reference alignment, most of the contemporary validations are done with a
reference alignment. Since this algorithm is at its early stage of evolution and does
not automatically change the signs of the genes, or in another words, does not con­
sider the self-reversal of the genes, genes will retain their original signs throughout the
alignment process. As a result, it would not be very ideal to evaluate the accuracy of
the graph-based consensus algorithm using any reference alignments th a t are gener-

60

ated by various anchor-based algorithms th a t have taken the genes’ self-reversal into
consideration. Some heuristics and modifications need to be applied to this algorithm
to automatically take genes’ self-reversal into consideration, and thus make it more
reasonable to be compared with any other reference alignments.
Fortunately, based on this work Pringle et al.

[62] extended the graph-based

approach proposed in this thesis and developed a more advanced precedence graphbased consensus approach and parallelized it utilizing the Message Passing Interface
System. In order to accommodate for the self-reversal, insertion and deletion cases,
a presence m atrix has been added to our approach, which basically keeps track of
the number of contributions th a t have been made to a particular gene pair distance.
Pringle et al. also took weight distances into consideration by applying a weight
function to the cells of precedence m atrix based on the count and the distance sum of a
particular gene pair, and penalizing precedence cells th a t do not have full participation
from all gene orders being aligned.
In the evaluation of [62], two metrics were considered, the timing of the program
and accuracy of the results it produces. The program was timed and the accuracy
of a particular gene order arrangement was determine from the average percentage
displacement between genes in the gene alignment and gene in the proper result. As
well, the presence or absence of genes was considered between the final and alignment
in the accuracy tests.
Pringle et al. shows th a t there is a linear relation between the timing of the utility
and the number of genomes being considered by our proposed graph-based method.
In the mean time, there is a polynomial relation between the size of the genomes and
the program timing which is what was expected from the time complexity analysis.
Pringle et al. also showed th a t the accuracy of the results obtained by our proposed
approach seemed to remain constant when switching between the original precedence
system, the weighted system and the weighted penalty system. The accuracy of the
results seems to indicate th a t our graph-based approach can handle deletions and
inversions quite well.

61

4.7

C onclu sion

Precedence gene order alignment is a promising approach to genome alignment that
offers a below exponential solutions to the problem. The research in multiple sequence
analysis can be compared to the developments in the study of canonical word order in
linguistics [34] [66]. The multiple gene order analysis can be compared with a level of
analysis in linguistics where the focus is on diachronic change in structure and syntax
rather than lexicon. The precedence graph-based consensus algorithm for calculating
a multiple alignment for gene orders is a linguistically motivated approach in the
area of multiple genome alignment, which is completely different from the existing
anchor-based alignment algorithms and methods. The core of the algorithm is th a t
pre-order traversing the minimum spanning tree of a precedence m atrix of a genome
sequence constructs the original sequence. The advantage of this method is partly
efficiency, since it computes the final alignment by computing a minimum spanning
tree from a two-dimensional m atrix and finding a path through the minimum spanning
tree, which is computable in n^ time, while computing the final alignment from the
anchored-based approaches for the unsigned genome, the complexity is more than
polynomial. However, This original method of precedence calculation is valid in the
cases of translocation and reversal mutations between gene orders and a common
ancestor but in the case of deletions or insertions it can cause a false minimal edge
in the precedence m atrix which could have ill effects on the result of the alignment.

62

C hapter 5
D iscussion and Future D irections
This thesis focused on improving the computational speed of the multiple biologi­
cal sequence alignment, including aligning characters (for DNA, RNA and protein
sequences) and aligning gene orders (for genome sequences) in com putational molec­
ular biology. Some issues related to multiple alignment accuracy and evaluation were
also introduced and discussed.
Sequence alignment deals with comparing different DNA or different protein se­
quences. This is done by writing one on top of the other padding them with spaces
( “indel” , for insertion or deletion) to achieve identical length. In DNA, the crite­
rion to distinguish among the many possibilities of this arrangement is the number
of unequal letters ending up on top of each other minus the number of spaces that
were introduced. For protein sequence comparison the pairs of matched letters are
weighted and the adjacent spaces are summarized into blocks which receive a penalty.
Alignments may be performed on a pairwise basis, across multiple sequences or
they can involve the alignment of a sequence to a previously aligned set of sequences
(sometimes called a profile). The information exhibited by a multiple sequence align­
m e n t allow s th e d e d u c tio n of p u ta tiv e s tr u c tu r a l a n d fu n c tio n a l features. Through

a reliable multiple sequence alignment of a set of homologous sequences, the evolu­
tionary pathway, corresponding to mutations as well as insertions and deletions of
sequence fragments, can often be traced under the model of divergent evolution.
63

Based on the mixed idea of Progressive alignment and Divide-and-conquer align­
ment, two different multithreaded multiple sequence alignment programs, depending
on how the guide tree(s) would be applied, were implemented for checking the im­
provements of alignment speed and sensitivity.

The single-tree alignment built a

uniform guide tree for the full-length sequences at the beginning, which were used by
all the sub-alignments as the guide tree. In the multiple-tree alignment, sequences
were firstly cut into pieces and these sub-sequences built their own guide trees to guide
their individual alignments. Multiple-tree alignment seemed having a better speedup
performance than the single tree alignment, but neither of them, at this stage, showed
ideal sensitivity results as the number of threads increases. Therefore, some heuristic
methods for fixing the cut points were suggested for future improvement, such as
overlapping alignment and sliding window alignment.
Gene order alignment is a larger-scale sequence alignment. The key difference be­
tween gene order analysis and sequence analysis is th a t sequence analysis is at the gene
level (i.e. individual character mutations), and gene order analysis is at the chromo­
some or genome level (i.e. gene order mutations). High-throughput DNA sequencing
technology has enabled researchers to rapidly determine the genome sequences of a
wide variety of organisms, laying the fonndation for comparative genomics [15]. Gene
order analysis, sometimes called Genome sequence analysis, help scientists make sense
of diverse elements in the genome, understanding how they are organized within the
genome of each species, and characterizing the changes in genome organization dur­
ing evolution. Comparative genomics plays an important role in making inferences
and gathering information specific to the evolution of a species or genetics diseases.
It provides insights to im portant tasks such as identifying regions of homogeneity or
regions of genetic anomalies. The research in multiple gene rearrangement has been
the basis for development of algorithms for multiple gene order alignment [22].
There are similar complexity problems with multiple gene order alignment (or
referred as multiple genome alignment) algorithms.

Besides, the existing anchor-

based gene order alignment algorithms and methods lack the parallelism properties for

64

computational speedup. As a result, a precedence graph-based consensus algorithm
was formalized for aligning two or more genome sequences, which is very original,
and was able to efficiently obtain the ancestor genome sequence for multiple genome
sequences. The algorithm argues for a potentially simpler approach for multiple gene
order alignment based on a notion of precedence derived from the canonical word
order studies in linguistics. This algorithm also possesses the parallelism properties
to be implemented in parallel for further speed improvement. However, at the current
stage, this algorithm does not consider the self-reversal of genes during the alignment
process, and thus some improvement and extension of this work needs to be done in
the future.

5.1

C on trib u tion

In this thesis, two levels of parallelism are considered for multiple biological sequence
alignments to accelerate the alignment processes: one is at the sequence level (aligning
the characters in a DNA, RNA or protein sequence), the other one is at the gene
order level (aligning the gene orders for a genome sequence). To the best of our
knowledge, it is a novel algorithm to progressively align the same set of multiple
sequences in parallel by building different number of phylogenetic guide trees. In the
meantime, the precedence based gene order alignment approach is a promising first
step in non-exponential gene order approaches. Its timing was shown to have a cubic
time complexity and the approach overall was shown to have a high to moderate
accuracy range. Other than anchor-based algorithms and methods, it is the first time
th a t a graph algorithm was developed for multiple genome (gene order) alignment.
Briefly, the contribution can be summarized as the following:
• Sequence level: multithreaded techniques together with different tree-building
techniques was deployed specifically for the multiple alignment of extremely
long sequences. It turns out th a t the multiple-tree alignment program achieves
seven times faster than the sequential program where one thread was used.
65

• Gene order level: opposed to the existing complex anchor-based algorithms and
methods for handling gene order alignments th a t lack parallelism properties
and cannot handle gene insertion, deletion, and duplications, a simpler and
novice precedence graph-based consensus algorithm th a t can be parallelized is
formalized to speed up the alignment process, which is also able to handle gene
insertion, deletion, and duplications.

5.2

Future D irection s

5.2.1

M u ltip le S e q u en ce A lig n m e n t

The multiple-tree and single-tree alignment implementations described in this thesis
presented ideal efficiency improvement for multiple sequence alignment. The opti­
mal m ulti-threaded approach is with two threads, where the alignment sensitivity
is retained and the speed is increased by saving half of the alignment time. How­
ever, neither of the multiple-tree implementation nor the single-tree implementation
shows very satisfying sensitivity results as the number of threads increases. There­
fore, improving their computation efficiency and in the meantime maintaining their
sensitivity performance is still an issue. The Following three ways could be considered
and investigated in the future.
1. Firstly, some more effective calculations should be found and performed to
decide the cut points.
2. Secondly, weights could be considered and given to the sequences: down weight­
ing the sequences th a t are very similar to other ones in the d ata set and up
weighting the most divergent sequences. The weights could be calculated di­
rectly from the branch lengths in the initial guide tree and the guide tree guides
all the subsequent alignments.
3. Thirdly, affine gap penalties and varying substitution matrices may be applied
dynamically in the progressive alignment.
66

5 .2 .2

M u ltip le G en e O rder A lig n m e n t

The original method proposed in this thesis for determining precedence simply uses
the sum of relative positions of gene pairs across all gene orders to determine th at
gene orders precedence value in the final precedence matrix. Though Pringle et al.
[62] has made a considerate contribution to it, there is still a lot of room for improving
its performance, such as:
1. Modifying the algorithm and making it be able to consider the self-reversal of
genes during alignment process. Some heuristics may be applied.
2. Clarifying the ways for minimum spanning tree traversal: whether increasing
pre-order traversal or decreasing pre-order traversal gives a better result.
3. Evaluating its alignment results against the reference alignments generated by
other approaches.
4. Developing a parallelized version of this algorithm for arbitrary number of
genome sequences and making it publicly available.
5. Currently this metric does not handle insertions th at well so future implemen­
tations may better utilize the presence m atrix to increase its overall accuracy
in the case of insertion mutations.

67

B ibliography
[1] M. I. Abouelhoda, and E. Ohlebusch. A Local Chaining Algorithm and
its Applications in Comparative Genomics. Proceedings of the 3rd Work­
shop on Algorithms in Bioinformatics, 2812:1-16, 2003.
[2] M. I. Abonelhoda, and E. Ohlebusch. CHAINER: Software for Compar­
ing Genomes. In 12th International Conference on Intelligent Systems
for Molecular Biology/3rd European Conference on Computational Biobgy
[3] S. Alum , N. Futam ura and K. Mehrotra. Parallel Biological Sequence
Comparison Using Prefix Computations. J. Parallel and Distributed
Computing, 63(3):264-272, 2003.
[4] A. Apostolico , M. J. Atallah , L. L. Larmore and S. MacFaddin. Efficient
Parallel Algorithms for String Editing and Related Problems. SIAM J.
Computing, 19(5):968-988, 1990.
[5] Arun K. Iyengar, Parallel Characteristics of Sequence Alignment Algo­
rithms, In the Proceedings of the ACM /IEEE conference on Supercom­
puting, Pp. 304-313, Reno, United States, 1989.
[6] P. Baldi, Y. Chauvin, T. Hunkapiller, and M. A. McClure. Hidden
Markov models of biological primary sequence information. Proc. Nat.
Acad. Sci. 91:1059-1063, 1994.

68

[7] S. Batzoglou, L. Pachter, J. P. Mesirov, B. Berger and E. S. Lander. Hu­
man and Mouse Gene Structure: Comparative Analysis and Application
to Exon Prediction. Genome Research, 10:950-958, 2000.
[8] M. P. Berger and P.J. Munson. A Novel Randomized Iteration Strategy
for Aligning Multiple Protein Sequences. Computer Applications in the
Biosciences, 7:479-484, 1991.
[9] G. Bourque and P. A. Pevzner, Genome-Scale Evolution : Reconstruct­
ing Gene Orders in the Ancestral Species, Genome Res. 12(l):26-36, Jan
2002 .

[10] M. Brudno, and B. Morgenstern. Fast and sensitive alignmetn of large
genomic sequences. Lecture notes in Computer Science. 138-147, 2002.
[11] M. Brudno, S. Malde, A. Poliakov, C. B. Do, and et. al. Global
alignment: finding rearrangements during alignment. Bioinformatics,
19(1):154-162, 2003.
[12] M. Brudno, C. B. Do, and G. Cooper. Langan and nulti-lagan: Efficient
tools for large-scale multiple alignment of genomic DNA. Genome Res.
13:721-731, 2003.
[13] L. Cai, D. Juedes and E. Liakhovitch. Evolutionary Com putation Tech­
niques for mutiple sequence alignment. Congress on evolutionary computation 2000, 829-835, 2000.
[14] J. Cheetham, F. Dehne, S. Pitre, A. Rau-Chaplin, and P. J. Taillon.
Parallel CLUSTAL W For PC Clusters. International Conference on
Computational Sciences and Its Applications, 2003.
[15] A. E. Darling, B. Man, M. Craven, and N. T. Perna. Multiple alignment
of rearranged genomes. Proceedings of the 2004 IEEE Computational
Systems Bioinformatics Comference (CSB 2004), 748-749, 2004.
69

[16] A. E. Darling, B. Mau, F. R. Blattner, and N. T. Perna. Mauve: multiple
alignment of conserved genomic sequence with rearrangements. Genome
Research, 14:1394-1403, 2004.
[17] David Sankoff. Historical linguistics as a stochastic process. PhD thesis,
McGill University, 1969
[18] David Sankoff, M athematical developments in lexicostatistical theory. In
T. A. Sebeok (Ed.), Gurrent trends in linguistics 11: Diachronic, areal
and typological linguistics, 93-112. The Hague: Mouton.
[19] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. W hite and
S. L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Research
27:2369-2376, 1999.
[20] A. L. Delcher, A. Phillippy, J. Carlton and S. L. Salzberg. Fast Algo­
rithm s for largescale genome alignment and comparison. Nucleic Acids
Research 30:2478-2483, 2002.
[21] J. S. Deogun, J. Yang, and F. Ma. EMEGEN: an efficient approach to
multiple whole genome alignment. The 2nd Asia Pacific Bioinformatics
Conference (APBG2004), Dunedin, New Zealand. Conferences in Re­
search and Practice in Information Technology, 29:113-121, 2004.
[22] T. Dobzhansky and A. Sturtevant, Inversions in the chromosomes of
Drosophila pseudoobscura. Genetics 23:28-64, 1938.
[23] S. R. Eddy. Multiple alignment using hidden Markov models. Third
international conference on intelligent systems for molecular biology
(IS M B ). C a m b rid g e E n g la n d : M en lo P a r k CA : A A A I P re s s (1995).

[24] R. C. Edgar. MUSCLE: a multiple sequence alignment m ethod with
reduced time and space complexity. BMC Bioinformatics, 5: 113, 2004.

70

[25] R. C. Edgar. MUSCLE; multiple sequence alignment with high accuracy
and high throughput. Nucleic Acids Research 32(5):1792-97, 2004.
[26] E.W. Edmiston, N.G. Core, J.H. Saltz and R.M. Smith, Parallel process­
ing of biological sequence comparison algorithms. International Journal
of Parallel Programming, 17(3), Pp. 259-275, 1988.
[27] Jenome Gracy and Jean Sallantin. Multiple sequence alignment us­
ing anchor points through generalized dynamic programming. Proceed­
ings of Genome Informatics Workshop 1994. Universal Academic Press,
Tokyo, 1994.
[28] R. R. Gonzalez. Multiple protein sequence comparison by genetic algo­
rithms, SPIE-98, 1999.
[29] O. Gotoh. Significant improvement in accuracy of multiple protein se­
quence alignments by iterative rehnements as assessed by reference to
structured alignments. J. Mol. Biol., 264:823-838, 1996.
[30] S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip (poly­
nomial algorithm for sorting signed permutations by reversals). In Proc.
27th Annual ACM Symposium on the Theory of Computing, (a): 178189,1995.
[31] S. Hannenhalli and P. Pevzner. Transforming men into mice (applica­
tions of comparative physical maps in molecular evolution). The pro­
ceedings of the 7th annual AVM-SIAM symposium on discrete algo­
rithms, Philadelphia: SIAM, 304-313, 1996.
[32] S. H a n n e n h a lli a n d P. P e v z n e r. T o cut, o r n o t to c u t (p o ly n o m ia l a l­

gorithm for sorting signed permutations by reversals). In Proc. 27th
Annual ACM Symposium on the Theory of Computing, (a):178-189,
1995.
71

[33] D. Haussier, A. Krogh, I. S. Mian, and K. Sander. Protein modeling
using hidden markov models: analysis of globins. Proceedings for the
26th Hawaii International Conference on Systems Sciences. Wailea HI
U.S.A.: Los Alamitos CA: IEEE Computer Society Press, 1993.
[34] J. Hawkins. A Performance Theory of Order and Constituency, Cam­
bridge University Press, 1994.
[35] J. Heringa. Two strategies for sequence comparison: profile-preprocessed
and secondary structure-induced multiple alignment. Comput. Chem.,
23(3-4):341-64, 1999.
[36] P. Hogeweg, B. Hesper. The alignment of sets of sequences and the con­
struction of phyletic trees: an integrated method. J Mol Evoh, 20(2):17586,1984.
[37] M. Hohl, S. Kurtz, and E. Ohlebusch. Efficient Multiple Genome Align­
ment. Proceedings of the Tenth International Conference on Intelligent
Systems for Molecular Biology, Bioinformatics, 18(S1):S312-S320, 2002.
[38] X. Huang, A space-efficient parallel sequence comparison algorithm for
a message passing multiprocessor, International Journal of Parallel Pro­
gramming, 18(3), Pp 223-239, 1989.
[39] R. Hughey, A. Krogh. Hidden Markov models for sequence analysis:
extension and analysis of the basic method. Computer Applications in
Biological Science, 12:95-107, 1996.
[40] M. Ishikawa, M. Hoshida, M. Hirosawa, T. Toya, O. Kentaro, and K.
N itta . P r o te in S e q u en ce A n a ly sis P ro g ra m : M u ltip le S e q u e n c e Align­

ment by Parallel Iterative Aligner. Demonstrations In t’l Conf. Fifth
Generation Computer Systems, Tokyo, 57-62, 1992.

72

[41] M. Ishikawa, M. Hoshida, M. Hirosawa, T. Toya, O. Kentaro, and K.
N itta. Protein Sequence Analysis Program by Parallel Inference Ma­
chine. Proc. In t’l Conf. Fifth Generation Computer Systems, Tokyo,
294-299, 1992.
[42] James W. Fickett, Fast Optimal Alignment, Nucl. Acids Res. 12(1) :175179,1984.
[43] N. C. Jones and P. Pevzner, An Introduction to Bioinformatics Algo­
rithms, MIT Press, 2004.
[44] W. J. Kent and A. M. Zahler. Conservation, Regulation, Synteny, and
Introns in a Large-scale C. briggsae-C. elegans Genomic Alignment.
Genome Research 10:1115-1125, 2000.
[45] J. Kim, S. Pram anik and M. J. Chung. Multiple sequence alignment
using simulated annealing. Comp. Applic. Biosci., 10:419-426, 1994
[46] J. Kleinjung, N. Douglas, J. Heringa. Parallelized multiple alignment.
Bioinformatics, 18:1270-1271, 2002.
[47] J. B. Kruskal. On the shortest spanning subtree and the traveling sales­
man problem. In: Proceedings of the American M athematical Society,
7:4850, 1956.
[48] E. Lander, J.P. Mesirov and W. Taylor, Protein sequence comparison
on a d ata parallel computer. In the Proceedings of the International
Conference on Parallel Processing , Pp.257-263, 1998.
[49] T. Lassmann, E. L. Sonnhammer. Quality assessment of multiple align­
ment programs. FEBS Lett., 529(l):126-30, 2002.
[50] C. Lee, C. Grasso, and M. F. Sharlow. Multiple sequence alignment
using partial order graphs. Bioinformatics, 18(3):452-64, 2002.
73

[51] H. Carrillo and D. J. Lipman. The multiple sequence alignment problem
in biology. SIAM J. Appl. Math., 48(5):1073-1082, 1988.
[52] D. Lipman, S. Altschul, and J. Kececioglu. A tool for multiple sequence
alignment. In Proc. Natl. Acad. Sci. U.S.A., 86:4412-4415, 1989.
[53] W. S. Martins, et al. A M ultithreaded Parallel Implementation of a Dy­
namic Programming Algorithm for Sequence Comparison. Pacific Sym­
posium on Biocomputing, 6:311-322, 2001.
[54] C. Michener and R. Sokal. A quantitative approach to a problem in
classification. Evolution, 11:130-162, 1957.
[55] D. Mikhailov, H. Cofer, and R. Gomperts. Performance Optimization
of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL:
SCR, 2001.
[56] M. Monwar, and S. Rezaei. Parallelized Multiple Biological Sequence
Alignment with MPI: The Divide-and-Conquer Approach. To appear in
the proceedings of the IMECS 2006 Conference, Hong Kong, June, 2006.
[57] B. M.E. Moret, D.A. Bader, T. Warnow, S.K. Wyman, and M. Yan.
GRAPPA: a high-performance computational tool for phylogeny recon­
struction from gene-order data. Botany 2001, Albuquerque, NM, August
12-16, 2001.
[58] B. Morgenstern. DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment. Bioinformatics, 15(3):211-8,
1999.
[59] C. Notredame. Recent progress in multiple sequence alignment: a sur­
vey. Pharmacogenomics, 3:131-144, 2002

74

[60] C. Notredame, D. Higgins, J. Heringa. T-CofFee: A novel method for
multiple sequence alignments. Journal of Molecular Biology, 302:205217, 2000
[61] C. Notredame, D. Higgins. SAGA: sequence alignment by genetic algo­
rithm. Nucleic Acids Res., 24:1515-1524, 1996.
[62] R. Pringle, and S. Rezaei. Precedence Based Multiple Gene Order Align­
ment. To appear in UNBC working papers in Bioinformatics, 2006.
[63] R. C. Prim. Shortest connection networks and some generalisations. In:
Bell System Technical Journal, 36:1389-1401, 1957.
[64] S. Raj ko and S. Alum. Space and Time Optimal Parallel Sequence
Alignments. IEEE transactions on parallel and distributed systems,
15(12):1070-1081, 2004.
[65] S. Ranka and S. Sahni. String Editing on an SIMD Hypercube Multi­
computer. J. Parallel and Distributed Computing, 9:411-418, 1990.
[66] S. Rezaei. Fuzzy Word Order Constraints. Constraints vs. Preferences
workshop, Poznan, Poland, 1999.
[67] S. Rezaei. Linguistic and Computational Analysis of Word Order and
Scrambling in Persian, PhD thesis, Edinburgh University, Feb. 1999.
[68] S. Rezaei, and M. Monwar. Divide and Conquer Algorithm for ClustalWMPI. To appear in the proceedings of the IEEE Canadian Conference
on Electrical and Computer Engineering, Ottawa, May, 2006.
[69] N. Saitou, and M. Nei. The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Molecular Biology Evol, 4(4):406-425,
1987.

75

[70] D. Sankoff and M. Blanchette. Multiple genome rearrangement and
breakpoint phylogeny. Journal of Computational Biology, (5)555-570,
1998.
[71] D. Sankoff. Historical linguistics as a stochastic process. PhD thesis,
McGill University, 1969.
[72] D. Sankoff. M athematical developments in lexicostatistical theory. In T.
A. Sebeok (Ed.), Current trends in linguistics 11: Diachronic, areal and
typological linguistics, 93-112. The Hague: Mouton.
[73] D. Sankoff and M. Blanchette, Multiple genome rearrangement, RECOMB, 1998.
[74] D. Sankoff, Genome rearrangement with gene families. Bioinformatics,
15:909-917, 1999.
[75] M. Schmollinger, K. Nieselt, M. Kaufmann, and B. Morgenstern. Dialign P: fast pair-wise and multiple sequence alignment using parallel
processors. BMC Bioinformatics, 5:128-135, 2004.
[76] V. A. Simossis, J. Kleinjung and J. Heringa. Homology-extended se­
quence alignment. Nucleic Acids Research, 33(3):816-824, 2005.
[77] P. Sneath and R. Sokal. Numerical Taxonomy. Freeman, San Francisco,
1973.
[78] Stjepan Rajko, Space and Time Optimal Parallel Sequence Alignments,
In the Proceedings of the International Conference on Parallel Processing (ICPP'03), 2003.
[79] Jens Stoye. Multiple sequence alignment with the divide-and-conquer
method. Gene, 211:GC45-GC56, 1998.

76

[80] J. Stoye, S. W. Perry, and A. Dress. Improving the divide-and-conquer
approach to sum-of-pairs multiple sequence alignment. Appl. Math.
Lett., 10(2):67-73, 1997.
[81] J. Thompson, F. Plewniak, et al. BAliBASE: A Comprehensive
comparison of multiple alignment programs. Nucleic Acids Research,
27(13):2682-2690, 1999.
[82] J. D. Thompson., D. G. Higgins, and T. J. Gibson. CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight
m atrix choice. Nucleic Acids Research, 22:4673-4680, 1994.
[83] J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G.
Higgins. The ClustalX windows interface: flexible strategies for multi­
ple sequence alignment aided by quality analysis tools. Nucleic Acids
Research, 24:4876-4882, 1997.
[84] D. G. Higgins, J. D. Thompson and T. J. Gibson. Using CLUSTAL for
multiple sequence alignments. Methods EnzimoL, 266:237-244, 1996.
[85] J. Thompson, F. Plewniak, et al. BAliBASE (version 2.0): A bench­
mark alignment database, including enhancements for repeats, trans­
membrane sequences and circular permutations, http://ww w -igbm c.ustrasbg.fr/Biolnfo/BAliBASE2/, 2003.
[86] T. Treangen, and X. Messeguer. M-GCAT: multiple genome compari­
son and alignment tool. The 5th Annual Spanish Bioinformatics Conference(JBI 2004), 30-33, 2004.
[87] L. Wang, and T. Jiang. On the complexity of multiple sequence align­
ment. J. Comp. Biol. l(4):337-348, 1994.

77

[88] W. W ilbur and D. Lipman, Rapid similarity searches of nucleic acid and
protein d ata banks, Proc. Nat. Acad. Sci. USA, 726-730, 1983.
[89] T. K. Yap, O. Frieder, and R. L. Martino. Parallel computation in bio­
logical sequence analysis. IEEE transactions on parallel and distributed
systems, 9(3);283-293, 1998.
[90] C. Zhang and A. K. Womg. A genetic algorithm for multiple molecular
sequence alignment, compt. Appl. Biosci., 13:561-581, 1997
[91] CHAINER:

Software

for

Comparative

Ge­

nomics. http://theorie.inform atik.uni-ulm .de/Personen/m ibrahim /chainer.htm l
[92] GRIMM:

Genome

Rearrangement

Algorithms.

http://www.google.ca/firefox?client=hrefox-ar&:ls=org.mozilla:enUS:ofRcial
[93] MALGEN:

Multiple

ALignment

of

GENomes.

http://alggen.lsi.upc.es/recerca/align/intro-align.htm l
[94] Mauve:

Multiple

Genome

Alignment,

http: / / gel.ahabs.wisc.edu/mauve/
[95] MGA:

Multiple

Genome

Aligner.

http://bibiserv.techfak.uni-

bielefeld.de / m ga/
[96] M-GGAT:

Multiple

Genome

Comparison

and

Alignment

http://alggen.lsi.upc.es/recerca/align/m gcat/intro-m gcat.htm l

78

Tool.