Exact Algorithms for the Longest Common Subsequence Problem for ArcAnnotated Sequences Jiong Guo May 13, 2002
2
Contents 1 Introduction
5
2 Biological Motivation
9
2.1
Some Molecular Biology . . . . . . . . . . . . . . . . . . . . . . .
2.2
Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Some Basic Definitions
9
13
3.1
LCS and Some Problems from Graph Theory . . . . . . . . . . . 13
3.2
Parameterized Complexity . . . . . . . . . . . . . . . . . . . . . . 15
3.3
Arc Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Previous Results
27
4.1
Classical Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2
Parameterized Complexity . . . . . . . . . . . . . . . . . . . . . . 29
4.3
Complexity of ArcPreserving Subsequence Problem . . . . . . . 30
4.4
Overview of This Work
. . . . . . . . . . . . . . . . . . . . . . . 31
5 cfragment, cdiagonal LAPCS
33
5.1
cfragment LAPCS(crossing,crossing) . . . . . . . . . . . . 33
5.2
cdiagonal LAPCS(crossing, crossing) . . . . . . . . . . . 38
5.3
LAPCS(unlimited, unlimited) . . . . . . . . . . . . . . . . . 40
6 An Algorithm for LAPCS(nested, nested)
43
7 ArcPreserving Subsequence Problems
53
7.1
NPHardness of APS(crossing, chain) . . . . . . . . . . . . . . 53 3
4
CONTENTS 7.2
APS(nested, nested) . . . . . . . . . . . . . . . . . . . . . . . 57
8 Conclusions
69
8.1
Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 1
Introduction Algorithms on sequences of symbols have been studied for a long time and now form a fundamental part of computer science. One of the very important problems in analysis of sequences is the LONGEST COMMON SUBSEQUENCE (LCS) problem. The computational problem of finding the longest common subsequence of k sequences has been researched extensively over the last twenty years and it plays a special role in the field of sequence algorithms. This is partly for historical reasons (many sequence and alignment ideas were first worked out for the special cases of LCS), and partly because LCS often seems to capture the desired relationship between the strings of interest. This problem has many applications [6, 14, 25]. For k = 2, the longest common subsequence is a measure for the similarity of two sequences and is, thus, useful in pattern recognition [21], text compression [22] and, particularly, in molecular biology. Sequencelevel investigation has become essential in modern molecular biology. “The digital information that underlies biochemistry, cell biology, and cell development can be represented by a simple string over letters G, A, T and C. This string is the root data structure of an organism’s biology [23].” But to consider genetic molecules only as long sequences consisting of the 4 basic constituents is too simple to determine the function and physical structure of the molecules. For this purpose, other information about the sequences and their parts should be added to the sequences. One prominent source of such 5
6
CHAPTER 1. INTRODUCTION
information in molecular biology is the secondary and tertiary structure of the molecules. For example, it is well known that the secondary and tertiary structural features of RNAs are important in molecular mechanism involving their functions. While the primary structure of a molecule is the sequence of bases, its secondary and tertiary structures reveal how the sequence folds into a threedimensional structure. RNA secondary and tertiary structures are represented as a set of bonded pairs of bases. A bonded pair of bases (base pair) is usually represented as an edge between the two complementary bases involved in the bond. In tertiary structure, the bonds can cross each other, while secondary structure has no crossing bonds. A bond in secondary structure can either inside or outside other bonds. Hence, the ability to analyze molecules requires taking into account all the primary, secondary and tertiary information. More biological background is discussed in Chapter 2. Early works with these additional information are primary structure based, the sequence comparison is basically done on the primary structure while trying to incorporate secondary structure data [3, 8]. This approach has the weakness that it does not treat a base pair as a whole entity. Recently, an improved model was proposed [10, 11]. In this model, the secondary and tertiary information is combined into the basic sequence, which represents the primary information, to affect subsequent analysis. The system of representing additional information is called annotation scheme. The objects used in this annotation are socalled arcs. An arc is a link or an edge that joins two symbols of the sequence, it corresponds to the chemical bonds between base pairs in the RNA sequence. The RNA structure can then be represented as a base sequence with arc annotations. We call these sequences arcannotated sequences. Arc annotations are defined and discussed in Chapter 3. For related studies concerning algorithm aspects of (protein) structure comparison using “contact maps”, refer to [13, 18]. In this work, we will follow this new model and exam the classical LCS problem for the sequences with different arc annotations. The focal points are two arc annotations, (crossing, crossing), where two sequences represent tertiary
7 structures of two RNA’s, and (nested, nested), which corresponds to an instance of two RNA sequences with secondary structure. Since superimposing arc structures on the basic sequences creates many natural parameters, we explore both the classical and parameterized complexity [1, 9, 12] of the LCS problem for sequences with different arc annotation schemes. A summary of previous work will be given in Chapter 4. In Chapter 5, we will prove that the cfragment (or cdiagonal) LAPCS(crossing, crossing), parameterized by the length l of the desired subsequence, is fixedparameter tractable, i.e., it belongs to the complexity class FPT. In Chapter 6, we will give an FPTalgorithm for the LAPCS(nested, nested) with parameters k1 and k2 , where k1 and k2 are the number of the deletions from the two sequences, that we have to make to get an arcpreserving common subsequence. In Chapter 7, we answer some open questions for the ArcPreserving Subsequence problem and give an algorithm which solves the ArcPreserving Subsequence problem with arc structure (nested, nested) in polynomial time. The last chapter summarizes our results in this work and gives some aspects for future research.
8
CHAPTER 1. INTRODUCTION
Chapter 2
Biological Motivation The purpose of this chapter is to provide a brief introduction to molecular biology, especially to DNA and RNA sequences. Here, we only give a few basics; more details can be found in [26].
2.1
Some Molecular Biology
A cell has two classes of molecules: large and small. The large molecules, known as macromolecules, are of three types: DNA, RNA, and protein, among which DNA and RNA are the molecules of most interest to us. DNA is the basis of heredity and it is constituted of small molecules called nucleotides, which are referred to as bases: adenine (A), cytosine (C), guanine (G), and thymine (T ). For our purpose, a DNA molecule can be viewed as a long sequence over the four letter alphabet Σ = {A, C, G, T }. The DNA contained in the cell is known as the genome. A genome of a human has about 3 × 109 letters, and each human cell contains the same DNA. For each base, there is a complementary base. A is paired with T , and C is paired with G. This pairing is formed by hydrogen bonds and it is essential for the structure of the DNA and for the replication and transcription of its code. The idea is that a single DNA sequence (or strand), e.g., ACCT GAA is paired to a complementary strand T GGACT T , as shown in Figure 2.1. DNA usually occurs double stranded and the bases on one strand fit together 9
10
CHAPTER 2. BIOLOGICAL MOTIVATION
A C C T GA A T GGA C T T Figure 2.1: 2 DNA strands with a complementary sequence of bases on the other strand. These two strands form a helical threedimensional structure. Figure 2.2 presents such a structure.
G C T A
G
C A T T A
G C T A A C T G
A T
Figure 2.2: The double helix
DNA can be replicated from another DNA already existing. This replication starts with a double helix that has been separated into two single strands. Then, each single strand is used to template new double strands. In this way, two identical DNA molecules are produced, each has one strand of the original molecule. DNA strands can also be transcribed into RNA. RNA is a related ribonucleic acid and it can be modeled by a word over another four letter alphabet of ribonucleotides Σ = {A, C, G, U }, where thymine (T ) is replaced by uracil (U ). RNA is singlestranded. One strand of the DNA is used to template a single strand of RNA that is made by moving alone the DNA strand. Finally, the double stranded DNA remains as before and a single strand of RNA has been generated. A specific type of RNA, message RNA (mRNA), is read to produce a protein. The genetic code on the mRNA is a language in which triples of the 4 bases  these are 64 possible combinations  specify either a single amino acid or the termination of the protein sequence; such a triple of nucleotides is
2.1. SOME MOLECULAR BIOLOGY
11
calledcodon. Proteins are built at the ribosomes of a cell, where the mRNA picks up complementary transfer RNA, tRNA. tRNA is another RNA molecule, which is also single stranded, without complementary strand that DNA has. This molecule tends to fold back on itself to form a threedimensional cloverleaf structure built from approximately 80 bases. See Figure 2.3 for a tRNA.
A G C A U C G
A
G UU A G G
A U C
G
G C
U C A C U
C
G
C G U G
A
G
A U U U G
U A A A G
C
Figure 2.3: A tRNA Amino acids are linked to these smaller tRNA molecules, and the tRNA interacts with the codon of mRNA. In this way, tRNA carries the appropriate amino acid to the mRNA. The ribosomes, which is a complex made of RNA and protein where the protein defined by a messenger RNA is synthesized, have also RNA whose threedimensional structure enables them to interact with the other molecules physically. The threedimensional structure of RNA can, thus, be extremely important for its function, and evolution is likely to preserve common structures. Determining the correct fold of a protein is a major open problem in protein analysis. The converse problem, to find an amino acid sequence
12
CHAPTER 2. BIOLOGICAL MOTIVATION
that will produce a particular folding or structure, is another great challenge in molecular biology.
2.2
Biological Motivation
Arcannotated sequences can be applied to describe the secondary and tertiary structures of RNA and protein sequences. Therefore, the problem of comparing arcannotated sequences has applications in the structural comparison of RNA and protein sequences and it has received much attention in the literature recently. One common way to measure the similarity of two sequences is pairwise sequence comparison, e.g., the longest common subsequence algorithm. RNA performs a wide range of functions in biological systems. In particular, it is RNA that contains genetic information of viruses such as HIV and therefore regulates the functions of such viruses. Furthermore, it is also widely known that secondary and tertiary structural features of RNA are essential for the molecular mechanisms involved in their function. Thus, it is of massive interest to know how RNA folds to achieve its specific biological functions. A typical feature of RNA molecules is that the comparison of individual sequences can provide information concerning their common features in structure. During the course of evolution, a number of mutations have occurred in these molecules. Comparative analysis of those variations may clarify how such mutations can happen. The common features preserved in course of evolution are likely to be of importance for function. Hence, the ability to compare RNA structure builds the fundament for further study of RNA. When we represent the secondary and tertiary structure of RNA as a basic sequence with arc annotation, algorithms for the longest common subsequence problem for two arcannotated sequences can play a key role in resolving a preserved secondary and tertiary structure, which corresponds to a preserved molecular conformation and to a preserved function.
Chapter 3
Some Basic Definitions Since we will explore the classical and parameterized complexity of the LCS problem of arcannotated sequences, this chapter gives some basic definitions and terminologies which we will use in the following chapters. In the first section, we will give a formal definition of the LCS problem and introduce some problems that originally arise in graph theory und are useful for our analysis of LCS of arcannotated sequences. Section 3.2 is concerned with parameterized complexity. Since we cannot cover all aspects of parameterized complexity, interested readers are referred to [9]. Arc annotation and the LCS problem of arcannotated sequences are the main objects of the last section. The definitions of various levels of arc annotation and of the Longest ArcPreserving Common Subsequence problem are taken from [11].
3.1
LCS and Some Problems from Graph Theory
As mentioned in Chapter 1, our main method to analyze the similarity of sequences is the pairwise comparison. Thus, the central problem in this work comes from the LCS problem, which is very important in classical and parameterized complexity. Here, we give a definition for subsequence and the LCS problem. Definition 3.1 Subsequence Given two sequences S1 , S2 over some given alphabet Σ, S2 is a subsequence 13
14
CHAPTER 3. SOME BASIC DEFINITIONS
of S1 , if S2 can be obtained from S1 by deleting some letters from S1 . The length of a sequence is denoted by S. For simplicity, we use S[i] to refer to the ith letter in S, and S[i1 , i2 ] to denote the subsequence of S from the i1 th letter to the i2 th letter (1 ≤ i1 ≤ i2 ≤ S). Definition 3.2 Longest Common Subsequence Problem (LCS) Given a set of k sequences S1 , S2 , . . . , Sk over some alphabet Σ, the longest common subsequence problem asks for a longest sequence P that is a subsequence of S1 , S2 , . . . and Sk . To date, most research has focused on deriving efficient algorithms for the LCS problem when k = 2. This problem can be solved by dynamic programming in time O(S1  · S2 ) [15]. If the number of sequences k is unrestricted, the LCS problem is NPcomplete [22]. However, certain algorithms for the case k = 2 have been extended to yield algorithms that require O(nk−1 ) time and space, where n is the length of the longest of the k sequences [2, 16]. In order to prove some complexity results of the LCS problem of arcannotated sequences, we will use reductions to or from some problems in graph theory with known complexity. These problems are vertex cover, independent set and clique. Definition 3.3 Vertex Cover, VC An edge e of an undirected graph G = (V, E) is incident to a vertex v if v is ′
one of the endpoints of e. A set of vertices V ⊆ V is called a vertex cover in G ′
if for each e ∈ E, there exists a v ∈ V such that e is incident to v (∃u ∈ V such that (u, v) = e). Given an undirected graph G = (V, E) and a positive integer k, the vertex cover problem asks whether G has a vertex cover of size at most k. Definition 3.4 Independent Set, IS Let G = (V, E) be an undirected graph, and let I ⊆ V . We say that I is an independent set if for each pair i, j ∈ I, i 6= j, there is no edge between i
3.2. PARAMETERIZED COMPLEXITY
15
and j. The independent set problem asks, given a parameter k, if there is an independent set I with I ≥ k.
Definition 3.5 clique Given an undirected graph G = (V, E), and a parameter k, the clique problem asks whether there is a vertex set C ⊆ V with C ≥ k such that for all vertices u, v ∈ C with u 6= v there is an edge between v and u. These three problems are all known to be NPcomplete [24]. It is easy to see that a set V C is a vertex cover in G if and only if V \ V C is an independent set in G. Also, there is a vertex cover in G of size k if and only if there is an independent set in G of size V  − k. The vertex cover problem is, conveniently, a minimization problem, while the independent set problem is a maximization problem.
3.2
Parameterized Complexity
In this section, we give an overview of the aspects of parameterized complexity, which are relevant to this work. Classical polynomial complexity, its reductions and NPhardness are discussed in depth by Paradimitriou [24]. Parameterized complexity was introduced by Downey and Fellows [9]. Many natural problems have now been shown to be NPcomplete or worse, which means, it is highly unlikely that there exits efficient algorithm for these problems. However, we can find for some of the problems algorithms such that a main part of the problem instance contributes to the overall running time “in a good way” (e.g., polynomially), and identify aspects of the input, which determine the combinatorial explosion of the running time. Then, these aspects can be used as parameters in the hope that these parameters are small in applications. While the notation of polynomial time is central to the classical formulation of computional complexity, central to parameterized complexity is the notion of fixedparameter tractability .
16
CHAPTER 3. SOME BASIC DEFINITIONS
Definition 3.6 FixedParameter Tractability A parameterized problem L is fixedparameter tractable, if and only if there is an algorithm which can in time f (k)nc decide whether (x, k) ∈ L, where x is the input and k is the parameter. Further n := x, c is a constant that is independent of both n and k, and f : N 7→ R is an arbitrary function. We denote the family of all fixedparameter tractable parameterized problems by F P T . Here, we give an example for fixedparameter tractability, which will be used in Chapter 5. Definition 3.7 Maximum Independent SetB, MAXISB Given a simple graph G = (V, E) in which each vertex has degree at most B, the MAXISB problem asks for a maximum independent set of G. Lemma 3.8 MAXISB, parameterized by the size k of the independent set, can be solved by an FPTalgorithm in time O((B + 1)k B 2 ). Proof. We use the notation G − {u} to denote the deletion of the vertex u and all edges incident to u from the graph G. We can construct a search tree of height k as follows. The root of the tree is labeled with an empty independent set I and the graph G. First, we find the vertex u with minimum degree which can have at most B neighbors {v1 , v2 , . . .}. Any independent set of G can contains at least either u or one of its neighbors, so we create the children of the root corresponding to these possibilities. The first child is labeled with {u} and G − {u} − {all neighbors of u}, and the second is labeled with {v1 } and G − {v1 } − {all neighbors of v1 }, and the other children are labeled in the same way for the remaining neighbors of u. There are at most B + 1 children of the root node. The set of vertices labeling a node represents a “possible” independent set, and the graph labeling the node represents what remains to be checked in G. In general, for a node labeled with the set of vertices S and the subgraph H of G, we choose the vertex v with minimum degree in H and create at most B + 1 child nodes. These child nodes are labeled in a similar way as the children of the root node.
3.2. PARAMETERIZED COMPLEXITY
17
If we can create a node at height k in the tree, then an independent set of cardinality at least k has been found. There is no need to explore the tree beyond height k. As we can easily see, each node has at most B + 1 children. Thus, the tree can have a maximum size of (B + 1)k . At each node, the deletion of vertex u and its neighbors and all edges incident to u and its neighbors can be done in time O(B 2 ). Therefore, this algorithm takes O((B + 1)k B 2 ) many steps.
2
As we have seen in classical complexity, the basic idea behind virtually all completeness results is the notion of a reduction. Therefore, we will need a new kind of reduction which is “parameter preserving” and can be used to show that two problems have the same parameterized complexity. Definition 3.9 FixedParameter Reducibility ′
′
Let L and L be two parameterized problems, L ⊆ Σ∗ × N and L ⊆ Γ∗ × N. ′
We say that L is fixedparameter reducible to L , if there are functions k 7→ k ′′
′
′
and k 7→ k on N and a function (x, k) 7→ x from Σ∗ × N to Γ∗ such that (a) (x, k) 7→ x is computable in time k xO(1) , ′
′′
′
′
′
(b) (x, k) ∈ L ⇔ (x , k ) ∈ L . Before we establish a hierarchy of parameterized complexity, we need some definitions, which help to define the classes in the hierarchy. Definition 3.10 Boolean Circuit A Boolean circuit is a directed graph G = (V, E), where the nodes in V = {1, . . . , n} are called the gates of G. There are no cycles in the graph. All nodes in the graph have fanin (the number of incoming edges). Each gate i ∈ V in the graph has a sort s(i) associated with it, where s(i) ∈ { TRUE, FALSE, AND, OR, NEGATION } ∪ {x1 , x2 , . . .}. If s(i) ∈ { TRUE, FALSE } ∪ { x1 , x2 , . . . }, then the fanin of i is 0, that is, i has no incoming edges. Gates with no incoming edges are called the input gate. Finally, there is a gate, which has no outgoing edges. It is called output gate of the circuit.
18
CHAPTER 3. SOME BASIC DEFINITIONS
Circuits can have gates of two types: a small gate has bounded fanin, while a large gate has unbounded fanin. A circuit, which has no inputs of sort TRUE or FALSE, can be thought of as representing a Boolean expression. Conversely, given a Boolean expression δ, there is a simple way to construct a circuit Cδ such that, for any truth assignment T appropriate to both (all variables in δ and Cδ are defined in T ), T (Cδ ) = TRUE if and only if δ is satisfied by the assignment T . A truth assignment T satisfies a Boolean expression δ, if all variables in δ are defined in T and δ becomes true with variables replaced by their truth values in T . A circuit C has a weight k satisfying assignment, if the Boolean expression δ, which corresponds to C, has a satisfying assignment, where δ hat exactly k variables set to be TRUE. The construction of Cδ follows the inductive definition of δ, and builds a new gate i for each subexpression encountered. Definition 3.11 Circuit Depth The depth of a circuit is the maximum number of gates on any path from an input gate to the output gate. Definition 3.12 Circuit Weft The weft of a circuit is the maximum number of large gates on any path from an input gate to the output gate. Let Γ = {C1 , C2 , C3 , . . .} be a family of circuits. Associated with Γ is a basic parameterized language LΓ = {hCi , ki  Ci has a weight k satisfying assignment}. By LΓ(t,h) , we denote the subset of LΓ of circuits with weft t and depth h. Definition 3.13 Basic Hardness Class A parameterized problem L is in the complexity class W [t] if it is fixedparameter reducible to LΓ(t,h) , where the depth h is constant. Definition 3.14 W hierarchy The W hierarchy is the set of the classes W [t] together with two other classes, W [SAT ] and W [P ]. W [P ] denotes the class obtained by having no restriction
3.2. PARAMETERIZED COMPLEXITY
19
on the depth, i.e, P size circuits, and W [SAT ] denotes the restriction to boolean formulas of P size. Hence, the W hierarchy is F P T ⊆ W [1] ⊆ W [2] ⊆ . . . ⊆ W [SAT ] ⊆ W [P ]. We conjecture that each of the containments is proper. W [SAT ] denotes the class of problems reducible to weighted satisfiability, while W [P ] denotes the class of problems reducible to weighted circuit satisfiability. Given a Boolean formula X and a positive integer k, weighted satisfiability asks whether X has a weight k satisfying assignment. weighted circuit satisfiability asks whether a given decision circuit C has a weight k satisfying assignment. Some common problems known to be NPcomplete in classical complexity fall into different classes in the W hierarchy when their natural parameters are used. For example, if the parameter is the desired size of the vertex subset, independent set and clique are both W [1]complete, while vertex cover∈ F P T . vertex cover can be solved by an algorithm with running time O(kn + 1.2852k ) [7].
Definition 3.15 Parameterized Variations of LCS Given a set of k sequences S1 , . . . , Sk and a positive integer m, parameterized variations of LCS ask for a sequence P of length at least m that is a subsequence of all of S1 , . . . , Sk . We refer to the variation with parameter k as LCS1, with parameter m as LCS2, with parameters k and m as LCS3. The parameterized complexity of the variations of the LCS problem is summarized in the following table (Table 3.1). The results are all due to Bodlaender et al. [5, 4]. The fixedparameter tractability of a problem can illustrate some of the possibilities for problem parameterization. It is frequently complained by computer scientists with a practical orientation that the classical complexity framework is not sufficiently realistic. As shown by the example of vertex cover, parameterization provides a way how to cope with NPhardness.
20
CHAPTER 3. SOME BASIC DEFINITIONS Problem
Parameter
Σ Unbounded
Σ Fixed
LCS1
k
W [t]hard, t ≥ 1
unknown
LCS2
m
W [2]hard
FPT
LCS3
k, m
W [1]complete
FPT
Table 3.1: Parameterized complexity of LCS
3.3
Arc Annotation
While the previous two sections have provided some basic knowledge of classical and parameterized complexity, we will from now focus on the main problem of this work, the Longest ArcPreserving Common Subsequence problem. The purpose of having arc annotation is to express additional information about a sequence in a way that the sequence and the additional information can be analyzed and manipulated simultaneously. Arcs represent binary relations between sequence symbols. Hence, they can be used to join base pairs that are chemically bonded in the represented biological sequence. This application is particularly relevant to RNA sequences, whose chemical bonds can be described by annotating the sequence with these arcs. Figure 3.1 shows a part of tRNA and its corresponding arcannotated sequence. Definition 3.16 Arc Annotation The arc annotation set A of a sequence S is a set of pairs of positions in S: A = { (i1 , i2 )  1 ≤ i1 < i2 ≤ S} ⊆ {1, . . . , S}2 Then, the sequence S with such an arc annotation is called an arcannotated sequence, denoted by (S,A). Since we incorporate both arcs and sequences into an overall measure of similarity, the definition of LCS must be adjusted to incorporate the arc structure. A common subsequence should also have the common arcs of the input sequences. To preserve arcs, a subsequence that selects both endpoints of an arc from one sequence must map those endpoints to the endpoints of some arc from the other sequence.
3.3. ARC ANNOTATION
21
C A U G C C
A C U
G
C
G A C
G A U G
G U A G
G
G
C
ACGUGACGUAGCGUAGGGCCCGUAC
Figure 3.1: A tRNA and its corresponding arcannotated sequence Definition 3.17 Longest ArcPreserving Common Subsequence Problem (LAPCS) Given two arcannotated sequences (S1 ,A1 ) and (S2 ,A2 ), the LAPCS problem asks to find the longest common subsequence of S1 and S2 which preserves the arcs, i.e., to find a mapping M S ⊆ {1, . . . , S1 } × {1, . . . , S2 } such that 1. the mapping is onetoone and preserves the order of the subsequence: ∀(i1 , j1 ), (i2 , j2 ) ∈ M S i1 = i2 ⇐⇒ j1 = j2
and
i1 < i2 ⇐⇒ j1 < j2 2. the arcs induced by the mapping are preserved: ∀(i1 , j1 ), (i2 , j2 ) ∈ M S (i1 , i2 ) ∈ A1 ⇐⇒ (j1 , j2 ) ∈ A2 3. the mapping produces a common subsequence: ∀(i, j) ∈ M S, S1 [i] = S2 [j]
22
CHAPTER 3. SOME BASIC DEFINITIONS
We name the pair hi, ji a base match, if S1 [i] = S2 [j] for some pair of positive integers i and j. If S1 [i1 ] = S2 [j1 ], S1 [i2 ] = S2 [j2 ], (i1 , i2 ) ∈ A1 and (j1 , j2 ) ∈ A2 for some integers i1 < i2 and j1 < j2 , then the pair h(i1 , i2 ), (j1 , j2 )i is an arc match. A restricted version of the above problem is defined as follows. Definition 3.18 ArcPreserving Subsequence Problem (APS) Given two arcannotated sequences (S1 ,A1 ) and (S2 ,A2 ), S1  ≤ S2 , the APS problem asks whether there is an arcpreserving mapping from S1 to S2 , i.e., S1 can be obtained from S2 by deleting some bases and arcs incident on these bases from S2 . When arcs are used to link sequence symbols to represent nonsequential information, comparing the resulting annotated sequences is much more complex than classical LCS. Since in practice of RNA and protein sequence comparison arc sets are likely to satisfy some constraints (e.g. bond arcs do not cross in the case of tRNA sequences), it is of interest to consider various restrictions on arc structure. As we will see, the different restrictions on arc annotation can alter the computational complexity of the LCS problem. Definition 3.19 Restricted Variations of LAPCS There are four natural restrictions on the arc set A of a sequence S: 1. no two arcs share an endpoint: ∀(i1 , i2 ), (i3 , i4 ) ∈ A (i1 6= i4 ) ∧ (i2 6= i3 ) ∧ (i1 = i3 ⇐⇒ i2 = i4 ) 2. no two arcs cross each other: ∀(i1 , i2 ), (i3 , i4 ) ∈ A i1 ∈ [i3 , i4 ] ⇐⇒ i2 ∈ [i3 , i4 ] 3. no two arcs nest: ∀(i1 , i2 ), (i3 , i4 ) ∈ A
3.3. ARC ANNOTATION
23 i1 ≤ i3 ⇐⇒ i2 ≤ i3
4. no arcs: A=∅ These four restrictions produce five levels of permitted arc structures: • unlimited: no restrictions, • crossing: restriction (1), • nested: restrictions (1) and (2), • chain: restrictions (1), (2), and (3), • plain: restriction (4). In the following, LAPCS(x, y) represents an LAPCS problem where the arc structure of S1 is of level x and the arc structure of S2 is of level y. Assume that x is at the same level of or higher than y. Note that the problem LAPCS(nested, nested) effectively models the similarity between two tRNA sequences, particularly the secondary structures. The following table (Table 3.2) shows the inclusion relation between the levels of restrictions on LAPCS(x, y). Moreover, we give the definitions of two special cases of the LAPCS problem, which were first studied in [20]. The special cases are motivated from biological applications [14, 19]. Definition 3.20 cfragment LAPCS Problem (c ≥ 1) Given two arcannotated sequences which are divided into fragments of lengths exactly c (the last fragment can have a length less than c), the allowed matches are those between fragments at the same location. For example, all matches induced by a 2fragment LAPCS are required to have the form h2i −
1 1 1 1 ± , 2i − ± i, 2 2 2 2
i ≥ 1.
24
CHAPTER 3. SOME BASIC DEFINITIONS
unlim, unlim ∪ unlim, cross
⊃
∪ unlim, nest
∪ ⊃
∪ unlim, chain
cross, nest
⊃
∪ ⊃
∪ unlim, plain
cross, cross
cross, chain
∪ ⊃
∪ ⊃
cross, plain
nest, nest
nest, chain
⊃
∪ ⊃
nest, plain
chain, chain ∪
⊃
chain, plain
⊃
plain, plain
Table 3.2: Problem inclusions for different levels of restriction. We use x, y to denote the arc structures of two arcannotated sequences. The symbol ⊃ indicates the inclusion relation between different levels resulting by the restriction hierarchy. unlim, unlim is the most general Longest arcpreserving Common Subsequence problem, and plain, plain is the unannotated Longest Common Subsequence problem.
Definition 3.21 cdiagonal LAPCS problem (c ≥ 0) cdiagonal LAPCS is an extension of cfragment LAPCS, where base S1 [i] is allowed only to match bases in the range S2 [i − c, i + c]. The cdiagonal and cfragment LAPCS problems are relevant in the comparison of conserved RNA sequences where we already have a rough idea about the correspondence between bases in the two sequences. The arc structure can provide many natural parameters for the Longest ArcPreserving Common Subsequence problem. In the following, we give two examples of such parameters concerning arc structure. Definition 3.22 Cutwidth Given an arcannotated sequence (S, A), the cutwidth of the arc structure is the maximum number of arcs that pass by or end at any position of the sequence.
Definition 3.23 Bandwidth Given an arcannotated sequence (S, A), the bandwidth of the arc structure is the maximum distance between the two endpoints of an arc, i.e., if we denote
3.3. ARC ANNOTATION
25 k
S
1
2
3
4
5
6
7
8
d
Figure 3.2: Cutwidth and Bandwidth. The arcannotated sequence S has 8 bases. There are 3 arcs passing by or ending at the 4th base and no other base has more arcs passing by or ending at it. Thus, S has a cutwidth of 3, denoted by c. It is obvious that the arc between the 2nd and 8th bases is the longest arc of S. The bandwidth of S is then 6, denoted by d.
bandwidth as d, then for any (i1 , i2 ) ∈ A, i2 − i1 ≤ d. Figure 3.2 illustrates an arcannotated sequence with cutwidth of 3 and bandwidth of 6.
26
CHAPTER 3. SOME BASIC DEFINITIONS
Chapter 4
Previous Results As already mentioned in Chapter 3, for referring to the problems, we follow the convention that at the arc structure of sequence S1 is at least as complex as that of sequence S2 . Using five levels of arc structures, we distinguish 15 distinct variations of LAPCS where S1 and S2 may have different level of arc structure (see Table 3.2). This chapter summarizes previous results of these 15 LAPCS variations, not only in classical but also in parameterized complexity framework. Various parameters have been used to exam LAPCS, such as the length l of the desired subsequence, the cutwidth k and the bandwidth d. The third section is concerned with the arcpreserving subsequence problem. The 5 levels of arc structure can also be used to APS. At last, we will address the problems that we will explicitly discuss in this work.
4.1
Classical Complexity
When the arc structure x of sequence S1 is at least crossing, LAPCS(x,y) is NPhard [11]. independent set, which is known to be NPcomplete, can be reduced to LAPCS(unlimited, plain) or LAPCS(crossing, plain). If the arc structures of both sequences are lower than nested, then the LAPCS problem is solvable in polynomial time [11, 15]. The NPhardness of the problem LAPCS(nested, nested) was shown in [20]. Jiang et al. [17] presented a dynamic programming algorithm to compute the LAPCS(nested, chain) 27
28
CHAPTER 4. PREVIOUS RESULTS
and LAPCS(nested, plain) in running time O(nm3 ). LAPCS(crossing, crossing) admits a 2approximation algorithm running in O(nm), and LAPCS (unlimited, plain) cannot be approximated within ratio nǫ for any ǫ ∈ (0, 41 ), where n denotes the length of the longer input sequence [17]. Table 4.1 gives a summary of results concerning classical complexity of LAPCS. unlimited
crossing
plain
N P − c+ [11]
—
nested
N P − c# [20]
—
chain plain
chain
N P − c∗ [11]
unlimited crossing
nested
—
O(nm3 ) [17] O(nm) [11]
— *: not approximable within nǫ , ǫ < 1/4 [17]
O(nm) [15]
+: 2approximable, MAXSNPHard [17] #: 2approximable Table 4.1: Classical Complexity
Lin et al. [20] showed also the NPhardness results for the cfragment (with c > 2) and cdiagonal (with c > 1) LAPCS. The 1fragment LAPCS (crossing, crossing) and 0diagonal LAPCS (crossing, crossing) are solvable in time O(n). See Table 4.2. unlimited unlimited crossing nested
crossing
nested
NPhard [20] — —
chain
plain ?
NPhard [20]
?
N P − hard# [20] #: admits a PTAS
?
Table 4.2: Complexity result for cfragment (c > 1) and cdiagonal (c > 0) LAPCS
4.2. PARAMETERIZED COMPLEXITY
4.2
29
Parameterized Complexity
Since many of these 15 variations of LAPCS are NPhard or have no currently known polynomial time algorithm, the parameterized complexity of these problems has also been investigated in some of the above works. The parameters being used include: the length l of the desired subsequence , the cutwidth k of the arc structure and the bandwidth d of the arc structure. The length of desired subsequence l is independent of the other parameters, while the cutwidth of an arc structure lower than unlimited is upperbounded by the bandwidth of the arc structure. If parameterized by the length of the desired subsequence, the LAPCS problem with at least one sequence having an unlimited arc structure was shown to be W [1]complete [11]. If the arc structures of both sequences are crossing, the problem also turns out to be W [1]complete [11]. The same reductions as for the classical hardness result can be used to show the corresponding parameterized complexity. For other variations, in which the arc structure of sequence S1 is crossing or nested and the arc structure of S2 is at most nested, the parameterized complexity of LAPCS is still unknown. Table 4.3 summarizes the parameterized complexity of LAPCS, when parameterized by the length of the desired subsequence. unlimited
crossing
nested
chain
plain
W [1] − complete [11]
unlimited crossing
nested
—
W [1] − complete [11] —
? ?
Table 4.3: Parameterized by the length l of desired subsequence
Evans [11] presented an algorithm running in time O(9k nm), where k is the cutwidth or bandwidth of the arc structure, to solve the LAPCS problem for variations with arc structure of both sequences being at most crossing. It uses multiple tables to compute the length of longest arcpreserving common subsequence in a manner similar to the algorithm without arcs. To enable
30
CHAPTER 4. PREVIOUS RESULTS
matched final endpoints to be aligned with matched starting endpoints of arcs, the algorithm uses a tree data structure to keep track of all combinations of initial endpoints matches that lie on a path that produces this maximum value. Since the bandwidth of a crossing arc structure is an upper bound of cutwidth, the algorithm developed for the parameter cutwidth can also be used for the parameter bandwidth. Therefore, Table 4.4 and Table 4.5 are identical. unlimited
crossing
unlimited crossing
nested
chain
plain
? O(9k nm) [11]
—
nested
O(k 2 4k nm) [11]
—
Table 4.4: Parameterized by the cutwidth k of arc structure of both sequences
unlimited unlimited crossing nested
crossing
nested
chain
plain
? O(9d nm) [11]
— —
O(d2 4d nm) [11]
Table 4.5: Parameterized by the bandwidth d of arc structure of both sequences
4.3
Complexity of ArcPreserving Subsequence Problem
The exact matching version of LAPCS, arcpreserving subsequence problem, arises in widely varying applications. For example, searching a specific pattern in DNA/RNA database. The existing works that analyze the LAPCS problem give no hardness result for this problem. But we can extend the reductions for classical complexity of LAPCS to show that even the APS problem of some arc structures is NPhard. Assuming the shorter sequence always has the same
4.4. OVERVIEW OF THIS WORK
31
or lower level of arc structure, we also summarize the classical complexity of APS problem in Table 4.6. unlimited
crossing
nested
chain
plain
N P − hard [11]
unlimited crossing
nested
—
N P − hard [11] —
? ?
O(nm3 ) [20]
Table 4.6: Classical complexity for APS problem
4.4
Overview of This Work
Comparing the summaries in the previous three sections with the fact that classical LCS for two sequences can be solved in polynomial time, we can come to the conclusion that adding the arc annotation to the basic sequences makes the LCS problem much more complex. However, the arc annotation model provides the most natural and most intuitive way to describe the structure of the large molecules. Thus, to find exact and effective algorithms for the LAPCS problem with various arc annotations, is the main goal of this work. At first, we will prove the fixedparameter tractability for the restricted versions of LAPCS, cfragment and cdiagonal LAPCS, when taking the length of the common subsequence as the problem parameter. Lin et al. [20] gave polynomial time approximation schemes (PTAS) for cfragment and cdiagonal LAPCS(nested, nested). Our fixedparameter tractability result is also tenable for more general arc structures, (crossing, crossing) and even (unlimited, unlimited) with the degree of sequences as the second parameter (see Chapter 5). For the most important variant of LAPCS problem, general LAPCS(nested, nested), there are only an FPTalgorithm from Evans [11] with cutwidth as parameter and a quadratic time factor2approximation algorithm from Jiang et al. [17]. However, the fixedparameter tractability of this problem is still an open question, when parameterized by the length l of the desired subsequence. We will give an exact, fixedparameter algorithm
32
CHAPTER 4. PREVIOUS RESULTS
that solves the LAPCS(nested, nested) problem in time O(3.31k1 +k2 · n), where n is the maximum input sequence length and k1 and k2 are the number of deletions allowed for S1 and S2 respectively. It should be clear that l = S1  − k1 and l = S2  − k2 . This algorithm provides an effective solution for the case of reasonably small values of k1 and k2 (see Chapter 6). Furthermore, we will answer some open questions in Table 4.6, namely the complexity for APS(crossing, chain), APS(crossing, plain) and APS(nested, nested) (see Chapter 7).
Chapter 5
cfragment, cdiagonal LAPCS In this chapter, we investigate the cfragment LAPCS(crossing, crossing) and the cdiagonal LAPCS(crossing, crossing) problems. We give algorithms for these problems when parameterized by the length l of the desired subsequence. The restricted versions cdiagonal and cfragment of LAPCS(crossing, crossing) were already treated by Lin et al. [20]. They gave PTAS’s for these problems. We want to remark that the running times for the following algorithms are based on worst case analysis. The algorithms are expected to perform much better in practice.
5.1
cfragment LAPCS(crossing, crossing)
Before entering into details of the algorithm for cfragment LAPCS(crossing, crossing), we review briefly an algorithm which solves 1fragment LAPCS (crossing, crossing) in linear time [20]. Let (S1 , A1 ) and (S2 , A2 ) be an instance of 1fragment LAPCS (crossing, crossing). We assume that n = S1  = S2 . If the sequences do not have the same length, we can extend the shorter one by adding a sequence of a letter not in the alphabet at its end. We construct a graph G as follows. If the two sequences induce a base match hi, ii, then we create a vertex vi . If the sequences 33
34
CHAPTER 5. CFRAGMENT, CDIAGONAL LAPCS
induce a pair of base matches hi, ii and hj, ji and (i, j) is an arc in either A1 or A2 but not both, then we impose an edge connecting vi and vj in G. It is clear that G has maximum degree 2 and every independent set of G onetoone corresponds to an arcpreserving common subsequence of (S1 , A1 ) and (S2 , A2 ). Since G is composed only of a collection of disjoint cycles and paths, we can compute a maximum independent set of G in linear time. Therefore, the 1fragment LAPCS(crossing, crossing) is solvable in O(n) time. Since 1fragment limits the base matches to bases at the same position in the two sequences, the resulting graph G is simple. cfragment relaxes this limitation and allows base matches of the form hi, ji, where S1 [i] and S2 [j] are not at the same position but in the same fragment. Using the above reduction, the resulting graph will be much more complicated. However, we will show in the following that this graph has a bounded degree, such that the fixedparameter tractable algorithm in Lemma 3.8 can be used to find out the maximum independent set of such a graph. Lemma 5.1 The cfragment LAPCS (crossing, crossing) is polynomially reducible and fixedparameter reducible to MAXISB, in time O(c3 n). Proof. Reduction: Let (S1 , A1 ) and (S2 , A2 ) be an instance of cfragment LAPCS (crossing, crossing), where S1 and S2 are over a fixed alphabet Σ. We assume that both sequences have the same length as in the algorithm for the 1fragment variant, n = S1  = S2  = p · c, where p ∈ N. We construct a graph G = (V, E) as follows. Each base of S1 in the ith fragment (1 ≤ i ≤ p) can only be matched to the bases in the ith fragment of S2 . If there is a such base match, then we create a vertex in G, i.e., we define V := {vi,j  S1 [i] = S2 [j] and ⌈i/c⌉ = ⌈j/c⌉}.
As explained in Definition 3.17, the LAPCS problem asks for a matching, which is onetoone, orderpreserving and arcpreserving. Since we want to translate
5.1. CFRAGMENT LAPCS(CROSSING,CROSSING)
35
the LAPCS instance into an instance of the independent set problem on G, the edges of G will represent all conflicting matches. Therefore, for each two vertices vi1 ,j1 and vi2 ,j2 , i.e, for two base matches S1 [i1 ] = S2 [j1 ] and S1 [i2 ] = S2 [j2 ] (i1 6= i2 or j1 6= j2 ), such a conflict may arise from three different situations: 1. Both matches are in the same fragment and both matches involve the same position in S2 or in S1 , i.e., (i1 6= i2 ∧ j1 = j2 ) ∨ (i1 = i2 ∧ j1 6= j2 ). 2. Both matches are in the same fragment and they cross each other, they do not preserve the order of the subsequence, i.e., ((i1 < i2 ) ∧ (j1 > j2 )) ∨ ((i1 > i2 ) ∧ (j1 < j2 )). 3. The two matches represented by vi1 ,j1 and vi2 ,j2 are not arcpreserving, i.e., ((i1 , i2 ) ∈ A1 ∧ (j1 , j2 ) ∈ / A2 ) ∨ ((i1 , i2 ) ∈ / A1 ∧ (j1 , j2 ) ∈ A2 ) Figure 5.1 illustrates an example of this reduction.
For the running time analysis of this construction, note that there can be up to c2 vertices in G for each fragment of the sequence. Hence, we have a total of cn vertices. Each vertex in G can have at most c2 + 2c − 1 adjacent edges, which come from following three groups: • If a base match hi, j1 i shares with another base match hi, j2 i the same base S1 [i], then an edge must be imposed between vertices vi,j1 and vi,j2 . There can be at most c − 1 such base matches, which share S1 [i] with hi, ji, and at most c − 1 base matches, which share S2 [j] with hi, ji. Thus, vi,j can have at most 2(c − 1) adjacent edges due to the first situation. • If S1 [i] is the first base in one fragment of S1 and S2 [j] is the last base in the same fragment of S2 , then the base match hi, ji can violate the order of the original sequences with at most (c − 1)2 other base matches. Thus, at most (c − 1)2 edges will be imposed on vertex vi,j due to the second situation. ′
′
• If S1 [i] and S2 [j] both are endpoints of arcs (i, i ) and (j, j ), then all ′
′
base matches involving S1 [i ] or S2 [j ] (but not both) with base match
36
CHAPTER 5. CFRAGMENT, CDIAGONAL LAPCS
S1 S2
G
2i − 1 2i
2j − 1 2j
a
b
a
b
a
a
b
a
2j − 1 2j
2i − 1 2i
c
c
i
j
v2i−1,2i−1
v2j−1,2j
(3)
(2)
(1) (3)
v2i−1,2i
v2j,2j−1
Figure 5.1: 2fragment LAPCS There are four base matches in the two segments shown in this figure. They correspond to the four vertices in G. The edge (1) in G is imposed due to the first situation in our construction, the base matches h2i − 1, 2i − 1i and h2i − 1, 2ii share the base S1 [2i − 1]. Since base matches h2j − 1, 2ji and h2j, 2j − 1i fit the second situation, their corresponding vertices are joined by an edge denoted by (2). While an arc joins bases S1 [2i − 1] and S1 [2j − 1], there is no arc in S2 between ith and jth fragments. According to the third situation, edges are imposed between the vertices which correspond the base matches involving the endpoints of the arc (2i − 1, 2j − 1). These edges are marked with (3).
′
′
hi, ji cannot be arcpreserving. Since S1 [i ] and S2 [j ] can be in two different fragments and each of them has at most c matched bases, the edges imposed on vertex vi,j due to the third situation can amount to 2c.
Thus, the resulting graph G has a vertex degree bounded by B = c2 + 2c − 1. Moreover, since we have cn vertices, G can have at most O(c3 n) edges. The construction of G can be carried out in time O(c3 n). To show that the above construction is a correct reduction from cfragment LAPCS(crossing,crossing) to MAXISB, we need to verify that there is a
5.1. CFRAGMENT LAPCS(CROSSING,CROSSING)
37
mapping M S of size l, M S ⊆ {1, . . . , S1 } × {1, . . . , S2 } i.e., there is an APCS with length l if and only if the graph G has an independent set of size l. ”=⇒”: Assume that there is an APCS with length l, then there is a mapping M S of size l for (S1 , A1 ) and (S2 , A2 ), M S = { (j1 , j2 )  S1 [j1 ] = S2 [j2 ], ⌈j1 /c⌉ = ⌈j2 /c⌉ }. For each element (j1 , j2 ) in M S, there is a vertex vj1 ,j2 in the graph G. We claim that these vertices form an independent set. To prove this, we show that its opposite is not correct. If the set of these vertices is not an independent set, there are at least two vertices joined by an edge. Assume that there is an edge between vertices vj1 ,j2 and vj3 ,j4 . From the construction above, one of following cases must hold for the two matches (j1 , j2 ) ∈ M S and (j3 , j4 ) ∈ M S: • j1 = j3 or j2 = j4 but not both. In this case, these two matches cannot be both in M S, because they violate the property that the matching is onetoone. • (j1 < j3 ) ∧ (j2 > j4 ) or (j1 > j3 ) ∧ (j2 < j4 ). Then they violate the order of the subsequence, and thus, they cannot be both in M S. • There is an arc between (j1 , j3 ) or between (j2 , j4 ), but not both. If this holds true, the mapping is not arcpreserving, because there is only one arc between two base matches. Consequently, the two vertices cannot be connected by an edge. Hence, the vertex set {vj1 ,j2 (j1 , j2 ) ∈ M S} is an independent set of G and its size is l. ′
”⇐=”: Assume that there is an independent set V of size l of G, we have a ′
mapping T of size l, i.e., T = { (j1 , j2 )  vj1 ,j2 ∈ V }. Because a vertex of G is created only if the two bases of the positions match, we have S1 [j1 ] = S2 [j2 ], and both bases are in the same fragment, so each element (j1 , j2 ) ∈ T represents
38
CHAPTER 5. CFRAGMENT, CDIAGONAL LAPCS
a base match of S1 and S2 . Then, T induces a common subsequence. ′
Since a pair of vertices vj1 ,j2 and vj3 ,j4 in V is not linked by an edge, the matches (j1 , j2 ) ∈ T and (j3 , j4 ) ∈ T cannot fit one of the three situations. This means that j1 6= j3 and j2 6= j4 and T is an onetoone matching. Furthermore, they preserve the order of subsequence, i.e., j1 < j3 ⇐⇒ j2 < j4 . They preserve the arcs too, i.e., there can be arcs between both (j1 , j2 ) and (j3 , j4 ) or there is no arc between both. Thus, the sequence induced by T is an APCS of (S1 , A1 ) and (S2 , A2 ) and its length is l.
2
Theorem 5.2 The cfragment LAPCS(crossing, crossing) problem, parameterized by the length l of the desired subsequence, is fixedparameter tractable and can be solved in time O((B + 1)l B 2 + c3 n), where B = c2 + 2c − 1. Proof. The problem MAXISB has a straight bounded search tree FPTalgorithm (see Lemma 3.8) and cfragment LAPCS(crossing, crossing), parameterized by the length l of the desired subsequence, is fixedparameter reducible to MAXISB in time O(c3 n). The resulting graph has cn vertices and a bounded degree B = c2 + 2c − 1. Thus, cfragment LAPCS(crossing, crossing) is also fixedparameter tractable and solvable in time O((B + 1)l B 2 + c3 n).
5.2
2
cdiagonal LAPCS(crossing, crossing)
The reduction in the last section can be easily extended to a reduction from cdiagonal LAPCS(crossing, crossing) to MAXISB. For given arcannotated sequences (S1 , A1 ), (S2 , A2 ), the set of vertices now becomes V := {vi,j  S1 [i] = S2 [j] and j ∈ [i − c, i + c]}, since each position i in sequence S1 can only be matched to positions j ∈ [i − c, i + c] of S2 . The definition of the edge set E can be adapted from the case for cfragment. We put an edge {vi1 ,j2 , vi2 ,j2 } iff the corresponding matches hi1 , j1 i and hi2 , j2 i (1) share a common base, (2) are not orderpreserving, or (3) are not arcpreserving. Figure 5.2 illustrates an example of the extended reduction.
5.2. CDIAGONAL LAPCS(CROSSING, CROSSING) A
B
c
c
S1
39
G i
vi,j
2 1
4
3
2
3
j
S2
c
c
X
Y
4
1
Figure 5.2: cdiagonal LAPCS The vertex vi,j is created for the base match hi, ji. The dashed lines 1, 2, 3 and 4 represent other four base matches, each of them corresponds a vertex in G. Since the base match 1 shares with the base S1 [i] with hi, ji, an edge is imposed between vertex vi,j and vertex 1. The base matches 2 and hi, ji cross each other. Hence, an edge is also imposed between their corresponding vertices. It is clear that neither the pair of base matches 3 and hi, ji nor 4 and hi, ji can preserve the arc of the sequences. Two edges are added to the graph G. Note that A, B, X and Y are all substrings of length c.
Obviously, V  ≤ (2c + 1) · n. In the following, we argue that the degree of G = (V, E) is upperbounded by B = 2c2 + 7c + 2: • Because a base can be matched to at most 2c+1 bases in another sequence, a base match can have common bases with up to 2c + 2c = 4c other base matches. In Figure 5.2, base match hi, ji share with base matches, which are between S1 [i] and bases in substrings X and Y or between S2 [j] and bases in substrings A and B. Since all these substrings have the length c, there can be at most 4c such bases matches. One example is the base match denoted by 1. • We can observe in Figure 5.2 that a vertex in G has a maximum number of edges imposed due to the second situation, if the distance between the bases involved in its corresponding base match is equal to c. Consider, e.g., the base match hi, ji in Figure 5.2. There, a base matches crossing hi, ji must be from one of the following sets: M1 = { hi1 , j1 i  S1 [i1 ] is in substring B, S2 [j1 ] is in substring X }, M2 = { hi2 , j2 i  S1 [i2 ] is in substring B, S2 [j2 ] is in substring Y , and j2 −i2 ≤ c } and M3 = { hi3 , j3 i 
40
CHAPTER 5. CFRAGMENT, CDIAGONAL LAPCS S1 [i3 ] is in substring A, S2 [j3 ] is in substring X, and j3 − i3 ≤ c }. The set M1 can have at most c2 elements. The elements of the other two sets can amount to c2 − c. Therefore, each vertex in V can have at most 2c2 − c edges which are imposed to guarantee the orderpreserving property. • If the two bases, which form a base match, both are endpoints of two arcs, like the base match hi, ji in Figure 5.2, then this base match cannot be in an arcpreserving match with base matches, which involve only one of the other endpoints of the arcs. Two such base matches are marked in Figure 5.2 with 3 and 4. Those base matches can amount to 4c + 2.
Consequently, the graph G has degree bounded by B = 2c2 + 7c + 2. With (2c + 1)n vertices, G has at most O(c3 n) edges. The construction of G can be done in time O(c3 n).
Proof of correctness of the reduction works analogously to the one shown in Section 3.1. Theorem 5.3 The cdiagonal LAPCS(crossing, crossing) problem, parameterized by the length l of the desired subsequence, is fixedparameter tractable and can be solved in time O((B + 1)l B 2 + c3 n), where B = 2c2 + 7c + 2. Proof. Analogous to the proof of Theorem 5.1.
5.3
2
cfragment(cdiagonal) LAPCS(unlimited, unlimited)
Note that the fact that the graph G = (V, E) constructed in the previous sections has bounded degree heavily depends on the fact that the two underlying sequences have crossing arc structure. Hence, the same method does not directly apply for cfragmented(cdiagonal) LAPCS(unlimited, unlimited).
However, if we use the socalled “degree of a sequence” as an
additional parameter, we can upperbound the degree of G. The degree of
5.3. LAPCS(UNLIMITED, UNLIMITED)
41
an arcannotated sequence (S, A) with unlimited arcstructure is the maximum number of arcs from A that start or end in a base in S. Clearly, the cutwidth (see Definition 4.4) of an arcannotated sequence is an upper bound on the degree. The amount of vertices in the resulting graph G is not changed, but the bounded degree is changed. In the construction for the arc structure (crossing, crossing), we have added three groups of edges to G. Since the first two groups have nothing to do with arcs, these edges remain in the graph for unlimited arc structure. Due to the third situation, 2c edges for cfragment and 4c + 2 edges for cdiagonal are added for a base match hi, ji with two arc endpoints, (i, i1 ) ∈ A1 and (j, j1 ) ∈ A2 . These edges are between vertex vi,j and the vertices, which correspond to the base matches involving one of S1 [i1 ] and S2 [j1 ]. In unlimited arc structure with bounded degree b, a base S1 [i] can be endpoint of at most b arcs, we denote them by (i, i1 ), (i, i2 ), . . . , (i, ib ). The third group of edges must be extended to include the edges between vi,j and all vertices, which correspond to base matches involving one of S1 [2], . . . , S1 [b], S2 [2], . . . , S2 [b]. The amount of edges in this set can increase to b(2c) for cfragment and to b(4c + 2) for cdiagonal LAPCS(unlimited, unlimited). The degree of the resulting graph for cfragment is then bounded by B = c2 + 2bc − 1, and the one for cdiagonal by B = 2c2 + (4b + 3)c + 2b. The construction can be carried out in time O((c3 +bc2 )n). Thus, cfragment and cdiagonal LAPCS(unlimited, unlimited) is also fixedparameter tractable, when the parameters are the length l of the desired subsequence and the maximum degree b of the two sequences; they can be solved in time O((B + 1)l B 2 + (c3 + 2bc2 )n), where B = c2 + 2bc − 1, and in time O((B ′ + 1)l B ′ 2 + (c3 + 2bc2 )n), where B ′ = 2c2 + (4b + 3)c + 2b, respectively.
42
CHAPTER 5. CFRAGMENT, CDIAGONAL LAPCS
Chapter 6
An Algorithm for LAPCS(nested, nested) In this chapter, we describe and analyze Algorithm LAPCS which solves the LAPCS(nested, nested) problem in time O(3.31k1 +k2 · n), where n is the maximum length of the input sequences. It is a search tree algorithm and, for sake of clarity, we choose the presentation in a recursive style: Based on the current instance, we make a case distinction, branch into one or more subcases of somehow simplified instances and invoke the algorithm recursively on each of these subcases. Note, however, that we require to traverse the resulting search tree in breadthfirst manner, which will be important in the running time analysis. Before presenting the algorithm, we define the employed notation. Recall that the considered sequences are seen as arcannotated sequences; a comparison S1 = S2 includes the comparison of arc structures. Additionally, we use a modified comparison S1 ≈i,j S2 that is satisfied when S1 = S2 after deleting at most i bases in S1 and at most j bases in S2 . Note that we can check whether S1 ≈1,0 S2 or whether S1 ≈0,1 S2 in linear time. The subsequence obtained from an arcannotated sequence S by deleting S[i] is denoted by S − S[i]. For handling branches in which no solution is found, we use a modified ˙ defined as follows: a+b ˙ := a + b if a ≥ 0 and b ≥ 0, and addition operator “+” ˙ := −1 otherwise. We abbreviate n1 := S1  and n2 := S2 . a+b The most involved case in the algorithm is Case (2.5), which will also deter43
44
CHAPTER 6. AN ALGORITHM FOR LAPCS(NESTED, NESTED)
mine our upper bound on the search tree size. The focus of our analysis will, in particular, be on Subcase (2.5.3). For sake of clarity, we, firstly, give an overview of the algorithm which omits the details of Case (2.5), and, then, present Case (2.5) in detail separately. Although the algorithm as given reports only the length of a longest arcpreserving common subsequence (lapcs), it can easily be extended to compute the lapcs itself within the same running time. Algorithm LAPCS(S1 , S2 , k1 , k2 ) Input: Arcannotated sequences S1 and S2 , positive integers k1 and k2 . Return value: Integer denoting the length of an lapcs of S1 and S2 which can be obtained by deleting at most k1 symbols in S1 and at most k2 symbols in S2 . Return value −1 if no such subsequence exists. (Case 0) /* Recursion ends. */ If k1 < 0 or k2 < 0 then return −1. If S1  = 0 and S2  = 0, then return 0. If S1  = 0 and S2  > 0, then
/* No solution found. */ /* Success! Solution found.*/ /* One sequence done... */
if k2 ≥ S2 , then return 0, else return −1. If S1  > 0 and S2  = 0, then
/* ...but not the other. */
if k1 ≥ S1 , then return 0, else return −1. (Case 1) /* Nonmatching bases. */ If S1 [1] 6= S2 [1], then return the maximum of the following values: • LAPCS(S1 [2, n1 ], S2 , k1 − 1, k2 ) /* delete S1 [1] */ • LAPCS(S1 , S2 [2, n2 ], k1 , k2 − 1) /* delete S2 [1] */. (Case 2) /* Matching bases */ If S1 [1] = S2 [1], then (2.1) /* No arcs involved. */ If both S1 [1] and S2 [1] are not endpoints of arcs, then return ˙ 1+LAPCS(S 1 [2, n1 ], S2 [2, n2 ], k1 , k2 ). /* Since no arcs are involved, it is safe to match the bases. */
45 (2.2) /* Only one arc. */ If S1 [1] is left endpoint of an arc (1, i) but S2 [1] is not endpoint of an arc, then return the maximum of the following values: • LAPCS(S1 [2, n1 ], S2 , k1 − 1, k2 ) /* delete S1 [1] */, • LAPCS(S1 , S2 [2, n2 ], k1 , k2 − 1) /* delete S2 [1] */, and ˙ • 1+LAPCS(S 1 [2, n1 ] − S1 [i], S2 [2, n2 ], k1 − 1, k2 ) /* match */. /* Since there is an arc in one sequence only, S1 [1] and S2 [1] can be matched only if S1 [i] and arc (1, i) is deleted. */ (2.3) /* Only one arc. */ If S2 [1] is left endpoint of an arc (1, j) but S1 [1] is not endpoint of an arc, then proceed analogously as in (2.2). (2.4) /* Nonmatching arcs. */ If S1 [1] is left endpoint of an arc (1, i), S2 [1] is left endpoint of an arc (1, j) and S1 [i] 6= S2 [j], then return the maximum of the following values: • LAPCS(S1 [2, n1 ], S2 , k1 − 1, k2 ) /* delete S1 [1] */, • LAPCS(S1 , S2 [2, n2 ], k1 , k2 − 1) /* delete S2 [1] */, and ˙ • 1+LAPCS(S 1 [2, n1 ]−S1 [i], S2 [2, n2 ]−S2 [j], k1 −1, k2 −1) /* match */. /* Since the arcs cannot be matched, S1 [1] and S2 [1] can be matched only if S1 [i], S2 [j], and the arcs are deleted. */ (2.5) /* An arc match is possible. */ If S1 [1] is left endpoint of an arc (1, i), S2 [1] is left endpoint of an arc (1, j), and S1 [i] = S2 [j], then go through Cases (2.5.1), (2.5.2), and (2.5.3) which are presented below (one of them will apply and will return the length of the lapcs of S1 and S2 , if such an lapcs can be obtained with k1 deletions in S1 and k2 deletions in S2 , or will return −1 otherwise). In Case (2.5), it is possible to match arcs (1, i) in S1 and (1, j) in S2 since S1 [1] = S2 [1] and S1 [i] = S2 [j]. Our first observation is that, if S1 [2, i − 1] = S2 [2, j − 1] (which will be handled in Case (2.5.1)) or if S1 [i + 1, n1 ] = S2 [j + 1, n2 ] (which will be handled in Case (2.5.2)), it is safe to match arc (1, i) with arc (1, j): no
46
CHAPTER 6. AN ALGORITHM FOR LAPCS(NESTED, NESTED)
longer apcs would be possible when not matching them. We match the equal parts of the sequences (either those inside arcs or those following the arcs) and call Algorithm LAPCS recursively only on the remaining subsequences. These cases only simplify the instance and do not require to branch into several subcases: (2.5.1) /* Sequences inside the arcs match. */ If S1 [2, i − 1] = S2 [2, j − 1], then return ˙ i+LAPCS(S 1 [i + 1, n1 ], S2 [j + 1, n2 ], k1 , k2 ). (2.5.2) /* Sequences following the arcs match. */ If S1 [i + 1, n1 ] = S2 [j + 1, n2 ], then return ˙ 1 − i)+LAPCS(S ˙ 2+(n 1 [2, i − 1], S2 [2, j − 1], k1 , k2 ). If neither Case (2.5.1) nor Case (2.5.2) applies, this is handled by Case (2.5.3), which branches into four recursive calls: we have to consider breaking at least one of the arcs (handled by the first three recursive calls in (2.5.3)) or to match the arcs (handled by the fourth recursive call in (2.5.3)): (2.5.3) Return the maximum of the following four values: • LAPCS(S1 [2, n1 ], S2 , k1 − 1, k2 ) /* delete S1 [1]. */, • LAPCS(S1 , S2 [2, n2 ], k1 , k2 − 1) /* delete S2 [1]. */, ˙ • 1+LAPCS(S 1 − S1 [i], S2 − S2 [j], k1 − 1, k2 − 1) /* match S1 [1] and S2 [1], but do not match arcs (1, i) and (1, j); this implies the deletion of S1 [i], S2 [j], and the incident arcs. */, • l (computed as given below) /* match the arcs. */ Value l denotes the length of the lapcs of S1 and S2 in case of matching arc (1, i) with arc (1, j). It can be computed as the sum of the lengths l′ , denoting the length of an lapcs of S1 [2, i−1] and S2 [2, j −1], and l′′ , denoting the length of an lapcs of S1 [i + 1, n1 ] and S2 [j + 1, n2 ]; each of l′ and l′′ can be computed by one recursive call. Remember that we already excluded S1 [2, i − 1] = S2 [2, j − 1] (by Case (2.5.1)) and S1 [i + 1, n1 ] = S2 [j + 1, n2 ] (by Case (2.5.2)). For the analysis
47 of running time, however, we will require that the deletion parameters k1 and k2 will be decreased by two in both recursive calls computing l′ and l′′ . Therefore, we will further exclude those special cases in which l′ or l′′ can be found by exactly one deletion, either in S1 or in S2 (this can be checked in linear time); then, we need only one recursive call to compute l. Only if this is not possible, we will invoke the two calls for l′ and l′′ . Therefore, l is computed as follows:
l :=
˙ j +LAPCS(S 1 [i + 1, n1 ], S2 [j + 1, n2 ], k1 − 1, k2 ) if S1 [1, i] ≈1,0 S2 [1, j], ˙ i+LAPCS(S 1 [i + 1, n1 ], S2 [j + 1, n2 ], k1 , k2 − 1) if S1 [1, i] ≈0,1 S2 [1, j], ˙ 2 − j)+LAPCS(S ˙ 2+(n 1 [2, i − 1], S2 [2, j − 1], k1 − 1, k2 )
if S1 [i + 1, n1 ] ≈1,0 S2 [j + 1, n2 ], ˙ 1 − i)+LAPCS(S ˙ 2+(n 1 [2, i − 1], S2 [2, j − 1], k1 , k2 − 1) if S1 [i + 1, n1 ] ≈0,1 S2 [j + 1, n2 ], ˙ ′ ˙ ′′ 2+l +l (defined below) otherwise.
Computing l′ , we credit the two deletions that will certainly be needed when computing l′′ . Depending on the length of S1 [i+1, n1 ] and S2 [j +1, n2 ], we have to decide which parameter to decrease: If S1 [i + 1, n1 ] > S2 [j + 1, n2 ], we will certainly need at least two deletions in S1 [i + 1, n1 ], and can start the recursive call with parameter k1 − 2 (and, analogously, with k2 − 2 if S1 [i + 1, n1 ] < S2 [j + 1, n2 ] and both k1 − 1 and k2 − 1 if S1 [i + 1, n1 ] and S2 [j + 1, n2 ] are of same length): LAPCS(S1 [2, i − 1], S2 [2, j − 1], k1 − 2, k2 ) if n1 − i > n2 − j, ′ l := LAPCS(S1 [2, i − 1], S2 [2, j − 1], k1 , k2 − 2) if n1 − i < n2 − j, LAPCS(S [2, i − 1], S [2, j − 1], k − 1, k − 1) if n − i = n − j. 1
2
1
2
1
2
Computing l′′ , we decrease k1 and k2 by the deletions already spent when ′ computing l′ , k1,1 := i − 2 − l′ denoting the deletions spent in S1 [1, i] and ′ := j − 2 − l′ denoting the deletions spent in S [1, j]: k2,1 2 ′ ′ l′′ := LAPCS(S1 [i + 1, n1 ], S2 [j + 1, n2 ], k1 − k1,1 , k2 − k2,1 ).
Correctness of Algorithm LAPCS. To show the correctness, we have to make sure that, if an lapcs with the specified properties exists, then the algorithm finds one; the reverse can be seen by checking, for every case of the
48
CHAPTER 6. AN ALGORITHM FOR LAPCS(NESTED, NESTED)
above algorithm, that we only make matches when they extend the lapcs and that the bookkeeping of the “mismatch counters” k1 and k2 is correct. In the following, we omit the details for the easier cases of our search tree algorithm and, instead, focus on the most involved situation, Case (2.5). In Case (2.5), S1 [1] = S2 [1], there is an arc (1, i) in S1 and an arc (1, j) in S2 , and S1 [i] = S2 [j]. In Cases (2.5.1) and (2.5.2), we handled the special situation that S1 [1, i] = S2 [1, j] or that S1 [i + 1, n1 ] = S2 [j + 1, n2 ]. Observe that, if we decide to match the arcs (Case (2.5.3)), we can divide the current instance into two subinstances: bases from S1 [2, i − 1] can only be matched to bases from S2 [2, j − 1] and bases from S1 [i + 1, n1 ] can only be matched to bases from S2 [j + 1, n2 ]. We will, in the following, denote the subinstance given by S1 [2, i − 1] and S2 [2, j − 1] as part 1 of the instance and the one given by S1 [i + 1, n1 ] and S2 [j + 1, n2 ] as part 2 of the instance. We have the choice of breaking at least one of the arcs (1, i) and (1, j) or to match them. We distinguish two cases. Firstly, suppose we want to break at least one arc. This can be achieved by either deleting S1 [1] or S2 [1]. If we do not delete either of these bases, we obtain a base match. But, in addition, we must delete both S1 [i] and S2 [j], since otherwise we cannot maintain the arcpreserving property. Secondly, we can match the arcs (1, i) and (1, j). Then, we know, since neither Case (2.5.1) nor (2.5.2) applies, that an optimal solution will require at least one deletion in part 1 and will also require at least one deletion in part 2. We can further compute, in linear time, whether part 1 (or part 2, resp.) can be handled by exactly one deletion and start the algorithm recursively only on part 2 (part 1, resp.), decreasing one of k1 or k2 by the deletion already spent. In the remaining case, we start the algorithm recursively first on part 1 (to compute l′ ) and, then, on part 2 (to compute l′′ ). At this point we know, however, that an optimal solution will require at least two deletions in part 1 and will also require at least two deletions in part 2. Thus, when starting the algorithm on part 1, we can “spare” two of the k1 + k2 deletions for part 2; depending on part 2 (as outlined above). Having, thus, found an optimal solution for part 1
49 of length l′ , the number of allowed deletions remaining for part 2 is determined: ′ := i − 2 − l′ deletions in S1 [2, i − 1] and we have, in part 1, already spent k1,1 ′ := j −2−l′ deletions in S [2, j −1]. Thus, there remain, for part 2, k −k ′ k2,1 2 1 1,1 ′ deletions for S1 [i + 1, n1 ] and k2 − k2,1 deletions for S2 [j + 1, n2 ].
This discussion showed that, in Case (2.5.3), our case distinction covers all subcases in which we can find an optimal solution and, hence, Case (2.5.3) is correct. Running time of Algorithm LAPCS. Lemma 6.1 Given two arcannotated sequences S1 and S2 , suppose that we have to delete k1′ symbols in S1 and k2′ symbols in S2 in order to obtain an lapcs.1 Then, the search tree size (i.e., the number of the nodes in the search ′
′
tree) for a call LAPCS(S1 , S2 , k1′ , k2′ ) is upperbounded by 3.31k1 +k2 . Proof. Algorithm LAPCS constructs a search tree. In each of the given cases, we do a branching and we perform a recursive call of LAPCS with a smaller value of the sum of the parameters in each of the branches. We now discuss some cases which, in some sense, have a special branching structure. Firstly, Cases (2.1), (2.5.1), and (2.5.2) do not cause any branching of the recursion. Secondly, for Case (2.5.3), in total, we perform five recursive calls. For the following running time analysis, the last two recursive calls of this case (i.e., the ones needed to evaluate l′ and l′′ ) will be treated together. More precisely, we treat Case (2.5.3) as if it were a branching into four subcases, where, in each of the first three branches we have one recursive call and in the fourth branch we have two recursive calls. In a search tree produced by Algorithm LAPCS, every search tree node corresponds to one of the cases mentioned in the algorithm. Let m be the number of nodes corresponding to Case (2.5.3) that appear in such a search tree. We prove the claim on the search tree size by induction on the number m. 1
Note that there might be several lapcs for two given sequences S1 and S2 . The length ℓ
of such an lapcs, however, is uniquely defined. Since, clearly, k1′ = S1  − ℓ and k2′ = S2  − ℓ, the values k1′ and k2′ also are uniquely defined for given S1 and S2 .
50
CHAPTER 6. AN ALGORITHM FOR LAPCS(NESTED, NESTED)
For m = 0, we do not have to deal with Case (2.5.3). Hence, we can determine the search tree size by the corresponding branching vectors: Suppose that in one search tree node with current sequences S1 , S2 and parameters k1′ , k2′ , we have q branches. Moreover, suppose that in branch t, 1 ≤ t ≤ q, we call LAPCS ′ and k ′ . Then, the branching vector for this with new parameter values k1,t 2,t ′ +k ′ ). Assuming branch is given by p = (p1 , . . . , pq ), where pt := (k1′ +k2′ )−(k1,t 2,t
that all branchings in the search tree had branching vector p, we can compute a k′ +k2′
basis cp which yields an upper bound on the search tree size of the form cp1
.
The branching vectors which appear in our search tree are (1, 1) (Case 1), (1, 1, 1) (Cases 2.2, 2.3), (1, 1, 2) (Case 2.4), (1, 1, 2) (Case 2.5.3 with m = 0). The worst case basis for these branching vectors is given for p = (1, 1, 1) with cp = 3 ≤ 3.31. Now suppose that the claim is true for all values m′ ≤ m − 1. In order to prove the claim for m we have to, for a given search tree, analyze a search tree node corresponding to Case (2.5). Suppose that the current sequences in this node are S1 and S2 with lengths n1 and n2 and that the optimal parameter values are k1′ and k2′ . Our goal is to show that the branching of the recursion for Case (2.5.3) has branching vector p = (1, 1, 2, 1) which corresponds to a basis cp = 3.31. As discussed above, for the first three branches of Case (2.5.3), we only need one recursive call of the algorithm. The fourth branch is more involved. We will have a closer look at this fourth subcase of (2.5.3) in the following. Let us evaluate the search tree size for a call of this fourth subcase. It is clear that the optimal parameter values for the subsequences S1 [2, i − 1] ′ ′ and S2 [2, j − 1] are k1,1 = (i − 2) − l′ and k2,1 = (j − 2) − l′ . Moreover, the
optimal parameter values for the subsequences S1 [i + 1, n1 ] and S2 [j + 1, n2 ] are ′ = (n − i) − l′′ and k ′ = (n − j) − l′′ . Since by Cases (2.5.1) and (2.5.2) k1,2 1 2 2,2
and by the first four cases in the fourth branch of Case (2.5.3) the cases where ′ + k′ ′ ′ k1,1 2,1 ≤ 1 or k1,2 + k2,2 ≤ 1 are already considered, we may assume that ′ + k ′ , k ′ + k ′ ≥ 2. we have k1,1 2,2 2,1 1,2
Hence, by induction hypothesis, the search tree size for the computation of l′ is ′
′
′
′
3.31k1,1 +k2,1 , and the computation of l′′ needs a search tree of size 3.31k1,2 +k2,2 .
51 This means that the total search tree size for this fourth subcase is upperbounded by ′
′
′
′
3.31k1,1 +k2,1 + 3.31k1,2 +k2,2 .
(6.1)
Note that, since kt′ is assumed to be the optimal value, we have kt′ = nt − l′ − l′′ − 2
for t = 1, 2,
and, hence, an easy computation shows that ′ ′ kt,1 + kt,2 = kt′
for t = 1, 2.
From this we conclude that, ′
′
′
′
3.31k1,1 +k2,1 + 3.31k1,2 +k2,2 ≤ 3.31k1 +k2 −1 . ′
′
(6.2)
′ + k′ , k′ + k′ Inequality (6.2) holds true since, by assumption, k1,1 2,2 ≥ 2. 2,1 1,2
Plugging Inequality (6.2) in Expression (6.1) we see that the search tree size ′
′
for this fourth case of (2.5.3) is upperbounded by 3.31k1 +k2 −1 . Besides, by induction hypothesis the search trees for the first and the second branch of ′
′
Case (2.5.3) also have size upperbounded by 3.31k1 +k2 −1 and the search tree ′
′
for the third branch of Case (2.5.3) has size upperbounded by 3.31k1 +k2 −2 Hence, the overall computations for Case (2.5.3) can be treated as branching vector p = (1, 1, 2, 1). The corresponding basis cp of this branching vector is 3.31, which again is the worst case basis among all branchings. Hence, the ′
′
full search tree has size 3.31k1 +k2 .
2
Now, suppose that we run algorithm LAPCS with sequences S1 , S2 and parameters k1 , k2 . As before let k1′ and k2′ be the number of deletions in S1 and S2 needed to find an lapcs. As pointed out at the beginning of this section, the search tree will be traversed in breadthfirst manner. Hence, on the one hand, we may stop the computation if at some search tree node an lapcs is found (even though the current parameters at this node may be nonzero). On the other hand, if it is not possible to find an lapcs with k1 and k2 deletions, then the algorithm terminates automatically by Case (0). Observe that the time needed in each search tree node is upperbounded by O(n) if both sequences S1 and S2
52
CHAPTER 6. AN ALGORITHM FOR LAPCS(NESTED, NESTED)
have length at most n. This gives a total running time of O(3.31k1 +k2 · n) for the algorithm. The following theorem summarizes the results of this section. Theorem 6.2 The problem LAPCS(nested, nested) for two sequences S1 and S2 with S1 , S2  ≤ n can be solved in time O(3.31k1 +k2 · n) where k1 and k2 are the number of deletions needed in S1 and S2 .
Chapter 7
ArcPreserving Subsequence Problems In this chapter, we deal with the ArcPreserving Subsequence problem (APS). Such a problem can be encountered when we search a certain RNA pattern in an RNA database, or when one of the two parameters of the algorithm in the previous chapter is equal to zero. This problem is NPcomplete, whenever one of the input sequences has an unlimited arc structure or the arc structures of both sequences are crossing [11].Up to our knowledge, the complexity of this problem for arc structures (crossing, nested), (crossing, chain), (crossing, plain) and (nested, nested) seemly has not been investigated prior to this work. In Section 7.1, we show that APS(crossing, chain) is NPhard. This implies that APS(crossing, nested) is NPhard, too. Section 7.2 is concerned with an algorithm which can solve APS(nested, nested) in polynomial time.
7.1
NPHardness of APS(crossing, chain)
The NPhardness of LAPCS(crossing, crossing) can be shown by a reduction form clique [11]. We use a similar construction to find a reduction from independent set to APS(crossing, chain). From a given independent set instance, we construct an instance for the APS(crossing, chain) prob53
54
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS
lem consisting of two arcannotated sequences, S1 and S2 , where S2 represents the graph and S1 represents an independent set of size k. Since the information about the edges in the graph must also be incorporated in S2 , we use a fragment of S2 1 with length equal to the number of the vertices of the graph in order to encode a vertex and use arcs between fragments in order to encode the edges. A similar concept can be used to build S1 , but in S1 , there is no arc between fragments, because it represents an independent set. Then, the question, whether the graph has an independent set of size k, can be transformed to the question, whether S1 is an arcpreserving subsequence of S2 .
G
2
1
4
5
3
S1
baaaaabbaaaaabbaaaaab
S2
baaaaabbaaaaabbaaaaabbaaaaabbaaaaab baaaaab baaaaab baaaaab
Alignment: of S1 in S2
Figure 7.1: APS(crossing,chain) A small example of this construction is illustrated in Figure 7.1. The graph G in Figure 7.1 has 5 vertices and we want to know if G has an independent set of size 3, i.e., n = 5 and k = 3. We construct the sequence S2 with five fragments, each fragment has five a symbols and two b symbols. Two b symbols of the same fragment are joined by an arc. They are used to separate fragments from each other. Each edge in G is represented by an arc between two a symbols from 1
Note that we use fragment of S2 here to denote a substring of S2 with the arcs of S2
between two bases in this substring.
7.1. NPHARDNESS OF APS(CROSSING, CHAIN)
55
two different fragments in S2 . For example, edge {1, 2} has a corresponding arc between the second a in the first fragment and the first a in the second fragment. Sequence S1 , which should represent a graph with k vertices and no edge, has only three such fragments and none of the symbols a in S1 is an endpoint of an arc. The alignment of S1 in S2 indicates that the three fragments of S1 can be matched to the first, the third and the fifth fragment of S2 . Therefore, the graph G has an independent set of size three, namely {1, 3, 5}. Note that S1 has a chain arc structure and S2 has a crossing arc structure. In the following, we give a more detailed description of the construction. Given a graph G = (V, E), where V = {v1 , v2 , . . . , vn }, we first construct the shorter sequence S1 with chain arc structure representing an independent set of size k in the same way as illustrated in Figure 7.1. For each vertex in the independent set, we create a fragment in S1 , which consists of n symbols a and two symbols b. S1 has length k(n + 2). The two b symbols build the beginning and the end of the fragment. We join the pair of b’s with an arc. Because we want to encode an independent set by S1 and no two vertices in an independent set are joined by an edge, there is no arc between the a’s in S1 . Therefore, S1 has a chain arc structure. Next, we construct the second sequence S2 to denote G in the similar way. For each vertex in V , we create again a fragment of n symbols a and two symbols b. The length of S2 is n(n + 2). The two b’s are again joined by an arc. If there is an edge between two vertices vi and vj in G, we impose an arc between the ith and the jth fragment in S2 . This arc, which represents the edge between vi and vj , starts at the jth a in the ith fragment and ends at the ith a in the jth fragment. Since the arcs between two a’s are between two different fragments, they definitively cross the arcs between b’s of these fragments. Thus, the arc structure of S2 is crossing. The construction above can be formally described by: S2 = (ban b)n , A2 = ((i − 1)(n + 2) + 1, i(n + 2))  vi ∈ V ∪
((i − 1)(n + 2) + j + 1, (j − 1)(n + 2) + i + 1)  {vi , vj } ∈ E ;
56
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS S1 = (ban b)k , A1 = ((i − 1)(n + 2) + 1, i(n + 2))  vi ∈ V .
Based on the construction, there are only two types of base matches, one is the matching between symbols a, the other one is the matching between symbols b. In order to verify if S1 is an arcpreserving subsequence of S2 , we observe that a fragment in S1 can only be matched to an entire fragment in S2 because of the arc between the two b symbols, which indicates the beginning and the end of the fragment. If there are no edges between vertices in G, which induces that S2 has no arc between symbols a, then S1 definitively is an arcpreserving subsequence of S2 . If G has some edges and an independent set I of size k, where k < n, S1 has length k(n + 2). According to the construction above, S2 has then arcs between two symbols a, one of which is in the fragments, which represent the vertices in I, and the other of which is in the fragments, which represent the vertices in V \ I. Nevertheless, no arcs in S2 are imposed between two a’s in the same fragment or in two different fragments, which correspond two vertices in I. Thus, by deleting the fragments, which correspond to vertices in V \ I, S2 also has no arcs between two symbols a. Now, we have two identical sequences and can affirm that S1 is an arcpreserving subsequence of S2 . Thus, this construction can reduce independent set to APS(crossing, chain). Since the length of the sequence is bounded by a polynomial in n, this construction can be done in polynomial time. Lemma 7.1 independent set is polynomially reducible to APS(crossing, chain). Proof. We have shown above that independent set can be transformed to APS (crossing, chain) and the construction works in polynomial time. The only thing that we must prove is that the graph G = (V, E) has an independent set of size k if and only if S1 is an arcpreserving subsequence of S2 . ′
=⇒: Let V ⊆ V be an independent set in G of size k. Each vertex u ∈ V
′
corresponds to a fragment of S2 . We match each of the fragments, which corre′
spond to the vertices in V , to an entire fragment of S1 and denote this matching
7.2. APS(NESTED, NESTED)
57
as M S. Because there is no arc between two symbols a in two such fragments of S2 , we only have arcs between two symbols b from the same fragment and no arcs between symbols a in M S. Thus, M S is an arcpreserving matching. And there are exactly k such fragments in S2 . Hence, S1 is an arcpreserving subsequence of S2 . ⇐=: Assume S1 an arcpreserving subsequence of S2 . The linked pairs of symbols b in both sequences enforce the matching of symbols which come from only k fragments of S2 . Since all fragments in S1 are not linked with each other and S1 is an arcpreserving subsequence of S2 , there is also no arc between the k selected fragments of S2 . Since, according to the construction, every edge in G results in an arc linking two corresponding fragments in S2 , the vertices in G which correspond to these k selected segments in S2 cannot be joined by edges. Hence, these k vertices form an independent set of G.
2
Theorem 7.2 APS( crossing, chain) is NPcomplete. Proof. NPcompleteness of APS(crossing, chain) can be directly followed from Lemma 7.1 and the fact that independent set is NPcomplete.
2
The NPcompleteness result for APS(crossing, chain) implies that the APS problem for arc structure (crossing, nested) is also NPcomplete, which answers two open questions in Table 4.6.
7.2
APS(nested, nested)
In this section, we investigate the APS problem for the arc structure (nested, nested) and describe an algorithm which solves this problem in time O(n4 m), where n is the length of the shorter sequence and m is the length of the longer sequence. Note that, in the fixedparameter algorithm for the NPcomplete LAPCS(nested, nested) in Chapter 6, we have an APS(nested, nested) problem, if one of the parameters becomes 0. Using this polynomial time algorithm, the size of the search tree can be significantly reduced, because there is no need to make a branching at search tree nodes with one parameter equal to 0.
58
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS
Assume that we have an instance of APS(nested, nested), (S1 , A1 ) and (S2 , A2 ). The sequences are over some alphabet Σ. We denote the length of S1 and S2 by n and m, respectively, and we assume that S1 is the shorter sequence, i.e., n ≤ m. As shown in [17], the problems LAPCS(nested, chain) and LAPCS(nested, plain) can be solved by a dynamic programming algorithm in time O(n3 m). Since the APS problem is easier than the LAPCS problem, we can also use this algorithm to solve the problems APS(nested, chain) and APS(nested, plain) in polynomial time. With nested arc structure of both sequences, we use the algorithm from [17] recursively from the inner arcs to the outer arcs in S1 . We want to find, for all arcs in S1 , all possible matching arcs in S2 and mark these possible candidates for the arcs in S1 with some new letters, which are not in Σ. These markings enable us to treat the arcs in S1 and the substrings inside these arcs as black boxes, thereby the nested arc structure of S1 can be resolved into a chain structure. At the end, we have an instance of APS(nested, chain), which the dynamic programming algorithm from [17] for LAPCS(nested, chain) can solve in polynomial time. Before giving a formal description of the algorithm in detail, we define some notation employed. As mentioned above, we will use the dynamic programming algorithm from [17] to determine whether the sequence inside an arc of S1 is an arcpreserving subsequence of the sequence inside an arc of S2 . We call this algorithm DPA. The cutwidth of S1 is denoted by d (for cutwidth see Definition 3.22). We can divide the arc set A1 of S1 into d subsets A11 , . . . , Ad1 . A11 includes the arcs which are not inside any other arcs. The arcs in At1 are directly inside the arcs from At−1 1 , which means that there are no arcs, that are between an arc from At−1 and an arc from At1 in the nested arc structure. 1 It can be easily seen that the arcs from the same subset can form at most a chain arc structure and that each arc in A1 can belong to only one of the d subsets. If we use k to denote the size of A1 and ki to denote the size of a P subset Ai1 , then we have k = di=1 ki . In order to identify the arcs of S1 , we use a set of symbols to denote the arcs in A1 . This set has k symbols, a1 , a2 , . . ., ak . Because the subsets of A1 have no common element, we can assume that a
7.2. APS(NESTED, NESTED)
59
a4 a1
S1
...
a5 a3
a2
a a d ef b
a e f c b... d e a d e b e
...
b4 b1
S2
...
a
a
d
e
d
b2
f
b
...
a
a
d
e
b3
b
a e f
c b
...
Figure 7.2: An instance of APS(nested, nested). S1 has 5 arcs, denoted by symbols a1 , a2 , a3 , a4 , and a5 . The cutwidth of S1 is 2. Hence, the arcs can be divided into 2 groups. The first group has 2 arcs, a4 and a5 , which are not inside any arcs. We call these arcs external arcs. The other three arcs, a1 , a2 , and a3 , which are inside a4 or a5 , form the second group. They are called internal arcs. The arcs a1 and a3 can be matched to arcs b1 , b2 , and b4 , where b2 is inside b4 . However, the matches between a1 or a3 and b2 are more advantageous for other bases of S1 than the matches between a1 or a3 and b4 , because more bases of S2 are left to be matched to other bases of S1 . For example, the match between a1 and b4 excludes the possible match between arcs a2 and b3 , while the match between a1 and b2 does not. Therefore, we ignore the arc matches between a1 or a3 and b4 . The arc a2 can only be matched to arc b3 .
subset Ai1 is given by Ai = {ai+δ  δ = 0, . . . , ki − 1}. In the following, we will use an example, illustrated in Figure 7.2, to explain the steps of the algorithm.
Algorithm for APS(nested, nested) This algorithm has three phases. The first phase checks and replaces the innermost arcs, i.e., the arcs in Ad1 . The second phase uses DPA to process and resolve the arcs in the subsets Ai1 , i = d − 1, . . . , 1, until S1 has a chain arc structure. In the last phase, we then use DPA to verify whether the remaining S1 is an arcpreserving subsequence of S2 .
Phase 1: For each arc ad+δ with left endpoint S1 [i1 ] and right endpoint S1 [i2 ] in Ad1 , 0 ≤ δ ≤ kd − 1, we search all arcs in A2 whose corresponding endpoints are the
60
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS
same as S1 [i1 ] and S1 [i2 ] and denote the resulting arc set by A2d+δ . The set A2d+δ has at least one element, otherwise S1 cannot be an arcpreserving subsequence of S2 . Assume that arc (j1 , j2 ) is in Ad+δ 2 . Because the arc structure of the sequence S1 [i1 + 1, i2 − 1] must be plain and the sequence S2 [j1 + 1, j2 − 1] has at most nested arc structure, we use DPA to check whether S1 [i1 + 1, i2 − 1] is an arcpreserving subsequence of S2 [j1 + 1, j2 − 1]. If the answer is negative, we d+δ delete the arc (j1 , j2 ) from the set Ad+δ is empty, after all arcs 2 . If the set A2
in Ad+δ have been checked, S1 cannot be an arcpreserving subsequence of S2 , 2 because for the arc ad+δ we cannot find a matching arc in S2 . If the set is not empty, we replace S1 [i1 , i2 ] by an arc connecting symbols xd+δ and yd+δ ; this arc is also inserted directly outside all arcs in Ad+δ 2 . The bases xd+δ and yd+δ are new symbols, which do not come from Σ. For example, for an arc (j1 , j2 ) in Ad+δ 2 , we insert xd+δ directly before S2 [j1 ] and yd+δ directly behind S2 [j2 ] and join the two new symbols xd+δ and yd+δ by an arc. Figure 7.3 illustrates the example after this step. Note that an arc in Ad1 can have more than one possible matching arc in S2 . If some of these possible matching arcs form a nested structure, i.e., one arc is inside another arc, we consider only the innermost arcs, because a match between an arc of S1 and the innermost arcs of S2 leaves more bases of S2 for other bases of S1 . The arc a1 in Figure 7.2 is an example for such arcs of S1 . It has three possible matching arcs in S2 . Two among them, b2 and b4 , form a nested structure. Hence, we consider only the match between a1 and b2 . Note that the original two sequences are changed after this phase. The innermost arcs of S1 are replaced, together with the bases inside them, by some new arcs with endpoints not in Σ; the same new arcs are also insert into S2 directly outside the appropriate arcs of S2 . We use S1d and S2d to denote the new sequences. After all arcs in Ad1 have been processed, we go to the second phase. Phase 2: In this phase, we deal with the remaining arcs of the original sequence S1 in a recursive manner, starting with the arcs in Ad−1 and continuing up to the 1 arcs in A11 . Each iteration processes one subset. In the (d − i)th iteration,
7.2. APS(NESTED, NESTED)
61
a4
S12
...
a x1
y1
a5
x2
y2 b . . . d e x3
y3 e
...
b4
b1
S22
...
b2
b3
ax3 x1a d e d f b y1y3. . . a x3x1a d e b y1y3 x2a e f c y2 b
...
Figure 7.3: The instance after processing of the internal arcs a1 , a2 , and a3 . The arcs a1 , a2 , and a3 and the substrings inside them are now replaced by three arcs (x1 , y1 ), (x2 , y2 ), and (x3 , y3 ). Note that there are no bases between the endpoints of the three new arcs. The sequence S2 is extended by the arcs (x1 , y1 ), (x2 , y2 ), and (x3 , y3 ), which record all possible arc matches involving the internal arcs of S1 .
1 ≤ i ≤ d − 1, the arcs in the subset Ai1 are processed in a similar way as the arcs in Ad1 , i.e., we search and mark all possible matching arcs in S2 for them and delete them afterwards. However, there are two differences between the process for the arcs in these subsets and the process for the arcs in Ad1 . One is that the sequences inside the arcs in Ai1 , i < d, is not of plain arc structure. They can have arcs with two endpoints not in Σ, which are the replacements of the arcs in Ai+1 inside the arcs in Ai1 . Note that, while processing the subset Ai1 , all 1 arcs inserted before processing Ai+1 have been deleted. As we have mentioned, 1 the arcs in Ai+1 form at most a chain structure. Hence, their replacements 1 can also be of at most chain structure. However, the DPA algorithm can be applied to arc structure (nested, chain), too. The second difference relates to the arcs inserted into the longer sequence as markings for the arcs in Ai+1 1 . Since all arcs in Ai+1 are inside the arcs in Ai1 , their matching arcs must also be 1 in the matching arcs of the arcs in Ai1 . While searching possible matching arcs for the arcs in Ai1 , we take into account the markings for the arcs in Ai+1 1 , which record the possible matching arcs for the arcs in Ai+1 1 . If we have found and marked, for each arc in Ai1 , all possibly matching arcs in A2 , then these newly inserted marking arcs represent not only that the arcs in Ai1 have a possibly
62
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS a5
S11
...
x4
y4
...
d e x3
y3 e
...
b4 b1
S21
...
b2
a x3a d e d f b y3 x4a x3 a d e b
b3
y3 a e f c b y4
...
Figure 7.4: The instance after processing of the external arc a4 In Figure 7.3, we can see that the arc a4 can be matched to the arc b4 in S2 and the substring inside a4 is of chain arc structure. This is the first difference between the first phase and the second phase. We replace a4 and the substring inside it by an arc (x4 , y4 ) and add the same arc outside b4 . It is also clear that the arcs (x1 , y1 ) and (x2 , y2 ) will be no more needed, so we can delete them, this is the second difference.
matching arc but also that all arcs in Ai+1 have possibly matching arcs inside 1 the marking arcs. This means that, for the following iterations, we do not need the markings for the arcs in Ai+1 1 . Hence, we can delete all arcs from S2 , which were inserted while processing Ai+1 1 . These two differences are also illustrated in Figure 7.4. Phase 3: After all arcs in A1 are processed, we have only a chain arc structure in S1 and the arc structure of S2 throughout this algorithm is nested. Therefore, DPA can find out if the remaining sequence of S1 is an arcpreserving subsequence of S2 . If it is, then the original S1 must be also an arcpreserving subsequence of S2 . (Note that, if we make a copy of S2 before deleting arcs from S2 in each iteration of the second phase, we can also back trace the positions in S2 , where a base of S1 has a matching.)
Correctness of the algorithm: The DPA algorithm can verify whether a sequence with plain or chain arc
7.2. APS(NESTED, NESTED)
63
structure is an arcpreserving subsequence of a sequence with nested sequence. Since the first phase and all iterations of the second phase have only instances of APS(nested, chain) or APS(nested, plain), by using DPA, we can find out recursively, from inner arcs to outer arcs, whether the arcs of S1 together with the subsequences inside them are arcpreserving subsequences of some subsequences of S2 . After DPA gives positive answers for all arcs in A11 , we treat then S1 entirely. By using DPA in each phase and in each iteration, the onetoone and orderpreserving properties are checked for each subsequences of S1 and, at the end, the entire S1 . The recursive way of processing arcs in A1 corresponds exactly to the nested arc structure and guarantees that two matchings arcs in A2 for two nested arcs in A2 have also a corresponding nested structure. This implies the arc structure of S1 is also preserved.
Running time analysis: After the first phase and each iteration of the second phase, the two sequences in our instance are changed. We use then S1d to denote the first sequence and S2d to denote the second sequence after the arcs in Ad1 have been processed, namely after the first phase. Similarly, S1i denotes the first sequence and S2i denotes the second sequence after the arcs in Ai1 having been processed. Further new notations employed in the running time analysis are in the following list: • ni+δ : the length of the δth arc of Ai1 in sequence S1i+1 . (Note that, for our purpose, the length of an arc does not include the endpoints of this arc. For example, an arc ai+δ with endpoints S1i+1 [i1 ] and S1i+1 [i2 ] has length of i2 − i1 − 1.) • Ai+δ 2 : the set of arcs in A2 , which are possible matching arcs for the arc ai+δ . • li+δ : the size of the set Ai+δ 2 . (j)
• bi+δ : the jth arc in the set Ai+δ 2 . (j)
(j)
• mi+δ : the length of the arc bi+δ in sequence S2i+1 .
64
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS
Phase 1: In this phase, we use DPA algorithm to check whether the sequences inside the innermost arcs in Ad1 are arcpreserving subsequences of the sequences inside arcs of S2 . Because of its dynamic programming property, DPA needs at most (j)
(j)
O((nd+δ )3 md+δ ) time to find a matching arc bd+δ in S2 for an arc ad+δ in Ad1 . Since 0 ≤ δ ≤ kd − 1 and 1 ≤ j ≤ ld+δ , the total time for the first phase sums to
kX d+δ d −1 lX
(j)
O((nd+δ )3 md+δ ).
δ=0 j=1
Phase 2: Actually, this phase consists of (d − 1) iterations of the first phase with the two mentioned differences. The first difference, that the subsequence to be checked in this phase can have chain structure, does not affect the running time of the DPA algorithm, because DPA works both for (nested, chain) and (nested, plain) in time O(n3 m). However, by inserting new arcs in the second sequence in each iteration, the length of S2 is increased by the amount of the endpoints of the new arcs. We must upperbound this amount, otherwise we cannot have a polynomial time algorithm at the end. The second difference, which is deleting the marking arcs from previous iterations, can help us to have such an upper bound. In the first iteration, i.e., for the subset Ad−1 1 , the two sequences have the following lengths: kX d −1 d S = n − nd+δ , 1 δ=0
kX d −1 d S = m + 2 ld+δ . 2 δ=0
Since kd ≤ n and ld+δ
≤ m, we obtain S2d ≤ m + 2nm. At the end of the
ith iteration, which processes the subset Ad−i 1 , the arcs inserted into the second sequence in the (i − 1)th iteration are deleted. There are no new arcs in the second sequence, which are not in S2 , and thus, the second sequence is now the same as the original sequence S2 . After new arcs, that represent the matching possibilities for arcs in Ad−i 1 , are inserted into the second sequence, we have now S2d−i , whose length is the same as the sum of S2  and the amount of the
7.2. APS(NESTED, NESTED)
65
endpoints of the new arcs inserted in this iteration: kd−i −1 X d−i ld−i+δ ≤ m + 2nm. S2 = m + 2 δ=0
The upper bound for the length of S2d−i is the same as the one for S2d . And the first sequence becomes shorter after each iteration: kd−i X−1 d−i d−i+1 nd−i+δ . − S1 = S1 δ=0
For an arc ad−i+δ in
Ad−i 1 ,
0 ≤ δ ≤ kd−i , the DPA algorithm needs time ld−i+δ
X
(j)
O((nd−i+δ )3 md−i+δ )
j=1
(j)
to find all possible matching arcs. Note that nd−i+δ and md−i+δ are the length (j)
of arc ad−i+δ and arc bd−i+δ in the sequences S1d−i+1 and S2d−i+1 . Hence, the ith iteration takes a total time: kd−i −1 ld−i+δ
X δ=0
X
(j)
O(nd−i+δ )3 md−i+δ ).
j=1
The deletion of arcs inserted in the (i − 1)th iteration and the insertion of new arcs can be done in time O(nm). Putting the time for all d − 1 iterations together, we have the total time for the second phase: li+δ d−1 kX i −1 X X
(j)
O((ni+δ )3 mi+δ ).
i=1 δ=0 j=1
After all arcs in A11 having been processed, we have two sequences with the following lengths: 1 −1 1 2 kX S1 = S1 − n1+δ ,
δ=0
kX 1 −1 1 S2 = m + 2 l1+δ ≤ m + 2nm δ=0
for the third phase. Phase 3: In this phase, we do not have any arcs of the original S1 . Thus, we use the DPA algorithm only once for S11 and S21 and the running time is: 3 O( S11 S21 ).
66
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS
The overall running time of the whole algorithm is then equal to the sum of the running time of the three phase: kX d+δ d −1 lX
(j)
O((nd+δ )3 md+δ )
δ=0 j=1
 +
{z
1.P hase
li+δ d−1 kX i −1 X X
} (j)
O((ni+δ )3 mi+δ )
i=1 δ=0 j=1
{z

}
2.P hase
3 + O( S11 S21 ) .  {z } 3.P hase
With following three conditions: • regarding Phase 1: Pld+δ (j) j=1 md+δ ≤ S2  = m, • regarding Phase 2: Pkd−1−i −1 Pli+δ (j) d−1−i ld−1−i+δ ≤ m + 2nm, for all = m + 2 δ=0 j=1 mi+δ ≤ S2 ith iteration, 1 ≤ i ≤ d − 1, and
• regarding Phase 3: 1 S ≤ m + 2nm, 2 we can give an upper bound for the length of the second sequence in all phases, m + 2nm. The overall running time has also an upper bound as the following: kX d −1
O((nd+δ )3 (m + 2nm))
δ=0 +
d−1 kX i −1 X
{z
1.P hase
}
O((ni+δ )3 (m + 2nm))
i=1 δ=0

{z
2.P hase
3 + O( S11 (m + 2nm)) . {z }  3.P hase
}
7.2. APS(NESTED, NESTED)
67
Because the subset A11 , . . . , Ad1 are disjoint and we define ni+δ as its length in sequence S1i+1 , we have a fourth condition: d kX i −1 X i=1 δ=0
ni+δ + S11 = n.
Therefore, we can conclude that the total running time of the algorithm cannot exceed O(n4 m). The next theorem summarizes the results in this section. Theorem 7.3 APS(nested, nested) can be solved in time O(n4 m), where n is the length of the shorter sequence and m is the length of the longer sequence.
68
CHAPTER 7. ARCPRESERVING SUBSEQUENCE PROBLEMS
Chapter 8
Conclusions 8.1
Summary of Results
In this study, we considered the LAPCS problems for arcannotated sequences with various types of arc structures, a problem motivated by biological structure comparison. The results of classical and parameterized complexity with various parameters previous to this work have been summarized in Chapter 4. The present work examined a new aspect for cfragment and cdiagonal LAPCS problem, namely the parameterized complexity with the length l of the desired subsequence as parameter. In particular, we provide an algorithm to solve cfragment LAPCS(crossing, crossing) in time O((B +1)l B 2 +c3 n), where B = c2 + 2c − 1, and cdiagonal LAPCS(crossing, crossing) in time O((B + 1)l B 2 + c3 n), where B = 2c2 + 7c + 2. This indicates that these problems are fixedparameter tractable, when none of the two sequences has an unlimited arc structure. This algorithm can also be extended to solve a restricted version of cfragment and cdiagonal LAPCS(unlimited, unlimited) where the sequences both have a bounded degree b, it works in time O((B + 1)l B 2 + (c3 + 2bc2 )n), where B = c2 + 2bc − 1, and in time O((B ′ + 1)l B ′ 2 + (c3 + 2bc2 )n), where B ′ = 2c2 + (4b + 3)c + 2b, respectively. Lin et al. [20] have shown that cfragment and cdiagonal LAPCS(nested, nested) admit a PTAS. However, our algorithm is the first fixedparameter solution for the cfragment and cdiagonal LAPCS, even for the more gen69
70
CHAPTER 8. CONCLUSIONS
eral arc structure, (crossing, crossing). The parameterized complexity of LAPCS(nested, nested) problem with the length l of the desired subsequence as parameter is still unknown. In positive aspect, we have shown in this work that this problem is fixedparameter tractable, when parameterized by k1 and k2 , the number of deletions allowed in the sequences, respectively. We designed an algorithm, which finds a longest arcpreserving common subsequence in time O(3.31k1 +k2 · n). This algorithm works efficiently for small values for k1 and k2 and initiates future work on the LAPCS problem with the length l of the desired subsequence as parameter (Note that l = S1  − k1 = S2  − k2 ). The APS problem for arcannotated sequences has not been discussed explicitly before this work. Because the LAPCS problem for many arc structures is known to be NPcomplete and W [1]complete, examining the exactmatching version of this problem is significant for practice. Algorithms verifying that an arcannotated sequence is an arcpreserving subsequence of another arcannotated sequence, can be used for pattern searching in DNA/RNA databases. In this work, we have shown that APS is also NPcomplete, if one of the sequences has an unlimited arc structure or one sequence has a crossing arc structure and the other sequence is of an arc structure higher than plain. An algorithm was given to solve the APS(nested, nested) problem in time O(n4 m), where n is the length of the shorter sequence and m is the length of the longer sequence.
8.2
Future Work
The obvious next step is to optimize the algorithms in this work. For example, the algorithm for LAPCS(nested, nested) with k1 and k2 as parameters is not very efficient for large values of k1 and k2 . The algorithm for APS(nested, nested) is based on the algorithm from [17], which is designed for LAPCS(nested, plain) and LAPCS(nested, chain) and has a running time O(n3 m). We are confident that the problems APS(nested, plain) and APS(nested, chain) can be solved more efficiently than the LAPCS
8.2. FUTURE WORK
71
problems. If so, to get the degree of the polynomial as small as possible remains a research issue. It is also a topic of future investigations to study the practical usefulness of our algorithms by implementations and experiments. As we observed in Chapter 4, there are still many unsolved LAPCS problems for arcannotated sequences. One of the most interesting points is to determine the parameterized complexity of LAPCS(nested, nested) when parameterized by the length of the desired subsequence. Another interesting research topic would be to examine some restricted versions of the LAPCS problem, for example, we can allow the resulting longest common subsequence to have some arc mismatches whose amount is fixed. Algorithms for this restricted LAPCS may provide a compromise between NPhardness of LAPCS and the need for efficient exact solutions.
72
CHAPTER 8. CONCLUSIONS
Bibliography [1] J. Alber, J. Gramm, and R. Niedermeier. Faster exact solutions for hard problems: a parameterized point of view. Discrete Mathematics, 229:3–27, 2001. [2] R.A. BaezaYates.
Searching subsequences.
Theoret. Computer Sci.,
78:363–376, 1991. [3] V. Bafna, S. Muthukrishnan, and R. Ravi. Comparing similarity between rna strings. Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937:1–16, 1995. [4] H.L. Bodlaender, R.G. Downey, M.R. Fellows, M.T. Hallet, and H.T. Wareham. Parameterized complexity analysis in computational biology. CABIO, 11(1):49–57, 1994. [5] H.L. Bodlaender, R.G. Downey, M.R. Fellows, and H.T. Wareham. The parameterized complexity of sequence alignment and consensus. Theoret. Computer Science, 147:31–54, 1995. [6] P. Bonizzoni, G. Della Vedova, and G. Mauri. Experimenting an approximation algorithm for the LCS. Discrete Applied Mathematics, 110:13–24, 2001. [7] Jianer Chen, Iyad A. Kanj, and Weijia Jia. Vertex cover: Further observations and further improvements. Journal of Algorithms, 41:280–301, 2001. 73
74
BIBLIOGRAPHY
[8] F. Corpet and B. Michot. Rnalign program: alignment of rna sequences using both primary and secondary structures. Comput. Appl. Biosci., 10:389– 399, 1994. [9] R.G. Downey and M.R. Fellows. Parameterized Complexity. SpringerVerlag New York Inc., 1999. [10] P. A. Evans. Finding common subsequences with arcs and pseudoknots. Proceedings of 10th Annual Symposium on Combinatorial Pattern Matching, LNCS 1645:270–280, 1999. [11] P.A. Evans. Algorithm and Complexity for ArcAnnotated Sequence Analysis. PhD Thesis, University of Victoria, 1999. [12] M. R. Fellows. Parameterized complexity: the main ideas and some research frontiers. Proc. of 12th ISAAC, LNCS 2223:291–307, 2001. [13] D. Goldman, S. Istrail, and C. H. Papadimitriou. Algorithmic aspects of protein structure similarity. Proc. of 40th IEEE FOCS, pages 512–521, 1999. [14] D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. [15] D.S. Hirschberg. The Longest Common Subsequence Problem. PhD Thesis, Princeton University, Canada, 1975. [16] R.W. Irving and C.B. Fraser. Two algorithms for the longest common subsequence of three (or more) strings. Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, LNCS 644:214–229, 1992. [17] T. Jiang, G.H. Lin, B. Ma, and K.Z. Zhang. The longest common subsequence problem for arcannotated sequences. Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, LNCS 1848:154– 165, 2000.
BIBLIOGRAPHY
75
[18] G. Lancia, R. Carr, B. Walenz, and S. Istrail. 101 optimal pdb structure alignments: a branchandcut algorithm for the maximum contact map overlap problem. Proc. of 32nd ACM STOC, pages 425–434, 2000. [19] M. Li, B. Ma, and L. Wang. Near optimal multiple alignment within a band in polynomial time. Cambridge University Press, 1997. [20] G.H. Lin, Z.Z. Chen, T. Jiang, and J.J. Wen. The longest common subsequence problem for sequences with nested arc annotations. Proceedings of the 28th International Colloquium on Automata, Languages and Programming, LNCS 2076:444–455, 2001. [21] S.Y. Lu and K.S. Fu. A sentencetosentence clustering procedure for pattern analysis. IEEE Tran. Syst., 8:381–389, 1978. [22] D. Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25:322–336, 1978. [23] M.V. Olson. A time to sequence. Science, 270:394–396, 1995. [24] C.H. Papadimitrion. Computational Complexity. AddisonWesley Publishing Company, 1994. [25] D. Sankoff and J. Kruskal (eds.). Time Warps, String Edits, and Macromolecules. AddisonWesley, 1983 (Reprinted in 1999 by CSLI Publications). [26] M. Waterman. Introduction to Computational Biology. Chapman and Hall, 1995.