Abstract
Motivation: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence.
Results: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats.
Availability: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.
Contact:sokol@sci.brooklyn.cuny.edu
1 INTRODUCTION
A tandem repeat, or tandem array, in DNA, is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats appear in biological sequences with a wide variety. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies.
Most genomes have a high content of repetitive DNA. Repeated sequences make up 50% of the human genome (Collins et al., 2003). The repeats in the human genome are important as genetic markers (Kannan and Myers, 1996), and they are also responsible for a number of inherited diseases involving the central nervous system. For example, in a normal FMR-1 gene, the triplet CGG is tandemly repeated 6 to 54 times, while in patients with Fragile X Syndrome the pattern occurs >200 times. Kennedy disease and Myotonic Dystrophy are two other diseases that have been associated with triplet repeats (Frazier et al., 2003). In addition, tandem repeats are used in population studies (Uform and Wayne, 1993), conservation biology (The International Human Genome Mapping Consortium, 2001) and in conjunction with multiple sequence alignments (Benson, 1997; Kolpakov and Kucherov, 2001).
Many of the repeats appear in non-coding regions of DNA. Although useful for identity testing, these regions were thought to carry no function. Recently, however, it has been shown that repeats in the genome of a rodent provide code for its sociobehavioral traits (Jeffreys, 1993). Scientists currently believe that the non-coding tandem repeats do affect the function of the DNA in ways yet unknown.
Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence, while allowing for mutations. Due to the recent sequencing of the human genome by the Human Genome Project (Groult et al., 2003; Fu et al., 1992), it is now possible to analyze the sequence of the human genome and create a listing of all tandem repeats in the genome. Detecting all tandem repeats in protein sequences is an important goal as well (Kitada et al., 1996).
In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence. We have already implemented the algorithm in C++, and a website for the program is under construction. Our program automates the task of listing all repeats that occur in a biological sequence. It is our hope that with the use of our program, biologists will discover new ways that tandem repeats affect both the structure and function of DNA and protein molecules.
The contributions of this paper are 2-fold. The theoretical contributions consist of an extension of the algorithm of Landau and Schmidt (Landau et al., 1998) to locate evolutive tandem repeats over the edit distance and a more careful analysis of the algorithm eliminating the need for suffix trees. The practical contribution is an efficient program for finding tandem repeats that will become available on the web for all to use.
1.1 Problem definition
We define tandem repeats over the edit distance using the model of evolutive tandem repeats (Groult et al., 2004; Hammock and Young, 2005). The model assumes that each copy of the repeat, from left to right, is derived from the previous copy through zero or more mutations. Thus, each copy in the repeat is similar to its predecessor and successor copy.
Let ed(s1, s2) denote the minimum edit distance between two strings, s1 and s2.
A string R is a k-edit repeat if it can be partitioned into consecutive substrings, R = r1, r2, … , rℓ, such that. The last copy of the repeat does not have to be complete, and therefore ed (rℓ−1, rℓ) refers to the edit distance between a prefix of rℓ−1 and rℓ.
A k-edit repeat is an evolutive tandem repeat in which there are at most k insertions, deletions and mismatches, overall, between all consecutive copies of the repeat. For example, the stringR = caagct∣cagct∣ccgct is a 2-edit repeat. A k-edit repeat is maximal if it cannot be extended to the right or left. Maximal repeats can be overlapping, as shown in the following example.
Repeat of length 14 with 2 errors:
Repeat of length 18 with 2 errors:
In this paper, we address the k-edit Repeat Problem, defined as follows. Given a string, S, and an integer k, find all maximalk-edit repeats in S. From the theoretical viewpoint, we provide an efficient algorithm that locates all maximal k-edit repeats in a string. We implemented this algorithm and found that our program locates many repeats that are similar one to another. Hence, we incorporated several heuristics in our program, to make the output more succinct and useful. The heuristics are described in Section 2.3. The result is a concise listing of the k-edit repeats occurring in the input sequence.
1.2 Related work
The early work on finding tandem repeats in strings dealt with simple repeats, i.e. repeats that contained exactly two parts. Kannan and Meyers (Katti et al., 2000), Benson (Benson, 1995) and Schmidt (Spong and Hellborg, 2002) present algorithms for finding simple repeats using the weighted edit distance.
The true goal of biologists, which turns out to be a much more difficult problem, is to find all maximal repeats in a string. A maximal repeat within a string is a repeat that contains two or more consecutive copies, and it cannot be extended further to the left or to the right. Each part of the repeat is called a period. When considering maximal repeats without errors, all periods (except possibly the last one) have equal length. For example, in the string aatgtgtgt the stringtgtgtgt is a maximal repeat with 3.5 periods.
More recently, the concentration has been on searching for maximal repeats with errors. In this case, however, the definition is not obvious. Given a repeat with several periods, what does it mean for the parts to be ‘similar?’ Benson (Benson, 1999) requires that a consensus string must exist which is similar to all periods of the repeat. Using this approach, it is difficult to provide a deterministic algorithm to find tandem repeats. Hence, Tandem Repeats Finder (TRF), developed by Benson1 uses a collection of statistical criteria in combination with k-tuple matches to detect statistically significant tandem repeats. Similar criteria are used by Wexler et al. (2004).
The goal of this paper is to provide a rigorous definition of tandem repeats and to provide a deterministic algorithm to detect all repeats that satisfy the definition. All of the algorithms that use this approach build upon the Hamming distance, which measures the number of mismatching characters between two strings, maintaining the property that each period of a repeat has the same length.
In (Groult et al., 2004; Hammock and Young, 2005), evolutive tandem repeats are defined over the Hamming distance as follows. A string is an evolutive tandem repeat if it can be broken up into substrings r1…rℓ such that the Hamming distance between ri and ri+1 is smaller than a given threshold, for 1 ≤ i < ℓ. The definition also allows for a small gap between copies of the repeat. The algorithm presented in (Hammock and Young, 2005) has worst-case quadratic runtime.
Kolpakov and Kucherov (Kolpakov et al., 2003, http://www.loria.fr/mreps/) present two different definitions. The first definition sums the mismatches between all neighboring copies of the repeat. Formally, a repeat is called a k-repetition if the Hamming distance between and is ≤k. The second definition allows k mismatches between each pair of consecutive copies, and this is called a k-run, and is similar to the evolutive tandem repeats over the Hamming distance. The algorithm of (Kolpakov et al., 2003,http://www.loria.fr/mreps/) has O(nk log k) time. The algorithm has been implemented in themreps software (Landau).
Landau, Schmidt and Sokol (Landau et al., 2001) define a tandem repeat to have k mismatches if the alignment constructed from its periods has k nonuniform columns. Their algorithm runs in O(nka log (n/k)) where n is the sequence, k is the error bound and a is the maximum number of periods in any reported repeat. The algorithm has been implemented and is available at (Landau and Vishkin, 1989http://csweb.cs.haifa.ac.il/library/ andhttp://www.sci.brooklyn.cuny.edu/~sokol/trepeats/).
The disadvantage of the Hamming distance is that it only accounts for point mutations and does not allow insertions and deletions. To model mutations more generally, it is preferable to use the edit distance, defined by Levenshtein (Main and Lorentz, 1984) as the minimum number of insertions, deletions and substitutions necessary to transform one string into the other. Ideally, we would like to differentially weight mutations; substitutions are typically scored more permissively than insertions and deletions. In this paper, we weight all differences uniformly, i.e. we use an edit distance scheme.
2 METHODS
In this section we describe the algorithm for finding k-edit repeats. We first describe a straightforward solution, following which we describe three speedups to achieve a time and space efficient algorithm. The speedups use similar ideas to those in (Landau et al., 2001).
2.1 A straightforward solution
The classical method for calculating the edit distance between two strings S = s1…sm and S′ =s′1…s′n is to construct an n × m dynamic programming matrix, to initialize (row 0, column j) to j (column 0, row i) to i, and to fill it in for i, j > 0 using the following formula.We define a p-restricted edit distance alignment as an edit distance alignment between two strings that disallows p insertions into the first string (alternatively, p deletions in the second string). This definition is necessary, since we would like to align a prefix of a string with a proper suffix of the same string. If insertions into the prefix are not restricted, then the string may eventually ‘catch up’ with itself, aligning characters in the prefix with the exact corresponding characters in the suffix.
In terms of the edit distance matrix, if we assume that the first input string is to the left of the matrix, a p-restricted alignment corresponds to the regular edit distance matrix, allowing onlyp − 1 diagonals to the left of the main diagonal.
A string R of length q is a k-edit repeat if and only if a p-restricted edit distance alignment can be constructed between a suffix of R, beginning at location p + 1, and a prefix of R, with ≤k differences, for some p < q/2.
The proof follows from the fact that we can obtain a k-edit repeat from a 2-sequence alignment of the prefix and suffix, and alternatively, given a k-edit repeat, the prefix/suffix can be derived from the alignment of the repeat. Given a p-restricted alignment of a suffix/prefix of a string R, the copies of the repeat can be derived from the alignment as follows. The first copy consists of the first p characters of the prefix. The second copy consists of all characters in the suffix that appear in the alignment against the first p characters in the prefix. Each successive copy is calculated by using the previous known copy in the prefix to demarcate the following copy in the suffix.
If a string R is a k-edit repeat, the prefix/suffix alignment can be obtained from the copy-to-copy alignments. The suffix can be taken as the start of the second copy of the repeat. The prefix is the string R minus the last copy of the repeat. Each has length ≥q/2 since only one copy of the repeat is removed. The prefix/suffix alignment can be constructed by aligning copy i in the prefix with copy i + 1 in the suffix.
We can use lemma 1 to derive a straightforward algorithm for finding all k-edit repeats within a string. The algorithm finds all substrings of the string for which a 2-sequence alignment exists between a proper prefix and a proper suffix of the substring. It then breaks up the 2-sequence alignment into its copies using the method in the first part of the proof of the lemma.
Simple Algorithm
Input: A string S, and an integer k.
Output: All k-edit repeats in S.
Consider each index 1 ≤ i < n as the starting point of a potential repeat.
Consider each index (to the right of location i) as the start of the proper suffix of the repeat.
For each pair (i, j) perform a (j − i)-restricted alignment of the two strings S1 = si … snand S2 = sj … sn using the classical dynamic programming method (allowing only j − i− 1 diagonals to the left of the main diagonal).
If there is a match between S1 and S2, of length at least j−i, with ≤k errors, then a repeat exists, beginning at location i and ending at the location before the k + 1st error found in step 3.
Complexity analysis: The time complexity of the naive algorithm is O(n4). There are O(n2) iterations, and in each iteration we compute the dynamic programming matrix of size O(n2). The space used is that of the edit distance matrix, which is of size O(n2).
In the following section we explain the efficient algorithm. We present it as a three-tier modification to the simple algorithm. The new algorithm reduces the time to O(nk log k log(n/k)) and it reduces the space from O(n2) to O(k2).
2.2 An efficient algorithm
We present a three-tier speedup to the naive algorithm. First, we reduce the number of iterations. Then, we show how each iteration can be improved by pruning the edit distance matrix. The third speedup fine-tunes the computation time of the partial edit distance matrix.
Speedup #1: reduce the number of iterations. We use the idea2 of Main and Lorentz (Schmidt) to reduce the number of iterations from O(n2) to O(nlog(n/k)). The algorithm, rather than considering all possible starts, anchors the comparisons at the center of the string. In the first iteration, the input is S = s1, … , sn, and all repeats that cross the center of the string (i.e. include character sn2) are found. In the following iteration, S is divided into two substrings, S = uv, u =s1…sn/2 and v = sn/2+1…sn. The algorithm which finds repeats crossing the center is applied recursively to both u and v.
In order to locate all repeats that include the character sn/2, it is necessary to consider all alignments in which sn/2 corresponds to another character in the string. Following we describe the procedure which aligns location n/2 with all indices p > n/2. By symmetry, alignments for values p < n/2 can be produced.
Find repeats
For p = 1 to n/2 do // find repeats with period p
Forward Direction: Find the longest prefix of sn/2+p…sn that p-restricted matches a prefix of sn/2…sn with ≤k errors. (Fig. 1).
Backward Direction: Find the longest suffix of s1…sn/2+p−1 that p-restricted matches a suffix of s1…sn/2−1 with ≤k errors.
Consider all pairs k1, k2 for which k1 + k2 = k. Let ℓ1 be the length of the backward direction comparison with k1 errors, and ℓ2 be the length of the forward direction comparison with k2 errors. If ℓ1 + ℓ2 ≥ p then there is a tandem repeat extending from
We illustrate this idea with the following example.
Let k = 4, find the k-edit repeats in the string S = ctcgagctcctgacctcgtga.
We show the iteration of p = 6, assuming that the first appearance of ‘t’ in the string is locationn/2. We omitted the first part of the string since it is not necessary for the example. The forward direction comparison is done by aligning the string sn/2…sn with the stringsn/2+6…sn. The optimal alignment (i.e. obtaining the minimum edit distance) is shown inFigure 2. Two gaps are introduced, and there are two mismatches.
The length of the forward extension with 4 errors is 14 (i.e. the alignment extends up until location n/2 + 6 + 14 − 1). Since we allowed 4 errors in the forward extension, and k = 4, we do not allow any errors in the backward extension. The length of the backward extension with 0 errors is 1 (the single character c).
The copies of the repeat are derived from the 2-sequence alignment as follows:
Repeat of length 21 with 4 errors.
Speedup #2: Reduce the Size of the Dynamic Programming Matrix. In the straightforward algorithm, the forward and reverse direction extensions are computed by building the dynamic programming edit distance matrix for each pair of substrings. The second idea for speeding up the algorithm uses the observation that it is not necessary to compute the entire edit distance matrix, since we are only looking for k errors.
Using the ideas of Ukkonen (1983) and Landau/Vishkin (Levenshtein, 1966), we can reduce the size of the matrix from O(n2) to O(k2). Consider the diagonals in the edit distance matrix, where diagonal d corresponds to the diagonal with indices {(i, j)|i − j = d}. The main diagonal is diagonal 0. Any diagonal in the edit distance matrix that has distance >k from diagonal 0 is not of interest since all of its values will be >k. Thus, it is only necessary to compute the values on 2k + 1 diagonals. Furthermore, on a given diagonal, successive values differ by at most 1. Therefore, it is only necessary to store the location of the last row on each diagonal that has value h for each 0 ≤ h ≤ k. Instead of the edit dist matrix, we compute a k × k matrix L such that:
L[d, h] = largest row in the edit_dist matrix on diagonal d that has the value h.
Analysis: The matrix L can be computed by filling in n × 2k entries in the dynamic programming matrix. Each row needs 2k values, and potentially n rows will have to be computed. Using the classical dynamic programming edit distance algorithm (as in Section 2.1), this can be done in O(nk) time. Only a constant number of rows need be stored, hence the space complexity is O(k2). This looks excellent, however, consider the fact that the matrix Lneeds to be computed, in a given iteration, for each possible 1 ≤ p < n/2. This results in the matrix being computed potentially O(n) times, resulting in O(n2k) time per iteration. This is where the third speedup comes into play.
Speedup #3: reduce the computation time of the dynamic programming matrix. The third speedup uses information from one period size for the following period size. In each iteration, the algorithm computes the edit distance matrix separately for each period 1 ≤ p ≤ n/2. For p= 1, this translates into the computation of an edit distance matrix for the two strings sn/2…snand sn/2+1…sn. For p = 2, a new matrix is computed for the strings sn/2…sn and sn/2+2…sn, and so on. The only change from one period size to the next is the deletion of the first character in the second string. There is a lot of overlapping computation between different period sizes. Landau, Myers and Schmidt (LMS) (Landau and Schmidt, 1993) called this problem ‘incremental string comparison’, and showed how to get from one matrix to the next in O(k) time, assuming we are only searching for up to k errors.
Thus, we can compute the first O(k2) matrix using the previous speedup (Levenshtein, 1966), and then for each period size spend O(k) time using LMS to modify the matrix.
The only remaining issue is that when using the LMS algorithm to update the matrix it is not trivial to figure out the values for ℓ1 and ℓ2, i.e. the longest common extension forward and backward. These values correspond to the rightmost column in the dynamic programming matrix, for which a certain value k1 < k appears. In (Landau et al., 2001) this is solved by maintaining a heap for each value k1 < k, which contains all columns having the value k1. Thus, the rightmost column can be located and updated in O(log k) time.
Complexity analysis: There are O(log (n/k)) iterations. In each iteration, we construct a matrix of size O(k2) in time O(k2 + nk). The matrix is modified for each period size in O(k log k) time. Overall, each iteration takes O(nk log k) time, and the algorithm has time complexity O(nk log klog(n/k)). The space complexity is O(n + k2).
2.3 Implementation
We have implemented the program in C++ including the first two speedups. The third speedup is complicated, and it is our guess that in practice it will not significantly affect the program. This is based on the observation that in practice we almost never compute the full n × kmatrix, as we stop calculating each of the diagonals after k + 1 errors are found. Our program can process a string of several thousand characters in a fraction of a second. Currently, the running time of the program is proportional to the time it takes to print the output.
An inherent flexibility exists in the definition of approximate repeats, with the goal of allowing for mutations. This flexibility poses an issue when the goal is to detect all substrings of the input string that satisfy the definition. The issue is that in practice there is too much output. Each repeat that is reported satisfies the mathematical definition; however, many reported repeats differ only slightly one from another. Hence, our challenge was to modify our algorithm in a way that it reports a meaningful and significant subset of the found repeats. To this end, we filter out all repeats that prove to be redundant. The filtering is incorporated into the algorithm itself, which is much more efficient than doing a post-processing phase to filter the output.
We deal with the issue of filtering by augmenting the different phases of our algorithm. The first filtering technique filters repeats found within an individual iteration, while the second technique filters repeats found in different iterations.
Filtering a given iteration: 1. Choose the Optimal Period using the Local Maximum. The first point of comparison is anchored at the center position of the string, but the second point varies, and only by one character at a time. Suppose there exists an alignment for some p withh errors. Then, this alignment can be converted into an alignment for p + 1 with at most h + 1 errors. Thus, we do not want to report every possible k-edit repeat. We are only interested in the value p that produces the best matching of the surrounding string segments. This leads to a local maximum idea. As the first point is anchored, and the second point moves forward, the comparison produces better matches as the second point nears a corresponding position in a different part of the repeat. The comparison gives worse results as the second point moves away from that ‘optimal’ position, until it begins nearing the corresponding position of another repeat.
2. Combine several neighboring repeats. Once a local maximum, p, is found, we have potentially k different repeats with period p, each shifted over several characters. This is because we consider k1 errors to the left and k2 errors to the right, for all values of k1, k2 such that k1 + k2 = k. For purposes of reporting the repeats, it is much clearer to see the shifting repeats as a single repeat. Hence, in this case, we combine all repeats with period p into a single repeat. We note that the combined repeat has at most 2k errors.
Filtering between iterations: The idea for filtering between iterations is a simple one. If a repeat reaches either end of the string segment being examined by the current iteration, it will also be detected by a different iteration at a higher level in the recursion and thus need not be reported in the current iteration.
3 RESULTS
The table on the left in Figure 3 contains a summary of the repeats that our program detected in the first 14 000 bp of human chromosome 18 (May 2004). The exact filtering criteria are still being improved, and thus we manually filtered the repeats to clarify results. The criterion for this run allows up to k = 40 errors and requires a minimum length of at least 100 characters more than the number of errors. This particular criterion was tested on pseudorandom strings created using the C++ function rand(), which produces statistically random output in a single run. The function rand() was initialized using the system clock time, ensuring a different set of statistically random strings for each execution. It was tested on 20 strings, each 20 000 in length, and returned a single repeat in 6 of them, and no repeats in the rest. The table on the right of Figure 3 shows the results of TRF (Benson, 1999) on the same sequence.
In Figure 4 we show the alignments of the copies of the two new repeats found by our program, beginning at positions 795 and 10 690.
4 DISCUSSION
Many of the repeats that are found by our program are also found by TRF (Benson, 1999), which uses a non-evolutive definition. This is due to the fact that the consensus type repeat often satisfies the evolutive definition as well. Sometimes, our program will detect more errors than TRF, because in the case of a deviation in a single period, we count each deviation twice: once from the period before to the period with the deviation and once again from the period with the deviation to the period after. However, more often, changes will occur and be repeated for several periods before they change again. It is in these situations that the evolutive definition is most applicable. A non-evolutive definition causes an error to be reported for each period in which the deviation occurs, whereas an evolutive definition counts it as only one error until it is changed again.
We conclude that our novel definition and efficient program provide a useful tool for analyzing whole genomes. It is our hope that with its widespread use new scientific discoveries and inferences will be achieved.