[출처 : http://blog.naver.com/holywater86/50098589686]


6. 힙정렬

1) 힙 성질을 만족하는 완전 이진 트리 (마지막 층은 왼쪽부터 채워져 있으며 일부는 빌 수 있다.)

2) 힙

- 최소(대)힙 : 각 노드의 값이 자식보다 작(크)거나 같다

- root 값이 최소(대) 값이며, O(1) 에 검색이 가능하다.

- 바닥에 최대(소) 값이 존재 한다

3) 트리

- 가족관계 : 부모와 자식이 존재

- 원소갯수가 n일 경우 깊이는 logn

- 수행시간은 깊이에 비례한다.

4) 완전 이진 트리의 배열 표현

- A[k]의 왼쪽 자식 : A[2k]

- A[k]의 오른쪽 자식 : A[2k+1]

- A[k]의 부모 : A[ k/2의 내림값 ]

 

5) 수행시간 : O(nlogn)

 

6)

heapSort(A, n)
{
buildHeap(A, n); // 힙을 만든다
for i ← n downto 2  // 참조구역 조정

   {
   A[1] ↔ A[i]; // 역배열, 참조 안하는 구역의 값과 교환
   heapify(A, 1, i-1); // 힙 조건을 만족시킨다.
   }
}

 

buildHeap(A[], n)
{
   for i ←⌊n/2⌋ downto 1 //최초 자식을 가지는 놈 찾기
   heapify(A, i, n);
}

 

Heapify(A[], k, n)
{
   left ← 2k; right ← 2k+1;
   if(right ≤ n) // 작은값 찾기

   {
      if(A[left] < A[right]) // 자식이 둘 인 경우

         smaller ← left;
      else smaller ← right
   }
   else if (left ≤ n)

      smaller ← left; // 왼쪽자식만 있는 경우
   else return; // 자식이 없는 경우

   if(A[smaller] < A[k]) //재귀적으로 가장 작은 값을 정점과 바꿔 힙 성질을 만족 시킨다.

   {
      A[k] ↔ A[smaller];
      heapify(A, smaller, n);
   }
}

 

[출처 : http://siderite.blogspot.com/2006/11/fast-string-distance-sift-algorithm.html]

foreach phase
    remove identical overlapping characters
shake strings
return number of shakes + the length of the longest string between the two.

Super Fast and Accurate string distance algorithm: Sift3



        public float Distance(string s1, string s2, int maxOffset)
        {
            if (String.IsNullOrEmpty(s1))
                return
                    String.IsNullOrEmpty(s2) ? 0 : s2.Length;
            if (String.IsNullOrEmpty(s2))
                return s1.Length;
            int c = 0;
            int offset1 = 0;
            int offset2 = 0;
            int lcs = 0;
            while ((c + offset1 < s1.Length)
                   && (c + offset2 < s2.Length))
            {
                if (s1[c + offset1] == s2[c + offset2]) lcs++;
                else
                {
                    offset1 = 0;
                    offset2 = 0;
                    for (int i = 0; i < maxOffset; i++)
                    {
                        if ((c + i < s1.Length)
                            && (s1[c + i] == s2[c]))
                        {
                            offset1 = i;
                            break;
                        }
                        if ((c + i < s2.Length)
                            && (s1[c] == s2[c + i]))
                        {
                            offset2 = i;
                            break;
                        }
                    }
                }
                c++;
            }
            return (s1.Length + s2.Length)/2 - lcs;
        }
[출처 : http://bioinformatics.oxfordjournals.org/content/23/2/e30.full]


Abstract

Motivation: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence.

Results: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats.

Availability: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

Contact:sokol@sci.brooklyn.cuny.edu

1 INTRODUCTION

A tandem repeat, or tandem array, in DNA, is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats appear in biological sequences with a wide variety. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies.

Most genomes have a high content of repetitive DNA. Repeated sequences make up 50% of the human genome (Collins et al., 2003). The repeats in the human genome are important as genetic markers (Kannan and Myers, 1996), and they are also responsible for a number of inherited diseases involving the central nervous system. For example, in a normal FMR-1 gene, the triplet CGG is tandemly repeated 6 to 54 times, while in patients with Fragile X Syndrome the pattern occurs >200 times. Kennedy disease and Myotonic Dystrophy are two other diseases that have been associated with triplet repeats (Frazier et al., 2003). In addition, tandem repeats are used in population studies (Uform and Wayne, 1993), conservation biology (The International Human Genome Mapping Consortium, 2001) and in conjunction with multiple sequence alignments (Benson, 1997Kolpakov and Kucherov, 2001).

Many of the repeats appear in non-coding regions of DNA. Although useful for identity testing, these regions were thought to carry no function. Recently, however, it has been shown that repeats in the genome of a rodent provide code for its sociobehavioral traits (Jeffreys, 1993). Scientists currently believe that the non-coding tandem repeats do affect the function of the DNA in ways yet unknown.

Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence, while allowing for mutations. Due to the recent sequencing of the human genome by the Human Genome Project (Groult et al., 2003Fu et al., 1992), it is now possible to analyze the sequence of the human genome and create a listing of all tandem repeats in the genome. Detecting all tandem repeats in protein sequences is an important goal as well (Kitada et al., 1996).

In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence. We have already implemented the algorithm in C++, and a website for the program is under construction. Our program automates the task of listing all repeats that occur in a biological sequence. It is our hope that with the use of our program, biologists will discover new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

The contributions of this paper are 2-fold. The theoretical contributions consist of an extension of the algorithm of Landau and Schmidt (Landau et al., 1998) to locate evolutive tandem repeats over the edit distance and a more careful analysis of the algorithm eliminating the need for suffix trees. The practical contribution is an efficient program for finding tandem repeats that will become available on the web for all to use.

1.1 Problem definition

We define tandem repeats over the edit distance using the model of evolutive tandem repeats (Groult et al., 2004Hammock and Young, 2005). The model assumes that each copy of the repeat, from left to right, is derived from the previous copy through zero or more mutations. Thus, each copy in the repeat is similar to its predecessor and successor copy.

Let ed(s1s2) denote the minimum edit distance between two strings, s1 and s2.

Definition 1

A string R is a k-edit repeat if it can be partitioned into consecutive substrings, R = r1, r2, … , r, such thatFormulaThe last copy of the repeat does not have to be complete, and therefore ed (rℓ−1, r) refers to the edit distance between a prefix of rℓ−1 and r.

k-edit repeat is an evolutive tandem repeat in which there are at most k insertions, deletions and mismatches, overall, between all consecutive copies of the repeat. For example, the stringR = caagctcagctccgct is a 2-edit repeat. A k-edit repeat is maximal if it cannot be extended to the right or left. Maximal repeats can be overlapping, as shown in the following example.

Example

Repeat of length 14 with 2 errors:

View this table:

Repeat of length 18 with 2 errors:

View this table:

In this paper, we address the k-edit Repeat Problem, defined as follows. Given a string, S, and an integer k, find all maximalk-edit repeats in S. From the theoretical viewpoint, we provide an efficient algorithm that locates all maximal k-edit repeats in a string. We implemented this algorithm and found that our program locates many repeats that are similar one to another. Hence, we incorporated several heuristics in our program, to make the output more succinct and useful. The heuristics are described in Section 2.3. The result is a concise listing of the k-edit repeats occurring in the input sequence.

1.2 Related work

The early work on finding tandem repeats in strings dealt with simple repeats, i.e. repeats that contained exactly two parts. Kannan and Meyers (Katti et al., 2000), Benson (Benson, 1995) and Schmidt (Spong and Hellborg, 2002) present algorithms for finding simple repeats using the weighted edit distance.

The true goal of biologists, which turns out to be a much more difficult problem, is to find all maximal repeats in a string. A maximal repeat within a string is a repeat that contains two or more consecutive copies, and it cannot be extended further to the left or to the right. Each part of the repeat is called a period. When considering maximal repeats without errors, all periods (except possibly the last one) have equal length. For example, in the string aatgtgtgt the stringtgtgtgt is a maximal repeat with 3.5 periods.

More recently, the concentration has been on searching for maximal repeats with errors. In this case, however, the definition is not obvious. Given a repeat with several periods, what does it mean for the parts to be ‘similar?’ Benson (Benson, 1999) requires that a consensus string must exist which is similar to all periods of the repeat. Using this approach, it is difficult to provide a deterministic algorithm to find tandem repeats. Hence, Tandem Repeats Finder (TRF), developed by Benson1 uses a collection of statistical criteria in combination with k-tuple matches to detect statistically significant tandem repeats. Similar criteria are used by Wexler et al. (2004).

The goal of this paper is to provide a rigorous definition of tandem repeats and to provide a deterministic algorithm to detect all repeats that satisfy the definition. All of the algorithms that use this approach build upon the Hamming distance, which measures the number of mismatching characters between two strings, maintaining the property that each period of a repeat has the same length.

In (Groult et al., 2004Hammock and Young, 2005), evolutive tandem repeats are defined over the Hamming distance as follows. A string is an evolutive tandem repeat if it can be broken up into substrings r1r such that the Hamming distance between ri and ri+1 is smaller than a given threshold, for 1 ≤ i < ℓ. The definition also allows for a small gap between copies of the repeat. The algorithm presented in (Hammock and Young, 2005) has worst-case quadratic runtime.

Kolpakov and Kucherov (Kolpakov et al., 2003http://www.loria.fr/mreps/) present two different definitions. The first definition sums the mismatches between all neighboring copies of the repeat. Formally, a repeat Graphic is called a k-repetition if the Hamming distance between Graphic and Graphic is ≤k. The second definition allows k mismatches between each pair of consecutive copies, and this is called a k-run, and is similar to the evolutive tandem repeats over the Hamming distance. The algorithm of (Kolpakov et al., 2003,http://www.loria.fr/mreps/) has O(nk log k) time. The algorithm has been implemented in themreps software (Landau).

Landau, Schmidt and Sokol (Landau et al., 2001) define a tandem repeat to have k mismatches if the alignment constructed from its periods has k nonuniform columns. Their algorithm runs in O(nka log (n/k)) where n is the sequence, k is the error bound and a is the maximum number of periods in any reported repeat. The algorithm has been implemented and is available at (Landau and Vishkin, 1989http://csweb.cs.haifa.ac.il/library/ andhttp://www.sci.brooklyn.cuny.edu/~sokol/trepeats/).

The disadvantage of the Hamming distance is that it only accounts for point mutations and does not allow insertions and deletions. To model mutations more generally, it is preferable to use the edit distance, defined by Levenshtein (Main and Lorentz, 1984) as the minimum number of insertions, deletions and substitutions necessary to transform one string into the other. Ideally, we would like to differentially weight mutations; substitutions are typically scored more permissively than insertions and deletions. In this paper, we weight all differences uniformly, i.e. we use an edit distance scheme.

2 METHODS

In this section we describe the algorithm for finding k-edit repeats. We first describe a straightforward solution, following which we describe three speedups to achieve a time and space efficient algorithm. The speedups use similar ideas to those in (Landau et al., 2001).

2.1 A straightforward solution

The classical method for calculating the edit distance between two strings S = s1sm and S′ =s1sn is to construct an n × m dynamic programming matrix, to initialize (row 0, column j) to j (column 0, row i) to i, and to fill it in for ij > 0 using the following formula.FormulaWe define a p-restricted edit distance alignment as an edit distance alignment between two strings that disallows p insertions into the first string (alternatively, p deletions in the second string). This definition is necessary, since we would like to align a prefix of a string with a proper suffix of the same string. If insertions into the prefix are not restricted, then the string may eventually ‘catch up’ with itself, aligning characters in the prefix with the exact corresponding characters in the suffix.

In terms of the edit distance matrix, if we assume that the first input string is to the left of the matrix, a p-restricted alignment corresponds to the regular edit distance matrix, allowing onlyp − 1 diagonals to the left of the main diagonal.

Lemma1

A string R of length q is a k-edit repeat if and only if a p-restricted edit distance alignment can be constructed between a suffix of R, beginning at location p + 1, and a prefix of R, with ≤k differences, for some p < q/2.

Proof

The proof follows from the fact that we can obtain a k-edit repeat from a 2-sequence alignment of the prefix and suffix, and alternatively, given a k-edit repeat, the prefix/suffix can be derived from the alignment of the repeat. Given a p-restricted alignment of a suffix/prefix of a string R, the copies of the repeat can be derived from the alignment as follows. The first copy consists of the first p characters of the prefix. The second copy consists of all characters in the suffix that appear in the alignment against the first p characters in the prefix. Each successive copy is calculated by using the previous known copy in the prefix to demarcate the following copy in the suffix.

If a string R is a k-edit repeat, the prefix/suffix alignment can be obtained from the copy-to-copy alignments. The suffix can be taken as the start of the second copy of the repeat. The prefix is the string R minus the last copy of the repeat. Each has length ≥q/2 since only one copy of the repeat is removed. The prefix/suffix alignment can be constructed by aligning copy i in the prefix with copy i + 1 in the suffix.

We can use lemma 1 to derive a straightforward algorithm for finding all k-edit repeats within a string. The algorithm finds all substrings of the string for which a 2-sequence alignment exists between a proper prefix and a proper suffix of the substring. It then breaks up the 2-sequence alignment into its copies using the method in the first part of the proof of the lemma.

Simple Algorithm

Input: A string S, and an integer k.

Output: All k-edit repeats in S.

  1. Consider each index 1 ≤ i < n as the starting point of a potential repeat.

  2. Consider each index Formula (to the right of location i) as the start of the proper suffix of the repeat.

  3. For each pair (i, j) perform a (j − i)-restricted alignment of the two strings S1 = si … snand S2 = sj … sn using the classical dynamic programming method (allowing only j − i− 1 diagonals to the left of the main diagonal).

  4. If there is a match between S1 and S2, of length at least ji, with ≤k errors, then a repeat exists, beginning at location i and ending at the location before the k + 1st error found in step 3.

Complexity analysis: The time complexity of the naive algorithm is O(n4). There are O(n2) iterations, and in each iteration we compute the dynamic programming matrix of size O(n2). The space used is that of the edit distance matrix, which is of size O(n2).

In the following section we explain the efficient algorithm. We present it as a three-tier modification to the simple algorithm. The new algorithm reduces the time to O(nk log k log(n/k)) and it reduces the space from O(n2) to O(k2).

2.2 An efficient algorithm

We present a three-tier speedup to the naive algorithm. First, we reduce the number of iterations. Then, we show how each iteration can be improved by pruning the edit distance matrix. The third speedup fine-tunes the computation time of the partial edit distance matrix.

Speedup #1: reduce the number of iterations. We use the idea2 of Main and Lorentz (Schmidt) to reduce the number of iterations from O(n2) to O(nlog(n/k)). The algorithm, rather than considering all possible starts, anchors the comparisons at the center of the string. In the first iteration, the input is S = s1, … , sn, and all repeats that cross the center of the string (i.e. include character sn2) are found. In the following iteration, S is divided into two substrings, S = uvu =s1sn/2 and v = sn/2+1sn. The algorithm which finds repeats crossing the center is applied recursively to both u and v.

In order to locate all repeats that include the character sn/2, it is necessary to consider all alignments in which sn/2 corresponds to another character in the string. Following we describe the procedure which aligns location n/2 with all indices p > n/2. By symmetry, alignments for values p < n/2 can be produced.

Find repeats

For p = 1 to n/2 do // find repeats with period p

  1. Forward Direction: Find the longest prefix of sn/2+psn that p-restricted matches a prefix of sn/2sn with ≤k errors. (Fig. 1).

  2. Backward Direction: Find the longest suffix of s1sn/2+p−1 that p-restricted matches a suffix of s1sn/2−1 with ≤k errors.

  3. Consider all pairs k1k2 for which k1 + k2 = k. Let ℓ1 be the length of the backward direction comparison with k1 errors, and ℓ2 be the length of the forward direction comparison with k2 errors. If ℓ1 + ℓ2 ≥ p then there is a tandem repeat extending from Formula

Fig. 1

Computing repeats: ℓ1 is the length of the forward direction comparisons with k1 < k errors, and ℓ1 is the length of the backward direction comparison with k2 < kerrors. If k1 + k2 ≤ k and ℓ1 + ℓ2 ≥p, then there is a k-edit repeat atsn/2−ℓ2+1 … sn/2+ℓ1.

We illustrate this idea with the following example.

Let k = 4, find the k-edit repeats in the string S = ctcgagctcctgacctcgtga.

We show the iteration of p = 6, assuming that the first appearance of ‘t’ in the string is locationn/2. We omitted the first part of the string since it is not necessary for the example. The forward direction comparison is done by aligning the string sn/2sn with the stringsn/2+6sn. The optimal alignment (i.e. obtaining the minimum edit distance) is shown inFigure 2. Two gaps are introduced, and there are two mismatches.

Fig. 2

Optimal alignment obtaining the minimum edit distance.

The length of the forward extension with 4 errors is 14 (i.e. the alignment extends up until location n/2 + 6 + 14 − 1). Since we allowed 4 errors in the forward extension, and k = 4, we do not allow any errors in the backward extension. The length of the backward extension with 0 errors is 1 (the single character c).

The copies of the repeat are derived from the 2-sequence alignment as follows:

Repeat of length 21 with 4 errors.

View this table:

Speedup #2: Reduce the Size of the Dynamic Programming Matrix. In the straightforward algorithm, the forward and reverse direction extensions are computed by building the dynamic programming edit distance matrix for each pair of substrings. The second idea for speeding up the algorithm uses the observation that it is not necessary to compute the entire edit distance matrix, since we are only looking for k errors.

Using the ideas of Ukkonen (1983) and Landau/Vishkin (Levenshtein, 1966), we can reduce the size of the matrix from O(n2) to O(k2). Consider the diagonals in the edit distance matrix, where diagonal d corresponds to the diagonal with indices {(i, j)|i − j = d}. The main diagonal is diagonal 0. Any diagonal in the edit distance matrix that has distance >k from diagonal 0 is not of interest since all of its values will be >k. Thus, it is only necessary to compute the values on 2k + 1 diagonals. Furthermore, on a given diagonal, successive values differ by at most 1. Therefore, it is only necessary to store the location of the last row on each diagonal that has value h for each 0 ≤ h ≤ k. Instead of the edit dist matrix, we compute a k × k matrix L such that:

L[d, h] = largest row in the edit_dist matrix on diagonal d that has the value h.

Analysis: The matrix L can be computed by filling in n × 2k entries in the dynamic programming matrix. Each row needs 2k values, and potentially n rows will have to be computed. Using the classical dynamic programming edit distance algorithm (as in Section 2.1), this can be done in O(nk) time. Only a constant number of rows need be stored, hence the space complexity is O(k2). This looks excellent, however, consider the fact that the matrix Lneeds to be computed, in a given iteration, for each possible 1 ≤ p < n/2. This results in the matrix being computed potentially O(n) times, resulting in O(n2k) time per iteration. This is where the third speedup comes into play.

Speedup #3: reduce the computation time of the dynamic programming matrix. The third speedup uses information from one period size for the following period size. In each iteration, the algorithm computes the edit distance matrix separately for each period 1 ≤ p ≤ n/2. For p= 1, this translates into the computation of an edit distance matrix for the two strings sn/2snand sn/2+1sn. For p = 2, a new matrix is computed for the strings sn/2sn and sn/2+2sn, and so on. The only change from one period size to the next is the deletion of the first character in the second string. There is a lot of overlapping computation between different period sizes. Landau, Myers and Schmidt (LMS) (Landau and Schmidt, 1993) called this problem ‘incremental string comparison’, and showed how to get from one matrix to the next in O(k) time, assuming we are only searching for up to k errors.

Thus, we can compute the first O(k2) matrix using the previous speedup (Levenshtein, 1966), and then for each period size spend O(k) time using LMS to modify the matrix.

The only remaining issue is that when using the LMS algorithm to update the matrix it is not trivial to figure out the values for ℓ1 and ℓ2, i.e. the longest common extension forward and backward. These values correspond to the rightmost column in the dynamic programming matrix, for which a certain value k1 < k appears. In (Landau et al., 2001) this is solved by maintaining a heap for each value k1 < k, which contains all columns having the value k1. Thus, the rightmost column can be located and updated in O(log k) time.

Complexity analysis: There are O(log (n/k)) iterations. In each iteration, we construct a matrix of size O(k2) in time O(k2 + nk). The matrix is modified for each period size in O(k log k) time. Overall, each iteration takes O(nk log k) time, and the algorithm has time complexity O(nk log klog(n/k)). The space complexity is O(n + k2).

2.3 Implementation

We have implemented the program in C++ including the first two speedups. The third speedup is complicated, and it is our guess that in practice it will not significantly affect the program. This is based on the observation that in practice we almost never compute the full n × kmatrix, as we stop calculating each of the diagonals after k + 1 errors are found. Our program can process a string of several thousand characters in a fraction of a second. Currently, the running time of the program is proportional to the time it takes to print the output.

An inherent flexibility exists in the definition of approximate repeats, with the goal of allowing for mutations. This flexibility poses an issue when the goal is to detect all substrings of the input string that satisfy the definition. The issue is that in practice there is too much output. Each repeat that is reported satisfies the mathematical definition; however, many reported repeats differ only slightly one from another. Hence, our challenge was to modify our algorithm in a way that it reports a meaningful and significant subset of the found repeats. To this end, we filter out all repeats that prove to be redundant. The filtering is incorporated into the algorithm itself, which is much more efficient than doing a post-processing phase to filter the output.

We deal with the issue of filtering by augmenting the different phases of our algorithm. The first filtering technique filters repeats found within an individual iteration, while the second technique filters repeats found in different iterations.

Filtering a given iteration: 1. Choose the Optimal Period using the Local Maximum. The first point of comparison is anchored at the center position of the string, but the second point varies, and only by one character at a time. Suppose there exists an alignment for some p withh errors. Then, this alignment can be converted into an alignment for p + 1 with at most h + 1 errors. Thus, we do not want to report every possible k-edit repeat. We are only interested in the value p that produces the best matching of the surrounding string segments. This leads to a local maximum idea. As the first point is anchored, and the second point moves forward, the comparison produces better matches as the second point nears a corresponding position in a different part of the repeat. The comparison gives worse results as the second point moves away from that ‘optimal’ position, until it begins nearing the corresponding position of another repeat.

2. Combine several neighboring repeats. Once a local maximum, p, is found, we have potentially k different repeats with period p, each shifted over several characters. This is because we consider k1 errors to the left and k2 errors to the right, for all values of k1k2 such that k1 + k2 = k. For purposes of reporting the repeats, it is much clearer to see the shifting repeats as a single repeat. Hence, in this case, we combine all repeats with period p into a single repeat. We note that the combined repeat has at most 2k errors.

Filtering between iterations: The idea for filtering between iterations is a simple one. If a repeat reaches either end of the string segment being examined by the current iteration, it will also be detected by a different iteration at a higher level in the recursion and thus need not be reported in the current iteration.

3 RESULTS

The table on the left in Figure 3 contains a summary of the repeats that our program detected in the first 14 000 bp of human chromosome 18 (May 2004). The exact filtering criteria are still being improved, and thus we manually filtered the repeats to clarify results. The criterion for this run allows up to k = 40 errors and requires a minimum length of at least 100 characters more than the number of errors. This particular criterion was tested on pseudorandom strings created using the C++ function rand(), which produces statistically random output in a single run. The function rand() was initialized using the system clock time, ensuring a different set of statistically random strings for each execution. It was tested on 20 strings, each 20 000 in length, and returned a single repeat in 6 of them, and no repeats in the rest. The table on the right of Figure 3 shows the results of TRF (Benson, 1999) on the same sequence.

Fig. 3

The repeats found by our program are compared and contrasted with the repeats found by TRF (Benson, 1999).

In Figure 4 we show the alignments of the copies of the two new repeats found by our program, beginning at positions 795 and 10 690.

Fig. 4

Alignments of the copies of two new repeats found by our program.

4 DISCUSSION

Many of the repeats that are found by our program are also found by TRF (Benson, 1999), which uses a non-evolutive definition. This is due to the fact that the consensus type repeat often satisfies the evolutive definition as well. Sometimes, our program will detect more errors than TRF, because in the case of a deviation in a single period, we count each deviation twice: once from the period before to the period with the deviation and once again from the period with the deviation to the period after. However, more often, changes will occur and be repeated for several periods before they change again. It is in these situations that the evolutive definition is most applicable. A non-evolutive definition causes an error to be reported for each period in which the deviation occurs, whereas an evolutive definition counts it as only one error until it is changed again.

We conclude that our novel definition and efficient program provide a useful tool for analyzing whole genomes. It is our hope that with its widespread use new scientific discoveries and inferences will be achieved.

[출처 : http://www.codeproject.com/KB/recipes/Levenshtein.aspx]


Introduction

The Levenshtein distance is the difference between two strings. I use it in a web crawler application to compare the new and old versions of a web page. If it has changed enough, I update it in my database.

Description

The original algorithm creates a matrix, where the size is StrLen1*StrLen2. If both strings are 1000 chars long, the resulting matrix is 1M elements; if the strings are 10,000 chars, the matrix will be 100M elements. If the elements are integers, it will be 4*100M == 400MB. Ouch!

This version of the algorithm uses only 2*StrLen elements, so the latter example would give 2*10,000*4 = 80 KB. The result is that, not only does it use less memory but it's also faster because the memory allocation takes less time. When both strings are about 1K in length, the new version is more than twice as fast.

Example

The original version would create a matrix[6+1,5+1], my version creates two vectors[6+1] (the white elements). In both versions, the order of the strings is irrelevant, that is, it could be matrix[5+1,6+1] and two vectors[5+1].

The new algorithm

Steps

StepDescription
1 Set n to be the length of s. ("GUMBO")
Set m to be the length of t. ("GAMBOL")
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements.
2 Initialize v0 to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] is not equal to t[j], the cost is 1.
6 Set cell v1[j] equal to the minimum of:
a. The cell immediately above plus 1: v1[j-1] + 1.
b. The cell immediately to the left plus 1: v0[j] + 1.
c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m].

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL":

Steps 1 and 2

  v0 v1        
    G U M B O
  0 1 2 3 4 5
G 1          
A 2          
M 3          
B 4          
O 5          
L 6          

Steps 3 to 6, when i = 1

  v0 v1        
    G U M B O
  0 1 2 3 4 5
G 1 0        
A 2 1        
M 3 2        
B 4 3        
O 5 4        
L 6 5        

Steps 3 to 6, when i = 2

SWAP(v0,v1): If you look in the code you will see that I don't swap the content of the vectors but I refer to them.

Set v1[0] to the column number, e.g. 2.

    v0 v1      
    G U M B O
  0 1 2 3 4 5
G 1 0 1      
A 2 1 1      
M 3 2 2      
B 4 3 3      
O 5 4 4      
L 6 5 5      

Steps 3 to 6, when i = 3

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 3.

      v0 v1    
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2    
A 2 1 1 2    
M 3 2 2 1    
B 4 3 3 2    
O 5 4 4 3    
L 6 5 5 4    

Steps 3 to 6, when i = 4

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 4.

        v0 v1  
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3  
A 2 1 1 2 3  
M 3 2 2 1 2  
B 4 3 3 2 1  
O 5 4 4 3 2  
L 6 5 5 4 3  

Steps 3 to 6, when i = 5

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 5.

          v0 v1
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3 4
A 2 1 1 2 3 4
M 3 2 2 1 2 3
B 4 3 3 2 1 2
O 5 4 4 3 2 1
L 6 5 5 4 3 2

Step 7

The distance is in the lower right hand corner of the matrix, v1[m] == 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and one insertion = two changes).

Improvements

If you are sure that your strings will never be longer than 2^16 chars, you could use ushort instead of int, if the strings are less than 2^8 chars, you could use byte. I guess, the algorithm would be even faster if we use unmanaged code, but I have not tried it.

[출처 : http://wiki.kldp.org/Translations/html/UTF8-Unicode-KLDP/UTF8-Unicode-KLDP-7.html]


무엇보다 먼저 UCS와 유니코드는 단지 정수를 문자에 할당하는 코드 테이블일 뿐이 라는 것이다. 그러한 문자 혹은 문자 각각의 정수 값의 시퀀스가 어떻게 바이트 시퀀스로 나타날 수 있는 지에 대한 몇 가지 대안들이 존재한다. 대단히 이해하기 쉬운 두 개의 인코딩은 유니코드 텍스트를 2 혹은 4바이트 시퀀스의 시퀀스(sequences of eit her 2 or 4 bytes sequences)로써 저장한다. 이러한 용어에 관한 공식적인 명칭은 각 각 UCS-2와 UCS-4이다. 다른 방법으로 명시되지 않는다면, 가장 중요한 바이트가 이들의 첫번째로 온다(Bigendian convention). ASCII 또는 Latin-1 파일은 모든 ASCII 바이트의 앞에 0x00 바이트를 삽입하므로써, UCS-2 파일로 변환시킬 수 있다. UCS-4 파일을 원한다면, 모든 ASCII 바이트 앞에 그 대신에 세개의 0x00 바이트를 삽입해야만 한다.

유닉스 환경에서 UCS-2(또는 UCS-4)를 사용하는 것은 매우 심각한 문제점을 불러온 다. 이러한 인코딩을 가진 문자열들은 파일명과 C 라이브러리 함수 파라미터에서 특별 한 의미를 갖는 '\0' 혹은 '/'와 같이 매우 광범위한 문자 바이트를 부분적으로 포함할 수 있다. 이에 더해, 대다수의 유닉스 툴들은 ASCII 파일을 예상하며, 큰 수정이 없이는 16비트 단어들을 문자로 읽을 수 없다. 이러한 이유 때문에 UCS-2는 파일명과 텍스트 파일 및 환경 변수 등에 적합한 유니코드의 외부 인코딩(suitable external encoding of Unicode)이 아니다.

ISO 10646-1 의 Annex R과 RFC 2279상에 정의된 UTF-8 인코딩은 이러한 문제점들이 없다. 이것은 유닉스 스타일의 운영 체제하에서 유니코드를 사용하기 위한 의심할 여지없이 좋은 방법이다.

UTF-8은 다음의 성질을 갖고 있다:

  • U+0000부터 U+007F까지의 UCS 문자들은 0x00에서 0x7f 바이트까지 쉽게 인코딩된다(ASCII와의 호환성). 이것은 오직 7비트의 ASCII 문자들을 포함하는 파일 및 문자열들이 ASCII와 UTF-8 양쪽 모두에서 같은 인코딩을 갖는다는 것을 의미한다.
  • U+007F보다 큰 모든 UCS 문자들은 각각 독자적인 바이트의 시퀀스로써 인코딩되며, 이것들은 각각 가장 중요한 비트셋(bit set)을 가진다. 그러므로 다른 문자의 부분에 어떤 ASCII 바이트(0x00-0x7f)도 나타날 수 없다.
  • ASCII가 아닌 문자를 나타내는 멀티바이트 시퀀스의 첫번째 바이트는 항상 0xC0에서부터 0xFD 범위에 있으며, 그것은 이러한 문자를 위해 얼마나 많은 바이트가 필요한 지를 가르킨다. 멀티바이트 시퀀스의 모든 이후의 바이트들은 0x80에서부터 0xBF 범위에 있다. 이 때문에 resynchronization을 쉽게할 수 있고 국가에 구애받지 않고 인코딩할 수 있으며 바이트를 잃어버리지 않게 된다.
  • 가능한 모든 231 UCS 코드를 인코딩할 수 있다.
  • UTF-8로 인코딩한 문자들은 이론적으로 6바이트 길이까지 가능하지만, 16비트 BMP 영역 문자들은 오직 3바이트 길이까지 가능하다.
  • Bigendian UCS-4 바이트 문자열의 정렬 순서는 보존된다.
  • 0xFE 및 0xFF 바이트는 결코 UTF-8 인코딩에서 사용하지 않는다.

다음의 바이트 시퀀스는 한 문자를 나타내기 위해 사용한다. 사용되는 시퀀스는 그 문자의 유니코드 번호에 따라 달라진다.


U-00000000 - U-0000007F:
0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

xxx비트의 위치는 이진 표기법에 의해 문자 코드 번호의 비트들로 채워진다. 오른쪽의 x비트는 별로 중요하지 않다. 그 문자의 코드 번호를 나타내는 오직 가장 짧은 멀티바이트 시퀀스만 사용할 수 있다. 멀티바이트 시퀀스에서 첫 번째 바이트의 왼쪽 1비트의 수는 전체 시퀀스에서의 바이트 수와 같다는 점에 주의하라.

예: "유니코드 문자 U+00A9 = 1010 1001"(저작권 부호)는 다음과 같은 UTF-8에 따라 인코딩된다.

11000010 10101001 = 0xC2 0xA9

그리고 문자 U+2260 = 0010 0010 0110 0000(저작권 부호)는 다음과 같은 UTF-8에 따라 인코딩된다.

11100010 10001001 10100000 = 0xE2 0x89 0xA0

이러한 인코딩의 공식 명칭과 정확한 표기는 UTF-8이며, UTF는 UCS Transformation Format을 의미한다. utf8혹은 UTF_8과 같은 다른 방법으로 UTF-8을 문서에 쓰지마라. 물론 인코딩 자체를 참조하지 않고 변수명에 참조할 경우에는괜찮다.

UTF-8의 디코딩 처리 순서에 있어서 중요한 점은 다음과 같다: 보안상의 이유 때문에, UTF-8 디코더는 한 문자를 인코딩하기 위해서 필요 이상으로 긴 UTF-8 시퀀스를 받아들여서는 안 된다. 예를 들어 U+000A(라인 피드) 문자는 오직 0x0A 형식으로 UTF-8 스트림으로부터 받아들여야만 하며, 다음의 다섯가지와 같이 과도하게 긴(overlong) 형식으로 받아들여서는 안된다.

  0xc0 0x8A
  0xe0 0x80 0x8A
  0xf0 0x80 0x80 0x8A
  0xf8 0x80 0x80 0x80 0x8A
  0xfc 0x80 0x80 0x80 0x80 0x8A

가장 짧은 인코딩을 찾기 위한 UTF-8 서브스트링 테스트를 무시하기 어떤 과도하게 긴 UTF-8 시퀀스를 남용할 수 있다. 모든 과도하게 긴 형식의 UTF-8 시퀀스는 다음의 바이트 패턴 중 한 가지로 시작한다.


1100000x (10xxxxxx)
11100000 100xxxxx (10xxxxxx)
11110000 1000xxxx (10xxxxxx 10xxxxxx)
11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)

정상적인 UTF-8 혹은 UCS-4 데이터상에서 코드 위치 U+FFFE와 U+FFFF 뿐만 아니라 코드 위치 U+D800 부터 U+DFFF(UTF-16 대용)까지는 사용해서는 안 된다. UTF-8 디코더는 이러한 것들을 안전성을 이유로, 잘못된 형식으로 혹은 너무 긴 시퀀스로 취급해야 만 한다.

Markus Kuhn의 UTF-8 decoder stress test file은 잘못된 형식을 갖는 과도하게 긴 UTF-8 시퀀스의 체계적인 모음을 포함하고 있으며 디코더의 강력함을 증명해준다.

+ Recent posts