[출처 : http://siderite.blogspot.com/2006/11/fast-string-distance-sift-algorithm.html]

foreach phase
    remove identical overlapping characters
shake strings
return number of shakes + the length of the longest string between the two.

Super Fast and Accurate string distance algorithm: Sift3



        public float Distance(string s1, string s2, int maxOffset)
        {
            if (String.IsNullOrEmpty(s1))
                return
                    String.IsNullOrEmpty(s2) ? 0 : s2.Length;
            if (String.IsNullOrEmpty(s2))
                return s1.Length;
            int c = 0;
            int offset1 = 0;
            int offset2 = 0;
            int lcs = 0;
            while ((c + offset1 < s1.Length)
                   && (c + offset2 < s2.Length))
            {
                if (s1[c + offset1] == s2[c + offset2]) lcs++;
                else
                {
                    offset1 = 0;
                    offset2 = 0;
                    for (int i = 0; i < maxOffset; i++)
                    {
                        if ((c + i < s1.Length)
                            && (s1[c + i] == s2[c]))
                        {
                            offset1 = i;
                            break;
                        }
                        if ((c + i < s2.Length)
                            && (s1[c] == s2[c + i]))
                        {
                            offset2 = i;
                            break;
                        }
                    }
                }
                c++;
            }
            return (s1.Length + s2.Length)/2 - lcs;
        }
[출처 : http://bioinformatics.oxfordjournals.org/content/23/2/e30.full]


Abstract

Motivation: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence.

Results: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats.

Availability: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

Contact:sokol@sci.brooklyn.cuny.edu

1 INTRODUCTION

A tandem repeat, or tandem array, in DNA, is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats appear in biological sequences with a wide variety. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies.

Most genomes have a high content of repetitive DNA. Repeated sequences make up 50% of the human genome (Collins et al., 2003). The repeats in the human genome are important as genetic markers (Kannan and Myers, 1996), and they are also responsible for a number of inherited diseases involving the central nervous system. For example, in a normal FMR-1 gene, the triplet CGG is tandemly repeated 6 to 54 times, while in patients with Fragile X Syndrome the pattern occurs >200 times. Kennedy disease and Myotonic Dystrophy are two other diseases that have been associated with triplet repeats (Frazier et al., 2003). In addition, tandem repeats are used in population studies (Uform and Wayne, 1993), conservation biology (The International Human Genome Mapping Consortium, 2001) and in conjunction with multiple sequence alignments (Benson, 1997Kolpakov and Kucherov, 2001).

Many of the repeats appear in non-coding regions of DNA. Although useful for identity testing, these regions were thought to carry no function. Recently, however, it has been shown that repeats in the genome of a rodent provide code for its sociobehavioral traits (Jeffreys, 1993). Scientists currently believe that the non-coding tandem repeats do affect the function of the DNA in ways yet unknown.

Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence, while allowing for mutations. Due to the recent sequencing of the human genome by the Human Genome Project (Groult et al., 2003Fu et al., 1992), it is now possible to analyze the sequence of the human genome and create a listing of all tandem repeats in the genome. Detecting all tandem repeats in protein sequences is an important goal as well (Kitada et al., 1996).

In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence. We have already implemented the algorithm in C++, and a website for the program is under construction. Our program automates the task of listing all repeats that occur in a biological sequence. It is our hope that with the use of our program, biologists will discover new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

The contributions of this paper are 2-fold. The theoretical contributions consist of an extension of the algorithm of Landau and Schmidt (Landau et al., 1998) to locate evolutive tandem repeats over the edit distance and a more careful analysis of the algorithm eliminating the need for suffix trees. The practical contribution is an efficient program for finding tandem repeats that will become available on the web for all to use.

1.1 Problem definition

We define tandem repeats over the edit distance using the model of evolutive tandem repeats (Groult et al., 2004Hammock and Young, 2005). The model assumes that each copy of the repeat, from left to right, is derived from the previous copy through zero or more mutations. Thus, each copy in the repeat is similar to its predecessor and successor copy.

Let ed(s1s2) denote the minimum edit distance between two strings, s1 and s2.

Definition 1

A string R is a k-edit repeat if it can be partitioned into consecutive substrings, R = r1, r2, … , r, such thatFormulaThe last copy of the repeat does not have to be complete, and therefore ed (rℓ−1, r) refers to the edit distance between a prefix of rℓ−1 and r.

k-edit repeat is an evolutive tandem repeat in which there are at most k insertions, deletions and mismatches, overall, between all consecutive copies of the repeat. For example, the stringR = caagctcagctccgct is a 2-edit repeat. A k-edit repeat is maximal if it cannot be extended to the right or left. Maximal repeats can be overlapping, as shown in the following example.

Example

Repeat of length 14 with 2 errors:

View this table:

Repeat of length 18 with 2 errors:

View this table:

In this paper, we address the k-edit Repeat Problem, defined as follows. Given a string, S, and an integer k, find all maximalk-edit repeats in S. From the theoretical viewpoint, we provide an efficient algorithm that locates all maximal k-edit repeats in a string. We implemented this algorithm and found that our program locates many repeats that are similar one to another. Hence, we incorporated several heuristics in our program, to make the output more succinct and useful. The heuristics are described in Section 2.3. The result is a concise listing of the k-edit repeats occurring in the input sequence.

1.2 Related work

The early work on finding tandem repeats in strings dealt with simple repeats, i.e. repeats that contained exactly two parts. Kannan and Meyers (Katti et al., 2000), Benson (Benson, 1995) and Schmidt (Spong and Hellborg, 2002) present algorithms for finding simple repeats using the weighted edit distance.

The true goal of biologists, which turns out to be a much more difficult problem, is to find all maximal repeats in a string. A maximal repeat within a string is a repeat that contains two or more consecutive copies, and it cannot be extended further to the left or to the right. Each part of the repeat is called a period. When considering maximal repeats without errors, all periods (except possibly the last one) have equal length. For example, in the string aatgtgtgt the stringtgtgtgt is a maximal repeat with 3.5 periods.

More recently, the concentration has been on searching for maximal repeats with errors. In this case, however, the definition is not obvious. Given a repeat with several periods, what does it mean for the parts to be ‘similar?’ Benson (Benson, 1999) requires that a consensus string must exist which is similar to all periods of the repeat. Using this approach, it is difficult to provide a deterministic algorithm to find tandem repeats. Hence, Tandem Repeats Finder (TRF), developed by Benson1 uses a collection of statistical criteria in combination with k-tuple matches to detect statistically significant tandem repeats. Similar criteria are used by Wexler et al. (2004).

The goal of this paper is to provide a rigorous definition of tandem repeats and to provide a deterministic algorithm to detect all repeats that satisfy the definition. All of the algorithms that use this approach build upon the Hamming distance, which measures the number of mismatching characters between two strings, maintaining the property that each period of a repeat has the same length.

In (Groult et al., 2004Hammock and Young, 2005), evolutive tandem repeats are defined over the Hamming distance as follows. A string is an evolutive tandem repeat if it can be broken up into substrings r1r such that the Hamming distance between ri and ri+1 is smaller than a given threshold, for 1 ≤ i < ℓ. The definition also allows for a small gap between copies of the repeat. The algorithm presented in (Hammock and Young, 2005) has worst-case quadratic runtime.

Kolpakov and Kucherov (Kolpakov et al., 2003http://www.loria.fr/mreps/) present two different definitions. The first definition sums the mismatches between all neighboring copies of the repeat. Formally, a repeat Graphic is called a k-repetition if the Hamming distance between Graphic and Graphic is ≤k. The second definition allows k mismatches between each pair of consecutive copies, and this is called a k-run, and is similar to the evolutive tandem repeats over the Hamming distance. The algorithm of (Kolpakov et al., 2003,http://www.loria.fr/mreps/) has O(nk log k) time. The algorithm has been implemented in themreps software (Landau).

Landau, Schmidt and Sokol (Landau et al., 2001) define a tandem repeat to have k mismatches if the alignment constructed from its periods has k nonuniform columns. Their algorithm runs in O(nka log (n/k)) where n is the sequence, k is the error bound and a is the maximum number of periods in any reported repeat. The algorithm has been implemented and is available at (Landau and Vishkin, 1989http://csweb.cs.haifa.ac.il/library/ andhttp://www.sci.brooklyn.cuny.edu/~sokol/trepeats/).

The disadvantage of the Hamming distance is that it only accounts for point mutations and does not allow insertions and deletions. To model mutations more generally, it is preferable to use the edit distance, defined by Levenshtein (Main and Lorentz, 1984) as the minimum number of insertions, deletions and substitutions necessary to transform one string into the other. Ideally, we would like to differentially weight mutations; substitutions are typically scored more permissively than insertions and deletions. In this paper, we weight all differences uniformly, i.e. we use an edit distance scheme.

2 METHODS

In this section we describe the algorithm for finding k-edit repeats. We first describe a straightforward solution, following which we describe three speedups to achieve a time and space efficient algorithm. The speedups use similar ideas to those in (Landau et al., 2001).

2.1 A straightforward solution

The classical method for calculating the edit distance between two strings S = s1sm and S′ =s1sn is to construct an n × m dynamic programming matrix, to initialize (row 0, column j) to j (column 0, row i) to i, and to fill it in for ij > 0 using the following formula.FormulaWe define a p-restricted edit distance alignment as an edit distance alignment between two strings that disallows p insertions into the first string (alternatively, p deletions in the second string). This definition is necessary, since we would like to align a prefix of a string with a proper suffix of the same string. If insertions into the prefix are not restricted, then the string may eventually ‘catch up’ with itself, aligning characters in the prefix with the exact corresponding characters in the suffix.

In terms of the edit distance matrix, if we assume that the first input string is to the left of the matrix, a p-restricted alignment corresponds to the regular edit distance matrix, allowing onlyp − 1 diagonals to the left of the main diagonal.

Lemma1

A string R of length q is a k-edit repeat if and only if a p-restricted edit distance alignment can be constructed between a suffix of R, beginning at location p + 1, and a prefix of R, with ≤k differences, for some p < q/2.

Proof

The proof follows from the fact that we can obtain a k-edit repeat from a 2-sequence alignment of the prefix and suffix, and alternatively, given a k-edit repeat, the prefix/suffix can be derived from the alignment of the repeat. Given a p-restricted alignment of a suffix/prefix of a string R, the copies of the repeat can be derived from the alignment as follows. The first copy consists of the first p characters of the prefix. The second copy consists of all characters in the suffix that appear in the alignment against the first p characters in the prefix. Each successive copy is calculated by using the previous known copy in the prefix to demarcate the following copy in the suffix.

If a string R is a k-edit repeat, the prefix/suffix alignment can be obtained from the copy-to-copy alignments. The suffix can be taken as the start of the second copy of the repeat. The prefix is the string R minus the last copy of the repeat. Each has length ≥q/2 since only one copy of the repeat is removed. The prefix/suffix alignment can be constructed by aligning copy i in the prefix with copy i + 1 in the suffix.

We can use lemma 1 to derive a straightforward algorithm for finding all k-edit repeats within a string. The algorithm finds all substrings of the string for which a 2-sequence alignment exists between a proper prefix and a proper suffix of the substring. It then breaks up the 2-sequence alignment into its copies using the method in the first part of the proof of the lemma.

Simple Algorithm

Input: A string S, and an integer k.

Output: All k-edit repeats in S.

  1. Consider each index 1 ≤ i < n as the starting point of a potential repeat.

  2. Consider each index Formula (to the right of location i) as the start of the proper suffix of the repeat.

  3. For each pair (i, j) perform a (j − i)-restricted alignment of the two strings S1 = si … snand S2 = sj … sn using the classical dynamic programming method (allowing only j − i− 1 diagonals to the left of the main diagonal).

  4. If there is a match between S1 and S2, of length at least ji, with ≤k errors, then a repeat exists, beginning at location i and ending at the location before the k + 1st error found in step 3.

Complexity analysis: The time complexity of the naive algorithm is O(n4). There are O(n2) iterations, and in each iteration we compute the dynamic programming matrix of size O(n2). The space used is that of the edit distance matrix, which is of size O(n2).

In the following section we explain the efficient algorithm. We present it as a three-tier modification to the simple algorithm. The new algorithm reduces the time to O(nk log k log(n/k)) and it reduces the space from O(n2) to O(k2).

2.2 An efficient algorithm

We present a three-tier speedup to the naive algorithm. First, we reduce the number of iterations. Then, we show how each iteration can be improved by pruning the edit distance matrix. The third speedup fine-tunes the computation time of the partial edit distance matrix.

Speedup #1: reduce the number of iterations. We use the idea2 of Main and Lorentz (Schmidt) to reduce the number of iterations from O(n2) to O(nlog(n/k)). The algorithm, rather than considering all possible starts, anchors the comparisons at the center of the string. In the first iteration, the input is S = s1, … , sn, and all repeats that cross the center of the string (i.e. include character sn2) are found. In the following iteration, S is divided into two substrings, S = uvu =s1sn/2 and v = sn/2+1sn. The algorithm which finds repeats crossing the center is applied recursively to both u and v.

In order to locate all repeats that include the character sn/2, it is necessary to consider all alignments in which sn/2 corresponds to another character in the string. Following we describe the procedure which aligns location n/2 with all indices p > n/2. By symmetry, alignments for values p < n/2 can be produced.

Find repeats

For p = 1 to n/2 do // find repeats with period p

  1. Forward Direction: Find the longest prefix of sn/2+psn that p-restricted matches a prefix of sn/2sn with ≤k errors. (Fig. 1).

  2. Backward Direction: Find the longest suffix of s1sn/2+p−1 that p-restricted matches a suffix of s1sn/2−1 with ≤k errors.

  3. Consider all pairs k1k2 for which k1 + k2 = k. Let ℓ1 be the length of the backward direction comparison with k1 errors, and ℓ2 be the length of the forward direction comparison with k2 errors. If ℓ1 + ℓ2 ≥ p then there is a tandem repeat extending from Formula

Fig. 1

Computing repeats: ℓ1 is the length of the forward direction comparisons with k1 < k errors, and ℓ1 is the length of the backward direction comparison with k2 < kerrors. If k1 + k2 ≤ k and ℓ1 + ℓ2 ≥p, then there is a k-edit repeat atsn/2−ℓ2+1 … sn/2+ℓ1.

We illustrate this idea with the following example.

Let k = 4, find the k-edit repeats in the string S = ctcgagctcctgacctcgtga.

We show the iteration of p = 6, assuming that the first appearance of ‘t’ in the string is locationn/2. We omitted the first part of the string since it is not necessary for the example. The forward direction comparison is done by aligning the string sn/2sn with the stringsn/2+6sn. The optimal alignment (i.e. obtaining the minimum edit distance) is shown inFigure 2. Two gaps are introduced, and there are two mismatches.

Fig. 2

Optimal alignment obtaining the minimum edit distance.

The length of the forward extension with 4 errors is 14 (i.e. the alignment extends up until location n/2 + 6 + 14 − 1). Since we allowed 4 errors in the forward extension, and k = 4, we do not allow any errors in the backward extension. The length of the backward extension with 0 errors is 1 (the single character c).

The copies of the repeat are derived from the 2-sequence alignment as follows:

Repeat of length 21 with 4 errors.

View this table:

Speedup #2: Reduce the Size of the Dynamic Programming Matrix. In the straightforward algorithm, the forward and reverse direction extensions are computed by building the dynamic programming edit distance matrix for each pair of substrings. The second idea for speeding up the algorithm uses the observation that it is not necessary to compute the entire edit distance matrix, since we are only looking for k errors.

Using the ideas of Ukkonen (1983) and Landau/Vishkin (Levenshtein, 1966), we can reduce the size of the matrix from O(n2) to O(k2). Consider the diagonals in the edit distance matrix, where diagonal d corresponds to the diagonal with indices {(i, j)|i − j = d}. The main diagonal is diagonal 0. Any diagonal in the edit distance matrix that has distance >k from diagonal 0 is not of interest since all of its values will be >k. Thus, it is only necessary to compute the values on 2k + 1 diagonals. Furthermore, on a given diagonal, successive values differ by at most 1. Therefore, it is only necessary to store the location of the last row on each diagonal that has value h for each 0 ≤ h ≤ k. Instead of the edit dist matrix, we compute a k × k matrix L such that:

L[d, h] = largest row in the edit_dist matrix on diagonal d that has the value h.

Analysis: The matrix L can be computed by filling in n × 2k entries in the dynamic programming matrix. Each row needs 2k values, and potentially n rows will have to be computed. Using the classical dynamic programming edit distance algorithm (as in Section 2.1), this can be done in O(nk) time. Only a constant number of rows need be stored, hence the space complexity is O(k2). This looks excellent, however, consider the fact that the matrix Lneeds to be computed, in a given iteration, for each possible 1 ≤ p < n/2. This results in the matrix being computed potentially O(n) times, resulting in O(n2k) time per iteration. This is where the third speedup comes into play.

Speedup #3: reduce the computation time of the dynamic programming matrix. The third speedup uses information from one period size for the following period size. In each iteration, the algorithm computes the edit distance matrix separately for each period 1 ≤ p ≤ n/2. For p= 1, this translates into the computation of an edit distance matrix for the two strings sn/2snand sn/2+1sn. For p = 2, a new matrix is computed for the strings sn/2sn and sn/2+2sn, and so on. The only change from one period size to the next is the deletion of the first character in the second string. There is a lot of overlapping computation between different period sizes. Landau, Myers and Schmidt (LMS) (Landau and Schmidt, 1993) called this problem ‘incremental string comparison’, and showed how to get from one matrix to the next in O(k) time, assuming we are only searching for up to k errors.

Thus, we can compute the first O(k2) matrix using the previous speedup (Levenshtein, 1966), and then for each period size spend O(k) time using LMS to modify the matrix.

The only remaining issue is that when using the LMS algorithm to update the matrix it is not trivial to figure out the values for ℓ1 and ℓ2, i.e. the longest common extension forward and backward. These values correspond to the rightmost column in the dynamic programming matrix, for which a certain value k1 < k appears. In (Landau et al., 2001) this is solved by maintaining a heap for each value k1 < k, which contains all columns having the value k1. Thus, the rightmost column can be located and updated in O(log k) time.

Complexity analysis: There are O(log (n/k)) iterations. In each iteration, we construct a matrix of size O(k2) in time O(k2 + nk). The matrix is modified for each period size in O(k log k) time. Overall, each iteration takes O(nk log k) time, and the algorithm has time complexity O(nk log klog(n/k)). The space complexity is O(n + k2).

2.3 Implementation

We have implemented the program in C++ including the first two speedups. The third speedup is complicated, and it is our guess that in practice it will not significantly affect the program. This is based on the observation that in practice we almost never compute the full n × kmatrix, as we stop calculating each of the diagonals after k + 1 errors are found. Our program can process a string of several thousand characters in a fraction of a second. Currently, the running time of the program is proportional to the time it takes to print the output.

An inherent flexibility exists in the definition of approximate repeats, with the goal of allowing for mutations. This flexibility poses an issue when the goal is to detect all substrings of the input string that satisfy the definition. The issue is that in practice there is too much output. Each repeat that is reported satisfies the mathematical definition; however, many reported repeats differ only slightly one from another. Hence, our challenge was to modify our algorithm in a way that it reports a meaningful and significant subset of the found repeats. To this end, we filter out all repeats that prove to be redundant. The filtering is incorporated into the algorithm itself, which is much more efficient than doing a post-processing phase to filter the output.

We deal with the issue of filtering by augmenting the different phases of our algorithm. The first filtering technique filters repeats found within an individual iteration, while the second technique filters repeats found in different iterations.

Filtering a given iteration: 1. Choose the Optimal Period using the Local Maximum. The first point of comparison is anchored at the center position of the string, but the second point varies, and only by one character at a time. Suppose there exists an alignment for some p withh errors. Then, this alignment can be converted into an alignment for p + 1 with at most h + 1 errors. Thus, we do not want to report every possible k-edit repeat. We are only interested in the value p that produces the best matching of the surrounding string segments. This leads to a local maximum idea. As the first point is anchored, and the second point moves forward, the comparison produces better matches as the second point nears a corresponding position in a different part of the repeat. The comparison gives worse results as the second point moves away from that ‘optimal’ position, until it begins nearing the corresponding position of another repeat.

2. Combine several neighboring repeats. Once a local maximum, p, is found, we have potentially k different repeats with period p, each shifted over several characters. This is because we consider k1 errors to the left and k2 errors to the right, for all values of k1k2 such that k1 + k2 = k. For purposes of reporting the repeats, it is much clearer to see the shifting repeats as a single repeat. Hence, in this case, we combine all repeats with period p into a single repeat. We note that the combined repeat has at most 2k errors.

Filtering between iterations: The idea for filtering between iterations is a simple one. If a repeat reaches either end of the string segment being examined by the current iteration, it will also be detected by a different iteration at a higher level in the recursion and thus need not be reported in the current iteration.

3 RESULTS

The table on the left in Figure 3 contains a summary of the repeats that our program detected in the first 14 000 bp of human chromosome 18 (May 2004). The exact filtering criteria are still being improved, and thus we manually filtered the repeats to clarify results. The criterion for this run allows up to k = 40 errors and requires a minimum length of at least 100 characters more than the number of errors. This particular criterion was tested on pseudorandom strings created using the C++ function rand(), which produces statistically random output in a single run. The function rand() was initialized using the system clock time, ensuring a different set of statistically random strings for each execution. It was tested on 20 strings, each 20 000 in length, and returned a single repeat in 6 of them, and no repeats in the rest. The table on the right of Figure 3 shows the results of TRF (Benson, 1999) on the same sequence.

Fig. 3

The repeats found by our program are compared and contrasted with the repeats found by TRF (Benson, 1999).

In Figure 4 we show the alignments of the copies of the two new repeats found by our program, beginning at positions 795 and 10 690.

Fig. 4

Alignments of the copies of two new repeats found by our program.

4 DISCUSSION

Many of the repeats that are found by our program are also found by TRF (Benson, 1999), which uses a non-evolutive definition. This is due to the fact that the consensus type repeat often satisfies the evolutive definition as well. Sometimes, our program will detect more errors than TRF, because in the case of a deviation in a single period, we count each deviation twice: once from the period before to the period with the deviation and once again from the period with the deviation to the period after. However, more often, changes will occur and be repeated for several periods before they change again. It is in these situations that the evolutive definition is most applicable. A non-evolutive definition causes an error to be reported for each period in which the deviation occurs, whereas an evolutive definition counts it as only one error until it is changed again.

We conclude that our novel definition and efficient program provide a useful tool for analyzing whole genomes. It is our hope that with its widespread use new scientific discoveries and inferences will be achieved.

[출처 : http://www.codeproject.com/KB/recipes/Levenshtein.aspx]


Introduction

The Levenshtein distance is the difference between two strings. I use it in a web crawler application to compare the new and old versions of a web page. If it has changed enough, I update it in my database.

Description

The original algorithm creates a matrix, where the size is StrLen1*StrLen2. If both strings are 1000 chars long, the resulting matrix is 1M elements; if the strings are 10,000 chars, the matrix will be 100M elements. If the elements are integers, it will be 4*100M == 400MB. Ouch!

This version of the algorithm uses only 2*StrLen elements, so the latter example would give 2*10,000*4 = 80 KB. The result is that, not only does it use less memory but it's also faster because the memory allocation takes less time. When both strings are about 1K in length, the new version is more than twice as fast.

Example

The original version would create a matrix[6+1,5+1], my version creates two vectors[6+1] (the white elements). In both versions, the order of the strings is irrelevant, that is, it could be matrix[5+1,6+1] and two vectors[5+1].

The new algorithm

Steps

StepDescription
1 Set n to be the length of s. ("GUMBO")
Set m to be the length of t. ("GAMBOL")
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements.
2 Initialize v0 to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] is not equal to t[j], the cost is 1.
6 Set cell v1[j] equal to the minimum of:
a. The cell immediately above plus 1: v1[j-1] + 1.
b. The cell immediately to the left plus 1: v0[j] + 1.
c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m].

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL":

Steps 1 and 2

  v0 v1        
    G U M B O
  0 1 2 3 4 5
G 1          
A 2          
M 3          
B 4          
O 5          
L 6          

Steps 3 to 6, when i = 1

  v0 v1        
    G U M B O
  0 1 2 3 4 5
G 1 0        
A 2 1        
M 3 2        
B 4 3        
O 5 4        
L 6 5        

Steps 3 to 6, when i = 2

SWAP(v0,v1): If you look in the code you will see that I don't swap the content of the vectors but I refer to them.

Set v1[0] to the column number, e.g. 2.

    v0 v1      
    G U M B O
  0 1 2 3 4 5
G 1 0 1      
A 2 1 1      
M 3 2 2      
B 4 3 3      
O 5 4 4      
L 6 5 5      

Steps 3 to 6, when i = 3

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 3.

      v0 v1    
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2    
A 2 1 1 2    
M 3 2 2 1    
B 4 3 3 2    
O 5 4 4 3    
L 6 5 5 4    

Steps 3 to 6, when i = 4

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 4.

        v0 v1  
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3  
A 2 1 1 2 3  
M 3 2 2 1 2  
B 4 3 3 2 1  
O 5 4 4 3 2  
L 6 5 5 4 3  

Steps 3 to 6, when i = 5

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 5.

          v0 v1
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3 4
A 2 1 1 2 3 4
M 3 2 2 1 2 3
B 4 3 3 2 1 2
O 5 4 4 3 2 1
L 6 5 5 4 3 2

Step 7

The distance is in the lower right hand corner of the matrix, v1[m] == 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and one insertion = two changes).

Improvements

If you are sure that your strings will never be longer than 2^16 chars, you could use ushort instead of int, if the strings are less than 2^8 chars, you could use byte. I guess, the algorithm would be even faster if we use unmanaged code, but I have not tried it.

[출처 : http://kerbtier.ch/2008/12/30/levenshtein-to-slow-how-to-speed-it-up]


you don’t need 40′000 levenshtein calls

one attempt to make it faster is to rule out some strings by comparing lengths first. if you say your maximum distance for two strings to match is 10 then you can discard strings which difference in character count is bigger than 10. it’s already much faster, for the 40′000 and for my usage almost fast enough but the faster the better. in this case i still have to check one strings length against 40′000 other strings length. this can be speed up by putting the sets of the strings with the same length into an array where the index is the length of the sets strings. so i don’t have to compare it against all strings. if l is the length of the string i want to test then i need to test it only against the strings in the sets for the indexes with a length from l-10 to l+10. probably i can rule out with this technique even more strings, by example count the number of words in the strings instead of the number of characters or the number of vowels. these approaches could be combined together and it probably will speed it up another bit. but due to statistics the result would probably be about the same like in the case where i use the total length of the string.
[출처 : http://pro.cug.kr/tag/Levenshtein%20Distance]


URL: http://www.merriampark.com/ld.htm

자주쓰이는 3개의 언어로 구현한 Levenshtein Distance

작성자: Michael Gilleland, Merriam Park Software

이 짧은 에세이를 쓰게 된 것은 Levenshtein distance 알고리즘에 대해서
설명하고 또 그것이 각각의 세가지 프로그래밍 언어에서 어떻게 구현되는가를
보이기 위해서입니다.

Levenshtein Distance이란 무엇인가?
데모
알고리즘
세가지 언어로 구현된 소스코드
레퍼런스
다른 언어로 구현

Levenshtein Distance이란 무엇인가?

Levenshtein Distance(이하 LD)는 두 문자열의 비슷한 정도를 측정하기위해 고안되었습니다.
여기서 원문자열을 (s)로, 대상문자열을 (t) 라고 나타낸다고 하겠습니다. distance란 s를
t로 변형시키기 위해 삭제, 추가, 또는 수정이 필요한 횟수를 뜻합니다. 예를든다면,

* s가 "test"이고 t도 "test"라면, LD(s,t) = 0 이 됩니다. 왜냐하면 문자열들이 이미 동일하여 변환이 필요하지 않기 때문입니다.

* s가 "test"이고 t가 "tent"라면, LD(s,t) = 1 이 됩니다. 왜냐하면 s를 t로 만들기 위해서는 "s"를 "n"으로 한번 수정이 필요하기 때문입니다.

Levenshtein distance는 string 간의 차이가 클수록 위대함을 느낄 수 있습니다.

Levenshtein distance는 러시아 과학자인 Vladimir Levenshtein가 1965년에 고안하여 그렇게 이름지어졌습니다.
Levenshtein 이란 단어가 쓰거나 읽기 힘들기 때문에 종종 edit distance라고도 불립니다.

Levenshtein distance 알고리즘은 다음과 같은 분야에 쓰여집니다:

* 철자 검사
* 음성 인식
* DNA 분석
* 표절여부 검사

데모

아래의 간단한 자바 애플릿으로 두 문자열의 Levenshtein distance를 알아보세요.

원래 문자열
대상 문자열


알고리즘

알고리즘 작동 단계
단계 설명

s의 문자열 길이를 n에 넣는다.
t의 문자열의 길이를 m에 넣는다.
만약 n = 0 이라면, m 을 리턴하고 종료한다.
만약 m = 0 이라면, n 을 리턴하고 종료한다.
0..m 행과, 0..n 열로 이루어진 행열을 만든다.
2
첫번째 행인 0..n을 초기화 한다.
첫번째 열인 0..m을 초기화 한다.
3
s의 각 문자(i는 1부터 n까지)를 검사한다.
4
t의 각 문자(j는 1부터 m까지)를 검사한다.
5
s[i]와 t[j]가 같다면, 변경하기 위한 비용은 0이 된다.
s[i]와 t[j]가 같지 않다면, 비용은 1이 된다.
6
행열의 셀 d[i,j]에 다음의 것들 중 가장 작은 값을 넣는다.
a. 바로 위의 셀이 더하기 1이 되는 경우: d[i-1, j] + 1
b. 바로 왼쪽 셀이 더하기 일이 되는 경우: d[i,j-1] + 1
c. 대각선으로 연속적인, 바로 왼,위쪽 셀의 비용: d[i-1,j-1] + cost
7
(3, 4, 5, 6) 단계를 반복하여 완료되면, d[n, m]셀에 있는 것이 distance가 된다.

예제

이 예제절에서는 원래 문자열이 "GUMBO"이고 대상 문자열이 "GAMBOL"이라 할 때
어떻게 Levenshtein distance가 계산되는지에 대해서 다룬다.

1 과 2 단계

i가 1일 때 3에서 6 단계

i가 2일 때 3에서 6 단계 

i가 3일 때 3에서 6 단계

i가 4일 때 3에서 6 단계

i가 5일 때 3에서 6 단계

7단계
행열의 가장 오른쪽 아래에 있는 값이 distance가 된다.(여기서는 2)
이 결과는 "GUMBO"가 "GAMBOL"이 되기 위해서 "U"를 "A"로 바꾸고
"L"을 추가해야한다는, 직관적으로 알 수 있는 결과와 일치합니다.
( 1번의 수정과 1번의 추가 = 2 번의 변경 )

세가지 언어로 구현된 소스코드

프로그래밍 언어들간에 차이에 대해서 토론하는 엔지니어들 사이에서는 종교 전쟁이 일어나기도합니다.
이러한 예로, 전형적인 주장은 JavaWorld article에서 일어난(July 1999) Allen Holub의 주장입니다.:
"예를들자면, 비주얼 베이식은 전혀 객체지향이라고 말할 수 없다. Microsoft Foundation Classes(MFC) 
또는 대부분의 다른 마이크로소프트의 테크놀러지는 어느것도 객체지향이라 주장할 수 없다."

Salon에 계제된(Jan. 8, 2001) Simson Garfinkels의 글에서 다른 진영의 반박이 이루어졌습니다.
이 글은 "Java: 느리고, 꼴사납고, 부적절한 언어"라는 제목으로 알려져 있는데, 명료하게
표현하자면 "나는 자바를 증오해"라고 나타낼 수 있습니다.

우리는 이러한 종교 전쟁들 속에서 자연스럽고 조심스런 입장을 취하로 했습니다. 배우기 위한 교재로써,
하나의 프로그래밍 언어에서만 해결할 수 있는 문제라면 대개 다른 언어에서도 마찬가지로 해결할 수
있을 것입니다. 우수한 프로그래머는 완전히 새로운 언어를 배우면서 한다고 하더라도 하나의 언어에서 
다른 언어로 비교적 쉽게, 큰 어려움에 당면하지 않고 옮길 수 있습니다. 프로그래밍 언어라는 것은
목적을 이루기 위한 것이지, 그 자체가 목적은 아닌 것입니다.

이러한 중도의 입장에서, 우리는 Levenshtein distance 알고리즘을 아래에 있는 프로그래밍 언어들로 
구현하여 소스코드를 보였습니다.

* Java
* C++
* Visual Basic

소스코드들 (블라블라)

참고문헌

Levenshtein distance에 관련된 다릍 토의를 다음 링크들에서 발견하실 수 있습니다.

* http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit.html (Lloyd Allison) 
* http://www.cut-the-knot.com/do_you_know/Strings.html (Alex Bogomolny) 
* http://www-igm.univ-mlv.fr/~lecroq/seqcomp/node2.html (Thierry Lecroq) 

다른 언어로 구현

아래 있는 분들은 그들 각자가 다양한 언어로 Levenshtein Distance 알고리즘을 구현한 것을
여기서 사용할 수 있게 친절히 승낙해 주셨습니다.

* Eli Bendersky 은 펄로 구현해주셨습니다.
* Barbara Boehmer 은 Oracle PL/SQL 로 구현해주셨습니다.
* Rick Bourner Objective-C 로 구현해주셨습니다.
* Joseph Gama 는 TSQL로 Planet Source Code 에 있는 TSQL 함수 패키지의 한 파트로 구현해주셨습니다.
* Anders Sewerin Johansen 는 C++로 제가 만든 것보다 C++의 정신에 가깝게, 더 최적화되고 세련되게 구현해주셨습니다.
* Lasse Johansen 는 C#으로 구현해 주셨습니다.
* Alvaro Jeria Madariaga는 Delphi로 구현해 주셨습니다.
* Lorenzo Seidenari 는 C로 구현해 주셨습니다.
* Steve Southwell는 Progress 4gl로 구현해 주셨습니다.

이 페이지 밖에 있는 다른 구현들:

* Art Taylor의 Emacs Lisp로의구현.
* Magnus Lie Hetland의 Python로의 구현.
* Richard Suchenwirth의 Tcl로의 구현(링크가 깨진 것을 알려주신 Stefan Seidler님 감사합니다).
[출처 : http://markos.gaivo.net/blog/?p=211]


Simon proposes using a dictionary to match mistyped URLs with real ones. I’d probably like the idea better if I actually understood it. Still using Levenshtein can be a bit easier than using a spell checker responsively.

Easier, but slow. I used existing implementation by Magnus Lie Hetland and made a test with 245 blog titles using a simple script. 100 iterations on aging powerbook produced:

1.766s, 29.152s, 9.399s (min, max, avg)

Average time to calculate distance between randomly chosen title and rest of them is 9.4 seconds, which is far too much to be useful. And there’s not even 250 of them.

There are two obvious ways to speed things up. Since minimum distance is at least as much as a difference in length of both strings, there’s no point in calculating it when difference already exceeds a limit we chose.

The other trick took into an account that Levenshtein’s algorithm of two strings of comparable length has complexity of O(n2) and that my blog titles are form quite sparse space. If strings are longer than a certain limit (I arbitrarily chose 10 letters), then first calculate edit distance on a small sparse substring and then calculate the real distance only if first one was close enough.

When I made 1000 iterations of randomly chosen title using only first test, I got following results:

0.003s, 0.284s, 0.167s (min, max, avg)

However, if I used both checks, the same 1000 iteration test got me:

0.003s, 0.162s, 0.027s (min, max, avg)

So, two simple checks can speed up search up to 350 times. Not bad, but I’m not happy. This is certainly fast enough for a personal website or sites where site structure effectively divides searching to relatively small sets of possible hits, but there are plenty of sites where it would be too slow.

+ Recent posts