Speeding up Levenshtein

nlogn 2010. 9. 15. 18:04

2010. 9. 15. 18:04

[출처 : http://markos.gaivo.net/blog/?p=211]

Simon proposes using a dictionary to match mistyped URLs with real ones. I’d probably like the idea better if I actually understood it. Still using Levenshtein can be a bit easier than using a spell checker responsively.

Easier, but slow. I used existing implementation by Magnus Lie Hetland and made a test with 245 blog titles using a simple script. 100 iterations on aging powerbook produced:

1.766s, 29.152s, 9.399s (min, max, avg)

Average time to calculate distance between randomly chosen title and rest of them is 9.4 seconds, which is far too much to be useful. And there’s not even 250 of them.

There are two obvious ways to speed things up. Since minimum distance is at least as much as a difference in length of both strings, there’s no point in calculating it when difference already exceeds a limit we chose.

The other trick took into an account that Levenshtein’s algorithm of two strings of comparable length has complexity of O(n2) and that my blog titles are form quite sparse space. If strings are longer than a certain limit (I arbitrarily chose 10 letters), then first calculate edit distance on a small sparse substring and then calculate the real distance only if first one was close enough.

When I made 1000 iterations of randomly chosen title using only first test, I got following results:

0.003s, 0.284s, 0.167s (min, max, avg)

However, if I used both checks, the same 1000 iteration test got me:

0.003s, 0.162s, 0.027s (min, max, avg)

So, two simple checks can speed up search up to 350 times. Not bad, but I’m not happy. This is certainly fast enough for a personal website or sites where site structure effectively divides searching to relatively small sets of possible hits, but there are plenty of sites where it would be too slow.

저작자표시

nlogn's log :: 천천히 달리기

Speeding up Levenshtein

+ Recent posts

티스토리툴바