Fuzzy string search
A fuzzy string search is a form of approximate string matching that compare two strings to show how similar or different they are. It is used for finding typos, matching DNA sequences, etc.
In today’s post we will look at Damerau-Levenshtein Edit Distance algorithm, and its variation, and afterwards will benchmark them and discuss preferred variation for various scenarios.
The Damerau-Levenshtein distance between two strings defined as:
And here is a straightforward implementation:
Let’s test it performance on small subset of our vocabulary, and test a hundred string pairs on MacBook Pro 2019: 10 sec.. 30 sec… 1min… 3min ..?? ..6 min! 3.6 seconds for one comparison is way too much. We have to improve it!
So, we have a recursive method that we want to speed up. Hmm, looks like perfect case for memoization! So, let’s try it:
Testing performance: 7.5ms. Using memoization we have improved the performance by ~45000 times! So, for one pair, it will be ~0.075ms. Great result, but can we do better?
For typo discovering, we do not need to count the real distance between two strings, we just want to know if this distance is small enough. Let’s say, that we want at most 2 errors between two words to count as typo. Our new method should return 0, 1, 2 or Infinity if there are more than 2 errors between two words.
Let’s run it with limit from 1 to 10, to see the influence of the limit on performance:
We can see, that for distance ≤ 2, using the limited version, we can further improve the times from 7.5ms to ~0.18ms, 41 times faster. This is a significant improvement. Yet, what if we memoize the limited version? Let’s try it out.
Now for the interesting part of comparing performance:
Oh, this is kind of disappointing. The memoized version of limited algorithm is actually performing worse for small distances, which we are mainly interested in. Probably, the overhead of managing and accessing the cache is not worth it in this case, since we do not have a lot of reappearances of same conditions, since now we need to take limit into account as well, as if the result is bigger than the limit, we should return infinity instead of what was calculated for larger limit.
All in all, after tweaking the algorithm for our specific case, we were able to improve the initial performance from 6 minutes to mere 0.18 milliseconds for 100 comparisons.