I recently came across a situation where I needed fuzzy string matching functionality for a command line application I built. After googling a way to do this in Ruby, I found a Stack Overflow post telling me to install a ruby gem called levenshtein-ffi and be on my way. However, my curiosity got the best of me, and I went down an internet rabbit hole of trying to understand what is really going on behind the scenes. Ultimately, I landed on the Levenshtein Distance Wikipedia page, where I saw this:
What?
I have minimal experience with matrices and have never taken Linear Algebra, so initially, I was bewildered. Eventually however, I was able to piece together an elementary understanding of what is happening, which I?ll attempt to explain below. This explanation is meant for beginners ? anyone who is confused by the equation above or has never taken higher level mathematics courses. Warning: if you don?t fall into the above camp, you?ll probably find the explanation below needlessly tedious.
Introduction
The Levenshtein distance is a number that tells you how different two strings are. The higher the number, the more different the two strings are.
For example, the Levenshtein distance between ?kitten? and ?sitting? is 3 since, at a minimum, 3 edits are required to change one into the other.
- kitten ? sitten (substitution of ?s? for ?k?)
- sitten ? sittin (substitution of ?i? for ?e?)
- sittin ? sitting (insertion of ?g? at the end).
An ?edit? is defined by either an insertion of a character, a deletion of a character, or a replacement of a character.
Functions
A quick refresher if you haven?t looked at functions recently? The first thing to understand is that functions are a set of instructions that take a given input, follow a set of instructions, and yield an output. You probably saw lots of basic functions in your high school math courses.
Piecewise Functions
Piecewise functions are more complex functions. In a piecewise function, there are multiple sets of instructions. You choose one set over another based on a certain condition. Consider the example below:
In the above example, we use different sets of instructions based on what the input is. Piecewise function are denoted by the brace { symbol.
With that in mind, the Levenshtein Distance equation should look a little more readable.
OriginalIn other words?
What do a, b, i, and j stand for?
a = string #1
b = string #2
i = the terminal character position of string #1
j = the terminal character position of string #2.
The positions are 1-indexed. Consider the below example where we compare string?cat? with string ?cap?:
The conditional (a? ?b?)
a? refers to the character of string a at position i
b? refers to the character of string b at position j
We want to check that these are not equal, because if they are equal, no edit is needed, so we should not add 1. Conversely, if they are not equal, we want to add 1 to account for a necessary edit.
Solving Using a Matrix
The Levenshtein distance for strings A and B can be calculated by using a matrix. It is initialized in the following way:
From here, our goal is to fill out the entire matrix starting from the upper-left corner. Afterwards, the bottom-right corner will yield the Levenshtein distance.
Let?s fill out the matrix by following the piecewise function.
Now we can fill out the rest of the matrix using the same piecewise function for all the spots in the matrixes.
One more example:
If you feel comfortable with the equation at this point, try to fill out the rest of the matrix. The result is posted below:
Since the lower-right corner is 3, we know the Levenshtein distance of ?kitten? and ?sitting? is 3. This is what we expected since we already knew the Levenshtein distance was 3 (as explained in the introduction). This is also reflected in the matrix shown on the Levenshtein Distance Wikipedia page.
Conclusion
Oh, I see!
Hopefully, the above looks less intimidating to you now. Even better, I hope you understand what it means!
Vladimir Levenshtein, inventor of the Levenshtein algorithm
Now that you understand to a certain degree what is happening under the hood, you can really appreciate all that?s happening when you download a gem like levenshtein-ffi and run a method like Levenshtein.distance(?kitten?, ?sitting?) and get a return value of 3. All the hard work has been abstracted away so that you get a simple numerical representation of how different two strings are. We should all thank Vladimir Levenshtein, who came up with his algorithm in 1965. The algorithm hasn?t been improved in over 50 years and for good reason. According to MIT, it may very well be that Levenshtein?s algorithm is the best that we?ll ever get in terms of efficiency.
Works Used / Cited