Understanding the Levenshtein Distance Equation for Beginners

Understanding the Levenshtein Distance Equation for Beginners

I recently came across a situation where I needed fuzzy string matching functionality for a command line application I built. After googling a way to do this in Ruby, I found a Stack Overflow post telling me to install a ruby gem called levenshtein-ffi and be on my way. However, my curiosity got the best of me, and I went down an internet rabbit hole of trying to understand what is really going on behind the scenes. Ultimately, I landed on the Levenshtein Distance Wikipedia page, where I saw this:

Image for postWhat?

I have minimal experience with matrices and have never taken Linear Algebra, so initially, I was bewildered. Eventually however, I was able to piece together an elementary understanding of what is happening, which I?ll attempt to explain below. This explanation is meant for beginners ? anyone who is confused by the equation above or has never taken higher level mathematics courses. Warning: if you don?t fall into the above camp, you?ll probably find the explanation below needlessly tedious.

Introduction

The Levenshtein distance is a number that tells you how different two strings are. The higher the number, the more different the two strings are.

For example, the Levenshtein distance between ?kitten? and ?sitting? is 3 since, at a minimum, 3 edits are required to change one into the other.

  1. kitten ? sitten (substitution of ?s? for ?k?)
  2. sitten ? sittin (substitution of ?i? for ?e?)
  3. sittin ? sitting (insertion of ?g? at the end).

An ?edit? is defined by either an insertion of a character, a deletion of a character, or a replacement of a character.

Functions

A quick refresher if you haven?t looked at functions recently? The first thing to understand is that functions are a set of instructions that take a given input, follow a set of instructions, and yield an output. You probably saw lots of basic functions in your high school math courses.

Image for post

Piecewise Functions

Piecewise functions are more complex functions. In a piecewise function, there are multiple sets of instructions. You choose one set over another based on a certain condition. Consider the example below:

Image for post

In the above example, we use different sets of instructions based on what the input is. Piecewise function are denoted by the brace { symbol.

With that in mind, the Levenshtein Distance equation should look a little more readable.

Image for postOriginalImage for postIn other words?

What do a, b, i, and j stand for?

a = string #1

b = string #2

i = the terminal character position of string #1

j = the terminal character position of string #2.

The positions are 1-indexed. Consider the below example where we compare string?cat? with string ?cap?:

Image for post

The conditional (a? ?b?)

a? refers to the character of string a at position i

b? refers to the character of string b at position j

We want to check that these are not equal, because if they are equal, no edit is needed, so we should not add 1. Conversely, if they are not equal, we want to add 1 to account for a necessary edit.

Image for post

Solving Using a Matrix

The Levenshtein distance for strings A and B can be calculated by using a matrix. It is initialized in the following way:

Image for post

From here, our goal is to fill out the entire matrix starting from the upper-left corner. Afterwards, the bottom-right corner will yield the Levenshtein distance.

Let?s fill out the matrix by following the piecewise function.

Image for postImage for post

Now we can fill out the rest of the matrix using the same piecewise function for all the spots in the matrixes.

One more example:

Image for postImage for post

If you feel comfortable with the equation at this point, try to fill out the rest of the matrix. The result is posted below:

Image for post

Since the lower-right corner is 3, we know the Levenshtein distance of ?kitten? and ?sitting? is 3. This is what we expected since we already knew the Levenshtein distance was 3 (as explained in the introduction). This is also reflected in the matrix shown on the Levenshtein Distance Wikipedia page.

Image for post

Conclusion

Image for postOh, I see!

Hopefully, the above looks less intimidating to you now. Even better, I hope you understand what it means!

Image for postVladimir Levenshtein, inventor of the Levenshtein algorithm

Now that you understand to a certain degree what is happening under the hood, you can really appreciate all that?s happening when you download a gem like levenshtein-ffi and run a method like Levenshtein.distance(?kitten?, ?sitting?) and get a return value of 3. All the hard work has been abstracted away so that you get a simple numerical representation of how different two strings are. We should all thank Vladimir Levenshtein, who came up with his algorithm in 1965. The algorithm hasn?t been improved in over 50 years and for good reason. According to MIT, it may very well be that Levenshtein?s algorithm is the best that we?ll ever get in terms of efficiency.

Works Used / Cited

The Levenshtein Algorithm

The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein?

www.cuelogic.com

Levenshtein distance – Wikipedia

In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the?

en.wikipedia.org

Easy to understand Dynamic Programming – Edit distance

IMPORTANT: This blog has been moved to jlordiales.me. The content here might be outdated. To see this post on the new?

jlordiales.wordpress.com

12

No Responses

Write a response