How I Wrote A Lexer - 911 WeKnow

Creating a Programming Language

My experience designing and writing a Lexical Analyzer

Image for post Image by Willi Heidelbach from Pixabay

In previous articles, I talked about my language (Ethereal) , my journey with it, and how I architected it. In this article, I want to share how I developed the first component of the language ?the lexical analyzer.

If you have any interest in developing a programming language, you?ve probably looked at lexers and how to develop them. But even if you haven?t, there are more than enough examples and tutorials out there. So, I am going to focus on my experience writing a lexer. I?ll explain the pitfalls I faced, the design decisions I made, and the techniques I used.

But first of all, let?s understand what are lexers and what do they do.

Understanding lexers

A lexical analyzer ? more commonly referred to as lexer ? is a software component that takes a string and breaks it down into smaller units that are understandable by a language. These smaller units are called lexical tokens or lexemes. In other words, you can think of a lexer as a black box that takes a sentence as input and breaks it into smaller units ?essentially, words.

A lexer, however, does more than that. Often it will also check if the broken-down units actually mean something for the programming language. It?s kind of like the spell checker we use in our word processing software. Do note that the lexer will not check the grammatical syntax of the language. (In other words, it doesn?t know if you?re using the right tokens in the wrong combination.)

Let?s look at an example of a lexer in action.

The input is this tiny piece of code:

Image for post Input to the lexer

When you feed this code into the lexer I made, you get the following output:

Image for post Lexical output ? tokens

In this output, the IDs are simply the indices for each token. The type identifies whether the token is an identifier (IDEN), number (INT), operator (+, +=), separator (;, ()), and so on. Next is the line number with a column number in brackets , and finally the actual data for dynamic types (identifiers, numbers).

In terms of a programming language, the lexer?s job is to break the source code down into smaller pieces ? lexemes ? so that they may be easily used in the next stage by the compiler/interpreter.

My first lexer implementation ? The bad one

When I first started developing the lexical analyzer for my language, being a CS undergrad, my thoughts immediately raced towards Finite State Machines. And since I had researched building lexers, regular expressions also came to mind. Frankly though, I dislike writing regular expressions (just like any sane person!). They are complex to write and understand, and from the perspective of computers, require rather hefty computation.

So, I started writing a lexer using the concepts of Finite State Machines and ended up with something like this:

Image for post FSM ? States of Lexer

Using this model, I iterated through each character of the source code. I used current and previous variables (type and prevtype) to keep track of the type of data being tokenized. Here?s the code that did the job:

Image for post FSM ? Loop to find out type of string based on the current and previous character type

This loop identifies the type of each data string by moving to a different state. The new state is based on the previous state. For example, if the loop encounters multiple digits followed by a dot (.), prevtype would be NUM, and type would become FLOAT. (This makes sense, because the dot operator after one or more digits signifies a floating point value.)

To be honest, when I first made this, I was proud of myself. I felt accomplished.

Well? for a little while at least.

Soon enough, I realized this was a very trivial way to go about things. My code would not be able to handle complex tokens easily (without fiddling at least). As a matter of fact, it couldn?t even handle parsing the usual identifier sequences (underscore, a-z, A-Z, 0?9) as it didn?t know that if underscore is the first character, the string should be categorized as an operator (invalid or otherwise) or the beginning of an identifier. My pride came crashing down.

Overall, this implementation was a total bust. It could parse simple tokens no problem, but it would be a nightmare to make it parse even a slightly more complex sequence of characters.

But ? as is often the case ? you learn more from making big mistakes than getting your code right the first time.

The moment of enlightenment

One fine morning, after some thinking and reading through a couple of other language implementations, it hit me!

I had been stuck on Finite State Machines and regular expressions, but at the end of the day, a lexer was just sequentially going through the characters and categorizing them into groups every time it finds a break point (an invalid character, space, operator, etc). With that thought, I started implementing the lexer using simple loops and switch cases to identify the token types.

Image for post A small part of the implementation of my lexer

For example, if the first character in loop is an operator, the get_operator function is called. Then, the code tests several cases (shown above) to see if the operator makes sense. If not, it errors out and returns FAILED.

This implementation technique improved the code?s readability and extensibility quite a lot.

Each line of code is looped through and tokenized like this. Even the code for tokenizing an identifier is very small and easily understandable:

Image for post Tokenizing an identifier

Once the identifier string is returned, it?s compared with the set of language keywords. This allows the lexer to decide if the identifier is a keyword. If it turns out to be a keyword, the lexer sets the type accordingly and the loop moves on.

And that is basically how I implemented the lexer for my language. It probably has some issues and edge cases where it will fail, but I have yet to find them.

To be honest, depending on language syntax, the lexer is usually the easiest component to code. The lexer I wrote using the technique above is about 600 lines in the source file. The entire source of my lexer is available here.

Conclusion

At the end of the day, there are a plethora of ways one can write a lexer. Some techniques are easier to implement while others provide great extensibility. Of course, the approach you use also depends on the syntax of your language.

For those of you interested in writing a language, or even just a lexer, I hope this article is helpful. Feel free to check out my lexer and learn from it, or just tell me what I did wrong!

Anyway, thank you for reading! Feedback is always appreciated. I?ll be writing about building the parser next, which is, in my experience, much more sophisticated than lexer (and equally exciting to write).

Until next time! ??