I was always very facinated of everything about compilers, interpreters, parsers and lexers. Now I was wondering how I could implement such things using Python. I decided to write a simple lexer for algebraic expressions that converts a sequence of characters into a sequence of tokens. Till now I just implemented such things using C or C++ and so I started the simplest way as I learned by reading the Dragon Book.
First of all I created a simple enum type that allows me to create enums as used to from languages like C++ and Java. Then I implemented a Token class that holds all information about a token and enables me to easily print a token using the Python’s magic str method. Finally I added a Lexer class that processes character by character and returns the tokens.
This resulted in the following code:
But this code is very long and it took some time until it worked. Fortunately, I stumbled over Python’s Regex module documentation where I found a simple tokenizer based on regular expressions. So I wrote a new script that uses this technique to parse algebraic expressions. This resulted in a much smaller script:
But as life goes, once you have a running solution, a friend tells you that there is a far simpler and more efficient way to implement this. Yeah, you are right, René told me about the pyparsing library. I couldn’t resist and so I implemented the same process using an other approach and all this just for fun. This script was again a little bit shorter than the regular expression based solution and brings another benefit: You are not only able to implement the lexical analysis in just a few lines, you could also do the parsing using this library by adding just a few lines.
For me it was very fascinating to see how many possibilties Python offers to do lexical analysis and how powerful regular expressions are. From one day of fun it became a day of many insights. Maybe I will delving deeper into the topic when I have a bit more time.
I hope you enjoyed this short overview of the different abilities of Python to tokenize a sequence of characters. I am looking forward to see your implementations and read your recommendations. Until next time, happy coding.
Phidelux is a Computer Science MSc. interested in hardware hacking, embedded Linux, compilers, etc.