Generators are on-premise iterators that output when asked for, shortly they provide the ability of lazy evaluation. One great example of this shiny feature is lexers. A lexer is a program, a function that takes raw source code as input and outputs certain parts of it with identifying according to some pre-defined categories.
Input: 'print(2+2)'
-------------------------------------------
1,0-1,5: NAME 'print'
1,5-1,6: OP '('
1,6-1,7: NUMBER '2'
1,7-1,8: OP '+'
1,8-1,9: NUMBER '2'
1,9-1,10: OP ')'
The example above shows tokens that belong to a single-word
(no space between any character group) source code. The lexer
has certain rules that explain what a token looks like. It
might be a regex, or a handwritten way of searching characters.
The example above starts with the p
character, and sees it
is a valid identifier start so continues to eat next characters
until it reaches something that is not alphanumeric ((
). After
that, it switches to the 'OP' rule, which has certain operator forms
like (
or >=
. When it gets a match on that, it switches again
to a different rule and so on...
Certain parsing methodologies requires to see all tokens before starting to parse, but that is not always the case. What if you have a very simple parser, which doesn't require to see upcoming tokens unless it can't parse the current state? It is where a lazy tokenizer comes in.
Let's imagine a simple tokenizer which only parses digits and
'+'/'-' operators. And there is also a syntactical verifier that
ensures there is always an operator between 2 digits and it goes like
that. When we input '2 + 2', it return True
and when we input '2 2 +',
or '2 + + 2' it raises an error. And if we give '$ 2 + 2', then it won't
start, there is going to be an error raised from the tokenizer side. But
what if we input something that has multiple defects, something like this,
'2 + + 2 + 3 $ 4'. Where it should raise an error? As I said before,
it depends, but for a simple parser, you want to show your user
the errors gradually so they solve them one by one. So we should
first give them the ParsingError and then after they handled it
we should raise the TokenizationError so they fix it too, but
incrementally. And there is no point in keeping '3' in the memory
when there is a problem that happens before.
Let's draft an example in Python. It is going to be very simple.
def lexer(source):
for word in source.split():
print(f'Lexing new word: {word}')
if word.isdigit():
yield Tokens.DIGIT
elif word in {"+", "-"}:
yield Tokens.OPERATOR
else:
raise LexerError(f"Unknown token: '{word}'")
We will go through the words (space splitted) and then if we encounter
a digit or an operator, we return its type. If not, we will raise
a LexerError
.
>>> producer = lexer('2 + + 2 + 3 $ 4')
>>> next(producer)
Lexing new word: 2
<Tokens.DIGIT: 1>
>>> next(producer)
Lexing new word: +
<Tokens.OPERATOR: 2>
>>> next(producer)
Lexing new word: +
<Tokens.OPERATOR: 2>
>>> next(producer)
Lexing new word: 2
<Tokens.DIGIT: 1>
>>> next(producer)
Lexing new word: +
<Tokens.OPERATOR: 2>
>>> next(producer)
Lexing new word: 3
<Tokens.DIGIT: 1>
>>> next(producer)
Lexing new word: $
__main__.LexerError: Unknown token: $
Let's draft about the verifier, which will use this producer;
def verifier(producer):
expected_type = Tokens.DIGIT
for token_type in producer:
if token_type is not expected_type:
raise ParserError(f'Expecting {expected_type} but got {token_type}')
if expected_type is Tokens.DIGIT:
expected_type = Tokens.OPERATOR
elif expected_type is Tokens.OPERATOR:
expected_type = Tokens.DIGIT
else:
raise UnexpectedStateError
else:
return True
Since we have only one way to go, we can use a single state-d system. We have 'expected_type' which points to the next thing that we need to get if we want to be syntactically correct. Our starting state will be a 'DIGIT', and then we will go through the producer and try to match the eaten tokens with our single state ('expected_type'). And after every the operation, we decide the next expected state by looking at the previous state. Let's give it a try:
>>> verify('2 + 2')
Lexing new word: 2
Lexing new word: +
Lexing new word: 2
True
>>> verify('2 + + 2 + 3 $')
Lexing new word: 2
Lexing new word: +
Lexing new word: +
__main__.ParserError: Expecting Tokens.DIGIT but got Tokens.OPERATOR
As you can see, it worked as we imagined. It didn't try to lex '$' sign because the parser failed before asking it to eat that character. It didn't even try to lex '3' so it early-failed and saved up some time. And if we try to fix this error;
>>> verify('2 + 2 + 3 $')
Lexing new word: 2
Lexing new word: +
Lexing new word: 2
Lexing new word: +
Lexing new word: 3
Lexing new word: $
__main__.LexerError: Unknown token: $
We get the lexer error. This was it. Thanks for reading, I know this was a relatively short blog post but I needed to explain why token producers are generally expressed as generators and what are the advantages (a hint about the new blog post coming is bringining 'backtracking').