Generators are on-premise iterators that output when asked for, shortly they provide the ability of lazy evaluation. One great example of this shiny feature is lexers. A lexer is a program, a function that takes raw source code as input and outputs certain parts of it with identifying according to some pre-defined categories.

Input: 'print(2+2)'
-------------------------------------------
1,0-1,5:            NAME           'print'
1,5-1,6:            OP             '('
1,6-1,7:            NUMBER         '2'
1,7-1,8:            OP             '+'
1,8-1,9:            NUMBER         '2'
1,9-1,10:           OP             ')'

The example above shows tokens that belong to a single-word (no space between any character group) source code. The lexer has certain rules that explain what a token looks like. It might be a regex, or a handwritten way of searching characters. The example above starts with the p character, and sees it is a valid identifier start so continues to eat next characters until it reaches something that is not alphanumeric ((). After that, it switches to the 'OP' rule, which has certain operator forms like ( or >=. When it gets a match on that, it switches again to a different rule and so on...

Certain parsing methodologies requires to see all tokens before starting to parse, but that is not always the case. What if you have a very simple parser, which doesn't require to see upcoming tokens unless it can't parse the current state? It is where a lazy tokenizer comes in.

Let's imagine a simple tokenizer which only parses digits and '+'/'-' operators. And there is also a syntactical verifier that ensures there is always an operator between 2 digits and it goes like that. When we input '2 + 2', it return True and when we input '2 2 +', or '2 + + 2' it raises an error. And if we give '$ 2 + 2', then it won't start, there is going to be an error raised from the tokenizer side. But what if we input something that has multiple defects, something like this, '2 + + 2 + 3 $ 4'. Where it should raise an error? As I said before, it depends, but for a simple parser, you want to show your user the errors gradually so they solve them one by one. So we should first give them the ParsingError and then after they handled it we should raise the TokenizationError so they fix it too, but incrementally. And there is no point in keeping '3' in the memory when there is a problem that happens before.

Let's draft an example in Python. It is going to be very simple.

def lexer(source):
  for word in source.split():
    print(f'Lexing new word: {word}')
    if word.isdigit():
      yield Tokens.DIGIT
    elif word in {"+", "-"}:
      yield Tokens.OPERATOR
    else:
      raise LexerError(f"Unknown token: '{word}'")

We will go through the words (space splitted) and then if we encounter a digit or an operator, we return its type. If not, we will raise a LexerError.

>>> producer = lexer('2 + + 2 + 3 $ 4')
>>> next(producer)
Lexing new word: 2
<Tokens.DIGIT: 1>
>>> next(producer)
Lexing new word: +
<Tokens.OPERATOR: 2>
>>> next(producer)
Lexing new word: +
<Tokens.OPERATOR: 2>
>>> next(producer)
Lexing new word: 2
<Tokens.DIGIT: 1>
>>> next(producer)
Lexing new word: +
<Tokens.OPERATOR: 2>
>>> next(producer)
Lexing new word: 3
<Tokens.DIGIT: 1>
>>> next(producer)
Lexing new word: $
__main__.LexerError: Unknown token: $

Let's draft about the verifier, which will use this producer;

def verifier(producer):
  expected_type = Tokens.DIGIT
  for token_type in producer:
    if token_type is not expected_type:
      raise ParserError(f'Expecting {expected_type} but got {token_type}')

    if expected_type is Tokens.DIGIT:
      expected_type = Tokens.OPERATOR
    elif expected_type is Tokens.OPERATOR:
      expected_type = Tokens.DIGIT
    else:
      raise UnexpectedStateError
  else:
    return True

Since we have only one way to go, we can use a single state-d system. We have 'expected_type' which points to the next thing that we need to get if we want to be syntactically correct. Our starting state will be a 'DIGIT', and then we will go through the producer and try to match the eaten tokens with our single state ('expected_type'). And after every the operation, we decide the next expected state by looking at the previous state. Let's give it a try:

>>> verify('2 + 2')
Lexing new word: 2
Lexing new word: +
Lexing new word: 2
True

>>> verify('2 + + 2 + 3 $')
Lexing new word: 2
Lexing new word: +
Lexing new word: +
__main__.ParserError: Expecting Tokens.DIGIT but got Tokens.OPERATOR

As you can see, it worked as we imagined. It didn't try to lex '$' sign because the parser failed before asking it to eat that character. It didn't even try to lex '3' so it early-failed and saved up some time. And if we try to fix this error;

>>> verify('2 + 2 + 3 $')
Lexing new word: 2
Lexing new word: +
Lexing new word: 2
Lexing new word: +
Lexing new word: 3
Lexing new word: $
__main__.LexerError: Unknown token: $

We get the lexer error. This was it. Thanks for reading, I know this was a relatively short blog post but I needed to explain why token producers are generally expressed as generators and what are the advantages (a hint about the new blog post coming is bringining 'backtracking').

Share on: TwitterFacebookEmail


Published

Category

Tokenizers

Tags

Contact