Tree-like data is a common consumption target of applications that are empowered by visitor pattern. A couple of use cases of this pattern includes code generators (a.k.a compilers), static code analysis tools (linters), and code formatters. Code generators take advantage of working with heterogeneous data. On the other side, linters make use of working with-in a state to process previously learned information about that codebase.

class Emitter(Visitor):

    def visit_Module(self, node):
        with self.create_module() as module:
            self.generic_visit(node)
        return module

    def visit_BinOp(self, node):
        self.visit(node.left)
        self.visit(node.right)

        opcode = type(node.op).__name__.upper()
        self.module.emit(Instruction(f"OPERATION_{opcode}"))

    def visit_Constant(self, node):
        self.module.emit(Instruction("LITERAL", node.value))

As seen in the upper example, implementing a simple AST to some imaginary bytecode compiler isn't hard at all. It only requires a visitor class and implementation of how an individual node should act when seen. This kind of Visitors actually take the whole responsibility of visiting the tree, in a way that can be used recursively.

What I mean by "recursive" is that in common languages, fields on certain nodes contain heterogeneous data (sums). For example, rhs or lhs of a binary operation can be a Name, a Call, or simply another binary operation. Instead of having certain conditions for all of them, you can simply pass it to the Visitor and it will do the job of finding the right place to emit code for that node.

int visit_constant(Generator* gen, Node* node) {
    return EMIT("LITERAL", node->v.Constant.value);
}

int visit_binop(Generator* gen, Node* node) {
    CHECK(visit_expression(node->v.BinOp.left));
    CHECK(visit_expression(node->v.BinOp.right));
    return EMIT(_operator_name(node->v.BinOp.operator));
}

int visit_expression(Generator* gen, Node* node) {
    switch (node->kind) {
        case Name_kind:
            return visit_name(gen, node);
        case BinOp_kind:
            return visit_binop(gen, node);
        case Constant_kind:
            return visit_constant(gen, node);
    }
}

In the example above, the visit_binop function doesn't need to know what is the type of lhs / rhs. It just knows it is an expression and passes it to a top-level visit_expression function which then handles the rest. Okay, we are good up to know, but what about some context dependant actions.

Context Dependent Actions

So, imagine you have a language that has certain features, but using two different esoteric functionality together should be forbidden. Grammars are not suitable for such actions because while the construction, no one knows what context you are in or who are your neighbors (you can actually record about it, but it would be a big mess). Since you are already using the visitor pattern to compile the AST, you can think that while compiling one of that features, you can peek around and find if it is good to use or not. For this example, I'll forbid using control flow elements inside of a finally.

Let's see what we should handle

        For(
            target=Name(id='x', ctx=Store()),
            iter=Name(id='y', ctx=Load()),
            body=[
                Try(
                    body=[Pass()],
                    handlers=[],
                    orelse=[],
                    finalbody=[Continue()], # <=======
                ),
            ],
            orelse=[],
            type_comment=None,
        )

What we are looking for a Continue (or Break() etc.) inside of a finally inside of a for loop. Let's try to draft an implemention of it.

import ast

class Compiler(ast.NodeVisitor):

    def visit_For(self, node):
        self.inside_for = True
        jumper = self.emit("FOR_LOOP", node.target, node.iter)
        self.emit_body(node.body, bind=jumper)
        self.inside_for = False

Okay, from the initial point, we got ourselves a "state" (self.inside_for), it is no big deal. The only downside we can think of, for now, is the code looks kinda messy. We can probably write a context manager to set an instance variable, but I don't think that would be a pythonic idea.

    def visit_Try(self, node):
        with self._try_guard(node) as guard:
        [...]
        self.inside_finally = True
        [...]
        self.inside_finally = False

    def visit_Continue(self, node):
        if self.inside_for and self.inside_finally:
            raise ValueError

It started to have more states (self.inside_finally), more and more. Guess we can have a context manager to at least reduce this hand managing burden, but it won't solve the problem of thousands of different states that you may require or not. While the codebase grows, some of them will become obsolete, useless; and no one might just notice it. And the actual problem is that, this check is wrong, for cases like the code below.

for x in y:
    try:
        pass
    finally:
        for y in z:
            continue

This should successfully run because the continue belongs a different for loop, and that has nothing to do with the upper try/finally clause. We can probably solve this with more and more states but I won't advise it because it would make the code more confusing, complex and awful.

As the title may suggest, the solution to this problem might be, the ancestral chains. It is some kind of backtracking to the original tree node from the child.

>>> tree = ast.parse("2+2", mode="eval")
>>> lhs = tree.body.left
>>> lhs
<ast.Constant object at 0x7f39270f9af0>
>>> lhs.parent
<ast.BinOp object at 0x7f39270f97d0>
>>> lhs.parent.parent
<ast.Expression object at 0x7f39270f9730>

Having such a chain attached to node objects would allow us to search in that context and check if we have the right conditions to raise an error.

Implementation

Let's implement a simple relate(tree: AST) -> None function that would take the tree, and parent field to all nodes by traversing it.

def relate(tree):
    tree.parent = None

The first thing is setting the parent of root node. It is important because it will signal consumers to where they should stop.

    for parent in ast.walk(tree):

ast.walk will walk in every node and leaf, which they all be parents (leaves might not, but that won't matter since the function below doesn't return anything on that case)

        for child in ast.iter_child_nodes(parent):
            child.parent = parent

iter_child_nodes will yield all the first-level child nodes that belong to the given node. And then we can safely set the parent. It might be a cool idea to use weakref.ref here, but I don't think the parents will go away while the children are still living.

Also, it might be good to have a flatten function that will iterate all parents, something like this

def get_parents(node):
    parent = node
    while parent := parent.parent:
        yield parent

This way we can do some kind of search among the parents

>>> tuple(get_parents(lhs))
(<ast.BinOp object at 0x7f46b7e3d190>, <ast.Expression object at 0x7f46b7ed6c80>)
>>> ast.Expression in tuple(map(type, get_parents(lhs)))
True

This was it, all we need is that searching for parents. If we want to repeat the Compiler example with the same the bug we had, it would be as simple and as pythonic as this;

class AncestorChain(set):
    def __init__(self, *args):
        return super().__init__(args)

    def __contains__(self):
        return self.issubset(map(type, get_parents(node)))

class Compiler(ast.NodeVisitor):
    [...]

    FORBIDDEN_SET = AncestorChain(ast.For, ast.Try)

    def visit_Continue(self, node):
        if node in FORIBDDEN_SET:
            raise ValueError

It is this simple, no states at all. Although we made the code simple, we want to also solve the bug. One thing that came to my mind for such cases is a utility function that returns the first occurrence of a type in that chain. This will help us to find the first occurrence of the try clause and then we can check its parent for the given for.

def first_occurrence(node, *ancestors):
    for parent in get_parents(node):
        if type(parent) in ancestors:
            return parent
    else:
        return False

class Compiler(ast.NodeVisitor):
    [...]

    def visit_Continue(self, node):
        if node in FORIBDDEN_SET:
            for_clause = first_occurrence(node, ast.For)
            try_clause = first_occurrence(node, ast.Try)
            if is_parented_by(try_clause, for_clause):
                raise ValueError

What it does is that it finds the first try/finally and the first for clause and then it checks that if that continue's bound for is also the parent of that try/finally clause, if so, it raises the value error, but cases like this;

for item in items:
    try:
        pass
    finally:
        for child in children:
            continue

it will be silenced.

So, instead of having millions of states, we solved the problem with 3 lines of extra code to the compiler. Having an advantage of working in a dynamic language, will certainly helped the simpler implementation of this ancestral chains, but I dont think there will be a decent difference on languages like C about the implementation's look. And by the way, thanks for reading. If you want to contact to me, please check out the social section in the footer, or mail me directly from isidentical [at] tree.science.

Appendix 1: Full Code

import ast

def relate(tree):
    tree.parent = None
    for parent in ast.walk(tree):
        for child in ast.iter_child_nodes(parent):
            child.parent = parent

def get_parents(node):
    parent = node
    while parent := parent.parent:
        yield parent

class AncestorChain(set):
    def __init__(self, *args):
        return super().__init__(args)

    def __contains__(self, node):
        return self.issubset(map(type, get_parents(node)))

def first_occurrence(node, *ancestors):
    for parent in get_parents(node):
        if type(parent) in ancestors:
            return parent
    else:
        return False

def is_parented_by(node, *ancestors):
    for parent in get_parents(node):
        if parent in ancestors:
            return True
    else:
        return False

Share on: TwitterFacebookEmail


Published

Category

Abstract Syntax Trees

Tags

Contact