Tree-like data is a common consumption target of applications that are empowered by visitor pattern. A couple of use cases of this pattern includes code generators (a.k.a compilers), static code analysis tools (linters), and code formatters. Code generators take advantage of working with heterogeneous data. On the other side, linters make use of working with-in a state to process previously learned information about that codebase.
class Emitter(Visitor):
def visit_Module(self, node):
with self.create_module() as module:
self.generic_visit(node)
return module
def visit_BinOp(self, node):
self.visit(node.left)
self.visit(node.right)
opcode = type(node.op).__name__.upper()
self.module.emit(Instruction(f"OPERATION_{opcode}"))
def visit_Constant(self, node):
self.module.emit(Instruction("LITERAL", node.value))
As seen in the upper example, implementing a simple AST to some imaginary
bytecode compiler isn't hard at all. It only requires a visitor class and
implementation of how an individual node should act when seen. This kind of
Visitor
s actually take the whole responsibility of visiting the tree, in
a way that can be used recursively.
What I mean by "recursive" is that in common languages, fields on certain nodes
contain heterogeneous data (sums). For example, rhs or lhs of a binary operation
can be a Name
, a Call
, or simply another binary operation. Instead of having
certain conditions for all of them, you can simply pass it to the Visitor
and it
will do the job of finding the right place to emit code for that node.
int visit_constant(Generator* gen, Node* node) {
return EMIT("LITERAL", node->v.Constant.value);
}
int visit_binop(Generator* gen, Node* node) {
CHECK(visit_expression(node->v.BinOp.left));
CHECK(visit_expression(node->v.BinOp.right));
return EMIT(_operator_name(node->v.BinOp.operator));
}
int visit_expression(Generator* gen, Node* node) {
switch (node->kind) {
case Name_kind:
return visit_name(gen, node);
case BinOp_kind:
return visit_binop(gen, node);
case Constant_kind:
return visit_constant(gen, node);
}
}
In the example above, the visit_binop
function doesn't need to know what is
the type of lhs / rhs. It just knows it is an expression and passes it to a top-level
visit_expression
function which then handles the rest. Okay, we are good up to know,
but what about some context dependant actions.
Context Dependent Actions
So, imagine you have a language that has certain features, but using two different
esoteric functionality together should be forbidden. Grammars are not suitable for
such actions because while the construction, no one knows what context you are in
or who are your neighbors (you can actually record about it, but it would be a big mess).
Since you are already using the visitor pattern to compile the AST, you can think that while
compiling one of that features, you can peek around and find if it is good to use
or not. For this example, I'll forbid using control flow elements inside of a finally
.
Let's see what we should handle
For(
target=Name(id='x', ctx=Store()),
iter=Name(id='y', ctx=Load()),
body=[
Try(
body=[Pass()],
handlers=[],
orelse=[],
finalbody=[Continue()], # <=======
),
],
orelse=[],
type_comment=None,
)
What we are looking for a Continue
(or Break()
etc.) inside of a finally inside of a
for loop. Let's try to draft an implemention of it.
import ast
class Compiler(ast.NodeVisitor):
def visit_For(self, node):
self.inside_for = True
jumper = self.emit("FOR_LOOP", node.target, node.iter)
self.emit_body(node.body, bind=jumper)
self.inside_for = False
Okay, from the initial point, we got ourselves a "state" (self.inside_for
), it is no big
deal. The only downside we can think of, for now, is the code looks kinda messy. We can probably
write a context manager to set an instance variable, but I don't think that would be a pythonic idea.
def visit_Try(self, node):
with self._try_guard(node) as guard:
[...]
self.inside_finally = True
[...]
self.inside_finally = False
def visit_Continue(self, node):
if self.inside_for and self.inside_finally:
raise ValueError
It started to have more states (self.inside_finally
), more and more. Guess we can have a context manager
to at least reduce this hand managing burden, but it won't solve the problem of thousands of different
states that you may require or not. While the codebase grows, some of them will become obsolete, useless;
and no one might just notice it. And the actual problem is that, this check is wrong, for cases like the code
below.
for x in y:
try:
pass
finally:
for y in z:
continue
This should successfully run because the continue
belongs a different for
loop, and that has nothing
to do with the upper try/finally
clause. We can probably solve this with more and more states but I won't
advise it because it would make the code more confusing, complex and awful.
As the title may suggest, the solution to this problem might be, the ancestral chains. It is some kind of backtracking to the original tree node from the child.
>>> tree = ast.parse("2+2", mode="eval")
>>> lhs = tree.body.left
>>> lhs
<ast.Constant object at 0x7f39270f9af0>
>>> lhs.parent
<ast.BinOp object at 0x7f39270f97d0>
>>> lhs.parent.parent
<ast.Expression object at 0x7f39270f9730>
Having such a chain attached to node objects would allow us to search in that context and check if we have the right conditions to raise an error.
Implementation
Let's implement a simple relate(tree: AST) -> None
function
that would take the tree, and parent
field to all nodes by traversing it.
def relate(tree):
tree.parent = None
The first thing is setting the parent of root node. It is important because it will signal consumers to where they should stop.
for parent in ast.walk(tree):
ast.walk
will walk in every node and leaf, which they all be parents (leaves might not, but that won't matter
since the function below doesn't return anything on that case)
for child in ast.iter_child_nodes(parent):
child.parent = parent
iter_child_nodes
will yield all the first-level child nodes that belong to the given node. And then we can safely
set the parent. It might be a cool idea to use weakref.ref
here, but I don't think the parents will go away
while the children are still living.
Also, it might be good to have a flatten function that will iterate all parents
, something like this
def get_parents(node):
parent = node
while parent := parent.parent:
yield parent
This way we can do some kind of search among the parents
>>> tuple(get_parents(lhs))
(<ast.BinOp object at 0x7f46b7e3d190>, <ast.Expression object at 0x7f46b7ed6c80>)
>>> ast.Expression in tuple(map(type, get_parents(lhs)))
True
This was it, all we need is that searching for parents. If we want to repeat the Compiler
example with the same
the bug we had, it would be as simple and as pythonic as this;
class AncestorChain(set):
def __init__(self, *args):
return super().__init__(args)
def __contains__(self):
return self.issubset(map(type, get_parents(node)))
class Compiler(ast.NodeVisitor):
[...]
FORBIDDEN_SET = AncestorChain(ast.For, ast.Try)
def visit_Continue(self, node):
if node in FORIBDDEN_SET:
raise ValueError
It is this simple, no states at all. Although we made the code simple, we want to also solve the bug. One thing that came to my mind for such cases is a utility function that returns the first occurrence of a type in that chain. This will help us to find the first occurrence of the try clause and then we can check its parent for the given for.
def first_occurrence(node, *ancestors):
for parent in get_parents(node):
if type(parent) in ancestors:
return parent
else:
return False
class Compiler(ast.NodeVisitor):
[...]
def visit_Continue(self, node):
if node in FORIBDDEN_SET:
for_clause = first_occurrence(node, ast.For)
try_clause = first_occurrence(node, ast.Try)
if is_parented_by(try_clause, for_clause):
raise ValueError
What it does is that it finds the first try/finally and the first for clause and then
it checks that if that continue
's bound for is also the parent of that try/finally
clause, if so, it raises the value error, but cases like this;
for item in items:
try:
pass
finally:
for child in children:
continue
it will be silenced.
So, instead of having millions of states, we solved the problem with 3 lines of extra code to the compiler.
Having an advantage of working in a dynamic language, will certainly helped the simpler implementation
of this ancestral chains, but I dont think there will be a decent difference on languages like C about
the implementation's look. And by the way, thanks for reading. If you want to contact to me, please check
out the social section in the footer, or mail me directly from isidentical [at] tree.science
.
Appendix 1: Full Code
import ast
def relate(tree):
tree.parent = None
for parent in ast.walk(tree):
for child in ast.iter_child_nodes(parent):
child.parent = parent
def get_parents(node):
parent = node
while parent := parent.parent:
yield parent
class AncestorChain(set):
def __init__(self, *args):
return super().__init__(args)
def __contains__(self, node):
return self.issubset(map(type, get_parents(node)))
def first_occurrence(node, *ancestors):
for parent in get_parents(node):
if type(parent) in ancestors:
return parent
else:
return False
def is_parented_by(node, *ancestors):
for parent in get_parents(node):
if parent in ancestors:
return True
else:
return False