The Zephyr Abstract Syntax Description Language or shortly Zephyr's ASDL is a well known; mature (released at 1997~), descriptive language for defining ASTs (nodes and leaves) and other tree-like data structures. When it got released (and even after one or two decades) it was more than capable of being used in major compilers, including CPython's bytecode compiler (from v2.5~). But after having quite a bit of experience with it, I started to realize that it can be improved in a few ways. The whole post will contain my own humble opinions. Please feel free share your own ones with me through the contact information at the end of the post. I'm currently working on a demo project to enpower all the ideas that are being mentioned below, and it would be really nice to hear from you about these.

Type System

Every field declaration in ASDL consists from this form [type][qualifier]? [name]; where [type] is either something defined in the current spec, or a built-in one (such as identifier). And the [qualifier] is a mutually exclusive and optional qualifier for the given type. There are 2 kinds of qualifiers [qualifier]={?, *}. A question mark (?) means it can be either empty or something that belongs to the [type]. On the other side, a star (*) means it is a zero or more element sequence of given [type]. In theory, these 2 qualifiers might seem enough, but giving a basic example might prove the otherwise. Let's imagine a simple AST of a python function.

function = Function(identifier name, expr? returns, stmt* body)

It has a name, an optional return annotation, and a list of statements as it's body. Which looks very accurate, right?

def foo() -> int:
    pass

def bar():
    pass
    pass

If we try to address these functions in the AST which we created earlier, it will look something like this;

Function("foo", Expr(int), [PassStmt()])
Function("bar", None, [PassStmt(), PassStmt()])

And there is nothing that looks wrong with this form, as long as it is generated by some kind of parser. But Python allows users to craft and compile arbitrary ASTs. As you might know, the function bodies in Python have to at least 1 statement, but the ASDL implies that it might have zero (since * means zero or more). There goes the conflict, this AST, Function("baz", None, []) is valid according to the ASDL spec, but it later on it might crash the interpreter or might not pass the validation at all. For CPython, there is a custom AST validator, which comes with the burden of maintenance, just for ensuring that user crafted AST's won't crash the compiler.

>>> import ast
>>> foo_mod = ast.parse("def foo(): pass")
>>> foo_mod.body[0].body.clear()
>>> compile(foo_mod, "<test>", "exec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: empty body on FunctionDef <= custom error
static int
validate_nonempty_seq(asdl_seq *seq, const char *what, const char *owner)
{
    if (asdl_seq_LEN(seq))
        return 1;
    PyErr_Format(PyExc_ValueError, "empty %s on %s", what, owner);
    return 0;
}
static int
validate_body(asdl_seq *body, const char *owner)
{
    return validate_nonempty_seq(body, "body", owner);
}
...
    case FunctionDef_kind:
        return validate_body(stmt->v.FunctionDef.body, "FunctionDef")

Also this is not the only case where the AST doesn't comply with ASDL. For an example, the AST of an dictionary defined as Dict(expr* keys, expr* values), which means that it has two list of expressions that are named keys and values. That makes sense since, AST of {'a': 'b'} is just Dict([Constant('a')], [Constant('b')]). But when it comes to dict unpacking inside of another dictionary with double-star operator, the AST looks like this;

Input: {**a, b:c}
Output:
Dict(
    keys=[
        None,
        Name(id='b', ctx=Load()),
    ],
    values=[
        Name(id='a', ctx=Load()),
        Name(id='c', ctx=Load()),
    ],
)

Did you see that there is an outlier among the keys, the None. This is because that field qualifiers are mutually exclusive, and you can't chain them. Things like this would make the ASDL a context-dependent thing and in some cases, they might increase the maintenance burden (such as external verifiers, which I'll also address in the next section). The solution would be as simple as just extending the current qualifiers and make them chainable. There are 2 design I have in my mind. The first one is introducing new qualifiers in the same form

* => zero or more sequence
+ => one or more sequence
? => optional
[<field type>] => also optional but chainable

FunctionDef(identifier name, expr? returns, stmt+ body)
Dict([expr]* keys, expr* values)

the second one is kind-a different might be hard to process in big ASDL's but more explicit.

ZeroOrMore[<field type>] => such as *
OneOrMore[<field type>] => such as +
Opt/Optional[<field type>] => such as * or [<field type>]

FunctionDef(identifier name, Opt[returns], OneOrMore[body])
Dict(ZeroOrMore[Opt[keys]], ZerOrMore[values])

ASDL actions

Integrating some source code inside of grammars isn't a new idea, a recent example would be the Python's new parser generator, and the grammar it consumes. I believe that this can be integrated very quickly to the ASDL itself, with a new but not an unorthodox syntax. The purpose of these actions is going to be both verification and transitions (not limited to that). It might open a way to language extensions.

Bringing such actions would require a metadata format to the ASDL modules, the best form I can think of is something similar to python decorators that will annotate the ASDL modules (namespaces, which are not part of the original paper).

@<key> <value>
module <name> {}

@actions C
@version 3.8
module Example {}

The action syntax will depend on the action's purpose

{verify, transition} $left::right [where $condition] {
    [ACTION]
}

Verifier Actions

As I mentioned earlier, languages that allow users to create external ASTs requires a custom validation step. Type checking will help in most cases, but there will be still some esoteric ones left. It might be a controversial thing since some people might not want to host their source code inside of a text spec (I dont know, maybe for their linters / formatters, or other purposes), but this will ensure that the validation process is public and the clients of this AST will know what nodes will be validated and which kind of methods will be used for their validation.

verify $nodes::$fields [where $condition] {
    [ACTION]
}

verify Dict::(keys, values) {
    return len(keys) == len(values)
}

verify ImportFrom::level {
    return level >= 0
}

verify Try::(handlers, finalbody, orelse) where len(handlers) == 0 {
    if len(finalbody) == 0 and len(orelse) > 0:
        return False
    ...
}

Transitioning

Here it comes the other big problem and a use case for actions, the AST changes. If you are writing some kind of tool that consumes the AST (e.g: linter), it is not uncommon for you to get broken in every release. The reason for that is AST is also an internal format and things might just change for internal reasons and no one gives you a guarantee about it won't change again. So you have to test your tool on every major release and ensure the breakages are gone by creating tons of workarounds.

This is the case for even the simplest change, like changing the name of a node. The solution would be a simple layer of "compatibility". The way it should work is that, for old nodes, it is going to keep the same structure as the old ASTs even though the name of the form of that node is changed. Achieving such a thing would be available in 2 ways: keeping ASDL of every version (and it would be definitely a mess), or only the generated code for that nodes as a part of that "compatibility" layer. I'd personally go for the latter. Let's do an imaginary example of 3 different language versions;

@version 3.6
module Example {
    number = NumOrFloat(object value)
           | Complex(object value)
}
@version 3.7
module Example {
    number = Num(object value)
           | Float(object value)
           | Complex(object value)
@version 3.8
module Example {
    number = Number(object value, str kind)

The 3.6 version has 2 nodes, NumOrFloat for numbers and floats and Complex for imaginary numbers. The 3.7 form splits NumOrFloat into 2 different nodes, a Num node and a Float node. And finally, the 3.8 version has only 1 constructor, it is the Number, with an additional kind field. In theory that no matter which version you are, in the "compatibility" layer, you would only have the NumOrFloat and the Complex nodes. For providing that we need some kind of action to do the transition.

transition $source::$destination [where $condition] {
    [ACTION]
}

transition Number::(Num, Float, Complex) {
    switch (origin->kind) {
        case 'integer':
            return Num(origin->value);
        case 'float':
            return Float(origin->value);
        case 'complex':
            return Complex(origin->value);
    }
}

The example upper takes a 3.8 Number node and outputs a 3.7 node (Num, Float, or Complex). This is also going to be the case for 3.7, it will take a 3.7 AST and output a 3.6 version. So in the end, all ASTs will be the same in the imaginary "compatibility" layer.

@version 3.7
module Example {
    number = Num(object value)
           | Float(object value)
           | Complex(object value)

    transition Float::NumOrFloat {
        return NumOrFloat(origin->value);
    }
}

Thanks

I guess this is all, thanks for reading this and if you have any extra thoughts about this, I really want to listen to all of them. Please contact me through isidentical [at] tree.science or twitter/telegram/discord (@isidentical)

Share on: TwitterFacebookEmail


Published

Category

Abstract Syntax Trees

Tags

Contact