Array Language Part 1: Representations and Parsing

09.16.2024

Previously I have written about compiler technologies with respect to Deep Learning frameworks. That entry was something that I considered more of an exploratory exercise in understanding something that, I can saw with certainty, is perceived by many of my peers graduating DL practitioner as a kind of black magic. I wish there was more to disambiguate and simplify the space¹ right now and that’s part of why I’m writing this because I think programming languages can lift that mental burden from those learning. Looking back on that old post, it is not the most comfortable read, but it certainly did clear up much of the gray areas in my mind about how neural networks get to the hardware. Some time a few months after that I discovered the joy of APL², and like many others who have used the language, began considering what my own implementation would look like. So I dug up the scrapped Lox interpreter that I had used to teach myself Rust a few years back, and dusted it off as a sort of thing to keep myself occupied during the bores of finishing up my final undergraduate requirements by refactoring it into a different sort of fishy-named language. It has developed quite a way, slowly but thoroughly, and now I would like to take this space to reflect and document what I have done so far.

Getting Started: Data Structures

When I started the project in 2022, it was an interpreter for Lox based on the book Crafting Interpreters by Robert Nystrom. This book’s content was a great starting place for understanding how a language could be implemented as an interpreter, but as it is about Lox, an imperative, dynamically typed programming language, the lexer and parser needed to be reworked slightly to accommodate the array language that we³ are hoping to build with this project. One of the first things that I did in revisiting this project was to create new AST-generator. Since we are using Rust, these important data structures need to be explicitly defined and enumerated, so I chose to do that using these macros:

#[macro_export]
macro_rules! define_enum {
    ($base_name:ident {
         $($variant:ident {
             $($childname:ident: $childtype:ty),*
         }),+
    }
    ) => {
        #[derive(Debug, PartialEq)]
        pub enum $base_name {
            $(
                $variant(NodeRef<paste::paste!{[<$variant Node>]}>)
            ),+
        }
    }
}

#[macro_export]
macro_rules! define_nodes {
    ($base_name:ident {
         $($variant:ident {
             $($childname:ident: $childtype:ty),*
         }),+
    }
    ) => {
        $(
        paste::paste!{
        #[derive(Debug, PartialEq)]
        pub struct [<$variant Node>] {$(pub $childname: $childtype),*}
        }
        )+
    };
}

What’s going on here, if you can look past the line noise of the input matching is we are making a tree data structure using Rust’s enums. So define_enum will define an enum node type with some variants for each part of the grammar. Each variant points to a unique data structure (with an indirection), as is defined by define_nodes. Then we can wrap each of these in the top level macro, which will create the whole AST data structure for, say, an Expr:

#[macro_export]
macro_rules! define_ast {
    ($(
        $base_name:ident {
         $($variant:ident {
             $($childname:ident: $childtype:ty),*
         })|+
    })*
    ) => {
        $(
        // Enums define the parse tree from grammar
        define_enum!($base_name {
         $($variant {
             $($childname: $childtype),*
         }),+});

        // data structures held by nodes
        define_nodes!($base_name {
         $($variant {
             $($childname: $childtype),*
         }),+});
        )*
    }
}

Finally, we can instantiate our grammar with the following macro invocation:

// AST productions for the lang of format:
// Symbol { 
//      production { attributes... } 
//      ...
//      } 

define_ast! {
    Expr {
          UnaryOperation { operator: UnaryOpType, right: Expr }
        | BinaryOperation { left: Expr, operator: BinaryOpType, right: Expr }
        | FnDef { signature: CallSignature, body:Expr }
        | FnCall { callee: Expr, args: Vec<Expr> }
        | Grouping { expression: Expr }
        | Frame { expression_list: Vec<Expr> }
        | Literal {value: DataType }
        | Let { name:Name, initializer: Expr, region: Option<Expr> }
        | Term {name: Name }
        | AtomicCast {value: DataType }
        | ShapeCast {value: DataType }
    }
}

I don’t know about you, but I think that’s a pretty satisfying result. Some things to note about this is the fact that nodes are heap allocated due to the fact that type NodeRef<T> = Box<T>. Doing this with an enum means that we can treat this type as Sized and implement traits over the Expr as well as over a LiteralNode.

Parsing

Now for the parser. I happen to have written a lexer⁴ for this as well, but I think it will be possible to understand without an explanation of that. We will start out, again, with some helpful macros:

macro_rules! match_next {
    ($tokens:ident, $types:pat) => {
        if let Some(TokenData { token: $types, .. }) = $tokens.peek() {
            $tokens.next()
        } else {
            None
        }
    };
}
macro_rules! expect_next {
    ($tokens:ident, $types:pat) => {
        match_next!($tokens, $types).ok_or(parse_failure!(
            $tokens,
            format!("expected {}", stringify!($types))
        ))
    };
}

These allow us to match against token patterns and advance the parser on success or failure of these patterns in a concise fashion. With these, we can easily parse sytax elements such as this ML-style let expression with a body:

pub fn expression(tokens: &mut Lexer) -> Result<Expr> {
    letexpr(tokens)
}

fn letexpr(tokens: &mut Lexer) -> Result<Expr> {
    if match_next!(tokens, LET).is_some() {
        let name = expect_next!(tokens, IDENTIFIER(..))?;
        expect_next!(tokens, EQUAL)?;
        Ok(Expr::Let(
            LetNode {
                name: name.try_into()?,
                initializer: value(tokens)?,
                region: {
                    expect_next!(tokens, IN | SEMICOLON)?;
                    Some(expression(tokens)?)
                },
            }
            .into(),
        ))
    } else {
        value(tokens)
    }
}

As an aside, you may cringe at the fact that we are passing a mutable reference to the Lexer to these functions, and there may well be a “cleaner” solution, but this way actually does not incur much issue with the borrow checker because of the serial nature of the algorithm. As you can see, it’s fairly easy here to implement much of the syntax and support operator-precedence infix notation. We can easily create a shorthand for writing Expr::Let(LetNode {...}.into()) each time, but it is left this way for clarity. You can see by this function that a value(tokens), while not yet explicit in the data structure is the type of all other expressions which have lower precedence than the let-expression, which includes things like function calls and operations on literals and terms. As a more complicated example, here is how a closure can be parsed this way⁵:

fn closure(tokens: &mut Lexer) -> Result<Expr> {
    if match_next!(tokens, VBAR).is_some() {
        let args: Vec<(Name, ShapeVec)> = (0..)
            .map_while(|_| {
                if let Some(name) = match_next!(tokens, IDENTIFIER(_)) {
                    Some((|| -> Result<(Name, ShapeVec)> {
                        Ok((name.try_into()?, shapedef(tokens)?))
                    })())
                } else {
                    None
                }
            })
            .collect::<Result<Vec<(Name, ShapeVec)>>>()?;
        let ret = if match_next!(tokens, ARROW).is_some() {
            Some(shapedef(tokens)?)
        } else {
            None
        };
        expect_next!(tokens, VBAR)?;
        let body = expression(tokens)?;
        Ok(Expr::FnDef(
            FnDefNode {
                signature: CallSignature { args, ret },
                body,
            }
            .into(),
        ))
    } else {
        expression(tokens)
    }
}

I love graphics programming and machine learning/ statistics and array languages, but each of these things is very pigeon-holed by their own separate software frameworks, university courses and online communities. It’s interesting to see some of this stuff coming together now around the excitement for AI.

thanks to the C++ podcast ADSP

the royal we, you know, the editorial, dude.

⁴

I know there is proc_macro::TokenStream and the syn crate, I just wanted to do this myself I guess.

⁵

Some things about this code I find a bit unpleasing, like the hack where I am using the closure which is invoked immediately to “merge” the two result values of name.try_into()? and shapedef(tokens)?