Array Language part 3: Semantic actions, ANF

02.21.2025

Previously¹, I described an AST data structure where we have the sum-type Expr where each variant of the type (Expr::Let(node), Expr::Term(node), Expr::Literal(node), etc) holds a single field: a unique node struct type generated by our macros. Each field is actually a NodeRef<LetNode> where noderef is an alias for some contianer type. This is really not the best way to go about it, and it sort of defeats the whole purpose of using the sum type in the first place, and makes it harder to reason about the fields of each variant. I’ve known this for a while, but I’ve put off improving it, perhaps for my own stubborness and perhaps for my own laziness. In any case, we now have a good reason to change it, because, in order to aid further compiler passes, we need some way to “annotate” our AST with semantic information. I can move the field of each variant into the enum structure, and move the indirection outside of the enum, creating a more generic ExprNode struct that will hold an Expr as well as some additional information.

Here is the refactored AST defining macro:

pub type NodeRef<Node> = Box<Node>;

#[macro_export]
macro_rules! ast {
    ($(
        $base_name:ident {
         $($variant:ident {
             $($childname:ident: $childtype:ty),*
         })|+
    })*
    ) => {


        $(
        paste::paste!{
        #[derive(Debug, Clone, PartialEq)]
        pub struct [< $base_name Node >]<T> where T: PartialEq {
            pub data: $base_name<T>,
            pub note: T
        }

        pub type [< $base_name Ref >]<T:PartialEq> = NodeRef<[< $base_name Node >]<T>>;
        }

        // Enums define the parse tree from grammar
        define_enum!($base_name {
         $($variant {
             $($childname: $childtype),*
         }),+});

        define_constructors!($base_name {
         $($variant {
             $($childname: $childtype),*
         }),+});
        )*
    }
}
#[macro_export]
macro_rules! define_enum {
    ($base_name:ident {
         $($variant:ident {
             $($childname:ident: $childtype:ty),*
         }),+
    }
    ) => {
        #[derive(Debug, Clone, PartialEq)]
        pub enum $base_name<T> where T: PartialEq {
            $(
                $variant { $($childname: $childtype),* }
            ),+
        }
    }
}

#[macro_export]
macro_rules! define_constructors {
    ($base_name:ident {
         $($variant:ident {
             $($childname:ident: $childtype:ty),*
         }),+
    }
    ) => {
        $(
            paste::paste!{
                // Gives a shorthand syntax for constructing node
                pub fn [< $variant:snake:lower _node >]<T:Default + PartialEq>($($childname : $childtype),*) -> [<$base_name Node>]<T> {
                    [<$base_name Node>] {
                        data: $base_name::<T>::$variant{ $($childname,)* },
                        note: T::default()
                    }
                }
            }
        )+
    };
}
pub use ast;
pub use define_constructors;
pub use define_enum;

// --------------------------------------------------------
// AST productions for the lang of format:
// meta-symbol {
//      production { attributes... }
//      ...
//      }
ast! {
    Stmt {
          Expr { expression: NodeRef<ExprNode<T>> }
        | Print { expression: NodeRef<ExprNode<T>> }
        | Unit {}
    }
    Expr {
          FnDef { signature: FuncSignature, body: NodeRef<ExprNode<T>> }
        | FnCall { callee: Callee<T>, args: Vec<NodeRef<ExprNode<T>>> }
        | Frame { expression_list: Vec<NodeRef<ExprNode<T>>> }
        | AtomLiteral { value: Literal }
        | ArrayLiteral { value: Vec<Literal> }
        | Let { name:Name, initializer: NodeRef<ExprNode<T>>, region: Option<NodeRef<ExprNode<T>>> }
        | Term { name: Name }
        // | Condition
    }
}

Our Expr and Stmt meta-symbols are now generic with some type T. This type will store information such as the source code location of expressions, and later on it will store type analysis data. Initially, it will store just a Location. Previously, I was only storing a Location on identifiers and literals. This is a lot better.

#[derive(derive_more::Display, Debug, Eq, PartialEq, Copy, Clone)]
pub enum Location {
    Unknown,
    Line(usize),
    #[display("{_0}:{_1}")]
    LineCol(usize, usize)
}

impl Default for Location {
    fn default() -> Self {
        Self::Unknown
    }
}

I needed to refactor the parser as well. Since I created the constructor macro, it’s not such a tough job, just changing the function signatures to return ExprNode<Location>. Locations can be attached to the AST during parsing now as well:

impl ExprNode<Location> {
    pub fn with_location(mut self, loc: Location) -> ExprNode<Location> {
        self.note = loc;
        return self;
    }
}

A-Normal Form

One of the great benefits designing a functional programming language is that we can achieve greater simplicity by making simple transformations on our AST. A-Normal Form will allow us to reduce the complexity of the abstract syntax and put it into a form that is more amenable to sequential computation. It means decomposing compound expression so that each operation is bound separately to a unique binding, with operations explicitly ordered by precedence (which is already specified by syntax). In this way it is more similar to the structure of SSA form code, while preserving the more information about the scope and structure of the code.²

Since the goal of our language is to allow for uncomplicated expression of parallelism, moving to this form will be useful because it will aid in identifying the places where parallelism is applicable. Rather, it will allow us to more easily express the sequential data-flow dimension of computation across which our array analysis can be applied. That’s the idea, at least.

Some examples in the language:

// Source
fn diff_square(x, y) {
    x*x - y*y
}
// ANF
let diff_square = |x,y| {
    let tmp1 = x*x in
    let tmp2 = y*y in 
    tmp1*tmp2
}

// source
fn foo(x, y) { (x + y) * [2 2] }
// ANF
let foo = |x, y| {
    let tmp1 = x+y in 
    tmp1 * [2 2]
}

//source
f(g)(h(x),3)
//ANF
let tmp1 = f(g) in 
let tmp2 = h(x) in
tmp1(tmp2, 3)

One last example to demonstrate:

let x = 12.4
f(3.8, [g(x), h(x-7.1)])/10.0
//ANF
let x = 12.4 in
let tmp1 = g(x) in
let tmp2 = x - 7.1 in
let tmp3 = h(tmp2) in
let tmp4 = f(3.8, [tmp1, tmp3]) in 
tmp4/10.0

Notice how, in the last example, I am considering the frame expression [tmp1, tmp3] to be an immediate value in A-Normal form. This is possible because we consider the framing (or rather, re-framing) of the two values to be type information about the layout of data in memory, rather than an operation. It may well be a interpreted as a data-copy operation, but it should be possible to such a copy this with further compiler passes.

Tagging

The first step to implement this is to add another piece of data to the tree which represents a unique identifier that can be used as need-be for when we rearrange the ordering of operations to conform with ANF. We will use these to ensure that the created symbols are This requires writing a compiler pass which increments a counter on each node. This, as well could be implemented in-place as a part of parsing (and it would be more efficient, too) but I give it here in the functional style³.

impl<T> ExprNode<T> where T: PartialEq {
    // A simple compiler pass which adds unique numeric tags to each node, used in later passes
    pub fn tag(self) -> ExprNode<(T, NodeTag)> {
        fn help<T:PartialEq>(e: ExprNode<T>, cur: NodeTag) -> (ExprNode<(T, NodeTag)>, NodeTag) {
            match e.data {
                ..
                Expr::FnCall { callee, args } => {
                    let (callee, mut next_tag) = match callee {
                        Callee::Primitive(pft) => (Callee::Primitive(pft), cur),
                        Callee::Expression(expr) => {
                            let (call_e, next_tag) = help(*expr, cur);
                            (Callee::Expression(call_e.into()), next_tag)
                        }
                    };
                    let args = args
                        .into_iter()
                        .map(|a| {
                            let (arg_e, tag) = help(*a, next_tag);
                            next_tag = tag;
                            Box::new(arg_e)
                        })
                        .collect_vec();
                    (
                        ExprNode {
                            data: Expr::FnCall { callee, args },
                            note: (e.note, next_tag),
                        },
                        next_tag + 1,
                    )
                }
        }
        let (tag_e, _) = help(self, 0);
        return tag_e;
    }
}

Checking the form

Before write the pass, I’m writing the pass to verify wether sub-expressions are in ANF or not. In our case, we will define the “immediate” as any expression that is either a literal or identifier³ as well as function definitions. We include the function definition, because it we can consider a function definition as representing an address in code, and thus an immediate value with no side effects.

trait IsImmediate {
    fn is_immediate(&self) -> bool;
}

impl<T> IsImmediate for ast::Expr<T>
where
    T: PartialEq,
{
    fn is_immediate(&self) -> bool {
        matches!(
            self,
            Self::Term { .. }
                | Self::AtomLiteral { .. }
                | Self::ArrayLiteral { .. }
                | Self::FnDef { .. }
        )
    }
}

trait IsANF {
    fn is_anf(&self) -> bool;
}

impl<T> IsANF for ast::Expr<T>
where
    T: PartialEq,
{
    fn is_anf(&self) -> bool {
        match self {
            Self::Let {
                name: _,
                initializer,
                region,
            } => initializer.data.is_anf() && region.as_ref().is_some_and(|e| e.data.is_anf()),
            Self::FnCall { callee: _, args } => args.iter().all(|a| a.data.is_anf()),
            Self::Frame { expression_list } => expression_list.iter().all(|e| e.data.is_anf()),
            _ => self.is_immediate(),
        }
    }
}

Implementing the Pass

As stated earlier, the purpose of this pass is to to move compound operations into separate bindings. For each computation and each abstraction, one binding. That is the core function of the pass so I wrote this simple helper function that does that thing.

// Converts a list of binding pairs to a nested let-expression
fn expression_in_assignments<T, I>(e: ast::ExprNode<T>, bindings: I) -> ast::ExprNode<T>
where
    I: DoubleEndedIterator<Item = (Name, ExprNode<T>)>,
    T: PartialEq + Default,
{
    bindings
        .into_iter()
        .rev()
        .fold(e, |acc, (arg_name, arg_val)| {
            ast::let_node(arg_name, arg_val.into(), Some(acc.into()))
        })
}

This way, we can produce an iterator of bindings, which will be much easier than trying to juggle the nested structure of the let-Exprs from the inside out (trust me, I tried it this way first.)

There is really only one case that is important to consider, and that is the FnCall nodes. You can think of it like this: one new binding per FnCall. The first step is to create this function which pulls the assignments out of the nested structure accordingly.

// Returns the expression along with all the context needed to for the expression. 
// TODO: preserve location information for error reporting
fn assignment_list<'a>(
    e: ExprNode<(Location, NodeTag)>,
) -> (
    ExprNode<(Location, NodeTag)>,
    Box<dyn DoubleEndedIterator<Item = (Name, ExprNode<(Location, NodeTag)>)> + 'a>,
) {
    match e.data {
        Expr::FnCall { callee, args } => {
            let (callee, context): (_, Box<dyn DoubleEndedIterator<Item = _>>) = match callee {
                ast::Callee::Primitive(_) => (callee, Box::new(std::iter::empty())),
                ast::Callee::Expression(e) => {
                    let (e, context) = assignment_list(*e);
                    (
                        ast::Callee::Expression(e.into()),
                        Box::new(context.into_iter()),
                    )
                }
            };
            let (args, args_context): (Vec<_>, Vec<_>) = args
                .into_iter()
                .map(|e| {
                    let (e, context) = assignment_list(*e);
                    (Box::new(e), context)
                })
                .unzip();
            let tmp = Name::from(&*format!("f{}", e.extra.1));
            (
                ast::term_node(tmp.clone()),
                Box::new(
                    context
                        .chain(args_context.into_iter().flatten())
                        .chain(std::iter::once((tmp, fn_call_node(callee, args)))),
                ),
            )
        }
        ..
        immediate!() => (e, Box::new(std::iter::empty())),
    }
}

What you see above is the main outline for the pass. The expression_in_assignments now can be used to convert this to a single nested expession in A-Normal form.

Previously, I had been working on and writing about the implementation of the interpreter. I’m moving on from that for now, because I am interested in developing the next compiler passes necessary to generate code, rather than delving into the hair-pulling details of trying to get closures to work in the interpreter (it’s quite difficult with Rust’s memory model)

I have been calling these identifiers in the language by the name Term, which makes sense to me. However, do not confuse this with the type theory notion of term, which represents a thing-to-which-type-judgement-may-be-applied — that is, every expression result.

That’s the beauty of Rust, we can write our compiler passes with immutability, and come back later to optimize the code by modifying in-place where possible.