What I learned while creating a language parser and an AST explorer

2023-09-08 . 3 min read

Introduction

The original language is called Lox by Bob Nystrom which he wrote a book about. I ported that to TypeScript along with the idea of creating a UI similar to Astexplorer to visualize the Syntax Tree. The fun part about it is that I made the language keywords flexible which means that the user can define their own keywords. You can try it out here. In this blog, I want to summarize the things that I learned in this process.

Language Parsing

Parsing in general can be thought of taking in something raw and trying to make sense of it. We do it all the time when we talk. Even now your brain is parsing and trying to make sense of the words and sentences you are reading. Every language has a set of vocabulary and grammar that creates a set of rules to use that vocabulary just like how a simple sentence in the English language is of the format Sub + Verb + Obj. When we write a program, we write a sequence of words that abide by the language's grammar and later feed it into the interpreter or compiler. For anything other than the compiler it is just a text file or a sequence of strings just like how our brain is an interpreter of our native language and when we hear an unknown foreign language we can't make sense of it. The same is true for a parser, when an HTML parser sees a " < " it knows that it is the starting of a tag and expects a " > ", slly. when a JS parser sees the word "let" it understands that is the starting of a variable declaration and expects a declaration statement after that. The compiler first identifies the words and defines them as a token. A token is a richer version of the individual word that the compiler recognizes. In tox, when we write a simple function:

fun sum(a, b) {
	return a + b;
}

Here's how each token looks.

Then the parsing phase starts where the program tries to make sense of the sequence of the tokens. If someone came to you and asked "What time is it?", we would look at our watch and tell the person the time but what if the person asked "it what ? time is" we would probably get confused for a while. The parsing technique used to create the language is famously known as Recursive Descent parsing. It is a top-down parsing technique and before talking about what it means, let's look at a simple syntax tree for a variable declaration statement var sum = 4 + 5.

When a top-down parser creates the syntax tree, it sees the 'var' token and creates the root var node, and searches for the variable name and initializer to create the leaf node, so in a way, the parser is creating the tree top-down. Also, the tree shown is only 2 levels deep, but it can go down many levels, to achieve this the parser recursively moves down the tree respecting the precedence and associativity.

Visitor Pattern

In the project, the Interpreter is a visitor. Visitor pattern separates the data from its usage. The syntax tree is the data structure holding the information about the program and the visitor uses the tree. The syntax tree can be used in many ways; we can interpret the program, transpile it to a completely different language, change part of the program, do a static analysis, etc etc. This is the structure that tools like Babel, Eslint, and prettier work with. Babel on top of it provides us with a plugin system where we can create our own visitor and manipulate the program. To implement the visitor pattern first we create a behavior on the data itself so that it can accept a visitor. Each node has a accept method attached to it which accepts a visitor and calls the required method in it.

Accep_Method_On_BinaryExpression_Node

accept<T>(visitor: ExprVisitor<T>): T {
    return visitor.visitBinaryExpr(this)
}

ExpressionVisitorInterface

interface ExprVisitor<T> {
    visitBinaryExpr(expr: Binary): T
    visitUnaryExpr(expr: Unary): T
    visitGroupingExpr(expr: Grouping): T
    visitLiteralExpr(expr: Literal): T
    visitLogicalExpr(expr: Logical): T
    visitTernaryExpression(expr: Ternary): T
    visitVariableExpression(expr: variable): T
    visitAssignmentExpression(expr: Assignment): T
    visitCallExpression(expr: CallExpr): T
}

Then we create the visitor that implements the interface.

Visitor

class Visitor implements ExprVisitor<unknown> {
    constructor() {}

    visitBinaryExpr(stmt: Binary): unknown{ ...implementation... }
    visitLogicalExpr(expr: Logical): unknown { ...implementation... }
    visitGroupingExpr(expr: Grouping): unknown {...implementation...}
    visitLiteralExpr(expr: Literal): unknown { ...implementation... }
    visitUnaryExpr(expr: Unary): unknown { ...implementation... }
    visitVariableExpression(expr: variable) { ...implementation... }
    visitCallExpression(expr: CallExpr): unknown { ...implementation... }
    visitTernaryExpression(expr: Ternary): unknown { ...implementation... }
    visitAssignmentExpression(expr: Assignment): unknown { ...implementation... }

}

Closure Implementation

Think of environments, it's where all the declarations reside and each block has its own environment. Now, how can we access the variables in the outer scope? Just store a reference to the outer scope in the inner scope and you got it. Take the following example.

Block_And_Environments

// Global Environment
let age = 10
let dummy = 1
// New Block - Environment A
{
   // This block has its own environment
   let dummy = 2 // local dummy
   console.log(dummy, age)
   {
      // New Block - Environment B
      let dummy = 3 // Inner local dummy
      console.log(dummy, age)
   }
}

Here's how the environment chain would look like

For block scopes, accessing the variables in the outer environment is fairly straightforward but a function is declared in one place and called in another. For them to be able to access the outer scope they need to store their closing environment when the function is created and use that environment when called. We need a definition for a Function's structure. So, let's take the following example.

Closure

// In Module A
const ten = 10

export function add10(num) {
    return num + ten
}

// In Module B

add10(5)

The function structure internally holds the reference to the enclosing environment and even when a function is called elsewhere when the interpreter finds a call expression, a new environment is created for the function block which references the declaration's enclosing environment, and wallah you get closure.

Recursive rendering

I had to visualize the tree that the parser had generated. I was using react and my first naive approach was to create a component for each node type but that approach would have been very ugly and very inefficient to write. Since, we're working with trees and have been recursing our way down it till now, why not render recursively? The idea is inspired by the implementation of the tree visualization of Astexplorer. Each node of the tree is either another node or a primitive. With this in mind, we can create a recursive pattern of visualization and all I had to do was use the component once by passing the syntax tree to it without any worry about the node type and its contents which means even if the shape if the tree node changes, the UI wouldn't break.

TreeVisualizer

<Block name={tree.type} node={tree} openstate="open">
    <Element node={tree} />
</Block>

Resources

Crafting Interpreters

Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.