How to write a linter using tree-sitter in an hour

Tree-sitter queries
Writing the linter
Final thoughts and challenges

This article was discussed on Hacker News.

This is a continuation of my last post on how to write a tree-sitter grammar in an afternoon. Building on the grammar we wrote, now we’re going to write a linter for Imp, and it’s even easier! The final result clocks in less than 60 SLOC and can be found here.

Recall that tree-sitter is an incremental parser generator. That is, you give it a description of the grammar of your programming language and it spits out a parser in C that creates a syntax tree based on the rules you specified. What’s notable about tree sitter is that it is resilient in the presence of syntax errors, and it being incremental means the parser is fast enough to reparse the file on every keystroke, only changing the parts of the tree as needed.

Specifically, we’ll write a program that suggests simplification of assignments and some conditional constructs. First I’ll describe the tree-sitter query language with some examples, then show how a little bit of JavaScript can let us manipulate the results programmatically. You can get the code in this post here. Ready? Set? Go!

Note: There are many language bindings that let you work with tree-sitter parsers using the respective language’s FFI. I’ve used only two to date, the Rust and the JavaScript bindings, and from my brief experience, the JavaScript bindings are much more usable. When using the Rust bindings the lifetime and mutability restrictions make abstraction more difficult, especially for a non-critical program such as a linter.

Tree-sitter queries

Tree-sitter has a built-in query language that lets you write queries to match parts of the AST of interest. Think of it as pattern matching, but you don’t need to handle every case of a syntactical construct.

Query syntax

Tree-sitter queries are written as a series of one or more patterns in an S-expression syntax. We first match on a node’s type (corresponding to a name of a node in the grammar file), then possibly the types of the children of the node as well. After each pattern, write @m (or any other valid variable name) so you can refer to the matched node later.

Our running example will be some Python code.

def factorial(n):
  return 1 if n == 0 else (n * (1 * 1)) * factorial(n - 1)

Let’s match all expressions involving binary operators.

(binary_operator) @m

def factorial(n):
  return 1 if n == 0 else (n * (1 * 1)) * factorial(n - 1)

Tree-sitter lets us specify what the children should be. So we can match all binary expressions involving at least one integer:

(binary_operator (integer)) @m

def factorial(n):
  return 1 if n == 0 else (n * (1 * 1)) * factorial(n - 1)

Or match all binary expressions involving two integers:

(binary_operator (integer) (integer)) @m

def factorial(n):
  return 1 if n == 0 else (n * (1 * 1)) * factorial(n - 1)

Try playing around with queries in the playground.

Capturing nodes

You can also assign capture names to nodes that you match, letting you refer to them later by name. This is useful because in the running example, suppose we wanted to capture the left and right integer arguments to a binary operator, labeling them a and b respectively. Then our query would look like this, and tree-sitter would highlight the matches accordingly.

(binary_operator (integer) @a (integer) @b) @m

def factorial(n):
  return 1 if n == 0 else (n * (1 * 1)) * factorial(n - 1)

Predicates

The tree-sitter query language also lets you specify additional constraints on matches. For instance, we can match on binary expressions where the left-hand side is n, which now gets highlighted in blue. The underscore _ lets us match any node.

((binary_operator _ @a _ @b) (#eq? @a n)) @m

def factorial(n):
  return 1 if n == 0 else (n * (1 * 1)) * factorial(n - 1)

Writing the linter

Now we have the basic parts out of the way, we can get to writing a linter! Instead of Python, we’ll continue working with Imp. Note that it’s easy to adapt this linter for any language with a tree-sitter grammar. Imp also has a much simpler semantics than Python so we can just focus on “obviously correct” lints rather than worry about suggestions changing program behavior.

We can start with a basic package.json:

{
  "name": "imp-lint",
  "type": "module",
  "version": "1.0.0",
  "description": "Linter for Imp",
  "main": "index.js",
  "scripts": {
    "lint": "node index.js"
  },
  "author": "Ben Siraphob",
  "license": "MIT",
  "devDependencies": {
    "tree-sitter": "^0.20.0",
    "tree-sitter-imp": "github:siraben/tree-sitter-imp"
  }
}

Then npm install to install the dependencies. We’ll write our code in index.js then we can call our linter by running npm run lint <file>.

Imports and setup

Nothing fancy here, just the Parser class from the tree-sitter library and our language definition Imp (discussed in my last blog post), and a library to read from the filesystem.

import Parser from "tree-sitter";
import Imp from "tree-sitter-imp";
import { readFileSync } from "fs";

const { Query } = Parser;
const parser = new Parser();
parser.setLanguage(Imp);

const args = process.argv.slice(2);

if (args.length != 1) {
  console.error("Usage: npm run lint <file to lint>");
  process.exit(1);
}

// Load the file passed as an argument
const sourceCode = readFileSync(args[0], "utf8");

Creating the parse tree

We then create the parser, set the language to Imp and run the parser on our source code to get out a syntax tree.

const parser = new Parser();
parser.setLanguage(Imp);

// Load the file passed as an argument
const tree = parser.parse(sourceCode);

If we have the following file:

x := x + 1

The corresponding output from console.log(tree.rootNode.toString()) would be:

(program (stmt (asgn name: (id) (plus (id) (num)))))

Identifying queries of interest

That was some preliminary work. Now let’s see what queries would be interesting to run over more realistic Imp programs. Say we have:

z := x;
y := 1;
y := y;
while ~(z = 0) do
  y := y * z;
  z := z - 1;
  x := x;
end;
x := x;
if x = y then x := 1 else x := 1 end

There’s some redundancies for sure! We can tell the user about assignments such as x := x which are a no-op, and that last if statement certainly looks redundant since both branches are the same statement.

It’s simple to create a Query object in JavaScript and run it over the root node.

const redundantQuery = new Query(
  Imp,
  "((asgn name: (id) @left _ @right) (#eq? @left @right)) @redundantAsgn"
);

console.log(redundantQuery.captures(tree.rootNode));

This is what we get:

[
  {
    name: 'redundantAsgn',
    node: AsgnNode {
      type: asgn,
      startPosition: {row: 2, column: 0},
      endPosition: {row: 2, column: 6},
      childCount: 3,
    }
  },
  {
    name: 'left',
    node: IdNode {
      type: id,
      startPosition: {row: 2, column: 0},
      endPosition: {row: 2, column: 1},
      childCount: 0,
    }
  },
  // etc...
]

Ok, that’s a lot of detail! Notice that every capture name was reported along with what type of node matched and the start and end of the match. Some tools might want this information, but for us it’s enough to report only the start of the match and the text that the match corresponded to:

// Given a raw list of captures, extract the row, column and text.
function formatCaptures(tree, captures) {
  return captures.map((c) => {
    const node = c.node;
    delete c.node;
    c.text = tree.getText(node);
    c.row = node.startPosition.row;
    c.column = node.startPosition.column;
    return c;
  });
}

Now we get something more concise:

[
  { name: 'redundantAsgn', text: 'y := y', row: 2, column: 0 },
  { name: 'left', text: 'y', row: 2, column: 0 },
  { name: 'right', text: 'y', row: 2, column: 5 },
  { name: 'redundantAsgn', text: 'x := x', row: 6, column: 2 },
  { name: 'left', text: 'x', row: 6, column: 2 },
  { name: 'right', text: 'x', row: 6, column: 7 },
  { name: 'redundantAsgn', text: 'x := x', row: 8, column: 0 },
  { name: 'left', text: 'x', row: 8, column: 0 },
  { name: 'right', text: 'x', row: 8, column: 5 }
]

And of course, it’s trivial to filter out the captures corresponding to a given name:

// Get the captures corresponding to a capture name
function capturesByName(tree, query, name) {
  return formatCaptures(
    tree,
    query.captures(tree.rootNode).filter((x) => x.name == name)
  ).map((x) => {
    delete x.name;
    return x;
  });
}

Passing tree, redundantQuery and "redundantAsgn" to capturesByName, we get:

[
  { text: 'y := y', row: 2, column: 0 },
  { text: 'x := x', row: 6, column: 2 },
  { text: 'x := x', row: 8, column: 0 }
]

Now you can process these objects however you like. Note that tree-sitter uses zero-based indexing for the rows and columns, and you might want to offset it by one so users can locate it in their text editor. Here’s a simple approach:

// Lint the tree with a given message, query and match name
function lint(tree, msg, query, name) {
  console.log(msg);
  console.log(capturesByName(tree, query, name));
}

lint(tree, "Redundant assignments:", redundantQuery, "redundantAsgn");

We get the output:

Redundant assignments:
[
  { text: 'y := y', row: 2, column: 0 },
  { text: 'x := x', row: 6, column: 2 },
  { text: 'x := x', row: 8, column: 0 }
]

As a bonus, we can reuse our existing code for new queries! Here’s a couple:

Redundant if

((if condition: _ @c consequent: _ @l alternative: _ @r)
 (#eq? @l @r)) @redundantIf

Addition with 0
```
((plus (num) @n) (#eq? @n 0)) @addzero
```

Here are some exercises to try:

Recognize while loops containing only a skip statement
Recognize constant arithmetic and boolean expressions
Recognize unused variables (those that only appear on the left-hand side of an assignment)

Final thoughts and challenges

To appreciate it more, think about what we would have done had we not used tree-sitter. The process might have gone something like this:

Write the parser and generate an AST in a given language or parser generator
Annotate the AST with location information from the source file
Traverse the AST looking for matches of interest
Report them to the user

Note that there are several steps were things could go wrong or block us later. If we wrote the parser, say in Haskell using megaparsec, we would have not been able to recover the rows and columns of the syntax elements (or painfully write an abstract data type with annotations). And even worse, what happens when the user supplies syntactically invalid input? Some parser generators based on GLR parsing such as Bison allow for error recovery, but then we’d need to define a custom error token and come up with ad-hoc logic for dealing with it.

Tree-sitter separates these design choices into orthogonal ones. A tree-sitter grammar is easy to write and reusable in any language with a C FFI. The error recovery logic is pervasive yet unwritten, and the resulting AST is annotated with locations and can be easily pattern-matched over with queries.

Should we throw tree-sitter at every problem involving parsing? No! There are certainly some areas where we need syntax trees without error nodes, and sometimes the incremental parsing is not necessary. For instance, if we’re working with a build farm, we don’t want to build package definitions with syntax errors!

Beyond linting, tree-sitter has also found applications in GitHub’s search-based code navigation which also makes use of the query language to annotate the AST with tags.