Lessons learned when writing an Intellij plugin grammar

October 21, 2019

Though there is a decent amount of documentation around writing an Intellij custom language plugin, there's a lot of subtleties in writing a grammar for it that you may run into only after you've been working on it for a while. This post just tries to lay out some of the things I ran into in the hope that it might help somebody else in future.

Background

Though I do most things in Python, I do think that ReasonML (hereafter referred to as 'Reason' - that 'ML' is a bit unnecessary to type out) is a very nice language (see also: my previous post). It has good support for compiling to Javascript (in more than one way!) as well as native code and has a good type system that I think just hits the sweet spot in terms of productivity, the only thing really missing being modular implicits in the vein of Scala.

At the time of writing there is still a few holes in the grammar, namely to do with being able to parse some of the more exotic type annotations possible, and matching multiline strings. A lot of the code presented in this article might even change further down the line after I run into some problem with the way I've currently implemented it.

This article is just going to go over some of the issues I had to do with getting the (nearly-finished) grammar working in a form that was suitable for consumption from within the plugin code itself (eg, the annotators, reference finder, etc.). There are also some more complex things like fake rules which aren't covered in this article.

Once I have more functionality working, in a future article I will try to go more in depth into how to get the actual IDE code like error annotations, code highlighting, and refactoring.

Editor plugins

There are currently 3 real options for editing Reason code:

(vscode) Reason language server plugin

There is a language server for Reason (written in Reason) which provides decent language coverage. This is used by the reason-vscode which is the recommended way of editing Reason code.

This is very good but I feel like it has a few shortcomings - there are some bugs, it fails on the first error in the file, syntax error messages are unhelpful, syntax higlighting can be broken by some fairly basic language constructs, etc.

There are other language server plugins (+syntax highlighting) for other editors like vim, sublime, etc. that can consume this language server.

OCaml vscode plugin

The OCaml support provided by the OCaml and Reason IDE is (in my opinion) occasionally superior to the above plugin - mainly in terms of jumping to definitions, showing type signatures, etc. These are some fairly minor things, but this plugin does provide a better experience in opinion. This is unmaintained in favour of the reason-vscode plugin however.

Existing Jetbrains IDE plugin

There is an existing Reason/OCaml plugin but I feel like this also falls short in quite a few ways - for example, it only has highlighting for Some and None, not generic variants/polymorphic variants, it seems designed very much to be bucklescript-first and doesn't support native projects, etc.

Another plugin?

So why make another plugin if there are already 3 (really 2, excluding the unmaintained vscode plugin)?

Though the vscode plugin is the 'recommended' way of developing Reason code, I personally don't really like it, mainly just because it feels like a slower sublime text while not providing much extra power in terms of the editor really 'understanding' the languages (as an aside - the language server protocol stuff is very nice, and works much better when using it with tabnine, but implementing a language server fully which supports the really useful things like workspace-wide renaming means writing a plugin which has full semantic knowledge of the language, which can be incredibly difficult, something which the Jetbrains SDK lets you do much more easily with it's grammar-kit plugin).

There are a few reasons (no pun intended) why I wanted to start writing a new plugin rather than modifying the existing Intellij plugin:

Wanting to learn the process of writing the plugin from scratch. Not having done language engineering for a few years, I wanted to try writing a plugin which would be written from the ground up, from a grammar file for the language. The existing plugin does not do this and has its own parser implementation.
Learning Kotlin. I has the chance to do some Kotlin on a previous work project and I quite liked it, so being able to do a personal project in it was a good chance to really learn how it works.
Better support for language constructs. As mentioned previously, the existing plugin doesn't provide highlighting options for things like variants.
Cross-language support. One of the goals with this plugin was to be able to transparently understand that Reason is essentially just syntactic sugar over OCaml, and as well as things like refactoring and moving files/variables between files of different types, it should also ideally be able to convert between the two.
First class support for Dune/esy. As well as Bucklescript, it should ideally be able to use these different build systems. Bucklescript would ideally be provided as a 'facet' for existing npm projects as well.
PPX support. The vscode plugin lets you see what a file is like after running a ppx on it - this would be useful if it was like markdown editors in vscode/intellij are where you can see a preview of the ppx-ed code in another pane as you type. Ideally it would also have a 'hover' action over a ppx tag to show what code would be autogenerated from it.

Some other but less important aims for this plugin:

Viewing bytecode and/or native code. Viewing and debugging native code would likely mean using CLion - Im not an expert but from what I've read it seems like most of the focus on the assembly level is available in CLion but not the other IDEs like Intellij.
Going to a definition to/from Javascript when using Bucklescript. This would be very useful, as well as something like being able to autogenerate typescript stub definitions for a Reason library.

Some non-goals:

Reimplement type checking. This will be done just by calling the command line using whatever build system is specified for the project, and parsing any errors which will then be annotated inline in the file.

First steps

Starting the grammar

Essentially, the first thing you have to do for any plugin is to write the BNF grammar for your language. Maybe you get lucky and there is an existing one, but for the purpose of this exercise I decided to start from scratch. Even if you do have an existing one, it will likely require significant modifications to get it to work nicely with the plugin code itself (eg, name refactoring and finding references, or even just pure parsing speed).

There really is no shortcut for this - I found that it was difficult just to get to the point where it could parse something relatively simple like assigning a variable in a way that 'seemed' sensibly laid out so that any autogenerated code would be easy to consume, but once I got over the initial hump adding more complex things like inline JSX expressions or first class module arguments was relatively easy. Consider for example just assigning a variable in Reason:

// Assign a simple value
let greeting = "hello!";

// Assign two values
let newScore = 5 and anotherscore = 20

// Assign a value, with an explicit type annotation
let bl: f  = 5;

// Create a new custom operator <$>
let (<$>) = (x, y) => fmap(x, y);
// Assign this operator to a variable, which will then act like a function
let fmap = (<$>);

// From a function call, destructure a tuple into two variables
let (a,b) = get_tuple();

// A tuple
let a_tuple = (a, b);

// Destructuring, type annotations, etc
let {name: (n: string), age: (a: int)} = somePerson;

There are obviously lots more of these (Reason is an expression based language, so almost anything can be on the right side of the equals sign) but this is just to show some things which made it hard to think of how to write the grammar. For example, the difference between destructuring a tuple or creating a new custom operator, type annotations which depend on whether some brackets are present or not, etc.

There's not much I can write here which isn't covered by the official howto document, but my advice is to look at other language plugins and see how they do things. The one for Haskell is fairly simple but might provide some pointers, a much better example is the one for Rust which is a fairly complicated language with a fairly complicated grammar to go along with it. If you have a similarly complicated language, expect your grammar to be at least 500 lines long. At the time of writing, the Reason grammar is just over 800.

One thing annoying about the documentation is that it doesn't really explain a lot - it provides a brief overview of the EBNF extensions, but not really how to use them. There are also some omissions - for example the howto briefly explains left rules, but if you look in the Rust grammar you will notice that it also uses the upper rule modifier, which seems not to be documented anywhere.

Once you do have the grammar for your whole language and you can generate the parser code from it, then you have go look at the generated code and tweak the grammar so it generates something which is usable.

Example 1

For example this rule which matches what comes after the switch keyword is a fairly obvious way of writing rule:

switchAnalyse ::= (
    LPAREN Expr RPAREN
    | constantExpr
    | valIdentifier
    | tupleLiteralExpr
)

It can be something like:

switch (func_call()) { ... }
switch 2 { ... } (admittedly a bit useless, but possible)
switch an_identifier { ... }
switch (a, b) { ... }

There is also the whole case that any expression can be wrapped in parentheses, but then it might be a custom operator depending on what is inside those parentheses, etc. but we'll just ignore that for now (and it is mostly handled by the Expr rule).

This generates this code:

public interface ReasonSwitchAnalyse extends PsiElement {

  @Nullable
  ReasonExpr getExpr();

  @Nullable
  PsiElement getIdentifier();

  @Nullable
  PsiElement getLparen();

  @Nullable
  PsiElement getRparen();

  @Nullable
  PsiElement getUnit();

}

The generated code presents all possible choices, as well as the parentheses, as methods. Unit is also there, which must have come from one of the rules used by the switchAnalyse rule, which is also technically valid in a switch statement (as it is a constant expression). If you did want to implement type checking for this, you need to see which one was non null (Expr, Identifier, or Unit). If the Expr is null, it then depends on how your Expr rule is written - you probably have to have a big if/else chain seeing what kind of Expr it was, then trying to figure out what type it was and whether all choices in the switch body were valid, and so on.

This is one reason I decided to only run the most basic syntax/semantics scans over the code and leave the heavy lifting to the existing OCaml toolchain - correctly structuring the grammar would take a lot longer.

Example 2

Another example - modules can be locally opened for one expression, for example to generate a record of a type defined in another module, or just to use a type defined in another module as part of an array. eg:

// A record is defined like 'type t = {key: int, value: string}' define in AnotherModule
let record = AnotherModule.{key: 123, value: "abc"}

// A value is defined like 'let a = 1' in AnotherModule 
let array = AnotherModule.[a, 2, 3]

// A function is define like `let a_func = (x) => x + 1` in AnotherModule
let result = AnotherModule.(a_func(2))

Again it makes sense to do it like this:

localModuleOpen ::= moduleAccessor (localOpenRecordExpr | localOpenScopeExpr | localOpenArrayExpr)

localOpenScopeExpr ::= LPAREN Expr RPAREN {pin=1}
localOpenRecordExpr ::= LBRACE createRecordExpr RBRACE {pin=1}
localOpenArrayExpr ::= listLiteralExpr

This generates the code

public interface ReasonLocalModuleOpenExpr extends ReasonExpr {

  @NotNull
  ReasonExpr getExpr();

  @NotNull
  ReasonModuleAccessor getModuleAccessor();

}

Though we know that the Expr is always going to be one of the local open types, it still means we need to have type checks in our IDE code to introspect which type it is, and it still means it's passing around a generic ReasonExpr interface with no information attached to it. Opening a module by itself is not an expression though, so simply by making sure they don't inherit from the root expression rule we can make it a bit nicer:

localModuleOpenExpr ::= moduleAccessor (localScopeOpen | localRecordOpen | localArrayOpen)

localScopeOpen ::= LPAREN Expr RPAREN {pin=1}
localRecordOpen ::= LBRACE createRecordExpr RBRACE {pin=1}
localArrayOpen ::= listLiteralExpr

Which generates:

public interface ReasonLocalModuleOpenExpr extends ReasonExpr {

  @Nullable
  ReasonLocalArrayOpen getLocalArrayOpen();

  @Nullable
  ReasonLocalRecordOpen getLocalRecordOpen();

  @Nullable
  ReasonLocalScopeOpen getLocalScopeOpen();

  @NotNull
  ReasonModuleAccessor getModuleAccessor();

}

But now we have to check which one is not null in some big if/else statement. We could also introduce another interface to disambiguate it a bit:

anyLocalOpen ::= (localRecordOpen | localScopeOpen | localArrayOpen)

// And in the top block:
//    extends(".*Open")=anyLocalOpen

Which maps to:

public interface ReasonLocalModuleOpenExpr extends ReasonExpr {

  @NotNull
  ReasonAnyLocalOpen getAnyLocalOpen();

  @NotNull
  ReasonModuleAccessor getModuleAccessor();

}

So now we have something similar to the first example, a common interface corresponding to any valid local module open, which is a bit more type safe than our first example, but we still need to do checking to see which kind of open it actually is.

Any of these 3 options is perfectly valid, but choosing which one is entirely up to you.

Example 3

These 2 examples are actually fairly simple - there are much more irritating things when you have 'one or more' rules.

try/catch

A try catch in Reason is like this:

try (a_func()){
| Not_found => print("not found")
| Another_exception => print("some other exception")
}

A bad implementation of a try/catch:

tryExpr ::= TRY LPAREN Expr RPAREN LBRACE catchBodyLineExpr+ RBRACE {pin=1}

This corresponds to:

public interface ReasonTryExpr extends ReasonExpr {

  @NotNull
  List<ReasonExpr> getExprList();

  @Nullable
  PsiElement getLbrace();

  @Nullable
  PsiElement getLparen();

  @Nullable
  PsiElement getRbrace();

  @Nullable
  PsiElement getRparen();

  @NotNull
  PsiElement getTry();

}

All but the 'exprlist' in this interface it completely useless - and how do you differentiate the expression being 'tried' and the actual lines in the body? This needs some serious rework, or some external helper methods defined to actually let you extract which expression is being 'tried' and what is in the body.

Class/object members

A type is defined like this:

type tesla = {
  .
  color: string,
};

let obj: tesla = {
  val red = "Red";
  pub color = red;
};

This is how the 'object contents' is defined, with some specific uses of 'private' rules to manipulate the grammar tree created:

// The last one doesn't have to have a semicolon at the end
objContents ::= (createObjMember)+ classOrObjMember?
private createObjMember ::= classOrObjMember SEMI {recoverWhile=memberRecoverRule}


private classOrObjMember ::= (valMember | pubMember | priMember)
valMember ::= VAL ( MUTABLE )? Expr {pin=1}
pubMember ::= PUB Expr {pin=1}
priMember ::= PRI Expr {pin=1}

private memberRecoverRule ::= !(SEMI|RBRACE|PUB|PRI|VAL)

This produces:

public interface ReasonObjContents extends PsiElement {

  @NotNull
  List<ReasonPriMember> getPriMemberList();

  @NotNull
  List<ReasonPubMember> getPubMemberList();

  @NotNull
  List<ReasonValMember> getValMemberList();

}

This is a very easy to use interface - it provides a non null list of public members, private members, and values for the object. It also has a recovery rule to make sure that missing semicolons will be handled gracefully (one of the more irritating problems with the reason language server and Merlin based editor plugins is that syntax errors are often incredibly vague or even point to the wrong line, so this is very useful).

Language specific concerns

There are some language-specific things which will always be a bit of a pain as well. For example, in Reason each file is a module, but also you can define a module inside a file which can be referenced and imported as if it was a file in its own right. There can also be anonymous modules defined inline in functions, and modules can include other modules to bring things into scope (like an 'include' statement in C).

It's also taught me a fair bit about what you can actually do in the language, even if it makes no sense. For example you can just have a raise statement in the top level of a module which will instantly raise an exception, or just define an inline module that is assigned to nothing. Statements which seemingly do not make sense to have on the right side of the 'equals' sign will just return 'unit', eg:

// type of g: mutable int
let g = ref(4);

// type of f: unit
let f = g := 5

Custom operators are also a big one - Reason lets you define custom operators with a variety of symbols (!$%&*+\-./:<=>?@^|~) which lets you construct familiar operators like >>=. The != operator is a builtin operator, but in Reason it can be thought of as a essentially a custom binary infix operator (there are more complex rules to do with polymorphic equality in Reason but that's outside the scope of this article). We can even redefine it like this:

// Type of (!=) is "('a, 'a) => bool"
let (!=) = (lhs, rhs) => !(lhs == rhs)

Though doing all the correct type inference/checking/etc is not in scope for this plugin, being able to correctly understand the different ways modules can be used so that renaming/moving/other refactoring can be done on them safely requires structuring the grammar to present this information to the other parts of the plugin in an easy-to-understand manner.

From working on this grammar for the past however long, I would say that the only real way to do this properly is to write the grammar in a way that 'makes sense', check the autogenerated code to make sure it sort of makes sense, and then move onto actually integrating it with the IDE. After you try to actually work out renaming and you realise the PSI tree isn't being generated in a nice way, then go back and edit it some more. Trying to do the stuff with flattening the PSI tree etc. doesn't really work early on because you won't know enough about how you're going to be consuming it or how the IDE even uses the interfaces you give it.

Checking generated PSI tree

Even if you have your grammar fully parsing some example code, it might be generating tokens in a way that isn't easy to use. Take this snippet that parses let expressions:

letExpr ::= (ppxMarkerExpr)? LET innerLetContents (AND innerLetContents)* {pin='LET'}
innerLetContents ::= (REC)? (UNIT | assignableWithType) EQ Expr
private assignableWithType ::= (letLhsParenExpr | useExistingValueExpr | destructuredRecord) (typeHint)?

Then use it to parse this:

let newScore = 5 and anotherscore = 20

And we get:

    ReasonModuleLevelDefinitionImpl(MODULE_LEVEL_DEFINITION)(26,64)
      ReasonLetExprImpl(LET_EXPR)(26,64)
        PsiElement(ReasonTokenType.ReasonTokenType.let)('let')(26,29)
        PsiWhiteSpace(' ')(29,30)
        ReasonUseExistingValueExprImpl(USE_EXISTING_VALUE_EXPR)(30,38)
          PsiElement(ReasonTokenType.ReasonTokenType.IDENTIFIER)('newScore')(30,38)
        PsiWhiteSpace(' ')(38,39)
        PsiElement(ReasonTokenType.ReasonTokenType.=)('=')(39,40)
        PsiWhiteSpace(' ')(40,41)
        ReasonConstantExprImpl(CONSTANT_EXPR)(41,42)
          PsiElement(ReasonTokenType.ReasonTokenType.number)('5')(41,42)
        PsiWhiteSpace(' ')(42,43)
        PsiElement(ReasonTokenType.ReasonTokenType.and)('and')(43,46)
        PsiWhiteSpace(' ')(46,47)
        ReasonUseExistingValueExprImpl(USE_EXISTING_VALUE_EXPR)(47,59)
          PsiElement(ReasonTokenType.ReasonTokenType.IDENTIFIER)('anotherscore')(47,59)
        PsiWhiteSpace(' ')(59,60)
        PsiElement(ReasonTokenType.ReasonTokenType.=)('=')(60,61)
        PsiWhiteSpace(' ')(61,62)
        ReasonConstantExprImpl(CONSTANT_EXPR)(62,64)
          PsiElement(ReasonTokenType.ReasonTokenType.number)('20')(62,64)

Note that the two definitions are smushed together - there is no separation between the newScore = 5 and anotherscore = 20 in the generated tree! This would be a nightmare to try and consume. This is partially because it misses the point that you can give a type hint to UNIT (hint: it is of type 'unit'), and mixing this with the private assignableWithType is generating a very ugly PSI tree . Fixing the bnf (and giving some things better names):

letExpr ::= (ppxMarkerExpr)? LET letAssignment (AND letAssignment)* {pin='LET'}
letAssignment ::= (REC)? letAssignedTo EQ Expr
letAssignedTo ::= (UNIT | letLhsParenExpr | valIdentifier | destructuredRecord) (typeHint)?

Now our PSI tree is:

    ReasonModuleLevelDefinitionImpl(MODULE_LEVEL_DEFINITION)(26,64)
      ReasonLetExprImpl(LET_EXPR)(26,64)
        PsiElement(ReasonTokenType.ReasonTokenType.let)('let')(26,29)
        PsiWhiteSpace(' ')(29,30)
        ReasonLetAssignmentImpl(LET_ASSIGNMENT)(30,42)
          ReasonLetAssignedToImpl(LET_ASSIGNED_TO)(30,38)
            PsiElement(ReasonTokenType.ReasonTokenType.IDENTIFIER)('newScore')(30,38)
          PsiWhiteSpace(' ')(38,39)
          PsiElement(ReasonTokenType.ReasonTokenType.=)('=')(39,40)
          PsiWhiteSpace(' ')(40,41)
          ReasonConstantExprImpl(CONSTANT_EXPR)(41,42)
            PsiElement(ReasonTokenType.ReasonTokenType.number)('5')(41,42)
        PsiWhiteSpace(' ')(42,43)
        PsiElement(ReasonTokenType.ReasonTokenType.and)('and')(43,46)
        PsiWhiteSpace(' ')(46,47)
        ReasonLetAssignmentImpl(LET_ASSIGNMENT)(47,64)
          ReasonLetAssignedToImpl(LET_ASSIGNED_TO)(47,59)
            PsiElement(ReasonTokenType.ReasonTokenType.IDENTIFIER)('anotherscore')(47,59)
          PsiWhiteSpace(' ')(59,60)
          PsiElement(ReasonTokenType.ReasonTokenType.=)('=')(60,61)
          PsiWhiteSpace(' ')(61,62)
          ReasonConstantExprImpl(CONSTANT_EXPR)(62,64)
            PsiElement(ReasonTokenType.ReasonTokenType.number)('20')(62,64)

Now we have the 'let expr', with 2 'let assignments' inside it, as we would expect. This does of course make the tree deeper, but in this case it is far easier to use and understand.

One thing that should be obvious from all these examples is knowing exactly when and where to use (or to not use) private rules to get the PSI tree to look how you want it to.

Next steps

As mentioned in the introductory section, the grammar itself still can't parse everything in the language, so just getting 100% language coverage is the first thing to do.
Integrating with the existing build systems (dune, bucklescript) and parsing errors to show to the user in a nice way.
Copying + modifying the grammar to parse OCaml in a way that reuses the same basic building blocks and Java interfaces so they can be used in a shared-source project (or just being able to go to an OCaml definition from Reason).
Integration with debugger - JS debugger and/or OCaml native debugger?
Integration with other languages - embedded JS blocks in bucklescript, going to typescript definition from Reason, etc.