Lessons learned when writing an Intellij plugin lexer

October 29, 2019

The Intellij Grammar-kit plugin allows you to autogenerate a rudimentary lexer from your grammar, but once you've done that the lexer does need some minor modifications before it is suitable to parse more complicated structures of the language.

Background

For this, I'm assuming that the grammar from the first post is all finished and working. Running the "Generate Jflex Lexer" task from inside Intellij with this will generate a simple lexer which can parse simple tokens and return the corect element types, but it can't do things like matching block comments.

Some useful background reading on this:

The simple lexer

When you first generate the lexer it will have some Java imports at the top, some token regexes (if you defined any in the grammar), and the matching rules for the 'initial' state of the lexer:


%%
<YYINITIAL> {
  {WHITE_SPACE}              { return WHITE_SPACE; }

  "and"                      { return AND; }
  "as"                       { return AS; }
  "class"                    { return CLASS_KW; }
  "constraint"               { return CONSTRAINT; }
  "downto"                   { return DOWNTO; }
  ...
  {FLOAT}                    { return FLOAT; }
  {CHAR}                     { return CHAR; }
  {STRING}                   { return STRING; }
}

This will match most of the language, but there are some complicates in Reason to take into account.

Improving the lexer

Block comments and docblocks

Like Java, Reason has documentation blocks (delimited with /** and */) and block comments (the same, but starting with /*). The problem is that this cannot be matched in the parser - in the parser we have only defined the language constructs, and in between these comments tags we want to allow anything.

In the grammar, there are a couple of extra 'dummy' tokens defined just for use in the lexer (these can also be defined outsode of the grammar, but just to keep all the token types in one place seemed easiest):

// At the top of Reason.bnf
{
    ...

    tokens = [
        ...
        BLOCK_COMMENT="block_comment"
        DOC_BLOCK="doc_block"
    ]
}

Taking the block comments as an example, in the lexer we need to look for an instance of /* and enter a new state where we don't care about anything we're reading. This new state is just defined in the lexer at the top:

// present already
%public
%class _ReasonLexer
%implements FlexLexer
%function advance
%type IElementType
%unicode

// our new state
%s IN_BLOCK_COMMENT

Then when we find a /* when in the YYINITIAL state, jump into this new state:

%%
<YYINITIAL> {
  ...

  {CHAR}                     { return CHAR; }
  {STRING}                   { return STRING; }

  "/*"                       { yybegin(IN_BLOCK_COMMENT);
                               yypushback(2); }
}

The yypushback 'rewinds' the lexer to the beginning of the comment (including the /*) so everything including the tag itself is included in this new state block.

As mentioned before, we want to allow anything inside this new state, so the matching rules are different, and much smaller:

<IN_BLOCK_COMMENT>{
  "*/" |
  <<EOF>>   { yybegin(YYINITIAL);
              return BLOCK_COMMENT; }

  [^]     { }
}

This just waits until either an EOF or the first */ (some languages will allow nesting block comments - for the purposes of this exercise, we don't). Once it finds it, it returns back to the initial state and return the dummy token BLOCK_COMMENT. This dummy token is then used inside the IDE code when initialising the language plugin as an indicator that any tokens of this type are comments. These can then be used for folding block comments, etc.

Documentation blocks are done in the same manner, but the 'start' character is /**.

Multiline strings

An interesting feature of Reason/OCaml is that generic multiline 'raw' strings can be defined with matching tags that start with {| and end with |}, with anything in between the curly brace and pipe, but they must match. An example:

let multiline = {| hello!! 
...
hello from here too|};

// Note: the 'u' does not mean anything, they are just a marker
let unicode = {u|😁|u};

// bad! tags do not match! 
// let badly_matched = {start| string contents |end};

As with block comments, we want to consider the whole string to be one token. Because the tag can be anything, and must match, this requires writing some extra code to actually check the start and end tags matched. Though this can be done by importing the class/method at the top of the file, keeping it all in the same flex file makes sure it doesn't fall out of sync.

Because we need the flex lexer to know when to start checking for a multiline string, we define a couple more 'dummy' tokens, this time just inside the lexer

they will be available for use inside the lexer but not from the parser etc.

RAW_STRING_BEGIN=(\{([a-zA-Z]+)?\|)
RAW_STRING_END=(\|([a-zA-Z]+)?\})

Then, we need some variables to keep track of what the tag we're currently trying to match is. Also we're going to be doing some regex matching ourselves as well, so precompile those to save processing time. This just goes at the top of the file under the lexer class definition, and these instances variables will be added to the autogenerated Java code:

%{}
  /**
  * Matching multiline string tag
  */
  private String multilineMatchTag = null;

  /**
  * The same as RAW_STRING_BEGIN, for matching ends of groups
  */
  private Pattern rawStringBeginPattern = Pattern.compile("\\{([a-zA-Z]+)?\\|");
  private Pattern rawStringEndPattern = Pattern.compile("\\|([a-zA-Z]+)?}");
%}

Assuming we've then defined another state for 'matching a raw string', we use the RAW_STRING_BEGIN token to find out when to enter this state:

%%
<YYINITIAL> {
  ...

  {CHAR}                     { return CHAR; }
  {STRING}                   { return STRING; }

  "/*"                       { yybegin(IN_BLOCK_COMMENT);
                               yypushback(2); }
                               
  {RAW_STRING_BEGIN}         { yybegin(IN_RAW_STRING);
                               setTagMatch(); }
}

Then in our new state we want to read until the matching tag is found

<IN_RAW_STRING>{
  // Read until we find a _possible_ match, then check it
  {RAW_STRING_END}    { if (checkTagMatch()) {
                            yybegin(YYINITIAL);
                            return STRING;
                        }
                      }

  <<EOF>>   { yybegin(YYINITIAL);
              return BAD_CHARACTER; }

  [^]     { }
}

This setTagMatch method will save the tag for the multiline string - eg, for {mytag| needs to save "mytag". The checkTagMatch then checks to see if a possible candidate actually matches the tag - for example, {start| fk |end} is invalid, but {mytag| fdfd |} |mytag} is valid.

This code again goes at the top of the file, just underneath the definition of the instance variables:

%{
  void setTagMatch() {
      String toMatch = yytext().toString();

      // This is expected to have been reset
//      assert multilineMatchTag == null;

      Matcher matcher = rawStringBeginPattern.matcher(toMatch);
      // This should only be called if it finds a match
      assert matcher.find();

      if (matcher.groupCount() == 1){
          multilineMatchTag = matcher.group(1);
      } else {
          // ie, "don't expect any match"
          multilineMatchTag = "";
      }
  }

  Boolean checkTagMatch() {
      String toMatch = yytext().toString();

      // Should always be set
      assert multilineMatchTag != null;

      Matcher matcher = rawStringEndPattern.matcher(toMatch);
      assert matcher.find();

      if (matcher.groupCount() == 1) {
          // Some extra tag found

          if(multilineMatchTag.equals(matcher.group(1))) {
              // Same tag found - good
              multilineMatchTag = null;
              return true;
          }
          // else ignore and continue
      } else {
          // '|}' found

          if(multilineMatchTag == "") {
              // No extra tag expected - good
              multilineMatchTag = null;
              return true;
          }
          // else ignore and continue
      }

      return false;
  }
%}

Next steps

The next thing to do to the lexer is to modify the tokens that match numbers/floats/integers etc. to return different types of tokens. For example, a long should be considered a different token type to a float (though they both are of the constant token type).

Reason/OCaml also have very generic things you can do inside PPX tags such as embedding other languages including GraphQL and Javascript - whether this is something that needs to have lexer support, I'm not sure yet - hopefully it should be possible to do inside the IDE with 'nested PSI files'.