Lessons learned when writing an Intellij plugin lexer
The Intellij Grammar-kit plugin allows you to autogenerate a rudimentary lexer from your grammar, but once you've done that the lexer does need some minor modifications before it is suitable to parse more complicated structures of the language.
Background
For this, I'm assuming that the grammar from the first post is all finished and working. Running the "Generate Jflex Lexer" task from inside Intellij with this will generate a simple lexer which can parse simple tokens and return the corect element types, but it can't do things like matching block comments.
Some useful background reading on this:
- https://github.com/JetBrains/intellij-deps-jflex/blob/master/docs/md/example.md
- https://jflex.de/manual.html
The simple lexer
When you first generate the lexer it will have some Java imports at the top, some token regexes (if you defined any in the grammar), and the matching rules for the 'initial' state of the lexer:
%%
<YYINITIAL> {
{WHITE_SPACE} { return WHITE_SPACE; }
"and" { return AND; }
"as" { return AS; }
"class" { return CLASS_KW; }
"constraint" { return CONSTRAINT; }
"downto" { return DOWNTO; }
...
{FLOAT} { return FLOAT; }
{CHAR} { return CHAR; }
{STRING} { return STRING; }
}
This will match most of the language, but there are some complicates in Reason to take into account.
Improving the lexer
Block comments and docblocks
Like Java, Reason has documentation blocks (delimited with /**
and */
) and
block comments (the same, but starting with /*
). The problem is that this
cannot be matched in the parser - in the parser we have only defined the
language constructs, and in between these comments tags we want to allow
anything.
In the grammar, there are a couple of extra 'dummy' tokens defined just for use in the lexer (these can also be defined outsode of the grammar, but just to keep all the token types in one place seemed easiest):
// At the top of Reason.bnf
{
...
tokens = [
...
BLOCK_COMMENT="block_comment"
DOC_BLOCK="doc_block"
]
}
Taking the block comments as an example, in the lexer we need to look for an
instance of /*
and enter a new state where we don't care about anything
we're reading. This new state is just defined in the lexer at the top:
// present already
%public
%class _ReasonLexer
%implements FlexLexer
%function advance
%type IElementType
%unicode
// our new state
%s IN_BLOCK_COMMENT
Then when we find a /*
when in the YYINITIAL
state, jump into this new
state:
%%
<YYINITIAL> {
...
{CHAR} { return CHAR; }
{STRING} { return STRING; }
"/*" { yybegin(IN_BLOCK_COMMENT);
yypushback(2); }
}
The yypushback
'rewinds' the lexer to the beginning of the comment (including
the /*
) so everything including the tag itself is included in this new state
block.
As mentioned before, we want to allow anything inside this new state, so the matching rules are different, and much smaller:
<IN_BLOCK_COMMENT>{
"*/" |
<<EOF>> { yybegin(YYINITIAL);
return BLOCK_COMMENT; }
[^] { }
}
This just waits until either an EOF or the first */
(some languages will allow
nesting block comments - for the purposes of this exercise, we don't). Once it
finds it, it returns back to the initial state and return the dummy token
BLOCK_COMMENT
. This dummy token is then used inside the IDE code when
initialising the language plugin as an indicator that any tokens of this type
are comments. These can then be used for folding block comments, etc.
Documentation blocks are done in the same manner, but the 'start' character is
/**
.
Multiline strings
An interesting feature of Reason/OCaml is that generic multiline 'raw' strings
can be defined with matching tags that start with {|
and end with |}
, with
anything in between the curly brace and pipe, but they must match. An
example:
let multiline = {| hello!!
...
hello from here too|};
// Note: the 'u' does not mean anything, they are just a marker
let unicode = {u|😁|u};
// bad! tags do not match!
// let badly_matched = {start| string contents |end};
As with block comments, we want to consider the whole string to be one token. Because the tag can be anything, and must match, this requires writing some extra code to actually check the start and end tags matched. Though this can be done by importing the class/method at the top of the file, keeping it all in the same flex file makes sure it doesn't fall out of sync.
Because we need the flex lexer to know when to start checking for a multiline string, we define a couple more 'dummy' tokens, this time just inside the lexer
- they will be available for use inside the lexer but not from the parser etc.
RAW_STRING_BEGIN=(\{([a-zA-Z]+)?\|)
RAW_STRING_END=(\|([a-zA-Z]+)?\})
Then, we need some variables to keep track of what the tag we're currently trying to match is. Also we're going to be doing some regex matching ourselves as well, so precompile those to save processing time. This just goes at the top of the file under the lexer class definition, and these instances variables will be added to the autogenerated Java code:
%{}
/**
* Matching multiline string tag
*/
private String multilineMatchTag = null;
/**
* The same as RAW_STRING_BEGIN, for matching ends of groups
*/
private Pattern rawStringBeginPattern = Pattern.compile("\\{([a-zA-Z]+)?\\|");
private Pattern rawStringEndPattern = Pattern.compile("\\|([a-zA-Z]+)?}");
%}
Assuming we've then defined another state for 'matching a raw string', we use
the RAW_STRING_BEGIN
token to find out when to enter this state:
%%
<YYINITIAL> {
...
{CHAR} { return CHAR; }
{STRING} { return STRING; }
"/*" { yybegin(IN_BLOCK_COMMENT);
yypushback(2); }
{RAW_STRING_BEGIN} { yybegin(IN_RAW_STRING);
setTagMatch(); }
}
Then in our new state we want to read until the matching tag is found
<IN_RAW_STRING>{
// Read until we find a _possible_ match, then check it
{RAW_STRING_END} { if (checkTagMatch()) {
yybegin(YYINITIAL);
return STRING;
}
}
<<EOF>> { yybegin(YYINITIAL);
return BAD_CHARACTER; }
[^] { }
}
This setTagMatch
method will save the tag for the multiline string - eg, for
{mytag|
needs to save "mytag". The checkTagMatch
then checks to see if a
possible candidate actually matches the tag - for example, {start| fk |end}
is
invalid, but {mytag| fdfd |} |mytag}
is valid.
This code again goes at the top of the file, just underneath the definition of the instance variables:
%{
void setTagMatch() {
String toMatch = yytext().toString();
// This is expected to have been reset
// assert multilineMatchTag == null;
Matcher matcher = rawStringBeginPattern.matcher(toMatch);
// This should only be called if it finds a match
assert matcher.find();
if (matcher.groupCount() == 1){
multilineMatchTag = matcher.group(1);
} else {
// ie, "don't expect any match"
multilineMatchTag = "";
}
}
Boolean checkTagMatch() {
String toMatch = yytext().toString();
// Should always be set
assert multilineMatchTag != null;
Matcher matcher = rawStringEndPattern.matcher(toMatch);
assert matcher.find();
if (matcher.groupCount() == 1) {
// Some extra tag found
if(multilineMatchTag.equals(matcher.group(1))) {
// Same tag found - good
multilineMatchTag = null;
return true;
}
// else ignore and continue
} else {
// '|}' found
if(multilineMatchTag == "") {
// No extra tag expected - good
multilineMatchTag = null;
return true;
}
// else ignore and continue
}
return false;
}
%}
Next steps
The next thing to do to the lexer is to modify the tokens that match
numbers/floats/integers etc. to return different types of tokens. For example, a
long
should be considered a different token type to a float
(though they
both are of the constant
token type).
Reason/OCaml also have very generic things you can do inside PPX tags such as embedding other languages including GraphQL and Javascript - whether this is something that needs to have lexer support, I'm not sure yet - hopefully it should be possible to do inside the IDE with 'nested PSI files'.