This is a demo for a library about doing fun things with grammars.
To get started, create one function per non-terminal, decorated with
the rule
decorator, and one instance of a
Terminal
object for each terminal. Then tie it all together
with an instance of a Grammar object.
Here’s an example:
from parser import *
@rule
def expression():
return seq(expression, PLUS, term) | term
@rule
def term():
return seq(LPAREN, expression, RPAREN) | ID
PLUS = Terminal('PLUS', '+')
LPAREN = Terminal('LPAREN', '(')
RPAREN = Terminal('RPAREN', ')')
ID = Terminal(
'ID',
Re.seq(
Re.set(("a", "z"), ("A", "Z"), "_"),
Re.set(("a", "z"), ("A", "Z"), ("0", "9"), "_").star(),
),
)
SimpleGrammar = Grammar(
name="Simple",
start=expression,
)
Terminal patterns can be plain strings or regular expressions
constructed with the Re
object. (Ironically, I guess this
library is not clever enough to parse a regular expression string into
one of these structures. If you want to build one, go nuts! It’s just
Python, you can do whatever you want so long as the result is an
Re
object.)
Productions can be built out of terminals and non-terminals,
concatenated with the seq
function or the +
operator. Alternatives can be expressed with the alt
function or the |
operator. These things can be freely
nested, as desired.
You can make lists in the classic context-free grammar way:
@rule
def list():
return NUMBER | (list + COMMA + NUMBER)
NUMBER = Terminal(Re.set(("0", "9")).plus())
COMMA = Terminal(',')
NumberList = Grammar(
name="NumberList",
start=list,
)
(Unlike with PEGs, you can write grammars with left or right-recursion, without restriction, either is fine.)
When used to generate a parser, the grammar describes a concrete
syntax tree. Unfortunately, that means that the list example above will
generate a very awkward tree for 1,2,3
:
list
list
list
NUMBER ("1")
COMMA
NUMBER ("2")
COMMA
NUMBER ("3")
In order to make this a little cleaner, rules can be “transparent”, which means they don’t generate nodes in the tree and just dump their contents into the parent node instead.
@rule
def list():
# The starting rule can't be transparent: there has to be something to
# hold on to!
return transparent_list
@rule(transparent=True)
def transparent_list() -> Rule:
return NUMBER | (transparent_list + COMMA + NUMBER)
NUMBER = Terminal(Re.set(("0", "9")).plus())
COMMA = Terminal(',')
NumberList = Grammar(
name="NumberList",
start=list,
)
This grammar will generate the far more useful tree:
list
NUMBER ("1")
COMMA
NUMBER ("2")
COMMA
NUMBER ("3")
Rules that start with _
are also interpreted as
transparent, following the lead set by tree-sitter, and so the grammar
above is probably better-written as:
@rule
def list():
# The starting rule can't be transparent: there has to be something to
# hold on to!
return transparent_list
@rule
def _list() -> Rule:
return NUMBER | (_list + COMMA + NUMBER)
NUMBER = Terminal(Re.set(("0", "9")).plus())
COMMA = Terminal(',')
NumberList = Grammar(
name="NumberList",
start=list,
)
That will generate the same tree, but a little more succinctly.
Of course, it’s a lot of work to write these transparent recursive rules by hand all the time, so there are helpers that do it for you:
@rule
def list():
return zero_or_more(NUMBER, COMMA) + NUMBER
NUMBER = Terminal(Re.set(("0", "9")).plus())
COMMA = Terminal(',')
NumberList = Grammar(
name="NumberList",
start=list,
)
Much better.
Most folks that want to parse something want to skip blanks when they do it. Our grammars don’t say anything about that by default (sorry), so you probably want to be explicit about such things.
To allow (and ignore) spaces, newlines, tabs, and carriage-returns in our number lists, we would modify the grammar as follows:
@rule
def list():
return zero_or_more(NUMBER, COMMA) + NUMBER
NUMBER = Terminal(Re.set(("0", "9")).plus())
COMMA = Terminal(',')
BLANKS = Terminal(Re.set(" ", "\t", "\r", "\n").plus())
# ^ and add a new terminal to describe what we're ignoring...
NumberList = Grammar(
name="NumberList",
start=list,
trivia=[BLANKS],
)
Now we can parse a list with spaces! “1 , 2, 3” will parse happily into:
list
NUMBER ("1")
COMMA
NUMBER ("2")
COMMA
NUMBER ("3")
In order to get good error recovery, you have to… do nothing.
The parser runtime we’re using here uses a non-interactive version of CPCT+.
I find that it actually works quite well! If you’re skeptical that a machine-generated parser can do well enough for, say, an LSP, give your favorite examples a try here. You might be surprised.
(Go ahead, give it some of your favorite examples of resilient parsing and see how it does. I would love to see examples of where the recovery went fully off the rails!)
You can annotate the terminals and nonterminals to generate syntax highlighting but the dingus doesn’t have it wired into the editors yet.
You can annotate the grammar with rules for pretty printing but the dingus doesn’t expose it yet.