A packrat parser is simply a recursive decent parser that memoizes.
I would like to remark on the extreme simplicity of implementing a
Parsing Expression
Grammar (PEG)
parser. PEG rules are formed from the following basic constructs:
x y - Try to parse x and y in sequence.
x | y - Try to parse x. If it fails, try y.
!x - Try to parse x. If it succeeds, fail, if it fails, succeed,
and in either case consume no input.
There are other operations sometimes considered part of the definition
that can be derived in terms of these.
x* - Parse zero or more occurrences of x in sequence.
x+ - Parse one or more occurrences of x.
x? - Either parse x, or do nothing.
&x - Parse x, but don’t consume any input.
The unusual operators here are ! and &. You often do not get these
operators in other parser generators because they cannot handle them
efficiently.
A PEG parser generator can generate parsers that have running time and
memory consumption at most O(N * M) for inputs of size N and grammars
of size M. The idea is simple - keep a cache of the result for each
input position and for each subexpression. Then each of the operators
can be implemented in constant amortized time. To see this, realize
that the time taken will be proportional to the number of expressions
computed, that there are by definition M expressions, and that each
expression will be computed at most once for a given input position.
The syntax the library will use will be the following,
* x <&> y - sequence
* do { x; y; } - sequence
* do { a <- x; y a; } - sequence, capturing a result
* x <|> y - choice
* neg x - negation (x!)
* rule x - declare a point of memoization
Here is how the other operators can be derived:
Streams of tokens
Before talking about parsing in general, it would be good to have a
way to efficiently deal with streams of tokens, as an input to the
parser. We’ll define the type Stream tok to mean a stream of
tokens of type tok, and provide the following operations,
stream - convert a list into a stream
match - match the first token in the stream against a predicate
matches - match the first few tokens of the stream against a list
eof - check if you are at the end of the stream
You’ll notice that the above operations can all be easily and
efficiently implemented on lists. The only advantage of our streams is
that they can be compared in constant time with other streams of the
same source.
PEG Parsers
Implementing the operators is completely straightforward. For sequence
expressions, try parsing the left expression first. If it fails,
fail. If it succeeds, succeed with the same result and stream
position. For choice expressions, try parsing the left expression
first. If it succeeds, succeed with the same result and stream
position. If it fails, try the right. Etc.
The slightly harder part is dealing with the state that arises. This
is harder in Haskell than most other languages because Haskell forces
us to be acknowledge precisely what it is we are doing. There are two
sources of state. First, a cache has to be maintained for each
expression that will be memoized using STRefs (it cannot be a global
cache, because there would be no way to make that type-safe). Second,
rules are allowed to refer to each other, but each reference to a rule
should refer to the same rule, with the same cache, not copies of
it with two different caches. This also requires state.
Without further ado, here is the comlete memoizing PEG parser (that
is, packrat parser), generic over the type of its tokens. It supports
actions and do-notation, like
Parsec.
Testing
Here are some tests to ensure it works, and as a demonstration of how
to use it.
To compile all of these, use
ghc6 --make stream.hs parser.hs sugar.hs test.hs
Future Work
But there is more to do! Here are some limitations:
There is no support for left-recursive rules. One approach is
described in the Thesis of Alessandro Warth on
Ometa.
However, this, and every other seed-based approach I have seen, suffers
from the problem of parsing E -> E + E | N on “1 + 2 + 3” as (1 +
(2 + 3)), which is incorrect.
There is no support for statful action, and with good cause. As it
is, actions are performed as they are encountered, which means they
could be computed many times. What should happen is that actions
be gathered up in a list as the parser progresses, and only
performed at the end. From the user’s perspective, it would appear
that the parser always chose the right path in each choice operator.
Parsers have to be specified all at once. Getting around this will
require abstracting the grammar further away from the parser, and
having rules stored under some sort of key.
[Done!] There is no check that a rule has been defined until
runtime. This should happen when compiling a grammar into a parser.
[Done!] Is it possible to remove the neccessity for wrapping rule
references with ref?
Is it possible to make separate streams incomparable?