A packrat parser is simply a recursive decent parser that memoizes.
I would like to remark on the extreme simplicity of implementing a Parsing Expression Grammar (PEG) parser. PEG rules are formed from the following basic constructs:
x y
- Try to parse x
and y
in sequence.x | y
- Try to parse x
. If it fails, try y
.!x
- Try to parse x
. If it succeeds, fail, if it fails, succeed,
and in either case consume no input.There are other operations sometimes considered part of the definition that can be derived in terms of these.
x*
- Parse zero or more occurrences of x
in sequence.x+
- Parse one or more occurrences of x
.x?
- Either parse x
, or do nothing.&x
- Parse x
, but don’t consume any input.The unusual operators here are !
and &
. You often do not get these
operators in other parser generators because they cannot handle them
efficiently.
A PEG parser generator can generate parsers that have running time and memory consumption at most O(N * M) for inputs of size N and grammars of size M. The idea is simple - keep a cache of the result for each input position and for each subexpression. Then each of the operators can be implemented in constant amortized time. To see this, realize that the time taken will be proportional to the number of expressions computed, that there are by definition M expressions, and that each expression will be computed at most once for a given input position.
The syntax the library will use will be the following,
* x <&> y
- sequence
* do { x; y; }
- sequence
* do { a <- x; y a; }
- sequence, capturing a result
* x <|> y
- choice
* neg x
- negation (x!
)
* rule x
- declare a point of memoization
Here is how the other operators can be derived:
module Sugar where
import Parser
many e = some e <|> return []
some e = do
x <- e;
xs <- many e;
return (x : xs)
maybe e = e <|> return ()
ignore = neg . neg
Before talking about parsing in general, it would be good to have a
way to efficiently deal with streams of tokens, as an input to the
parser. We’ll define the type Stream tok
to mean a stream of
tokens of type tok
, and provide the following operations,
stream
- convert a list into a streammatch
- match the first token in the stream against a predicatematches
- match the first few tokens of the stream against a listeof
- check if you are at the end of the streamYou’ll notice that the above operations can all be easily and efficiently implemented on lists. The only advantage of our streams is that they can be compared in constant time with other streams of the same source.
module Stream (Stream, stream, match, matches, eof) where
data Stream tok = Stream [tok] Int
instance Eq (Stream tok) where
(Stream _ x) == (Stream _ y) = x == y
instance Ord (Stream tok) where
compare (Stream _ x) (Stream _ y) = compare x y
stream :: [tok] -> Stream tok
stream list = Stream list 0
match :: (tok -> Bool) -> Stream tok -> Maybe (tok, Stream tok)
match p (Stream [] _) = Nothing
match p (Stream (x:xs) n) =
if p x
then Just (x, Stream xs (n + 1))
else Nothing
matches :: Eq tok => [tok] -> Stream tok -> Maybe (Stream tok)
matches [] s = Just s
matches (x:xs) (Stream [] _) = Nothing
matches (x:xs) (Stream (y:ys) n) =
if x == y
then matches xs (Stream ys (n + 1))
else Nothing
eof :: Stream tok -> Bool
eof (Stream [] _) = True
eof _ = False
Implementing the operators is completely straightforward. For sequence expressions, try parsing the left expression first. If it fails, fail. If it succeeds, succeed with the same result and stream position. For choice expressions, try parsing the left expression first. If it succeeds, succeed with the same result and stream position. If it fails, try the right. Etc.
The slightly harder part is dealing with the state that arises. This
is harder in Haskell than most other languages because Haskell forces
us to be acknowledge precisely what it is we are doing. There are two
sources of state. First, a cache has to be maintained for each
expression that will be memoized using STRef
s (it cannot be a global
cache, because there would be no way to make that type-safe). Second,
rules are allowed to refer to each other, but each reference to a rule
should refer to the same rule, with the same cache, not copies of
it with two different caches. This also requires state.
Without further ado, here is the comlete memoizing PEG parser (that is, packrat parser), generic over the type of its tokens. It supports actions and do-notation, like Parsec.
{-# LANGUAGE Rank2Types, GADTs #-}
module Parser (Grammar, parse, rule,
eof, fail, token, tokens, (<.>), (<|>), neg) where
import qualified Debug.Trace as Trace
import Control.Monad.ST
import Control.Monad.State
import Data.Map
import Data.STRef
import qualified Stream as Stream
type Memo t a = Map (Stream.Stream t) (Result t a)
type Grammar t a = forall s. ST s (Parser s t a)
data Parser s t a where
Succeed :: a -> Parser s t a
Eof :: Parser s t ()
Match :: [t] -> Parser s t ()
Pred :: (t -> Bool) -> Parser s t t
Seq :: Parser s t a -> (a -> Parser s t b) -> Parser s t b
Choice :: Parser s t a -> Parser s t a -> Parser s t a
Not :: Parser s t a -> Parser s t ()
Memo :: Parser s t a -> STRef s (Memo t a) -> Parser s t a
data Result t a = Failure
| Success (Stream.Stream t) a
useMemo = True
rule :: Parser s t a -> ST s (Parser s t a)
rule rule = do
if useMemo
then do
memo <- newSTRef empty
return (Memo rule memo)
else return rule
infixl 4 <|>
infixl 5 <.>
(<.>) :: Parser s t a -> Parser s t b -> Parser s t b
(<.>) = (>>)
(<|>) = Choice
neg = Not
eof = Eof
token = Pred
tokens = Match
instance Monad (Parser s t) where
return = Succeed
(>>=) = Seq
parse :: Eq t => Grammar t a -> [t] -> Maybe a
parse grammar toks =
let result = runST $ do
parser <- grammar
run parser (Stream.stream toks) in
case result of
Failure -> Nothing
Success _ x -> Just x
run :: Eq t => Parser s t a -> Stream.Stream t -> ST s (Result t a)
run (Succeed x) s = return $ Success s x
run Eof s =
if Stream.eof s
then return (Success s ())
else return Failure
run (Match toks) s =
case Stream.matches toks s of
Nothing -> return Failure
Just s' -> return (Success s' ())
run (Pred p) s =
case Stream.match p s of
Nothing -> return Failure
Just (x, s') -> return (Success s' x)
run (Seq e f) s = do
result <- run e s
case result of
Failure -> return Failure
Success s' x -> run (f x) s'
run (Choice e e') s = do
result <- run e s
case result of
Failure -> run e' s
Success s' x -> return (Success s' x)
run (Not e) s = do
result <- run e s
case result of
Failure -> return (Success s ())
Success _ _ -> return Failure
run (Memo e ref) s = do
memo <- readSTRef ref
case Data.Map.lookup s memo of
Nothing -> do
result <- run e s
writeSTRef ref (insert s result memo)
return result
Just result -> return result
Here are some tests to ensure it works, and as a demonstration of how to use it.
{-# LANGUAGE DoRec, Rank2Types #-}
module Main where
import Data.Char (isDigit)
import Control.Monad
import Parser
import Sugar
import Data.List
import qualified Data.Map as Map
import Data.STRef
import Data.Char
runTests :: (Eq t, Show t, Eq a, Show a) =>
String -> Grammar t a -> [([[t]], Maybe a)] -> IO ()
runTests name grammar tests =
let flattenGroup (inputs, output) = map (\i -> (i, output)) inputs
testPairs = concat $ map flattenGroup tests in
mapM_ (runTest name grammar) testPairs
runTest :: (Eq t, Show t, Eq a, Show a) =>
String -> Grammar t a -> ([t], Maybe a) -> IO ()
runTest testName grammar (input, expectedResult) = do
let result = parse grammar input
if result == expectedResult
then return ()
else do
putStrLn $ "Test " ++ testName ++ " failed!"
++ " On input " ++ (show input)
++ ", expected result " ++ (show expectedResult)
++ ", but found " ++ (show result) ++ "."
main = do
-- See the Catalan numbers?
runTests "Matching Parens" matchingParens
[(["()"], Just 1),
(["(())"], Just 2),
(["((()))", "(()())"], Just 3),
(["(((())))", "((()()))", "((())())", "(()(()))", "(()()())"], Just 4),
(["(", ")",
"((", ")(", "))",
"(()", "())", "()()", "(()())()", "(((())())"], Nothing)]
runTests "Non-context-free" nonContextFree
[([[1, 2, 3], [1, 1, 2, 2, 3, 3], [1, 1, 1, 2, 2, 2, 3, 3, 3]],
Just ()),
([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3],
[1, 1, 2, 3], [1, 2, 2, 3], [1, 2, 3, 3],
[1, 1, 2, 2, 3], [1, 2, 2, 3, 3], [1, 1, 2, 3, 3], [1, 2, 1, 2, 3]],
Nothing)]
runTests "ML Comments" comments
[(["(**)", "(*(**)*)", "(*(*(**)*)*)", "(*(**)(**)*)", "(*(**)(*(**)*)*)",
"(*x*)", "(*comment*)", "(*X(*Y*)Z*)",
"(*X (* Y*) Z *)", "(*(*X*)X(*(**)*)X*)",
"(*(x*)", "(***)", "(*)*)", "(*(*)*)((**)**)",
"(**(*)*)**(*()*(*)*))*)*)", "(*)(***)()(**(**))*)*)"],
Just ()),
(["*", "()", "(*", "(*(**))", "(**)*)", "(*(**)(**)",
"x(**)", "(**)x", "((**)", "(**)*",
"(*(**)(**)*)*)", "(*(**)*))", "(*(*((*(**))*)*)"],
Nothing)]
putStrLn "Running memo test..."
runTests "Memoization" memoTest
[([concat ["((((((((((((((((((((((((((((((",
"((((((((((((((((((((((((((((((",
" o ",
")))]))]]]])))]]]))])))])))])]]",
"))])]]]])]))))]])))]]]]]]]])))"]], Just ())]
putStrLn "ok"
-- I think this takes exponential time without memoization.
memoTest = do
rec
s <- rule $ r <.> eof
r <- rule $ tokens " o "
<|> tokens "(" <.> r <.> tokens ")"
<|> tokens "(" <.> r <.> tokens "]"
return s
-- Matches {1^n 2^n 3^n}; taken from Wikipedia
nonContextFree = do
rec
abc <- rule $ ignore (ab <.> tokens [3]) <.> as <.> bc <.> eof
as <- rule $ many (tokens [1])
ab <- rule $ tokens [1] <.> ab <.> tokens [2] <|> tokens [1, 2]
bc <- rule $ tokens [2] <.> bc <.> tokens [3] <|> tokens [2, 3]
return abc
-- Ensure that parens match, and count the maximum depth.
matchingParens = do
rec
parens <- rule $ do
tokens "("
n <- expr
tokens ")"
return (n + 1)
expr <- rule $ exprBody <|> return 0
exprBody <- rule $ do
x <- parens
y <- expr
return (x + y)
start <- rule $ do
n <- parens
eof
return n
return start
comments = do
rec
start <- rule $ comment <.> eof
comment <- rule $ tokens "(*" <.> body <.> tokens "*)"
body <- rule $ comment <.> body <|> char <.> body <|> return ()
text <- rule $ char <.> text <|> return ()
char <- rule $ neg (tokens "(*") <.> neg (tokens "*)") <.> token (\_ -> True)
return start
To compile all of these, use
ghc6 --make stream.hs parser.hs sugar.hs test.hs
But there is more to do! Here are some limitations:
E -> E + E | N
on “1 + 2 + 3” as (1 +
(2 + 3)), which is incorrect.ref
?