Parsing Expression Grammars

A packrat parser is simply a recursive decent parser that memoizes.

I would like to remark on the extreme simplicity of implementing a Parsing Expression Grammar (PEG) parser. PEG rules are formed from the following basic constructs:

x y - Try to parse x and y in sequence.
x | y - Try to parse x. If it fails, try y.
!x - Try to parse x. If it succeeds, fail, if it fails, succeed, and in either case consume no input.

There are other operations sometimes considered part of the definition that can be derived in terms of these.

x* - Parse zero or more occurrences of x in sequence.
x+ - Parse one or more occurrences of x.
x? - Either parse x, or do nothing.
&x - Parse x, but don’t consume any input.

The unusual operators here are ! and &. You often do not get these operators in other parser generators because they cannot handle them efficiently.

A PEG parser generator can generate parsers that have running time and memory consumption at most O(N * M) for inputs of size N and grammars of size M. The idea is simple - keep a cache of the result for each input position and for each subexpression. Then each of the operators can be implemented in constant amortized time. To see this, realize that the time taken will be proportional to the number of expressions computed, that there are by definition M expressions, and that each expression will be computed at most once for a given input position.

The syntax the library will use will be the following, * x <&> y - sequence * do { x; y; } - sequence * do { a <- x; y a; } - sequence, capturing a result * x <|> y - choice * neg x - negation (x!) * rule x - declare a point of memoization

Here is how the other operators can be derived:

module Sugar where

import Parser

many e = some e <|> return []

some e = do
  x <- e;
  xs <- many e;
  return (x : xs)

maybe e = e <|> return ()

ignore = neg . neg

Streams of tokens

Before talking about parsing in general, it would be good to have a way to efficiently deal with streams of tokens, as an input to the parser. We’ll define the type Stream tok to mean a stream of tokens of type tok, and provide the following operations,

stream - convert a list into a stream
match - match the first token in the stream against a predicate
matches - match the first few tokens of the stream against a list
eof - check if you are at the end of the stream

You’ll notice that the above operations can all be easily and efficiently implemented on lists. The only advantage of our streams is that they can be compared in constant time with other streams of the same source.

module Stream (Stream, stream, match, matches, eof) where


data Stream tok = Stream [tok] Int

instance Eq (Stream tok) where
  (Stream _ x) == (Stream _ y) = x == y

instance Ord (Stream tok) where
  compare (Stream _ x) (Stream _ y) = compare x y


stream :: [tok] -> Stream tok
stream list = Stream list 0

match :: (tok -> Bool) -> Stream tok -> Maybe (tok, Stream tok)
match p (Stream [] _) = Nothing
match p (Stream (x:xs) n) =
  if p x
  then Just (x, Stream xs (n + 1))
  else Nothing

matches :: Eq tok => [tok] -> Stream tok -> Maybe (Stream tok)
matches [] s = Just s
matches (x:xs) (Stream [] _) = Nothing
matches (x:xs) (Stream (y:ys) n) =
  if x == y
  then matches xs (Stream ys (n + 1))
  else Nothing

eof :: Stream tok -> Bool
eof (Stream [] _) = True
eof _ = False

PEG Parsers

Implementing the operators is completely straightforward. For sequence expressions, try parsing the left expression first. If it fails, fail. If it succeeds, succeed with the same result and stream position. For choice expressions, try parsing the left expression first. If it succeeds, succeed with the same result and stream position. If it fails, try the right. Etc.

The slightly harder part is dealing with the state that arises. This is harder in Haskell than most other languages because Haskell forces us to be acknowledge precisely what it is we are doing. There are two sources of state. First, a cache has to be maintained for each expression that will be memoized using STRefs (it cannot be a global cache, because there would be no way to make that type-safe). Second, rules are allowed to refer to each other, but each reference to a rule should refer to the same rule, with the same cache, not copies of it with two different caches. This also requires state.

Without further ado, here is the comlete memoizing PEG parser (that is, packrat parser), generic over the type of its tokens. It supports actions and do-notation, like Parsec.

{-# LANGUAGE Rank2Types, GADTs #-}

module Parser (Grammar, parse, rule,
               eof, fail, token, tokens, (<.>), (<|>), neg) where

import qualified Debug.Trace as Trace
import Control.Monad.ST
import Control.Monad.State
import Data.Map
import Data.STRef
import qualified Stream as Stream


type Memo t a = Map (Stream.Stream t) (Result t a)

type Grammar t a = forall s. ST s (Parser s t a)

data Parser s t a where
  Succeed :: a -> Parser s t a
  Eof :: Parser s t ()
  Match :: [t] -> Parser s t ()
  Pred :: (t -> Bool) -> Parser s t t
  Seq :: Parser s t a -> (a -> Parser s t b) -> Parser s t b
  Choice :: Parser s t a -> Parser s t a -> Parser s t a
  Not :: Parser s t a -> Parser s t ()
  Memo :: Parser s t a -> STRef s (Memo t a) -> Parser s t a

data Result t a = Failure
                  | Success (Stream.Stream t) a


useMemo = True

rule :: Parser s t a -> ST s (Parser s t a)
rule rule = do
  if useMemo
    then do
      memo <- newSTRef empty
      return (Memo rule memo)
    else return rule

infixl 4 <|>
infixl 5 <.>

(<.>) :: Parser s t a -> Parser s t b -> Parser s t b
(<.>) = (>>)
(<|>) = Choice
neg = Not
eof = Eof
token = Pred
tokens = Match

instance Monad (Parser s t) where
  return = Succeed
  (>>=) = Seq


parse :: Eq t => Grammar t a -> [t] -> Maybe a
parse grammar toks =
  let result = runST $ do
        parser <- grammar
        run parser (Stream.stream toks) in
  case result of
    Failure -> Nothing
    Success _ x -> Just x


run :: Eq t => Parser s t a -> Stream.Stream t -> ST s (Result t a)

run (Succeed x) s = return $ Success s x

run Eof s =
  if Stream.eof s
  then return (Success s ())
  else return Failure

run (Match toks) s =
  case Stream.matches toks s of
    Nothing -> return Failure
    Just s' -> return (Success s' ())

run (Pred p) s =
  case Stream.match p s of
    Nothing -> return Failure
    Just (x, s') -> return (Success s' x)

run (Seq e f) s = do
  result <- run e s
  case result of
    Failure -> return Failure
    Success s' x -> run (f x) s'

run (Choice e e') s = do
  result <- run e s
  case result of
    Failure -> run e' s
    Success s' x -> return (Success s' x)

run (Not e) s = do
  result <- run e s
  case result of
    Failure -> return (Success s ())
    Success _ _ -> return Failure

run (Memo e ref) s = do
  memo <- readSTRef ref
  case Data.Map.lookup s memo of
    Nothing -> do
      result <- run e s
      writeSTRef ref (insert s result memo)
      return result
    Just result -> return result

Testing

Here are some tests to ensure it works, and as a demonstration of how to use it.

{-# LANGUAGE DoRec, Rank2Types #-}

module Main where

import Data.Char (isDigit)
import Control.Monad
import Parser
import Sugar
import Data.List
import qualified Data.Map as Map
import Data.STRef
import Data.Char


runTests :: (Eq t, Show t, Eq a, Show a) =>
            String -> Grammar t a -> [([[t]], Maybe a)] -> IO ()
runTests name grammar tests =
  let flattenGroup (inputs, output) = map (\i -> (i, output)) inputs
      testPairs = concat $ map flattenGroup tests in
  mapM_ (runTest name grammar) testPairs

runTest :: (Eq t, Show t, Eq a, Show a) =>
           String -> Grammar t a -> ([t], Maybe a) -> IO ()
runTest testName grammar (input, expectedResult) = do
  let result = parse grammar input
  if result == expectedResult
    then return ()
    else do
      putStrLn $ "Test " ++ testName ++ " failed!"
        ++ " On input " ++ (show input)
        ++ ", expected result " ++ (show expectedResult)
        ++ ", but found " ++ (show result) ++ "."


main = do
  -- See the Catalan numbers?
  runTests "Matching Parens" matchingParens
    [(["()"], Just 1),
     (["(())"], Just 2),
     (["((()))", "(()())"], Just 3),
     (["(((())))", "((()()))", "((())())", "(()(()))", "(()()())"], Just 4),
     (["(", ")",
       "((", ")(", "))",
       "(()", "())", "()()", "(()())()", "(((())())"], Nothing)]

  runTests "Non-context-free" nonContextFree
    [([[1, 2, 3], [1, 1, 2, 2, 3, 3], [1, 1, 1, 2, 2, 2, 3, 3, 3]],
      Just ()),
     ([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3],
       [1, 1, 2, 3], [1, 2, 2, 3], [1, 2, 3, 3],
       [1, 1, 2, 2, 3], [1, 2, 2, 3, 3], [1, 1, 2, 3, 3], [1, 2, 1, 2, 3]],
      Nothing)]
  
  runTests "ML Comments" comments
    [(["(**)", "(*(**)*)", "(*(*(**)*)*)", "(*(**)(**)*)", "(*(**)(*(**)*)*)",
       "(*x*)", "(*comment*)", "(*X(*Y*)Z*)",
       "(*X (* Y*) Z *)", "(*(*X*)X(*(**)*)X*)",
       "(*(x*)", "(***)", "(*)*)", "(*(*)*)((**)**)",
       "(**(*)*)**(*()*(*)*))*)*)", "(*)(***)()(**(**))*)*)"],
      Just ()),
    (["*", "()", "(*", "(*(**))", "(**)*)", "(*(**)(**)",
      "x(**)", "(**)x", "((**)", "(**)*",
      "(*(**)(**)*)*)", "(*(**)*))", "(*(*((*(**))*)*)"],
     Nothing)]
  
  putStrLn "Running memo test..."
  runTests "Memoization" memoTest
   [([concat ["((((((((((((((((((((((((((((((",
              "((((((((((((((((((((((((((((((",
              " o ",
              ")))]))]]]])))]]]))])))])))])]]",
              "))])]]]])]))))]])))]]]]]]]])))"]], Just ())]
  putStrLn "ok"


-- I think this takes exponential time without memoization.
memoTest = do
  rec
    s <- rule $ r <.> eof
    r <- rule $ tokens " o "
      <|> tokens "(" <.> r <.> tokens ")"
      <|> tokens "(" <.> r <.> tokens "]"
  
  return s

-- Matches {1^n 2^n 3^n}; taken from Wikipedia
nonContextFree = do
  rec
    abc <- rule $ ignore (ab <.> tokens [3]) <.> as <.> bc <.> eof
    as <- rule $ many (tokens [1])
    ab <- rule $ tokens [1] <.> ab <.> tokens [2] <|> tokens [1, 2]
    bc <- rule $ tokens [2] <.> bc <.> tokens [3] <|> tokens [2, 3]

  return abc

-- Ensure that parens match, and count the maximum depth.
matchingParens = do  
  rec
    parens <- rule $ do
       tokens "("
       n <- expr
       tokens ")"
       return (n + 1)
  
    expr <- rule $ exprBody <|> return 0
    
    exprBody <- rule $ do
      x <- parens
      y <- expr
      return (x + y)
  
    start <- rule $ do
      n <- parens
      eof
      return n
  
  return start

comments = do
  rec
    start <- rule $ comment <.> eof
    comment <- rule $ tokens "(*" <.> body <.> tokens "*)"
    body <- rule $ comment <.> body <|> char <.> body <|> return ()
    text <- rule $ char <.> text <|> return ()
    char <- rule $ neg (tokens "(*") <.> neg (tokens "*)") <.> token (\_ -> True)
  return start

To compile all of these, use

ghc6 --make stream.hs parser.hs sugar.hs test.hs

Future Work

But there is more to do! Here are some limitations:

There is no support for left-recursive rules. One approach is described in the Thesis of Alessandro Warth on Ometa. However, this, and every other seed-based approach I have seen, suffers from the problem of parsing E -> E + E | N on “1 + 2 + 3” as (1 + (2 + 3)), which is incorrect.
There is no support for statful action, and with good cause. As it is, actions are performed as they are encountered, which means they could be computed many times. What should happen is that actions be gathered up in a list as the parser progresses, and only performed at the end. From the user’s perspective, it would appear that the parser always chose the right path in each choice operator.
Parsers have to be specified all at once. Getting around this will require abstracting the grammar further away from the parser, and having rules stored under some sort of key.
[Done!] There is no check that a rule has been defined until runtime. This should happen when compiling a grammar into a parser.
[Done!] Is it possible to remove the neccessity for wrapping rule references with ref?
Is it possible to make separate streams incomparable?

Justin Pombrio

About Me

Puzzles

Random

Tree Editors

Parsing Expression Grammars

Streams of tokens

PEG Parsers

Testing

Future Work