Skip to content
Parsing
TwitterGitHub

PEG

PEG deserves separate page. It was proposed by Bryan Ford in 2004.

It is different from other “classical” approaches, such as Chomsky grammar, regular expression (at least it was then). I consider it to be quite innovative. Key features (differences):

  • focuses on parsing rather, than generation of languages
  • doesn’t support left-recursion
    • There is research on how to add support for left-recursion to PEG
  • unambigious, because doesn’t have unordered choice
  • linear complexity (O(n)), which is quite impressive taking into grammars it can handle
  • is not contained in CFG, but can handle some CSG
    • This is research subject to understand how PEG relates to Chomsky hierarchy

For decades we have been using Chomsky’s generative system of grammars, particularly context-free grammars(CFGs)and regular expressions(REs), to express the syntax of programming languages and protocols. The power of generative grammars to express ambiguity is crucial to their original purpose of modeling natural languages, but this very power makes it unnecessarily difficult both to express and to parse machine-oriented languages using CFGs. Parsing Expression Grammars(PEGs) provide an alternative recognition-based formal foundation for describing machine-oriented syntax, which solves the ambiguity problem by not introducing ambiguity in the first place Where CFG express nondeterministic choice between alternatives, PEGs instead use prioritized choice. PEGs address frequently felt expressiveness limitations of CFGs and REs, simplifying syntax definitions and making it unnecessary to separate their lexical and hierarchical components. A linear-time parser can be built for any PEG, avoiding both the complexity and fickleness of LR parsers and the inefficiency of generalized CFG parsing. While PEGs provide a rich set of operators for constructing grammars, they are reducible to two minimal recognition schemas developed around 1970, TS/TDPL and gTS/GTDPL, which are here proven equivalent ineffective recognition power.

Abstract. Parsing Expression Grammars (PEGs) are specifications of unambiguous recursive-descent style parsers. PEGs incorporate both lexing and parsing phases and have valuable properties, such as being closed under composition. In common with most recursive-descent systems, raw PEGs cannot handle left-recursion; traditional approaches to left-recursion elimination lead to incorrect parses. In this paper, I show how the approach proposed for direct left-recursive Packrat parsing by Warth et al. can be adapted for ‘pure’ PEGs. I then demonstrate that this approach results in incorrect parses for some PEGs, before outlining a restrictive subset of left-recursive PEGs which can safely work with this algorithm. Finally I suggest an alteration to Warth et al.’s algorithm that can correctly parse a less restrictive subset of directly recursive PEGs.

Packrat parsing is a linear-time implementation method of recursive descent parsers. The trick is a memoization mechanism, where all parsing results are memorized to avoid redundant parsing in cases of backtracking. An arising problem is extremely huge heap consumption in memoization, resulting in the fact that the cost of memoization is likely to outweigh its benefits. In many cases, developers need to make a difficult choice to abandon packrat parsing despite the possible exponential time parsing. Elastic packrat parsing is developed in order to avoid such a difficult choice. The heap consumption is upper-bounded since memorized results are stored on a sliding window buffer. In addition, the buffer capacity is adjusted by tracing each of nonterminal backtracking activities at runtime. Elastic packrat parsing is implemented in a part of our Nez parser. We demonstrate that the elastic packrat parsing achieves stable and robust performance against a variety of inputs with different backtracking activities.

A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A packrat parser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant problems: (i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it can be difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Both problems are solved by the pika parser, a novel reformulation of packrat parsing as a dynamic programming algorithm, which requires parsing the input in reverse: bottom-up and right to left, rather than top-down and left to right. This reversed parsing order enables pika parsers to handle grammars that use either direct or indirect left recursion to achieve left associativity, simplifying grammar writing, and also enables optimal recovery from syntax errors, which is a crucial property for IDEs and compilers. Pika parsing maintains the linear-time performance characteristics of packrat parsing as a function of input length. The pika parser was benchmarked against the widely-used Parboiled2 and ANTLR4 parsing libraries, and the pika parser performed significantly better than the other parsers for an expression grammar, although for a complex grammar implementing the Java language specification, a large constant performance impact was incurred per input character for the pika parser, which allowed Parboiled2 and ANTLR4 to perform significantly better than the pika parser for this grammar (in spite of ANTLR4’s parsing time scaling between quadratically and cubically in the length of the input with the Java grammar). Therefore, if performance is important, pika parsing is best applied to simple to moderate-sized grammars, or to very large inputs, if other parsing alternatives do not scale linearly in the length of the input. Several new insights into precedence, associativity, and left recursion are presented.

We show how one way of removing non-determinism from this formalism yields a formalism with the semantics of PEGs. We also prove, based on these new formalisms, how LL(1) grammars define the same language whether interpreted as CFGs or as PEGs, and also show how strong-LL(k), right-linear, and LL-regular grammars have simple language-preserving translations from CFGs to PEGs

On the relation between context-free grammars and parsing expression grammars