PEG deserves separate page. It was proposed by Bryan Ford in 2004.
It is different from other “classical” approaches, such as Chomsky grammar, regular expression (at least it was then). I consider it to be quite innovative. Key features (differences):
- focuses on parsing rather, than generation of languages
- doesn’t support left-recursion
- There is research on how to add support for left-recursion to PEG
- unambigious, because doesn’t have unordered choice
- linear complexity (
O(n)), which is quite impressive taking into grammars it can handle
- is not contained in CFG, but can handle some CSG
- This is research subject to understand how PEG relates to Chomsky hierarchy
TS(The tmg recognition schema, Alexander Birman, 1970)
For decades we have been using Chomsky’s generative system of grammars, particularly context-free grammars(CFGs)and regular expressions(REs), to express the syntax of programming languages and protocols. The power of generative grammars to express ambiguity is crucial to their original purpose of modeling natural languages, but this very power makes it unnecessarily difficult both to express and to parse machine-oriented languages using CFGs. Parsing Expression Grammars(
PEGs) provide an alternative recognition-based formal foundation for describing machine-oriented syntax, which solves the ambiguity problem by not introducing ambiguity in the first place Where CFG express nondeterministic choice between alternatives,
PEGs instead use prioritized choice.
PEGs address frequently felt expressiveness limitations of CFGs and REs, simplifying syntax definitions and making it unnecessary to separate their lexical and hierarchical components. A linear-time parser can be built for any PEG, avoiding both the complexity and fickleness of
LRparsers and the inefficiency of generalized CFG parsing. While PEGs provide a rich set of operators for constructing grammars, they are reducible to two minimal recognition schemas developed around 1970,
gTS/GTDPL, which are here proven equivalent ineffective recognition power.
Abstract. Parsing Expression Grammars (
PEGs) are specifications of unambiguous recursive-descent style parsers. PEGs incorporate both lexing and parsing phases and have valuable properties, such as being closed under composition. In common with most recursive-descent systems, raw
PEGs cannot handle left-recursion; traditional approaches to left-recursion elimination lead to incorrect parses. In this paper, I show how the approach proposed for direct left-recursive Packrat parsing by Warth et al. can be adapted for ‘pure’
PEGs. I then demonstrate that this approach results in incorrect parses for some
PEGs, before outlining a restrictive subset of left-recursive
PEGs which can safely work with this algorithm. Finally I suggest an alteration to Warth et al.’s algorithm that can correctly parse a less restrictive subset of directly recursive
Packrat parsing is a linear-time implementation method of recursive descent parsers. The trick is a memoization mechanism, where all parsing results are memorized to avoid redundant parsing in cases of backtracking. An arising problem is extremely huge heap consumption in memoization, resulting in the fact that the cost of memoization is likely to outweigh its benefits. In many cases, developers need to make a difficult choice to abandon packrat parsing despite the possible exponential time parsing. Elastic packrat parsing is developed in order to avoid such a difficult choice. The heap consumption is upper-bounded since memorized results are stored on a sliding window buffer. In addition, the buffer capacity is adjusted by tracing each of nonterminal backtracking activities at runtime. Elastic packrat parsing is implemented in a part of our Nez parser. We demonstrate that the elastic packrat parsing achieves stable and robust performance against a variety of inputs with different backtracking activities.
Pika(Pika parsing: reformulating packrat parsing as a dynamic programming algorithm solves the left recursion and error recovery problems, 2020)
A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A
packratparser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant problems: (i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it can be difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Both problems are solved by the
pikaparser, a novel reformulation of
packratparsing as a dynamic programming algorithm, which requires parsing the input in reverse: bottom-up and right to left, rather than top-down and left to right. This reversed parsing order enables
pikaparsers to handle grammars that use either direct or indirect left recursion to achieve left associativity, simplifying grammar writing, and also enables optimal recovery from syntax errors, which is a crucial property for IDEs and compilers.
Pikaparsing maintains the linear-time performance characteristics of
packratparsing as a function of input length. The
parserwas benchmarked against the widely-used Parboiled2 and ANTLR4 parsing libraries, and the pika
parserperformed significantly better than the other
parsersfor an expression grammar, although for a complex grammar implementing the Java language specification, a large constant performance impact was incurred per input character for the
pikaparser, which allowed Parboiled2 and ANTLR4 to perform significantly better than the
pikaparser for this grammar (in spite of ANTLR4’s parsing time scaling between quadratically and cubically in the length of the input with the Java grammar). Therefore, if performance is important,
pikaparsing is best applied to simple to moderate-sized grammars, or to very large inputs, if other parsing alternatives do not scale linearly in the length of the input. Several new insights into precedence, associativity, and left recursion are presented.
We show how one way of removing non-determinism from this formalism yields a formalism with the semantics of PEGs. We also prove, based on these new formalisms, how
LL(1)grammars define the same language whether interpreted as CFGs or as
PEGs, and also show how strong-
LL(k), right-linear, and LL-regular grammars have simple language-preserving translations from CFGs to