PEG
PEG deserves separate page. It was proposed by Bryan Ford in 2004 ↗.
It is different from other “classical” approaches, such as Chomsky grammar, regular expression (at least it was then). I consider it to be quite innovative. Key features (differences):
- focuses on parsing rather, than generation of languages
- cannot handle Left recursion
- There is research on how to add support for left-recursion to PEG
- unambigious, because doesn’t have unordered choice
- linear complexity (), which is quite impressive taking into account grammars it can handle
- is not contained in CFG, but can handle some CSG
- This is research subject to understand how PEG relates to Chomsky hierarchy
- recursive-descent
Links
Abstract. Parsing Expression Grammars (
PEG
s) are specifications of unambiguous recursive-descent style parsers. PEGs incorporate both lexing and parsing phases and have valuable properties, such as being closed under composition. In common with most recursive-descent systems, rawPEG
s cannot handle left-recursion; traditional approaches to left-recursion elimination lead to incorrect parses. In this paper, I show how the approach proposed for direct left-recursive Packrat parsing by Warth et al. can be adapted for ‘pure’PEG
s. I then demonstrate that this approach results in incorrect parses for somePEG
s, before outlining a restrictive subset of left-recursivePEG
s which can safely work with this algorithm. Finally I suggest an alteration to Warth et al.’s algorithm that can correctly parse a less restrictive subset of directly recursivePEG
s.— Direct Left-Recursive Parsing Expression Grammars, Laurence Tratt, 2010 ↗
Packrat parsing is a linear-time implementation method of recursive descent parsers. The trick is a memoization mechanism, where all parsing results are memorized to avoid redundant parsing in cases of backtracking. An arising problem is extremely huge heap consumption in memoization, resulting in the fact that the cost of memoization is likely to outweigh its benefits. In many cases, developers need to make a difficult choice to abandon packrat parsing despite the possible exponential time parsing. Elastic packrat parsing is developed in order to avoid such a difficult choice. The heap consumption is upper-bounded since memorized results are stored on a sliding window buffer. In addition, the buffer capacity is adjusted by tracing each of nonterminal backtracking activities at runtime. Elastic packrat parsing is implemented in a part of our Nez parser. We demonstrate that the elastic packrat parsing achieves stable and robust performance against a variety of inputs with different backtracking activities.
— Packrat Parsing with Elastic Sliding Window Kimio Kuramitsu, 2015 ↗
A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A
packrat
parser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant problems: (i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it can be difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Both problems are solved by thepika
parser, a novel reformulation ofpackrat
parsing as a dynamic programming algorithm, which requires parsing the input in reverse: bottom-up and right to left, rather than top-down and left to right. This reversed parsing order enablespika
parsers to handle grammars that use either direct or indirect left recursion to achieve left associativity, simplifying grammar writing, and also enables optimal recovery from syntax errors, which is a crucial property for IDEs and compilers.Pika
parsing maintains the linear-time performance characteristics ofpackrat
parsing as a function of input length. Thepika
parser
was benchmarked against the widely-used Parboiled2 and ANTLR4 parsing libraries, and the pikaparser
performed significantly better than the otherparsers
for an expression grammar, although for a complex grammar implementing the Java language specification, a large constant performance impact was incurred per input character for thepika
parser, which allowed Parboiled2 and ANTLR4 to perform significantly better than thepika
parser for this grammar (in spite of ANTLR4’s parsing time scaling between quadratically and cubically in the length of the input with the Java grammar). Therefore, if performance is important,pika
parsing is best applied to simple to moderate-sized grammars, or to very large inputs, if other parsing alternatives do not scale linearly in the length of the input. Several new insights into precedence, associativity, and left recursion are presented.
We show how one way of removing non-determinism from this formalism yields a formalism with the semantics of PEGs. We also prove, based on these new formalisms, how
LL(1)
grammars define the same language whether interpreted as CFGs or asPEG
s, and also show how strong-LL(k)
, right-linear, and LL-regular grammars have simple language-preserving translations from CFGs toPEG
s— On the relation between context-free grammars and parsing expression grammars ↗