Parse tree
aka derivation tree, syntax tree, abstract syntax tree, concrete syntax tree.
Formal languages (branch of computer science and mathematics) studies diefferent problems. Two main problems (tasks) are:
- recognition - answers if a given string belongs to the language or not (yes or no). Aka membership problem
- parsing - solves recognition task, but it focuses more on structural analysis (producing a parse tree).
Recogintion vs Parsing
Ruzzo (1979) showed that any recognizer can be transformed to parser with the overhead of at most.
Recognition not bothered with stuctural anlysis:
- allows to work with with Regular Expressions (which typically given without explanation of how to construct parse tree)
- allows any sound transformation of grammar, like Chomsky Normal Form, Greibach Normal Form, etc.
Parsing more focused on stuctural anlysis:
- sound transformation of grammar changes the structure, so it is better to avoid them
- ambiguity - there can be more than one way to produce parse tree
- left-recursion - can be problematic for some parsers
Constructing parse tree (informal)
Let’s start with the simple example (of parse tree for CFG):
And the string that we need to parse: (n + n) * n
.
Top-down parsing
Top-down, Leftmost
We start with start symbol ( in this case) and apply production rules to the leftmost non-terminal until we get the final string:
Sentence | Rule applied | Current tree level |
---|---|---|
E | start symbol | 1 |
E O E | 2 | |
(E) O E | 3 | |
(E O E) O E | 4 | |
(n O E) O E | 5 | |
(n + E) O E | 5 | |
(n + n) O E | 5 | |
(n + n) O E | 5 | |
(n + n) * E | 3 | |
(n + n) * n | 3 |
And parse tree would be graphical representation of derivation steps (TODO replace with Graphviz diagram):
1 _____ E
/ | \
2 __ E __ O E
/ | \ | |
3 ( E ) * n
/ | \
4 E O E
| | |
5 n + n
Note: this process is similar to depth-first traversal (search)
Top-down, Rightmost parsing
We start with start symbol ( in this case) and apply production rules to the rightmost non-terminal until we get the final string:
Sentence | Rule applied | Current tree level |
---|---|---|
E | start symbol | 1 |
E O E | 2 | |
E O n | 3 | |
E * n | 3 | |
(E) * n | 3 | |
(E O E) * n | 4 | |
(E O n) * n | 5 | |
(E + n) * n | 5 | |
(n + n) * n | 5 |
Bottom-up parsing
TODO this is vague explanation I need to find more precise rules. It can happen that formally it can be defined only in terms of shift-reduce.
Bottom-up, Leftmost
We start with the string and apply production rules (in reverse) until we get to the start symbol ( in this case)
Sentence | Rule applied | Current tree level |
---|---|---|
(n + n) * n | 5 | |
(E + n) * n | 4 | |
(E O n) * n | 4 | |
(E O E) * n | 4 | |
(E) * n | 3 | |
E * n | 2 | |
E O n | 2 | |
E O E | 2 | |
E | 1 |
5 ( n + n ) * n
| | | | | | |
4 | E O E | | |
| \ | / | | |
3 | E | | |
\___ | ___/ | |
2 E O E
\____ | /
1 E
Bottom-up, Rightmost parsing
Sentence | Rule applied | Current tree level |
---|---|---|
(n + n) * n | 5 | |
(n + n) * E | 4 | |
(n + n) O E | 4 | |
(n + E) O E | 4 | |
(n O E) O E | 4 | |
(E O E) O E | 4 | |
(E) O E | 3 | |
E O E | 2 | |
E | 1 |
Lexing and scanerless parsers
Parsing can be divided into two phase (but it’s not required):
- Lexical analysis
- converts sentence (string) into tokens
- typically done with regular languages (Type 3 grammar, Regular expressions)
- doesn’t require tree construction, which it means it can be done faster
- but may cause ambiguity problems - it may not know how to distinguish symbols without “context”. For example, that is why some languages have reserved words, which can’t be used as variable names
- Syntactical analysis (aka hierarchical)
- converts list (stream) of tokens into tree
Parsers that don’t need two separate phases are called scanerless:
- Some scanerless parsers still choose to differentiate lexical and syntactical grammars, for example, Spoofax
- When parsing done in one iteration it means that “lexical part” is not limited to regular languages and has access to the “context”
Constructing parse tree (formal)
TODO: Most of papers and books give quite hand-wavy explanation of parse tree construction (insert ”Draw the rest of owl” meme here). I need to find formal explanation of the process.
So far I found papers which actually talk about parse trees:
- Masaru Tomita. 1987. An efficient augmented-context-free parsing algorithm. Comput. Linguist. 13, 1–2 (January-June 1987), 31–46.
- Lang, B. 1991. Towards a Uniform Formal Framework for Parsing. In: Tomita, M. (eds) Current Issues in Parsing Technology. The Springer International Series in Engineering and Computer Science, vol 126. Springer, Boston, MA.
- M.-J. Nederhof and G. Satta. 2004. Tabular Parsing. In C. Martin-Vide, V. Mitrana, and G. Paun, editors, Formal Languages and Applications, Studies in Fuzziness and Soft Computing 148, pages 529-549.
- Kuramitsu, K. 2015. Fast, Flexible, and Declarative Construction of Abstract Syntax Trees with PEGs. J. Inf. Process., 24, 123-131.
- Yamaguchi, Daisuke, and Kimio Kuramitsu. 2019. CPEG: A Typed Tree Construction from Parsing Expression Grammars with Regex-Like Captures Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing.
Ambiguity
Grammar is called ambigious if:
- In terms of generative grammars: if there is more than one way to derive the string (sentence) in the grammar
- In terms of parsing: if there is more than one parse tree for the string (sentence) in the grammar
Ways to deal with ambiguity:
- embrace it and return all parse trees or parse-forest
- rewrite grammar. This works perfectly for recognizers, but for parsers it changes the structure of parse trees
- disambiguation filters. Produce all parse trees, but filter them based on some conditions to left only one. Example Spoofax
TODO: ambiguity desrves a separate page
Examples
TODO:
- arithmetic (associativity and precedence)
- if/else
Parse tree, AST, CST
Parse tree is more theoretical concept. AST and CST essntially are parse trees, but more practical - in the sense that it would have specification (as data structure), nodes would have types and positions, etc.
To produce AST some unssential details are erased (for example, spaces, tabs etc.). Typically used in compilers, interpreters
To produce CST all details are preserved. Typically used in pretty-printers, linters
Regular Expressions operators
Procedure for parse tree construction typically given for some variation of Chomsky grammar. All Regular Expressions extensions (think of BNF and similar) are ignored - they changed (rewritten, desugared) to similar analogues in the variation of Chomsky grammar. For example, in PEG:
- changed to
- changed to
- treated as (see “Eliminating Repetition Operators” in Bford original paper)