Parse tree
aka derivation tree, syntax tree, abstract syntax tree, concrete syntax tree.
Formal languages (branch of computer science and mathematics) studies diefferent problems. Two main problems (tasks) are:
- recognition - answers if a given string belongs to the language or not (yes or no). Aka membership problem
- parsing - solves recognition task, but it focuses more on structural analysis (producing a parse tree).
Recogintion vs Parsing
Ruzzo (1979) showed ↗ that any recognizer can be transformed to parser with the overhead of at most.
Recognition not bothered with stuctural anlysis:
- allows to work with with Regular Expressions (which typically given without explanation of how to construct parse tree)
- allows any sound transformation of grammar, like Chomsky Normal Form, Greibach Normal Form, etc.
Parsing more focused on stuctural anlysis:
- sound transformation of grammar changes the structure, so it is better to avoid them
- Ambiguity - there can be more than one way to produce parse tree
- Left recursion - can be problematic for some parsers
Constructing parse tree (formal)
TODO: Most of papers and books give quite hand-wavy explanation of parse tree construction (insert “Draw the rest of owl” meme here). I need to find a formal explanation of the process.
So far I found papers which actually talk about parse trees:
- M.-J. Nederhof and G. Satta. 2004. Tabular Parsing. ↗ In C. Martin-Vide, V. Mitrana, and G. Paun, editors, Formal Languages and Applications, Studies in Fuzziness and Soft Computing 148, pages 529-549.
- Kuramitsu, K. 2015. Fast, Flexible, and Declarative Construction of Abstract Syntax Trees with PEGs ↗. J. Inf. Process., 24, 123-131.
- Yamaguchi, Daisuke, and Kimio Kuramitsu. 2019. CPEG: A Typed Tree Construction from Parsing Expression Grammars with Regex-Like Captures ↗ Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing.
Examples
Naive mathematical expression
e := e "+" e | e "-" e | e "*" e | e "/" e | numnum := "1"
For string 1+1*1+1
there are 5 parse trees and only one is correct ↗.
Adding left associativity
e := e "+" num | e "-" num | e "*" num | e "/" num | numnum := "1"
For string 1+1*1+1
there is only 1 parse tree, but incorrect ↗
Adding precedence
e := add_submul_div := mul_div "*" mul_div | mul_div "/" mul_div | numadd_sub := add_sub "+" add_sub | add_sub "-" add_sub | mul_divnum := "1"
For string 1+1*1+1
there are 2 parse trees and only one is correct ↗.
Adding precedence and left associativity
e := add_sub
mul_div := mul | div | nummul := mul_div "*" numdiv := mul_div "/" num
add_sub := add | sub | mul_divadd := add_sub "+" mul_divsub := add_sub "-" mul_div
num := "1"
For string 1+1*1+1
there is only 1 parse and it’s correct ↗
It is harder to read (understand) grammar, and tree changed it’s shape drastically.
Removing intermediate nodes
Note: <>
is the special feature which allows to hide nodes, it is rarely supported. Given here for comparison.
e := add_sub
<add_sub> := add | sub | mul_divadd := add_sub <"+"> mul_divsub := add_sub <"-"> mul_div
<mul_div> := mul | div | nummul := mul_div <"*"> numdiv := mul_div <"/"> num
num := "1"
One more example with brackets to show how precedence works ↗.