Parse tree

aka derivation tree, syntax tree, abstract syntax tree, concrete syntax tree.

Formal languages (branch of computer science and mathematics) studies diefferent problems. Two main problems (tasks) are:

recognition - answers if a given string belongs to the language or not (yes or no). Aka membership problem
parsing - solves recognition task, but it focuses more on structural analysis (producing a parse tree).

Recogintion vs Parsing

Ruzzo (1979) showed ↗ that any recognizer can be transformed to parser with the overhead of $O(\log n)$ at most.

Recognition not bothered with stuctural anlysis:

allows to work with with Regular Expressions (which typically given without explanation of how to construct parse tree)
allows any sound transformation of grammar, like Chomsky Normal Form, Greibach Normal Form, etc.

Parsing more focused on stuctural anlysis:

sound transformation of grammar changes the structure, so it is better to avoid them
Ambiguity - there can be more than one way to produce parse tree
Left recursion - can be problematic for some parsers

Constructing parse tree (formal)

TODO: Most of papers and books give quite hand-wavy explanation of parse tree construction (insert “Draw the rest of owl” meme here). I need to find a formal explanation of the process.

So far I found papers which actually talk about parse trees:

M.-J. Nederhof and G. Satta. 2004. Tabular Parsing. ↗ In C. Martin-Vide, V. Mitrana, and G. Paun, editors, Formal Languages and Applications, Studies in Fuzziness and Soft Computing 148, pages 529-549.
Kuramitsu, K. 2015. Fast, Flexible, and Declarative Construction of Abstract Syntax Trees with PEGs ↗. J. Inf. Process., 24, 123-131.
Yamaguchi, Daisuke, and Kimio Kuramitsu. 2019. CPEG: A Typed Tree Construction from Parsing Expression Grammars with Regex-Like Captures ↗ Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing.

Examples

Naive mathematical expression

e := e "+" e | e "-" e | e "*" e | e "/" e | num
num := "1"

For string 1+1*1+1 there are 5 parse trees and only one is correct ↗.

Adding left associativity

e := e "+" num | e "-" num | e "*" num | e "/" num | num
num := "1"

For string 1+1*1+1 there is only 1 parse tree, but incorrect ↗

Adding precedence

e := add_sub
mul_div := mul_div "*" mul_div | mul_div "/" mul_div | num
add_sub := add_sub "+" add_sub | add_sub "-" add_sub | mul_div
num := "1"

For string 1+1*1+1 there are 2 parse trees and only one is correct ↗.

Adding precedence and left associativity

e := add_sub

mul_div := mul | div | num
mul := mul_div "*" num
div := mul_div "/" num

add_sub := add | sub | mul_div
add := add_sub "+" mul_div
sub := add_sub "-" mul_div

num := "1"

For string 1+1*1+1 there is only 1 parse and it’s correct ↗

It is harder to read (understand) grammar, and tree changed it’s shape drastically.

Removing intermediate nodes

Note: <> is the special feature which allows to hide nodes, it is rarely supported. Given here for comparison.

e := add_sub

<add_sub> := add | sub | mul_div
add := add_sub <"+"> mul_div
sub := add_sub <"-"> mul_div

<mul_div> := mul | div | num
mul := mul_div <"*"> num
div := mul_div <"/"> num

num := "1"

Final result ↗

One more example with brackets to show how precedence works ↗.