Parsing

# Parse tree

aka derivation tree, syntax tree, abstract syntax tree, concrete syntax tree.

Formal languages (branch of computer science and mathematics) studies diefferent problems. Two main problems (tasks) are:

• recognition - answers if a given string belongs to the language or not (yes or no). Aka membership problem
• parsing - solves recognition task, but it focuses more on structural analysis (producing a parse tree).

## Recogintion vs Parsing

Ruzzo (1979) showed that any recognizer can be transformed to parser with the overhead of $O(\log n)$ at most.

Recognition not bothered with stuctural anlysis:

• allows to work with with Regular Expressions (which typically given without explanation of how to construct parse tree)
• allows any sound transformation of grammar, like Chomsky Normal Form, Greibach Normal Form, etc.

Parsing more focused on stuctural anlysis:

• sound transformation of grammar changes the structure, so it is better to avoid them
• ambiguity - there can be more than one way to produce parse tree
• left-recursion - can be problematic for some parsers

## Constructing parse tree (informal)

$E \rightarrow E \: O \: E \: \vert \: (E) \: \vert \: n \\ O \rightarrow + \: \vert \: *$

And the string that we need to parse: (n + n) * n.

### Top-down parsing

#### Top-down, Leftmost

We start with start symbol ($E$ in this case) and apply production rules to the leftmost non-terminal until we get the final string:

SentenceRule appliedCurrent tree level
Estart symbol1
E O E$E \rightarrow E \: O \: E$2
(E) O E$E \rightarrow (E)$3
(E O E) O E$E \rightarrow E \: O \: E$4
(n O E) O E$E \rightarrow n$5
(n + E) O E$O \rightarrow +$5
(n + n) O E$E \rightarrow n$5
(n + n) O E$E \rightarrow n$5
(n + n) * E$O \rightarrow *$3
(n + n) * n$E \rightarrow n$3

And parse tree would be graphical representation of derivation steps (TODO replace with Graphviz diagram):

1       _____ E
/      | \
2   __ E __   O  E
/   |   \  |  |
3 (    E    ) *  n
/ | \
4   E  O  E
|  |  |
5   n  +  n

Note: this process is similar to depth-first traversal (search)

#### Top-down, Rightmost parsing

We start with start symbol ($E$ in this case) and apply production rules to the rightmost non-terminal until we get the final string:

SentenceRule appliedCurrent tree level
Estart symbol1
E O E$E \rightarrow E \: O \: E$2
E O n$E \rightarrow n$3
E * n$O \rightarrow *$3
(E) * n$E \rightarrow (E)$3
(E O E) * n$E \rightarrow E \: O \: E$4
(E O n) * n$E \rightarrow n$5
(E + n) * n$O \rightarrow +$5
(n + n) * n$E \rightarrow n$5

### Bottom-up parsing

TODO this is vague explanation I need to find more precise rules. It can happen that formally it can be defined only in terms of shift-reduce.

#### Bottom-up, Leftmost

We start with the string and apply production rules (in reverse) until we get to the start symbol ($E$ in this case)

SentenceRule appliedCurrent tree level
(n + n) * n5
(E + n) * n$E \rightarrow n$4
(E O n) * n$O \rightarrow +$4
(E O E) * n$E \rightarrow n$4
(E) * n$E \rightarrow E \: O \: E$3
E * n$E \rightarrow (E)$2
E O n$O \rightarrow *$2
E O E$E \rightarrow n$2
E$E \rightarrow E \: O \: E$1
5  ( n  +  n ) *  n
| |  |  | | |  |
4  | E  O  E | |  |
|  \ | /  | |  |
3  |    E    | |  |
\___ | ___/ |  |
2       E      O  E
\____ | /
1              E

#### Bottom-up, Rightmost parsing

SentenceRule appliedCurrent tree level
(n + n) * n5
(n + n) * E$E \rightarrow n$4
(n + n) O E$O \rightarrow *$4
(n + E) O E$E \rightarrow n$4
(n O E) O E$O \rightarrow +$4
(E O E) O E$E \rightarrow n$4
(E) O E$E \rightarrow E \: O \: E$3
E O E$E \rightarrow (E)$2
E$E \rightarrow E \: O \: E$1

### Lexing and scanerless parsers

Parsing can be divided into two phase (but it’s not required):

• Lexical analysis
• converts sentence (string) into tokens
• typically done with regular languages (Type 3 grammar, Regular expressions)
• doesn’t require tree construction, which it means it can be done faster
• but may cause ambiguity problems - it may not know how to distinguish symbols without “context”. For example, that is why some languages have reserved words, which can’t be used as variable names
• Syntactical analysis (aka hierarchical)
• converts list (stream) of tokens into tree

Parsers that don’t need two separate phases are called scanerless:

• Some scanerless parsers still choose to differentiate lexical and syntactical grammars, for example, Spoofax
• When parsing done in one iteration it means that “lexical part” is not limited to regular languages and has access to the “context”

## Constructing parse tree (formal)

TODO: Most of papers and books give quite hand-wavy explanation of parse tree construction (insert ”Draw the rest of owl” meme here). I need to find formal explanation of the process.

So far I found papers which actually talk about parse trees:

## Ambiguity

Grammar is called ambigious if:

• In terms of generative grammars: if there is more than one way to derive the string (sentence) in the grammar
• In terms of parsing: if there is more than one parse tree for the string (sentence) in the grammar

Ways to deal with ambiguity:

• embrace it and return all parse trees or parse-forest
• rewrite grammar. This works perfectly for recognizers, but for parsers it changes the structure of parse trees
• disambiguation filters. Produce all parse trees, but filter them based on some conditions to left only one. Example Spoofax

TODO: ambiguity desrves a separate page

### Examples

TODO:

• arithmetic (associativity and precedence)
• if/else

## Parse tree, AST, CST

Parse tree is more theoretical concept. AST and CST essntially are parse trees, but more practical - in the sense that it would have specification (as data structure), nodes would have types and positions, etc.

To produce AST some unssential details are erased (for example, spaces, tabs etc.). Typically used in compilers, interpreters

To produce CST all details are preserved. Typically used in pretty-printers, linters

## Regular Expressions operators

Procedure for parse tree construction typically given for some variation of Chomsky grammar. All Regular Expressions extensions (think of BNF and similar) are ignored - they changed (rewritten, desugared) to similar analogues in the variation of Chomsky grammar. For example, in PEG:

• $e^?$ changed to $e / \epsilon$
• $e^+$ changed to $e e^*$
• $e^*$ treated as $A \leftarrow e A / \epsilon$ (see “Eliminating Repetition Operators” in Bford original paper)