Skip to content

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n6323348069246328 S n8563473828016634 a n6323348069246328->n8563473828016634
n882714077774376 S n28820986338986665 a n882714077774376->n28820986338986665

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n36553348519931794 S n7407211999024093 a n36553348519931794->n7407211999024093 n9933975378990967 b n36553348519931794->n9933975378990967
n06201368751777614 S n2535917662744698 a n06201368751777614->n2535917662744698 n6606616329955182 b n06201368751777614->n6606616329955182

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n8364278286160858 S n8801520155465694 a n8364278286160858->n8801520155465694 n10886501954671357 b n8364278286160858->n10886501954671357
n6249983713878668 S n4115783506595643 a n6249983713878668->n4115783506595643
n841857785758372 S n559706050764649 b n841857785758372->n559706050764649

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n820474374266768 S n5006716812016372 ϵ n820474374266768->n5006716812016372 n7048608958408185 n820474374266768->n7048608958408185 n27989360924243134 S n820474374266768->n27989360924243134 n7048608958408185->n27989360924243134 n24405149580038232 a n7048608958408185->n24405149580038232
n8310708584763107 S n3427606713420315 n8310708584763107->n3427606713420315 n3099281140683674 S n3427606713420315->n3099281140683674 n6876711599795684 a n3427606713420315->n6876711599795684 n6501403984370053 n3099281140683674->n6501403984370053 n12787837449659145 S n6501403984370053->n12787837449659145 n5346993573656638 a n6501403984370053->n5346993573656638 n9982376913593161 ϵ n12787837449659145->n9982376913593161

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n7213957627488439 S n2157375664395631 n7213957627488439->n2157375664395631 n08721275392713168 S n2157375664395631->n08721275392713168 n36044847833086213 a n2157375664395631->n36044847833086213 n2966873028412833 n08721275392713168->n2966873028412833 n05571356339419875 S n2966873028412833->n05571356339419875 n003194369610927339 a n2966873028412833->n003194369610927339 n9865879934868214 ϵ n05571356339419875->n9865879934868214
n07747227386053956 S n5611615763393996 n07747227386053956->n5611615763393996 n8579873941179996 a n5611615763393996->n8579873941179996 n11553604898540248 S n5611615763393996->n11553604898540248 n635502273644972 n11553604898540248->n635502273644972 n566188179702015 a n635502273644972->n566188179702015 n024362198226117515 S n635502273644972->n024362198226117515 n16567949184876007 ϵ n024362198226117515->n16567949184876007
n5154152251858624 S n7894107979961298 a n5154152251858624->n7894107979961298 n7449031893341114 a n5154152251858624->n7449031893341114

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings ("ϵ\epsilon") and empty nodes (“horizontal” compaction)
Original parse tree”Vertical” compaction”Horizontal” compaction
n6573827339883429 S n19062785026384388 n6573827339883429->n19062785026384388 n81436104227287 S n19062785026384388->n81436104227287 n43147376977589746 a n19062785026384388->n43147376977589746 n19140606524093506 n81436104227287->n19140606524093506 n05031691419850892 S n19140606524093506->n05031691419850892 n878562563265878 a n19140606524093506->n878562563265878 n6111762861692338 ϵ n05031691419850892->n6111762861692338
n04640904568478588 S n5250298848018167 S n04640904568478588->n5250298848018167 n837692868947757 a n04640904568478588->n837692868947757 n26778610987094287 S n5250298848018167->n26778610987094287 n75284381321378 a n5250298848018167->n75284381321378 n866187199278678 ϵ n26778610987094287->n866187199278678
n6321769488471667 S n3670967823783633 n6321769488471667->n3670967823783633 n16963542482066374 S n3670967823783633->n16963542482066374 n5384269546134988 a n3670967823783633->n5384269546134988 n5145477749936398 n16963542482066374->n5145477749936398 n2969279119660593 a n5145477749936398->n2969279119660593

And if we apply both strategies:

Original parse treeCompact parse tree
n7622543177652255 S n9434495363256696 n7622543177652255->n9434495363256696 n4731492984575154 S n9434495363256696->n4731492984575154 n18828842446522986 a n9434495363256696->n18828842446522986 n4394867852262021 n4731492984575154->n4394867852262021 n7945270933218691 S n4394867852262021->n7945270933218691 n2652858590488729 a n4394867852262021->n2652858590488729 n378601421256072 ϵ n7945270933218691->n378601421256072
n5290202671333728 S n13907440916203373 S n5290202671333728->n13907440916203373 n9486841878384598 a n5290202671333728->n9486841878384598 n5989012228013908 a n13907440916203373->n5989012228013908

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n13081623480673388 Add n49381712862234917 1 n13081623480673388->n49381712862234917 n039000429968365724 + n13081623480673388->n039000429968365724 n42228520583406537 2 n13081623480673388->n42228520583406537
n6558481400990062 Add n7443170590170687 1 n6558481400990062->n7443170590170687 n31538234613117333 ϵ n6558481400990062->n31538234613117333 n2608603823855433 2 n6558481400990062->n2608603823855433
n305522224909736 Add n6239539150418889 1 n305522224909736->n6239539150418889 n2174903553252321 2 n305522224909736->n2174903553252321

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scanerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n9116109167131599 Str n6188793968768607 " n9116109167131599->n6188793968768607 n10451348795059445 a n9116109167131599->n10451348795059445 n9254983226067193 b n9116109167131599->n9254983226067193 n09623137925377434 " n9116109167131599->n09623137925377434
n2541555959908415 Str n6538950702854032 " n2541555959908415->n6538950702854032 n45618629825026225 ab n2541555959908415->n45618629825026225 n5322931692778174 " n2541555959908415->n5322931692778174
n4284294860654341 Str n5966762654692828 ab n4284294860654341->n5966762654692828

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n617682390885838 S n637790499557761 Str n617682390885838->n637790499557761 n36780760085102293 : n617682390885838->n36780760085102293 n5445590316468862 Str n617682390885838->n5445590316468862 n2956137582632561 ab n637790499557761->n2956137582632561 n20656747812227305 cd n5445590316468862->n20656747812227305
n01523502410585098 S n47012398331746774 ab n01523502410585098->n47012398331746774 n9846224295687682 : n01523502410585098->n9846224295687682 n4450793119402958 cd n01523502410585098->n4450793119402958