Skip to content

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Related: Parse tree

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n8921194238452024 S n6722614495857564 a n8921194238452024->n6722614495857564
n9409787990248148 S n5009857902889363 a n9409787990248148->n5009857902889363

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n7753889280461777 S n6520488001328095 a n7753889280461777->n6520488001328095 n6377343997124383 b n7753889280461777->n6377343997124383
n01625732282975756 S n5183372153763972 a n01625732282975756->n5183372153763972 n757247657925864 b n01625732282975756->n757247657925864

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n9749093414597456 S n785066164099053 a n9749093414597456->n785066164099053 n48621922861316835 b n9749093414597456->n48621922861316835
n5876172382671483 S n585874772873304 a n5876172382671483->n585874772873304
n14893308120730775 S n2054348351222437 b n14893308120730775->n2054348351222437

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n4023620896395641 S n24359841443456043 ϵ n4023620896395641->n24359841443456043 n5051023128229557 n4023620896395641->n5051023128229557 n9088705941669604 S n4023620896395641->n9088705941669604 n5051023128229557->n9088705941669604 n3605846400157917 a n5051023128229557->n3605846400157917
n8789749261598685 S n8072869791613682 n8789749261598685->n8072869791613682 n11810794234162847 S n8072869791613682->n11810794234162847 n5965845831048773 a n8072869791613682->n5965845831048773 n8816904592508856 n11810794234162847->n8816904592508856 n8545491219381744 S n8816904592508856->n8545491219381744 n2201097471029092 a n8816904592508856->n2201097471029092 n7765938310274567 ϵ n8545491219381744->n7765938310274567

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n9789699617707655 S n2110751837351108 n9789699617707655->n2110751837351108 n41271900638751524 S n2110751837351108->n41271900638751524 n49727249004663854 a n2110751837351108->n49727249004663854 n36736185875866 n41271900638751524->n36736185875866 n4404902349304183 S n36736185875866->n4404902349304183 n2529196266460796 a n36736185875866->n2529196266460796 n20181498563454148 ϵ n4404902349304183->n20181498563454148
n7272336886330892 S n6614112039917168 n7272336886330892->n6614112039917168 n5594089220641345 a n6614112039917168->n5594089220641345 n5186981486460451 S n6614112039917168->n5186981486460451 n9764203625726551 n5186981486460451->n9764203625726551 n6728279650450191 a n9764203625726551->n6728279650450191 n7865349451069221 S n9764203625726551->n7865349451069221 n12672702075041675 ϵ n7865349451069221->n12672702075041675
n9241631116918485 S n6241467749791665 a n9241631116918485->n6241467749791665 n3501647355743469 a n9241631116918485->n3501647355743469

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings ("ϵ\epsilon") and empty nodes (“horizontal” compaction)
Original parse tree”Vertical” compaction”Horizontal” compaction
n6214643319269018 S n3277956833278355 n6214643319269018->n3277956833278355 n38355016356462945 S n3277956833278355->n38355016356462945 n024401963671305982 a n3277956833278355->n024401963671305982 n6305865883049886 n38355016356462945->n6305865883049886 n8237167717771845 S n6305865883049886->n8237167717771845 n1865270139186781 a n6305865883049886->n1865270139186781 n12602805571607312 ϵ n8237167717771845->n12602805571607312
n22802056386744396 S n7445802541895081 S n22802056386744396->n7445802541895081 n8552610277274622 a n22802056386744396->n8552610277274622 n11738793922871094 S n7445802541895081->n11738793922871094 n45920705203412115 a n7445802541895081->n45920705203412115 n6362528699519208 ϵ n11738793922871094->n6362528699519208
n34794126426892813 S n1448368755723608 n34794126426892813->n1448368755723608 n26512643961189286 S n1448368755723608->n26512643961189286 n1255833267787485 a n1448368755723608->n1255833267787485 n7210813386459334 n26512643961189286->n7210813386459334 n6010045402976483 a n7210813386459334->n6010045402976483

And if we apply both strategies:

Original parse treeCompact parse tree
n8070527310872975 S n4784525686540082 n8070527310872975->n4784525686540082 n966096666793516 S n4784525686540082->n966096666793516 n690083262591987 a n4784525686540082->n690083262591987 n6555134655438022 n966096666793516->n6555134655438022 n8739251882078682 S n6555134655438022->n8739251882078682 n751228938991255 a n6555134655438022->n751228938991255 n3445965716189905 ϵ n8739251882078682->n3445965716189905
n6970941049988819 S n5540590197086712 S n6970941049988819->n5540590197086712 n5681708294828784 a n6970941049988819->n5681708294828784 n5882950615430576 a n5540590197086712->n5882950615430576

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n25812967999216263 Add n5494311264763534 1 n25812967999216263->n5494311264763534 n20597663573275438 + n25812967999216263->n20597663573275438 n22070750902304614 2 n25812967999216263->n22070750902304614
n36150253724359804 Add n51613416580455 1 n36150253724359804->n51613416580455 n03203371380932252 ϵ n36150253724359804->n03203371380932252 n7258649308619742 2 n36150253724359804->n7258649308619742
n9688367076031488 Add n9427207846169645 1 n9688367076031488->n9427207846169645 n9601677062768688 2 n9688367076031488->n9601677062768688

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scannerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n41935406127796493 Str n7061975659321986 " n41935406127796493->n7061975659321986 n8159112573550307 a n41935406127796493->n8159112573550307 n9145111961429471 b n41935406127796493->n9145111961429471 n5649337548389732 " n41935406127796493->n5649337548389732
n8008954404586213 Str n022242120556423828 " n8008954404586213->n022242120556423828 n26577364069482035 ab n8008954404586213->n26577364069482035 n39122779432727883 " n8008954404586213->n39122779432727883
n561847286724704 Str n6225871259226778 ab n561847286724704->n6225871259226778

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n8536000512006274 S n504530107355831 Str n8536000512006274->n504530107355831 n8326410708378968 : n8536000512006274->n8326410708378968 n08652476444961743 Str n8536000512006274->n08652476444961743 n5255261280624735 ab n504530107355831->n5255261280624735 n7863760201383769 cd n08652476444961743->n7863760201383769
n15108600409209605 S n5876639461385522 ab n15108600409209605->n5876639461385522 n9920432959688164 : n15108600409209605->n9920432959688164 n059086258924613944 cd n15108600409209605->n059086258924613944