Skip to content

Parse tree

aka: abstract syntax tree (AST), concrete syntax tree (CST).

Related: Parse tree

Parsing is the process of convertion of text (string) to a tree according to the given grammar.

Let’s explore examples to understand the concept.

Token

The simplest possible grammar: S -> "a";.

Grammar grpahParse tree for a
n29352115858446237 S n3734949636217657 a n29352115858446237->n3734949636217657
n4507133893513733 S n6772703963448443 a n4507133893513733->n6772703963448443

Concatenation

Concatenation of two tokens S -> "a" "b";.

Grammar grpahParse tree for ab
n44545901981734204 S n2222954384446656 a n44545901981734204->n2222954384446656 n3204438014478599 b n44545901981734204->n3204438014478599
n7839562424109137 S n37958477625 a n7839562424109137->n37958477625 n0024960709700601047 b n7839562424109137->n0024960709700601047

Unordered choice

S -> "a" | "b".

Grammar grpahParse tree for aParse tree for b
n6117309496745222 S n5040781274580137 a n6117309496745222->n5040781274580137 n24596702595061215 b n6117309496745222->n24596702595061215
n34516104389222324 S n3401717302431446 a n34516104389222324->n3401717302431446
n2169453577978151 S n4866827066425983 b n2169453577978151->n4866827066425983

General recursion

Classical example S -> "" | S "a";.

Grammar treeParse tree for aa
n44349053476548295 S n26242610632522734 ϵ n44349053476548295->n26242610632522734 n05403871509239755 n44349053476548295->n05403871509239755 n3986694437915812 S n44349053476548295->n3986694437915812 n05403871509239755->n3986694437915812 n07252518363197358 a n05403871509239755->n07252518363197358
n9124351807634346 S n4225995635823754 n9124351807634346->n4225995635823754 n14643009035975085 S n4225995635823754->n14643009035975085 n8301006359536043 a n4225995635823754->n8301006359536043 n021841262302323416 n14643009035975085->n021841262302323416 n42734486267834715 S n021841262302323416->n42734486267834715 n9378739597785346 a n021841262302323416->n9378739597785346 n44905802499752556 ϵ n42734486267834715->n44905802499752556

Parse tree in this case looks differently compared to grammar, but we can see how it served as pattern.

Kleene star

There are 3 different grammars which correspond to the same language:

  1. S -> "" | S "a";
  2. S -> "" | "a" S;
  3. S -> "a"*;

First two are general recursion, but the last one is “iteration” (Kleene star). So I thought it would make sense that Kleene star would be closer “concatenation” tree. Note: this feature is specific to my implementaion of PwZ. And we have following result:

1. Parse tree for aa2. Parse tree for aa3. Parse tree for aa
n2980097653363021 S n012390948441347804 n2980097653363021->n012390948441347804 n9447397903808712 S n012390948441347804->n9447397903808712 n4510643965592436 a n012390948441347804->n4510643965592436 n9120345686064655 n9447397903808712->n9120345686064655 n7288690232196533 S n9120345686064655->n7288690232196533 n459457711446265 a n9120345686064655->n459457711446265 n18980196307255492 ϵ n7288690232196533->n18980196307255492
n9860926963175212 S n023555867436992894 n9860926963175212->n023555867436992894 n47055067700205355 a n023555867436992894->n47055067700205355 n08610336019190146 S n023555867436992894->n08610336019190146 n2618746423632641 n08610336019190146->n2618746423632641 n7512070202696182 a n2618746423632641->n7512070202696182 n5986246770732997 S n2618746423632641->n5986246770732997 n8196828275150323 ϵ n5986246770732997->n8196828275150323
n8653303625778348 S n5990879255479271 a n8653303625778348->n5990879255479271 n23625114742583553 a n8653303625778348->n23625114742583553

Controlling tree structure

Note: this feature is specific to my implementaion of PwZ.

Compact tree

We can make parse tree more compact:

  • by merging all nameless nodes to the closest named parent (“vertical” compaction)
  • by removing all empty strings ("ϵ\epsilon") and empty nodes (“horizontal” compaction)
Original parse tree”Vertical” compaction”Horizontal” compaction
n7283935072824546 S n8543454581002619 n7283935072824546->n8543454581002619 n2769060188632677 S n8543454581002619->n2769060188632677 n46422432351120846 a n8543454581002619->n46422432351120846 n7874736595119161 n2769060188632677->n7874736595119161 n006984082807952374 S n7874736595119161->n006984082807952374 n14745158946824732 a n7874736595119161->n14745158946824732 n4896919916429927 ϵ n006984082807952374->n4896919916429927
n8206059774468968 S n5337170761270793 S n8206059774468968->n5337170761270793 n8301253577091392 a n8206059774468968->n8301253577091392 n16635070604988078 S n5337170761270793->n16635070604988078 n5959031294296511 a n5337170761270793->n5959031294296511 n9587498275621493 ϵ n16635070604988078->n9587498275621493
n5563884359931193 S n6777429759350084 n5563884359931193->n6777429759350084 n768863567485569 S n6777429759350084->n768863567485569 n29836451823914745 a n6777429759350084->n29836451823914745 n5762800475490126 n768863567485569->n5762800475490126 n11487990682845317 a n5762800475490126->n11487990682845317

And if we apply both strategies:

Original parse treeCompact parse tree
n07445195000921001 S n5966586078244429 n07445195000921001->n5966586078244429 n3880104378914726 S n5966586078244429->n3880104378914726 n003920414560866892 a n5966586078244429->n003920414560866892 n9031973001190916 n3880104378914726->n9031973001190916 n11688484448160863 S n9031973001190916->n11688484448160863 n5112644689976906 a n9031973001190916->n5112644689976906 n8943006631882511 ϵ n11688484448160863->n8943006631882511
n5003500515143227 S n5154538627154512 S n5003500515143227->n5154538627154512 n07663608769320507 a n5003500515143227->n07663608769320507 n8802109120602 a n5154538627154512->n8802109120602

You can think of it as turning CST to AST.

Ignored nodes

There are cases when symbol is required to distinguish something, but it useless in the final tree (AST). For example imagine function call syntax name(arg1, arg2). What is important: name of the function, arguments. What is not important: (, ), ,, space, tabs, etc.

For this case you can use ignored nodes - they will be parsed as other nodes, but after matching they will return empty string instead. For example:

  1. Add -> [0-9] "+" [0-9];
  2. Add -> [0-9] ign("+") [0-9];
1 parse tree without ign2 parse tree with ign2 compact parse tree with ign
n5171194310298322 Add n7991609836382687 1 n5171194310298322->n7991609836382687 n742591970976032 + n5171194310298322->n742591970976032 n9142169624031182 2 n5171194310298322->n9142169624031182
n4160793961782705 Add n402003194635455 1 n4160793961782705->n402003194635455 n9677290583121958 ϵ n4160793961782705->n9677290583121958 n5256870612142195 2 n4160793961782705->n5256870612142195
n24238054185861335 Add n5244306892582191 1 n24238054185861335->n5244306892582191 n9375006686549714 2 n24238054185861335->n9375006686549714

Lexical nodes

Some parsers (especially old-school ones) require separate step before parsing - tokenization. It requires separate article why I don’t like it. But other parsers doesn’t require this step. It’s called scannerless parser. Instead you can provide: lexical and syntactical grammar and parser (or parser generator) will figure out everything.

In my implementation of PwZ you can use lex for this. For example:

  1. Str -> "\"" [^"]* "\"";
  2. Str -> "\"" lex([^"]*) "\"";
  3. Str -> ign("\"") lex([^"]*) ign("\"");
1 parse tree without lex2 parse tree with lex3 parse tree with lex and ign
n9565821510414039 Str n9872159863199057 " n9565821510414039->n9872159863199057 n9150772381170151 a n9565821510414039->n9150772381170151 n48521087641895067 b n9565821510414039->n48521087641895067 n8860299636860016 " n9565821510414039->n8860299636860016
n6200088161431854 Str n32856183632756575 " n6200088161431854->n32856183632756575 n8269467321915811 ab n6200088161431854->n8269467321915811 n8365578534002487 " n6200088161431854->n8365578534002487
n6329869966057664 Str n1865680174423332 ab n6329869966057664->n1865680174423332

Nameless nodes

Sometime you want to reuse some part of grammar, in order to do this you would use rule (aka symbol or variable). But if you do - it will introduce new node in the tree. If you want to reuse grammar, but don’t want for it to have named node you can do this with special syntax for nameless nodes.

  1. Str -> ign("\"") lex([^"]*) ign("\""); S -> Str ":" Str;
  2. _Str -> ign("\"") lex([^"]*) ign("\""); S -> _Str ":" _Str;
1 parse tree for "ab":"cd"2 parse tree for "ab":"cd"
n9512568135990043 S n18366868438980966 Str n9512568135990043->n18366868438980966 n007915098782365959 : n9512568135990043->n007915098782365959 n26171448248986606 Str n9512568135990043->n26171448248986606 n12013811161465981 ab n18366868438980966->n12013811161465981 n7634523231762931 cd n26171448248986606->n7634523231762931
n40091806428772725 S n398930619576533 ab n40091806428772725->n398930619576533 n24487977824853835 : n40091806428772725->n24487977824853835 n29270374199990057 cd n40091806428772725->n29270374199990057