Parsing text using a formal grammar - Reply 1

Is there a way to annotate the parse tree with source position information? (From a quick look at the code and documentation, it seems that source position is available for error messages, although I don’t know how precise it is)

A good suggestion and it would be quite easy to implement. The ptree is the fundamental data structure defined by pPEG so the ramifications of this change would impact other code and documentation and some thought needed.

I’m thinking the primary purpose was so that downstream processing of the ptree could reference the input text for things like error messages. Do you have any any other use cases in mind?

The error location info may not be precise. The problem is akin to finding the clause that has the issue after a predicate failed - if there are several alternatives, which one do you pick? pPEG uses a heuristic based on the high water mark of the “cursor position” which seems to work a lot of the time, but it is just a heuristic.

My use cases are tools for analyzing and manipulating source - e.g., prettyprinting, type inferencing, refactoring. The functionality would be similar to read_term/2 option subterm_positions, but for arbitrary parsed input and not just Prolog terms (I think that subterm_positions was originally for an IBM Prolog to Quintus Prolog source translator).

My specific case is for Python - it has a new PEG parser that has obsoleted the old lib2to3 parser whose parser generated an AST (or CST) with source information (and the AST could then be used to recreate the source exactly; it was originally designed for simple refactoring, such changing print x,y,z statements to print(x,y,z) function calls).

Thanks. Just wanted to confirm that a simple annotation specifying the source starting position is sufficient for your purposes.

I think we’d want both source start and end position (the From-To that subterm_positions uses) … from that, the whitespace can also be derived.

Only terminal symbols need to have from-to information; for non-terminals, it’s easy to derive from-to from the children. (Whitespace information is a bit more difficult to derive; it’s probably best to do a separate pass over the tree to create a new tree with “prefix” whitespace information)

Terminal nodes already contain the matching text so the span is easy to calculate.

Note that the ptree result isn’t necessarily a complete syntax tree. Usually the desired result is some trimmed version, so like groups in a regular expression.

For example literals don’t appear in the ptree and neither does implied whitespace using the _space_ rule. But all that can be controlled from the grammar definition, in particular, the rule naming conventions.

Only if the tokenizer preserves exactly the source form and doesn’t convert it (e.g., 0x10 vs 16 or, in Python or C: "abc" "def" vs "abcdef").

pPEG doesn’t really have a separate tokenizer and doesn’t convert anything - all terminal nodes are matching source strings. Any “conversion” must done in a subsequent “tree walker” (no semantic actions in the grammar). The intent is to keep the grammar “pure” and portable.

Another way of thinking about is that the node names (i.e., rule names) are semantic labels and the node values are either the matching string (terminal nodes) or a list of child nodes (non-terminal nodes).