Term rewriting and layout

I’m rethinking source code rewriting as currently provided by term_expansion/2 and goal_expansion/2. (Conditional compilation and program transformation) One of the challenges is how to deal with layout information that is needed for the source level debugger as well as reporting positions from code analysis based on the prolog_walk_code/1. (Prolog code walker)

This source info is provided by the subterm_positions option of read_term/2,3 (inherited from Quintus and extended over the years) which produces a term that describes the exact layout of the term that was read. This term is possibly converted using the above hook and subsequently asserted (compiled). The compiler may do some additional transformations, in particular when optimization is enabled.

Relating back to the source is facilitated by term_expansion/4 and goal_expansion/4 that translate the layout information along with the term (clause). In addition, clause_info/5 has a large number of fuzzy matching rules that try to re-establish the layout, compensating for compiler transformations.

This works sort of 90% :frowning: It seems SICStus and ECLiPSe are doing something similar, at least they also provide term/macro expansion alternatives that may translate the source layout in addition to the term itself.

If you look at boot/expand.pl (GitHub Source) and boot/dcg.pl (GitHub Source) you should get pretty unhappy as the code dealing with updating the layout complicates the term translation enormously. Getting rid of that simplifies the system and provides good source info for user expansions without users having to go through the same nightmare.

So, I think we must find something better. One of the options I see is to find a mapping from subterms in the input to subterms in the output, looking at a non-overlapping set of mapping relations that cover an as large as possible part of the input and output terms. With this in place we can migrate the layout information. This is a heuristic process, but we can make the result more precise by using explicit mappings provided by the user, providing an API to the term translator that says “I have translated this subterm into that”. Some of this we can collect automatically as goal_expansion/2 takes a subterm from the input and produces a subterm in the output, so we know the relation.

Does this make sense? And if so, does this relate to prior art from where we can steal (part of) an implementation? Any better ideas?

1 Like

Partly.

For myself to better understand this I would have to dig and do some work with the various predicates to see what they do and also understand why they do it. Having experience with parsing and pretty printing, at the highest level with just removing the white space then reintroducing the white space based on a set of rules, this is easy. When it gets more complicated like simplifying an algebra equation, then rules come into play, but before those rules are applied, the input is converted to an AST (Abstract Syntax Tree) then the rules operate on the AST and not the input stream. Then the AST can be converted back into syntax and output as a stream with position information.

So one thought is that during the transformations through the AST the position information must remain intact. Now if during term rewriting a branch of the tree is removed and new one added, the question becomes how does that effect the position information when the AST is converted back to a stream. Do the rewrite rules also insert position information as a variable and then a latter step recalculates all of the positions for the entire AST, and not just the modified branches.

Or does one forgo the position information and instead rely on attributes of the data to reconstruct the position information. e.g. if the operator is does it get a new line after the or after the statement that follows .

I don’t have experience with either of these ideas that rewrite the AST and try to preserve the formatting, but just thinking out loud so that others know you don’t have to be an expert in this area to add some feedback.

Thanks. I’m not really trying to maintain the layout to be able to emit the rewritten AST nicely, although that would be an interesting use case as well. For now I want the source level debugger to show the proper goal location and code analysis that has something to say about some term to pinpoint the exact position of that term.

My input is a term and a matching layout representation. Then some magic happened to the term (term/goal expansion, compiler removed/modified subgoals) and I want an as accurate as possible layout term that matches the structure of the rewritten term and carries as much as possible of the layout information. That is, of course, in general not always possible. Notably if something is added to the term it can be hard to represent this properly. A nice example is the functional rewrite, which rewrites e.g., g(A.B) into .(A,B,C), g(C). In that case (as now explicitly), we would like to associate .(A,B,C) with
the location of A.B and g(C) with the location of g(A.B).

Possibly the above can work using general matching, possibly this really requires additional heuristics.

2 Likes

Thanks. Now I have a clearer idea of what to think about.

There is some “prior art” from Python, originally designed for transforming Python2 to Python3. It consists of an augmented AST and pattern matching rules against the AST. (The pattern matching is, of course, an attempt to reinvent Prolog, and badly.)

The augmented AST has position information for each node, either as a token or derived from the tokens “contained” by a node. In addition, each token has a “prefix”, which is the whitespace (including comment text) preceding it – a non-token node can derive its prefix from the left-most token within it. The source can be trivially synthesized from the augmented AST (the token position information is redundant and is ignored when synthesizing the source from a modified AST).

One difference between Python’s AST and term_position and friends – things like parentheses and commas show as tokens in the Python AST – they are implied by term_position.

My guess is that the Python augmented AST isn’t useful as a design, but it might give you some more ideas.

(Other prior art: Richard O’Keefe invented something for translating IBM Prolog to Quintus Prolog, but I’ve forgotten the details … the only thing I remember is that he used one-way unification for matching in his transformations.)

This is (again) primarily using the layout for refactoring. For refactoring @edison created the refactor pack and we extended the original Quintus info to find embraced terms. As is, you can use SWI-Prolog’s layout information to make exact copies of the input, although you need the to get ranges from the source. For example, it will tell you 0'a is 97 and three characters long, but to get the 0'a back there is little choice but getting the indicated character range from the input. These days that is easy and fast: just load the input into a string and use sub_string/5. You still may need to do some searching to find the full-stop, etc. SWI-Prolog read_term/3 can return comments and their positions.

The separate layout and AST is rather impractical. Sometimes I would love to have the option to attach hidden information to any term …

As I explained to @EricGT though, this isn’t really what I am after (although it would be nice). I want to have a fair layout term that represents a compiled clause. For the debugger it is good enough to have the head and subgoal locations. For code linting more detailed information can be nice, but is often not really necessary. Just look at boot/dcg.pl to understand why translating the position along with the DCG is not a good idea :frowning: Besides, SWI-Prolog does not store the layout info except for the start position of the term. The debugger re-reads this term and re-runs the transformations to establish the full source information. This has been criticized, but if what I’m after can be proved to work I think the advantage becomes even bigger (less space, faster compilation and no need to decide in advance between compiling for source level debugging or not).

Hi,
If somebody want to have a look, in my refactor package (packages) I addressed some of the issues regarding a more exact representation of a term, since for refactoring that is crucial, otherwise you will cut/paste terms in the wrong positions, just have a look at the module fix_termpos.pl that does part of the job, it considers comments and fix some minor position misplacement.

3 Likes

That’s the technique Turbo-Pascal used back in the days of 8MHz 8-bit computers and it worked fine on source files with 1000s of lines. :wink:
Why do people criticize this design? – it seems quite elegant to me.

I know it’s not one of your design requirements, but if you extend the comments(...) option to include all whitespace, then would be fairly straightforward to regenerate the source from the results of the Term, subterm_positions, and comments. (Details such as 0'a would need to be worked out, with some small extensions to subterm_positions, I think.)

One other piece of “prior art” is the Kythe system, which was designed at Google to allow large-scale source browsing, analysis, and refactoring. (It’s a second-generation system: an extension and regularization of the earlier Grok system.) This also seems to be commercialized by source{d} and sourccegraph.

For “macros”, the Kythe schema has ref/expands and ref/expands/transitive node edges. This allows, for example, a C++ macro substitution or template expansion to be properly processed by a call.

I’m not very familiar with this part of the system; if it seems useful from a data modelling point of view, I can dig into it a bit more and give you some details and examples. I can also write up a brief summary of the Kythe data model.

(None of this should be interpreted to mean that I think the Kythe data model is a good one; but it has been used for some large-scale tools.)

2 Likes

Comments do include the white space inside the comment. Using subterm_positions, variable_names and comments you can fully recreate the term including layout except lexical variations at the token level such as different ways to write the same number. As you have the character range, you just fetch that from the original document. So far that has proved to be good enough.