When using a DCG on a binary file can the position be changed?

Using SWI-Prolog 8.x on Windows.

Can a DCG position be changed within an active DCG clause for a binary file? i.e. can one use seek/4 or similar?

In researching this I find that to change the position seek/4 is needed. seek/4 only works with streams and to work with DCGs with streams phrase_from_stream/2 is needed. However phrase_from_stream/2 is not aware of changes to position using seek/4 and thus the answer to the question is no.

Is this correct that this cannot be currently done?


I am aware that it is possible to use multiple DCGs and to pass the selected data into each DCG as needed but it would be nice just to change the position within the DCG.

The exact need for this is to parse Portable Executable (PE) files and to jump ahead to the export section once the offset is known.

I was going to suggest using a “skip” DCG like skip(N) --> { length(Skip, N) }, Skip. but I suppose that would only work for a relative offset, not jumping to an absolute position. For that to work then, maybe one could use an eDCG (or updating a global) plus some term-expansion magic to keep a count of the number of bytes read thus far, so the skip//1 technique would work. phrase_from_stream/2 makes a lazy list, so perhaps lazy_list_location//1 could make that easier…

Thanks. :slightly_smiling_face:

In the early parts of parsing of a PE file using DCGs was working because the data was laid out sequentially. Latter the data has pointers to null terminated structures (think null terminated strings) which requires either keeping multiple offsets (think the value for seek/4) and jumping forward and backward in the data so lazy list (think phrase_from_file/3) are out. Then to keep track of the different offsets for each DCG is making the code harder to understand for a novice.

One option I did not try but I think would work would be to push the data such as offset onto a stack and when accessing data at a new offset would push a new context so that the DCG works as expected. For those who are comfortable with such the code would make sense but as I want to reserve the option to release the code for public use would rather make it much simpler for the everyday programmer and setting the offset with a seek function/method is common knowledge in parsing binary data so for now will not use DCGs and instead base it on

You could use an alternative DCG implementation, based on file pointers.
Use a DCG that generates ‘C’/3 calls for terminals. Usually ‘C’/3 is implemented
as follows, and you usually start parsing a non-terminal s by calling s("text", ""):

'C'(X, [X|Y], Y).

But then you implement ‘C’/3 as follows, and you would start parsing
a non-terminal s by calling s(0, Len) where Len is the file length:

'C'(Pos, Byte, Pos2)  :- 
      set_stream_position(mystream, Pos),   /* ISO Core Standard */
      get_byte(mystream, Byte),             /* ISO Core Standard */
      Pos2 is Pos+1.

It shouldn’t be that slow, if set_stream_position/2 is optimized, to
do no work, when we are already at Pos. And get_byte/2 could be
also quick, if the stream is for example buffered.

You need to write your grammar with as less as possible choice points,
so some DCG idioms that use backtracking, and that are sometimes
seen in the wild, are not recommended . But with the above you can

freely seek in the file at any moment of time, backward and forward.

Edit 25.09.2022
You could implement these DCG helpers seek_relative//1 and seek_absolute//1:

seek_relative(Delta, Pos, Pos2) :-
     Pos2 is Pos+Delta.

seek_absolute(Pos, _, Pos).

Don’t we need to revert the seek on backtracking?

Anyway, SWI-Prolog’s DCG translation does not involve a ‘C’/3 predicate, and thus this doesn’t work in SWI-Prolog. Of course you can use an alternative DCG translation.

For PE files for the case in question the short answer is no, reverting the file position (seek) on backtracking is not needed.

For my case, which is to inspect the import and export functions/methods/predicates it is essentially just starting with the first structure at the start of file to get a pointer to the location to the next structure(s) and continuing the same until I get to the needed information. A new file position is typically given to the predicate for a structure and for the parts of a structure just keeping track of file position using the number of bytes read. Rather boring but being off by just one byte can lead to days of frustration if one does not know how to decode the file by hand.

While most of the ideas of parsing do apply to what needs to be done, it is not what I would really consider parsing in the sense of parsing code. It is really reading lots of structures given a file offset and structure to read. Then using the info in a structure to calculate a new file position and repeat. Many of the offsets in the PE file are not relative to the file but relative to the memory location after the data is mapped into memory. If one Googles for RVA with the word executable you will find such code by everyone who can write a blog on the subject.

You sure, how do you know? Already innocous DCG like:

s(foo) --> "foo", !.
s(bar) --> "bar", !.
s(baz) --> "baz".

Involves backtracking, and thus position revert. But its irrelevant

for the proposed ‘C’/3, since this can deal with position revert,
I am only currious what makes you think your grammar has
no position revert? You didn’t post the grammar, so we don’t know.

Also the above has small delta of the position revert, so if
your streams have a kind of sliding window buffering, do not
forget everything while advancing next buffer, could be of advantage.

Edit 25.09.2022
For example the PE format has many multi-byte keys:

This specification describes the structure of executable (image) files and object files under the Windows family of operating systems.
https://learn.microsoft.com/en-us/windows/win32/debug/pe-format

To avoid backtracking, you would need to do it like this:

s(X) --> [A,B,C], s2([A,B,C],X).

s2("foo", foo) :- !.
s2("bar", bar) :- !.
s2("baz", baz).

this is indeed a good approach, and a well known pattern for removing backtracking in DCGs.

A nitpick, I don’t think this works as is, at least not in SWI-Prolog. There is first a syntax problem, you need to either make s2 a DCG rule or wrap it in braces:

s(X) --> [A,B,C], s2([A,B,C],X).

s2("foo", foo) --> !.
s2("bar", bar) --> !.
s2("baz", baz).

This of course still doesn’t work for SWI-Prolog in particular because "foo" is a string and not a list of codes. You’d have to write it as:

s(X) --> [A,B,C], s2([A,B,C],X).

s2(`foo`, foo) --> !.
s2(`bar`, bar) --> !.
s2(`baz`, baz).

I personally prefer to avoid the cuts completely. One way is to make the atom explicitly; it sometimes works, depending on the exact situation. Here, you could do:

s(X) --> [A,B,C],
    {   atom_codes(X, [A,B,C]),
        s2(X)
    }.

s2(foo).
s2(bar).
s2(baz).

… but I admit this starts to look a bit hacky at some point. Avoiding backtracking in DCGs is an uphill battle and I don’t know if it is worth it. Maybe in isolated cases it is OK.

You can also read it, that in the present case, DCGs are useless.

You could also do:

s(X) :- get_byte(mystream, A),
        get_byte(mystream, B),
        get_byte(mystream, C), s2([A,B,C],X).

Also the Pos2 is Pos+1 in my ‘C’/3 suggestion isn’t really
needed, since a stream can have:

?- stream_property(mystream, position(Pos)).

But I am not used to the position approach in SWI-Prolog,
it seems that Pos is not simply a smallint or bigint, something
more complex. Which is unfortunately allowed by the ISO

core standard (7.10.2.8), and gives an additional challenge
of portability, and realizing helpers such as seek_relative//1
and seek_absolute//1, unless you have seek/4 as in SWI-Prolog,

but I don’t find this in the ISO core standard.

This comes from the Quintus/SICStus tradition. The idea is one can get and restore a stream position including line, line position and character position information. Compliant with SICStus there is stream_position_data/3 to get plain numbers.