Best way to weave list processing predicates?

I’m getting stymied by a task that I’d find simple in other languages, but I just don’t know what the best way is to implement it in Prolog. Specifically, I want to perform several processing steps on a stream of data, each of which may or may not change the cardinality of the dataset, and which may or may not be deterministic, and I’d like them to operate without having to hold the entire stream (or, worse, multiple copies of the stream) in memory. Basically, I want to do the following:

process_stream(List0, Processed) :-
    transformation1(List0, List1), % no cut
    transformation2(List1, List2), % no cut
    transformation3(List2, Processed).

only, without storing the whole list in between. If all the streams had the same cardinality, I could just do:

process_stream([H0|T0], [HP|TP]) :-
    transform1(H0, H1),
    transform2(H1, H2),
    transform3(H2, HP),
    process_stream(T0, TP).
process_stream([], []).

but that only works if the lists remain in lockstep, and each can process a single element at a time, without context.

In, say, Python or C#, I could just use generators/iterators, with each transformer consuming the elements it needs before yielding whatever values it can produce to its output. I’ve glanced at library(lazy_lists), but that really seems geared only towards deterministic processes, and especially those that can opportunistically pull in a whole block of data at once. Basically, for reading from file/network. And honestly, constraint programming in general seems pretty ill-suited to this, because unless you are dealing with a det/semidet source and you use extra-logical storage for stream blocks, it only ever works in post-hoc mode. Basically, if you had this definition for transform2:

transform2(foo, bar).
transform2(apple, banana).
transform2(bear, eel).

and you called it with a delayed/attributed variable in the first argument that would eventually resolve to ‘bear’, Prolog would have no choice but to try unifying it with foo, calling the unification hook, failing to unify foo = bear, undo the H1 = foo unification, try it again with H1 = apple, and so on. Not precisely the most practical or efficient way to go about things.

So, how do I do this in Prolog? And, to extend it a little, how do I do it with DCGs, which are even more limited in their calling convention?

1 Like

Pehaps Coroutining using Prolog engines.

HTH

1 Like

Yeah, I was thinking about that, though I wasn’t sure whether the context-switching would impair efficiency too much. Also, would there be any way to feed input to a DCG from a pengine yield? Also, how would you avoid extra backtracking (if the upstream predicate produces more stream elements and choicepoints than necessary) or overenthusiastic consuming (if the downstream transformers end up instantiating variables that haven’t been populated by upstream yet)?

If they don’t have the same size, just use a DCG transducer.
Example replace 3 a’s by one a, and replace one b by two b’s.
Otherwise just carry over the data:

transduce,[a]  --> [a,a,a], transduce.
transduce,[b,b] --> [b], transduce.
transduce,[X] --> [X], transduce.
transduce --> [].

Works smoothly:

?- transduce([c,a,a,a,a,b,b,c], X).
X = [c, a, a, b, b, b, b, c] .

Edit 21.07.2021:
You can also use listing to see how the DCG push back (=semi context
in the ISO DCG draft) is translated, in case you want to hand
craft a transducer:

?- listing(transduce/2).
transduce([a, a, a|A], B) :-
    transduce(A, C),
    B=[a|C].
transduce([b|A], B) :-
    transduce(A, C),
    B=[b, b|C].
transduce([A|B], C) :-
    transduce(B, D),
    C=[A|D].
transduce(A, A).

The (=)/2 can be eliminated, thats just a DCG artefact. If you
move the (=)/2 into head, it gets tail recursive.

1 Like

!! I’ve never seen this use of pushback lists! How does it avoid double-processing the expansions and getting into an infinite recursion?

There are two rules for DCG rule transformation:

Rule 1: H, P --> B
Rule 2: H --> B

Just check the ISO DCG draft reference implementation
curated by Ulrich Neumerkel:
https://www.complang.tuwien.ac.at/ulrich/iso-prolog/dcgs/dcgs_exp.pl

…wait, that’s not actually a difference list, is it? That’s using the DCG List\Tail syntax but with two completely unrelated lists, isn’t it? Mind blown. :astonished: :exploding_head:

The difference list view is also helpful, you can factor out the
transduce step as follows:

transduce2(L, R) :-
    transduce2_step(L, H, R, J),
    transduce2(H, J).
transduce2(A, A).

transduce2_step([a, a, a|A], A, [a|C], C).
transduce2_step([b|A], A, [b, b|C], C).
transduce2_step([X|A], A, [X|C], C).

transduce2_step/4 basically takes two difference lists. Works also:

?- transduce2([c,a,a,a,a,b,b,c], X).
X = [c, a, a, b, b, b, b, c] .

I did not manage to do exactly this with DCG. But there are also special transducer DCGs around. Or you can do it differently with DCGs as well, without push back but non-terminal arguments works also.

Some such stuff is/was used in natural language processing:

2.3 Morphological Analysis with Finite State Transducers
https://cs.union.edu/~striegnk/courses/nlp-with-prolog/html/node14.html#l2.morphology

I have not used this ever, so I won’t be giving any other info. :neutral_face:

DCG have a marketing problem. If they had a media coverage like Haskell monads have, more people would use it monadic style. Monad is anyway a rubber word, that covers practically anything.

So how does it feel that you saw an alternative use of the hidden DCG arguments?
Ok I see mind blow. Yes indeed. BTW: I asked the same already @grossdan .

To learn more read the section named " Implicitly passing states around" in this tutorial

See, that part I understand (and I used that tutorial myself, way back when I was learning DCGs), and I’ve actually used that syntax myself from time to time. The problem is that you can’t do that and do list-processing of any significant complexity (you know, the thing DCGs are ostensibly used for when they’re not being used to emulate procedural languages) in the same DCG unless you’re willing to jump through some pretty significant syntactic hoops and lose some runtime efficiency, because the states get put into the same list as the data.

To illustrate, here’s what happens to that tree example if, instead of counting leaves in a compound term, we wanted to count newline characters in a list of charcodes:

state(S), [S] --> [S].
state(S0, S), [S] --> [S0].

count_newlines(Codes, Newlines) :-
    phrase(nlcount, [0 | Codes], [Newlines]).

nlcount --> checknl, nlcount.
nlcount --> [], state(N), {format('Found ~p newlines~n', [N])}.

checknl --> state(N0, N), `\n`, {N #= N0 + 1}.
checknl --> [NonNewline], {NonNewline \= 0'\n}.

That’s not gonna work nearly as well as you might want it to, because the `\n` isn’t going to match a character code in the Codes list; it’s going to match whatever the number was that was generated by state(N0, N). Basically, if you’re mixing state variables and data in a DCG, then every nonterminal that you use has to understand that there’s gonna be a state that comes first before the data.

In any case, though, this is kinda beside point when it comes to the original question I wanted help with. I’m not trying to maintain a state through a single complex DCG, I’m trying to get multiple list-processing predicates (which may or may not be DCGs) to work well together.

1 Like

I see, you may want to try the EDCG package maintained by @peter.ludemann, it solves exactly this problem as it allows you to have DCGs handling separate lists.

This is something of an unsolved problem for me too. By unsolved I mean that I don’t really know how to do it efficiently, with clean obvious code, entirely from Prolog. But my understanding is that you want something equivalent to Unix pipes, without giving away multiple solutions, is that right? Like,

would be actually (in some unix shell):

< List0 transformation1 \
        | transformation2 \
        | transformation3 \
        > Processed

… but you also want to keep the non-determinism? Meaning that any of the transformations can succeed more than once, with different lists pushed down to the next transformation, and the whole pipeline would succeed more than once with different solutions?

Do you want the transformations to be unaware of each other? In other words, that you can write two parts of the pipeline ignorant of each other, but you can chain them if they happen to produce/consume compatible elements?

EDIT: I strongly suspect that Prolog Engines as suggested by @EricGT above is what will get you the closest to what I described above. Not sure yet if this is what you are after, though. Also haven’t myself tried to see how this combines with non-determinism. I have this suspicion that long lists + non-determinism somewhere along the list requires buffering and can blow up really fast.

That is not really true. Lazy lists (and pure input, where it all started) allow processing external data (typically using a DCG) as it comes in. GC discards all already processed data before the oldest point to where we can still backtrack. If your DCG is fully nondet there is little choice but holding the entire list in memory.

Otherwise it is an interesting problem. Your comparison to Python and C# seems a little unfair to me as it is precisely the required non-determinism that makes this problem hard to deal with. If we do not have non-determinism that needs to be communicated between the layers we have at least one theoretical option: use a thread for each layer, make one layer send the elements to the message queue of the next thread and there use a lazy list to process the input using a DCG. That is pretty much the shell analogy mentioned in this thread.

If you need non-determinism between the steps it gets nasty. The trivial solution is to merge the steps into one big DCG. Often not so elegant :frowning: If there is another solution (I don’t know) I’d suspect it to use delimited continuations or, as mentioned, engines.

Here is yet another solution of the transducer, without push back this time.
This is possibly the most modular solution, since could define transduce3_step//2
also via DCG. Its a variation of the idea of atom//1 in library(http/dcg_basic):

transduce3(I, O) -->
    transduce3_step(I, H),
    transduce3(H, O).
transduce3(I, I) --> [].

transduce3_step([a|C], C) --> [a, a, a].
transduce3_step([b, b|C], C) --> [b].
transduce3_step([X|C], C) --> [X].

Works as expected:

?- transduce3(X, [], [c,a,a,a,a,b,b,c], []).
X = [c, a, a, b, b, b, b, c] 

The variation is that atom//1 has only one argument, whereby here we have
two arguments, a difference list, which is threaded parallel to the
hidden DCG arguments.

Here is a solution with Prolog Engines, the enum//1 is a Prolog Engine. It only works with ground data, if the transformed elements contain interacting variables, these binds get lost during findall/3.

So Prolog Engines are for example not suitable for WAM compilers etc… that use holes. This is unlike Python, Java, etc… where iterators do not require a copy_term/2 for their result.

But thread queues in Prolog require a copy, so this is what is simulated:

transduce4(L, R) :-
   findall(X, enum(X, L, _), H),
   flat(H, R).

enum(Y) --> enum_step(X), !, ({Y=X}; enum(Y)).

enum_step([a]) --> [a, a, a].
enum_step([b,b]) --> [b].
enum_step([X]) --> [X].

flat([X|Y], L) :-
   flat_list(X, L, R),
   flat(Y, R).
flat([], []).

flat_list([X|Y], [X|L], R) :-
   flat_list(Y, L, R).
flat_list([], L, L).

You can also do the flat/2 with an extra argument for the Prolog Engine handle, and then tail recursively repeatedly ask the Prolog Engine for answers. But the copy problem will be still there.

Works as expected:

?- transduce4([c,a,a,a,a,b,b,c], X).
X = [c, a, a, b, b, b, b, c].

This has been my go to solution. I wonder what is not elegant about it. With good enough building blocks, as available in library(dcg/basics) for example, the “one big DCG” is not so big after all.

Lately I have been doubting the idea that everything must be composable using a particular pattern. In my experience, too often this creates too much overhead in terms of design, initially, and then programming the adapter code. It ends up more confining than liberating.

1 Like

I do not recommend using DCG with non-determinism, like in the Protobuf example. You can easily measure the negative impact of spurious choice points. Take this naive reverse with spurious choice points:

rev([], []).
rev([X|Rest], Ans) :- rev(Rest, L), concatenate(L, [X], Ans).
rev(_, _) :- fail.

concatenate([], L, L).
concatenate([X|L1], L2, [X|L3]) :- concatenate(L1, L2, L3).
concatenate(_, _, _) :- fail.

Benchmarking shows, spurious choice points slow down nrev by a factor >2x:

/* without spurious choice points */
user:nrev	in 93	(0 gc) ms

/* with spurious choice points */
user:nrev	in 234	(0 gc) ms

So the kind of library(dcg/basics) should be quaranteened. Or put into a poison cabinet, and the key then thrown away. Or write tutorial that show to write DCG without any spurious choice points.

Languages that are for example LL(1) do not need any choice point.

This is really unnecessary. There isn’t any apparent unnecessary non-determinism in this library, and if you find it, you should report it or suggest an improvement.

This has been written. There is a chapter about this in “The Craft of Prolog”. Maybe we can kindly ask Richard O’Keefe if we can use his chapter and make a tutorial based on it available on the internet.