Deterministically parsing with library(dcg/high_order)

My question boils down to: if we do have sequence//2 (and also sequence//3 and sequence//5), how can we get the corresponding det_sequence//2? The naive implementation for the 2-argument version:

det_sequence(G, [H|T]) -->
    call(G),
    !,
    det_sequence(G, T).
det_sequence(_, []) --> [].

This seems to be a generalization of blanks//0, nonblanks//1, whites//0, digits//1, xdigits//1 from library(dcg/basics). Given that this is not available at the moment, I have been trying to explain it away with “it isn’t necessary” or “it has a fundamental defect”. Could someone who understands better try to elaborate a bit?

PS: What I think I understand now but did miss initially is that the motivation behind the meta-predicates in library(dcg/high_order) was to make it easier to generate from a fully instantiated list, and not parse. For parsing, one can still just cut after sequence//2, but with long enough input we might run out of memory before that. I am still not sure if this is a problem. One argument is that even if you don’t run out of memory because of the choice points you eventually run out of memory because of the parsed list?

Try this making the closure deterministic:

once_fun_dcg(F,X,I,O) :- call(F,X,I,O), !.

?- sequence(once_fun_dcg(F), L).

Here is an example:

foo(1) --> [a].
foo(2) --> [b].

?- length(L,3), phrase(sequence(foo,L), R).
L = [1, 1, 1],
R = [a, a, a] ;
L = [1, 1, 2],
R = [a, a, b] ;
L = [1, 2, 1],
R = [a, b, a] ;
L = [1, 2, 2],
R = [a, b, b] Etc...

?- length(L,3), phrase(sequence(once_fun_dcg(foo),L), R).
L = [1, 1, 1],
R = [a, a, a].

You can also use a deterministic foo2//1 in the first place, maybe easier:

foo2(X) --> foo(X), !.

?- length(L,3), phrase(sequence(foo2,L), R).
L = [1, 1, 1],
R = [a, a, a].

Not sure about that. It isn’t the cut in foo2//1 that made this deterministic, it is the nonvar L that did it. Check out the definition of sequence//2 and you will see.

You looked up the wrong sequence implementation.

There is no nonvar/1 anywhere:

sequence(OnElem, List) -->
    sequence_(List, OnElem).

sequence_([H|T], P) -->
    call(P, H),
    sequence_(T, P).
sequence_([], _) -->
    [].

You can check yourself:

https://www.swi-prolog.org/pldoc/doc/SWI/library/dcg/high_order.pl?show=raw

You looked up:
sequence(:Start, :Element, :Sep, :End, ?List)// is semidet.
But I used:
sequence(:Element, ?List)// is nondet.

I guess you are aware, that inside SWI-Prolog docu, the “//” means add 2 to the arity.

I am still not sure. There isn’t nonvar in the definition, but you made it nonvar with your length(L, 3). I did already check everything, while I am not convinced that you did, since you reply so incredibly fast. Although I do admit I am at least an order of magnitude slower than you. Here for example:

?- [user].
|: foo(1) --> [a].
|: foo(2) --> [b].
|: foo2(X) --> foo(X), !.
|: ^D% user://1 compiled 0.01 sec, 3 clauses
true.

?- use_module(library(dcg/high_order)).
true.

?- phrase(sequence(foo2, L), [a, a, a]).
L = [1, 1, 1] ;

You can also try this:

?- length(R,3), phrase(sequence(foo,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
R = [a, a, b],
L = [1, 1, 2] ;
R = [a, b, a],
L = [1, 2, 1] Etc..

?- length(R,3), phrase(sequence(once_fun_dcg(foo),L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

?- length(R,3), phrase(sequence(foo2,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

Now L is not anymore nonvar. Same result. Only problem
there is now a spurious choice point. I don’t know where this
spurious choice point comes from…

See my edited reply above. You are now showing the same thing I guess, here:

?- length(R,3), phrase(sequence(foo2,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

The choice point comes from the same place where all the choice points come from. The lack of choice points when the second argument of sequence//2 is a proper list is because of the special-case indexing for [_|_] vs [] in the first argument. This is why the arguments are reordered in the definition of sequence//2, which I will now paste here so that there is no confusion to those of us who are slower:

sequence(OnElem, List) -->
    sequence_(List, OnElem).

sequence_([H|T], P) -->
    call(P, H),
    sequence_(T, P).
sequence_([], _) -->
    [].

Thank you, now you understand what I asked in the first place. I apologize that I am so bad at explaining.

That is right. As a serializer, the non-determinism is not a problem. For parsing one typically has to be keen using cuts to get a realistic parser. Without, a parse failure typically will cause the program not to terminate in a realistic time and even a syntactically valid input may cause resource overflows. The sequence//3 use the Separator to prune choice points. Even that is dubious as using non-determinism to add a separator to an item may be desired in some cases.

1 Like

Well the lack of example was the source of the confusion. From this other thread, then the query in this post shows the problem with the default settings.

With the implementation by @brebs, here their source in full:

:- use_module(library(dcg/basics)).

split_list(Len, Lst, LstSplit) :-
    must_be(positive_integer, Len),
    phrase(split(LstSplit, Len), Lst).

split([H|T], Len) --> list_length(H, Len), !, split(T, Len).
% Terminate at end of list
split([H], _) --> [H], !.
split([], _) --> [].

list_length(L, Len) --> { length(L, Len) }, string(L).

Then this query takes a while but spits out an answer:

?- numlist(1, 10000000, L), time(split_list(3, L, S)).
% 30,000,134 inferences, 3.076 CPU in 3.341 seconds (92% CPU, 9753696 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20|...], [22|...], [...|...]|...].

If you try to use sequence//2 like I suggested in this other post:

?- numlist(1, 10000000, L), time(phrase(sequence(length_list(3), S), L)).
% 42,111 inferences, 2.201 CPU in 3.319 seconds (66% CPU, 19131 Lips)
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 0.8Mb, global: 0.8Gb, trail: 33Kb
ERROR:   Stack depth: 4,232, last-call: 0%, Choice points: 4,217
ERROR:   In:
ERROR:     [4,232] user:length_list(0, [length:1|_60001076], [length:9,987,370], _60001070)
ERROR:     [4,228] dcg_high_order:sequence_([length:1|_60001118], <compound (:)/2>, [length:9,987,373], [])
ERROR:     [4,227] dcg_high_order:sequence_([length:2|_60001166], <compound (:)/2>, [length:9,987,376], [])
ERROR:     [4,226] dcg_high_order:sequence_([length:3|_60001214], <compound (:)/2>, [length:9,987,379], [])
ERROR:     [4,225] dcg_high_order:sequence_([length:4|_60001262], <compound (:)/2>, [length:9,987,382], [])
ERROR: 
ERROR: Use the --stack_limit=size[KMG] command line option or
ERROR: ?- set_prolog_flag(stack_limit, 2_147_483_648). to double the limit.

Thank you for your answer. Let me ask it differently: given that there are several predicates in dcg/basics that follow the same pattern (whites//0, digits//1 etc) is there a point in providing a meta-predicate specifically for deterministic parsing? And would it rather belong to dcg/basics since its motivation is

We need a comprehensive set of generally useful DCG primitives.

In my slowly growing personal collection of “generally useful code” I have exactly det_sequence//2,3,5 with the naive implementation I showed above. It is useful like once a high noon.

I dug it out, apparently I only have it for 2 and 3 arguments, it looks like this:

$ cat detparse.pl
:- module(detparse, [det_sequence//2,
                     det_sequence//3]).
:- meta_predicate det_sequence(3, -, ?, ?).
:- meta_predicate det_sequence(3, //, -, ?, ?).

det_sequence(Goal, [X|Xs]) -->
    call(Goal, X),
    !,
    det_sequence(Goal, Xs).
det_sequence(_, []) --> [].

det_sequence(Goal, Sep, [X|Xs]) -->
    call(Goal, X),
    det_sequence_1(Goal, Sep, Xs).

det_sequence_1(Goal, Sep, [X|Xs]) -->
    Sep,
    call(Goal, X),
    !,
    det_sequence_1(Goal, Sep, Xs).
det_sequence_1(_, _, []) --> [].

I am not sure about the names nor about the implementation.

That is exactly my worry :slight_smile:

1 Like

Point taken :slight_smile:

This is indeed the way to use sequence//2 for parsing; just cut after the call. In practice it will work fine for realistic input. If the input is endless/unbounded (not sure about the terminology) then we have library(pure_input) and library(intercept). Those two together work really well.

1 Like

What is unrealistic input, never heard of that? You can try yourself:
A regex /s.*o/ and some realistic input.

Should this match:

From 2008 on stackoverflow helps

Or should this match:

From 2008 on stackoverflow helps

What would sequence//2 do? How would you even translate
these two matching problems to Prolog? With what library exactly?

The resulting list of the parse grows with the size of the input. I simply meant that you can guess when you write your code using sequence//2 that it will fit in memory. If you cannot make that guess then you shouldn’t be putting it in a list like this.

It should obviously match greedily. I don’t understand the rest of your questions to be honest.

I am unlimmited motivated to explain the matter. But my motivation
is no guarantee that anything I write makes sense. How can I help?

I forgot to add to my question:

What would det_sequence//2 do? How can it be used
to realize the regex /s.*o/. Is this even possible?

My ability to comprehend is probably the bottleneck.

I shared a possible implementation above. As far as I can tell, you cannot use it for matching this regex. Instead, you’d need sequence//2 as defined at the moment, and a cut. So:

any(X) --> [X].

'/s.*o/' --> "s", sequence(any, _), "o", !.

and then:

?- phrase('/s.*o/', `stackoverflow`, Rest).

And would you be so kind to stop deleting so many of your posts. It makes it even more difficult to follow what it is that we are talking about.

What about, any//2 will not work with det_sequence//2, but maybe this here will do:

any_but_o(X) --> \+ "o", [X].

'/s.*o/_lazy' --> "s", det_sequence(any_but_o, _), "o", !.

Interestingly this should give the lazy parsing and not the greedy parsing.

Edit 18.04.2022:
Yes indeed, its the lazy one, stops at the first “o” and not at the last “o”:

?- set_prolog_flag(double_quotes, codes).
true.

?- phrase((sequence(any,L), '/s.*o/_lazy', sequence(any,R)), 
"From 2008 on stackoverflow helps"), atom_codes(A, L), atom_codes(B, R).
A = 'From 2008 on ',
B = 'verflow helps' .

I think that any_but_o is already implemented as string_without//2. But this is indeed “lazy”, using again the examples from the link that you shared then deleted. Like this:

?- portray_text(true).
true.

?- phrase('/s.*o/_lazy', `stackoverflow`, Rest).
Rest = `verflow`.

?- use_module(library(dcg/basics)).
true.

?- phrase(("s", string_without("o", _), "o"), `stackoverflow`, Rest).
Rest = `verflow`.

But we got lost again. sequence//2 as it is defined at the moment is does, exactly as advertised in the docs, greedy matching, and then on backtracking “goes back”; so it does more than a regex can do:

?- phrase(("s", sequence(any, L), "o"), `stackoverflow`, Rest).
L = `tackoverfl`,
Rest = [119] ; % hehe
L = `tack`,
Rest = `verflow` ;
false.