Deterministically parsing with library(dcg/high_order)

Boris · April 18, 2022, 8:51am

My question boils down to: if we do have sequence//2 (and also sequence//3 and sequence//5), how can we get the corresponding det_sequence//2? The naive implementation for the 2-argument version:

det_sequence(G, [H|T]) -->
    call(G),
    !,
    det_sequence(G, T).
det_sequence(_, []) --> [].

This seems to be a generalization of blanks//0, nonblanks//1, whites//0, digits//1, xdigits//1 from library(dcg/basics). Given that this is not available at the moment, I have been trying to explain it away with “it isn’t necessary” or “it has a fundamental defect”. Could someone who understands better try to elaborate a bit?

PS: What I think I understand now but did miss initially is that the motivation behind the meta-predicates in library(dcg/high_order) was to make it easier to generate from a fully instantiated list, and not parse. For parsing, one can still just cut after sequence//2, but with long enough input we might run out of memory before that. I am still not sure if this is a problem. One argument is that even if you don’t run out of memory because of the choice points you eventually run out of memory because of the parsed list?

j4n_bur53 · April 18, 2022, 9:37am

Try this making the closure deterministic:

once_fun_dcg(F,X,I,O) :- call(F,X,I,O), !.

?- sequence(once_fun_dcg(F), L).

Here is an example:

foo(1) --> [a].
foo(2) --> [b].

?- length(L,3), phrase(sequence(foo,L), R).
L = [1, 1, 1],
R = [a, a, a] ;
L = [1, 1, 2],
R = [a, a, b] ;
L = [1, 2, 1],
R = [a, b, a] ;
L = [1, 2, 2],
R = [a, b, b] Etc...

?- length(L,3), phrase(sequence(once_fun_dcg(foo),L), R).
L = [1, 1, 1],
R = [a, a, a].

You can also use a deterministic foo2//1 in the first place, maybe easier:

foo2(X) --> foo(X), !.

?- length(L,3), phrase(sequence(foo2,L), R).
L = [1, 1, 1],
R = [a, a, a].

Boris · April 18, 2022, 9:39am

Not sure about that. It isn’t the cut in foo2//1 that made this deterministic, it is the nonvar L that did it. Check out the definition of sequence//2 and you will see.

j4n_bur53 · April 18, 2022, 9:41am

You looked up the wrong sequence implementation.

There is no nonvar/1 anywhere:

sequence(OnElem, List) -->
    sequence_(List, OnElem).

sequence_([H|T], P) -->
    call(P, H),
    sequence_(T, P).
sequence_([], _) -->
    [].

You can check yourself:

https://www.swi-prolog.org/pldoc/doc/SWI/library/dcg/high_order.pl?show=raw

You looked up:
sequence(:Start, :Element, :Sep, :End, ?List)// is semidet.
But I used:
sequence(:Element, ?List)// is nondet.

I guess you are aware, that inside SWI-Prolog docu, the “//” means add 2 to the arity.

Boris · April 18, 2022, 9:44am

I am still not sure. There isn’t nonvar in the definition, but you made it nonvar with your length(L, 3). I did already check everything, while I am not convinced that you did, since you reply so incredibly fast. Although I do admit I am at least an order of magnitude slower than you. Here for example:

?- [user].
|: foo(1) --> [a].
|: foo(2) --> [b].
|: foo2(X) --> foo(X), !.
|: ^D% user://1 compiled 0.01 sec, 3 clauses
true.

?- use_module(library(dcg/high_order)).
true.

?- phrase(sequence(foo2, L), [a, a, a]).
L = [1, 1, 1] ;

j4n_bur53 · April 18, 2022, 9:46am

You can also try this:

?- length(R,3), phrase(sequence(foo,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
R = [a, a, b],
L = [1, 1, 2] ;
R = [a, b, a],
L = [1, 2, 1] Etc..

?- length(R,3), phrase(sequence(once_fun_dcg(foo),L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

?- length(R,3), phrase(sequence(foo2,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

Now L is not anymore nonvar. Same result. Only problem
there is now a spurious choice point. I don’t know where this
spurious choice point comes from…

Boris · April 18, 2022, 9:51am

See my edited reply above. You are now showing the same thing I guess, here:

?- length(R,3), phrase(sequence(foo2,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

The choice point comes from the same place where all the choice points come from. The lack of choice points when the second argument of sequence//2 is a proper list is because of the special-case indexing for [_|_] vs [] in the first argument. This is why the arguments are reordered in the definition of sequence//2, which I will now paste here so that there is no confusion to those of us who are slower:

sequence(OnElem, List) -->
    sequence_(List, OnElem).

sequence_([H|T], P) -->
    call(P, H),
    sequence_(T, P).
sequence_([], _) -->
    [].

Boris · April 18, 2022, 9:52am

Thank you, now you understand what I asked in the first place. I apologize that I am so bad at explaining.

jan · April 18, 2022, 10:00am

That is right. As a serializer, the non-determinism is not a problem. For parsing one typically has to be keen using cuts to get a realistic parser. Without, a parse failure typically will cause the program not to terminate in a realistic time and even a syntactically valid input may cause resource overflows. The sequence//3 use the Separator to prune choice points. Even that is dubious as using non-determinism to add a separator to an item may be desired in some cases.

Boris · April 18, 2022, 10:09am

Well the lack of example was the source of the confusion. From this other thread, then the query in this post shows the problem with the default settings.

With the implementation by @brebs, here their source in full:

:- use_module(library(dcg/basics)).

split_list(Len, Lst, LstSplit) :-
    must_be(positive_integer, Len),
    phrase(split(LstSplit, Len), Lst).

split([H|T], Len) --> list_length(H, Len), !, split(T, Len).
% Terminate at end of list
split([H], _) --> [H], !.
split([], _) --> [].

list_length(L, Len) --> { length(L, Len) }, string(L).

Then this query takes a while but spits out an answer:

?- numlist(1, 10000000, L), time(split_list(3, L, S)).
% 30,000,134 inferences, 3.076 CPU in 3.341 seconds (92% CPU, 9753696 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20|...], [22|...], [...|...]|...].

If you try to use sequence//2 like I suggested in this other post:

?- numlist(1, 10000000, L), time(phrase(sequence(length_list(3), S), L)).
% 42,111 inferences, 2.201 CPU in 3.319 seconds (66% CPU, 19131 Lips)
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 0.8Mb, global: 0.8Gb, trail: 33Kb
ERROR:   Stack depth: 4,232, last-call: 0%, Choice points: 4,217
ERROR:   In:
ERROR:     [4,232] user:length_list(0, [length:1|_60001076], [length:9,987,370], _60001070)
ERROR:     [4,228] dcg_high_order:sequence_([length:1|_60001118], <compound (:)/2>, [length:9,987,373], [])
ERROR:     [4,227] dcg_high_order:sequence_([length:2|_60001166], <compound (:)/2>, [length:9,987,376], [])
ERROR:     [4,226] dcg_high_order:sequence_([length:3|_60001214], <compound (:)/2>, [length:9,987,379], [])
ERROR:     [4,225] dcg_high_order:sequence_([length:4|_60001262], <compound (:)/2>, [length:9,987,382], [])
ERROR: 
ERROR: Use the --stack_limit=size[KMG] command line option or
ERROR: ?- set_prolog_flag(stack_limit, 2_147_483_648). to double the limit.

Boris · April 18, 2022, 10:13am

Thank you for your answer. Let me ask it differently: given that there are several predicates in dcg/basics that follow the same pattern (whites//0, digits//1 etc) is there a point in providing a meta-predicate specifically for deterministic parsing? And would it rather belong to dcg/basics since its motivation is

We need a comprehensive set of generally useful DCG primitives.

In my slowly growing personal collection of “generally useful code” I have exactly det_sequence//2,3,5 with the naive implementation I showed above. It is useful like once a high noon.

I dug it out, apparently I only have it for 2 and 3 arguments, it looks like this:

$ cat detparse.pl
:- module(detparse, [det_sequence//2,
                     det_sequence//3]).
:- meta_predicate det_sequence(3, -, ?, ?).
:- meta_predicate det_sequence(3, //, -, ?, ?).

det_sequence(Goal, [X|Xs]) -->
    call(Goal, X),
    !,
    det_sequence(Goal, Xs).
det_sequence(_, []) --> [].

det_sequence(Goal, Sep, [X|Xs]) -->
    call(Goal, X),
    det_sequence_1(Goal, Sep, Xs).

det_sequence_1(Goal, Sep, [X|Xs]) -->
    Sep,
    call(Goal, X),
    !,
    det_sequence_1(Goal, Sep, Xs).
det_sequence_1(_, _, []) --> [].

I am not sure about the names nor about the implementation.

jan · April 18, 2022, 10:25am

That is exactly my worry

Boris · April 18, 2022, 1:25pm

Point taken

j4n_bur53:

But things get funky here, both foo//1 and bar//1 are deterministic, nevertheless:

foo(1) --> [a].
bar(2) --> [a].

?- phrase((sequence(foo,L), sequence(bar,R)), [a,a,a]), !.
L = [1, 1, 1],
R = [].

?- phrase((sequence2(foo,L), sequence2(bar,R)), [a,a,a]), !.
L = [],
R = [2, 2, 2].

This is indeed the way to use sequence//2 for parsing; just cut after the call. In practice it will work fine for realistic input. If the input is endless/unbounded (not sure about the terminology) then we have library(pure_input) and library(intercept). Those two together work really well.

j4n_bur53 · April 18, 2022, 2:30pm

What is unrealistic input, never heard of that? You can try yourself:
A regex /s.*o/ and some realistic input.

Should this match:

From 2008 on stackoverflow helps

Or should this match:

From 2008 on stackoverflow helps

What would sequence//2 do? How would you even translate
these two matching problems to Prolog? With what library exactly?

Boris · April 18, 2022, 2:57pm

The resulting list of the parse grows with the size of the input. I simply meant that you can guess when you write your code using sequence//2 that it will fit in memory. If you cannot make that guess then you shouldn’t be putting it in a list like this.

It should obviously match greedily. I don’t understand the rest of your questions to be honest.

j4n_bur53 · April 18, 2022, 4:00pm

I am unlimmited motivated to explain the matter. But my motivation
is no guarantee that anything I write makes sense. How can I help?

I forgot to add to my question:

What would det_sequence//2 do? How can it be used
to realize the regex /s.*o/. Is this even possible?

Boris · April 18, 2022, 4:24pm

My ability to comprehend is probably the bottleneck.

I shared a possible implementation above. As far as I can tell, you cannot use it for matching this regex. Instead, you’d need sequence//2 as defined at the moment, and a cut. So:

any(X) --> [X].

'/s.*o/' --> "s", sequence(any, _), "o", !.

and then:

?- phrase('/s.*o/', `stackoverflow`, Rest).

And would you be so kind to stop deleting so many of your posts. It makes it even more difficult to follow what it is that we are talking about.

j4n_bur53 · April 18, 2022, 4:29pm

What about, any//2 will not work with det_sequence//2, but maybe this here will do:

any_but_o(X) --> \+ "o", [X].

'/s.*o/_lazy' --> "s", det_sequence(any_but_o, _), "o", !.

Interestingly this should give the lazy parsing and not the greedy parsing.

Edit 18.04.2022:
Yes indeed, its the lazy one, stops at the first “o” and not at the last “o”:

?- set_prolog_flag(double_quotes, codes).
true.

?- phrase((sequence(any,L), '/s.*o/_lazy', sequence(any,R)), 
"From 2008 on stackoverflow helps"), atom_codes(A, L), atom_codes(B, R).
A = 'From 2008 on ',
B = 'verflow helps' .

Boris · April 18, 2022, 4:38pm

I think that any_but_o is already implemented as string_without//2. But this is indeed “lazy”, using again the examples from the link that you shared then deleted. Like this:

?- portray_text(true).
true.

?- phrase('/s.*o/_lazy', `stackoverflow`, Rest).
Rest = `verflow`.

?- use_module(library(dcg/basics)).
true.

?- phrase(("s", string_without("o", _), "o"), `stackoverflow`, Rest).
Rest = `verflow`.

Boris · April 18, 2022, 4:43pm

But we got lost again. sequence//2 as it is defined at the moment is does, exactly as advertised in the docs, greedy matching, and then on backtracking “goes back”; so it does more than a regex can do:

?- phrase(("s", sequence(any, L), "o"), `stackoverflow`, Rest).
L = `tackoverfl`,
Rest = [119] ; % hehe
L = `tack`,
Rest = `verflow` ;
false.

Topic		Replies	Views
Confusing behavior of sequence//3 and sequence//5 from dcg/high_order Help! how-to	18	1366	December 25, 2020
Controlling backtracking in a DCG grammar? Help!	10	1187	July 11, 2019
Series of "procedural" predicates -- removing choice points Help!	33	2382	June 4, 2019
DCG look ahead, how its done? General	47	297	August 13, 2024
Sequence/3: problematic interaction with dcg/basics General	7	199	December 10, 2023

Deterministically parsing with library(dcg/high_order)

Related topics