Deterministically parsing with library(dcg/high_order)

My question boils down to: if we do have sequence//2 (and also sequence//3 and sequence//5), how can we get the corresponding det_sequence//2? The naive implementation for the 2-argument version:

det_sequence(G, [H|T]) -->
    call(G),
    !,
    det_sequence(G, T).
det_sequence(_, []) --> [].

This seems to be a generalization of blanks//0, nonblanks//1, whites//0, digits//1, xdigits//1 from library(dcg/basics). Given that this is not available at the moment, I have been trying to explain it away with “it isn’t necessary” or “it has a fundamental defect”. Could someone who understands better try to elaborate a bit?

PS: What I think I understand now but did miss initially is that the motivation behind the meta-predicates in library(dcg/high_order) was to make it easier to generate from a fully instantiated list, and not parse. For parsing, one can still just cut after sequence//2, but with long enough input we might run out of memory before that. I am still not sure if this is a problem. One argument is that even if you don’t run out of memory because of the choice points you eventually run out of memory because of the parsed list?

Not sure about that. It isn’t the cut in foo2//1 that made this deterministic, it is the nonvar L that did it. Check out the definition of sequence//2 and you will see.

I am still not sure. There isn’t nonvar in the definition, but you made it nonvar with your length(L, 3). I did already check everything, while I am not convinced that you did, since you reply so incredibly fast. Although I do admit I am at least an order of magnitude slower than you. Here for example:

?- [user].
|: foo(1) --> [a].
|: foo(2) --> [b].
|: foo2(X) --> foo(X), !.
|: ^D% user://1 compiled 0.01 sec, 3 clauses
true.

?- use_module(library(dcg/high_order)).
true.

?- phrase(sequence(foo2, L), [a, a, a]).
L = [1, 1, 1] ;

See my edited reply above. You are now showing the same thing I guess, here:

?- length(R,3), phrase(sequence(foo2,L), R).
R = [a, a, a],
L = [1, 1, 1] ;
false.

The choice point comes from the same place where all the choice points come from. The lack of choice points when the second argument of sequence//2 is a proper list is because of the special-case indexing for [_|_] vs [] in the first argument. This is why the arguments are reordered in the definition of sequence//2, which I will now paste here so that there is no confusion to those of us who are slower:

sequence(OnElem, List) -->
    sequence_(List, OnElem).

sequence_([H|T], P) -->
    call(P, H),
    sequence_(T, P).
sequence_([], _) -->
    [].

Thank you, now you understand what I asked in the first place. I apologize that I am so bad at explaining.

That is right. As a serializer, the non-determinism is not a problem. For parsing one typically has to be keen using cuts to get a realistic parser. Without, a parse failure typically will cause the program not to terminate in a realistic time and even a syntactically valid input may cause resource overflows. The sequence//3 use the Separator to prune choice points. Even that is dubious as using non-determinism to add a separator to an item may be desired in some cases.

1 Like

Well the lack of example was the source of the confusion. From this other thread, then the query in this post shows the problem with the default settings.

With the implementation by @brebs, here their source in full:

:- use_module(library(dcg/basics)).

split_list(Len, Lst, LstSplit) :-
    must_be(positive_integer, Len),
    phrase(split(LstSplit, Len), Lst).

split([H|T], Len) --> list_length(H, Len), !, split(T, Len).
% Terminate at end of list
split([H], _) --> [H], !.
split([], _) --> [].

list_length(L, Len) --> { length(L, Len) }, string(L).

Then this query takes a while but spits out an answer:

?- numlist(1, 10000000, L), time(split_list(3, L, S)).
% 30,000,134 inferences, 3.076 CPU in 3.341 seconds (92% CPU, 9753696 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20|...], [22|...], [...|...]|...].

If you try to use sequence//2 like I suggested in this other post:

?- numlist(1, 10000000, L), time(phrase(sequence(length_list(3), S), L)).
% 42,111 inferences, 2.201 CPU in 3.319 seconds (66% CPU, 19131 Lips)
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 0.8Mb, global: 0.8Gb, trail: 33Kb
ERROR:   Stack depth: 4,232, last-call: 0%, Choice points: 4,217
ERROR:   In:
ERROR:     [4,232] user:length_list(0, [length:1|_60001076], [length:9,987,370], _60001070)
ERROR:     [4,228] dcg_high_order:sequence_([length:1|_60001118], <compound (:)/2>, [length:9,987,373], [])
ERROR:     [4,227] dcg_high_order:sequence_([length:2|_60001166], <compound (:)/2>, [length:9,987,376], [])
ERROR:     [4,226] dcg_high_order:sequence_([length:3|_60001214], <compound (:)/2>, [length:9,987,379], [])
ERROR:     [4,225] dcg_high_order:sequence_([length:4|_60001262], <compound (:)/2>, [length:9,987,382], [])
ERROR: 
ERROR: Use the --stack_limit=size[KMG] command line option or
ERROR: ?- set_prolog_flag(stack_limit, 2_147_483_648). to double the limit.

Thank you for your answer. Let me ask it differently: given that there are several predicates in dcg/basics that follow the same pattern (whites//0, digits//1 etc) is there a point in providing a meta-predicate specifically for deterministic parsing? And would it rather belong to dcg/basics since its motivation is

We need a comprehensive set of generally useful DCG primitives.

In my slowly growing personal collection of “generally useful code” I have exactly det_sequence//2,3,5 with the naive implementation I showed above. It is useful like once a high noon.

I dug it out, apparently I only have it for 2 and 3 arguments, it looks like this:

$ cat detparse.pl
:- module(detparse, [det_sequence//2,
                     det_sequence//3]).
:- meta_predicate det_sequence(3, -, ?, ?).
:- meta_predicate det_sequence(3, //, -, ?, ?).

det_sequence(Goal, [X|Xs]) -->
    call(Goal, X),
    !,
    det_sequence(Goal, Xs).
det_sequence(_, []) --> [].

det_sequence(Goal, Sep, [X|Xs]) -->
    call(Goal, X),
    det_sequence_1(Goal, Sep, Xs).

det_sequence_1(Goal, Sep, [X|Xs]) -->
    Sep,
    call(Goal, X),
    !,
    det_sequence_1(Goal, Sep, Xs).
det_sequence_1(_, _, []) --> [].

I am not sure about the names nor about the implementation.

That is exactly my worry :slight_smile:

1 Like

Point taken :slight_smile:

This is indeed the way to use sequence//2 for parsing; just cut after the call. In practice it will work fine for realistic input. If the input is endless/unbounded (not sure about the terminology) then we have library(pure_input) and library(intercept). Those two together work really well.

1 Like

The resulting list of the parse grows with the size of the input. I simply meant that you can guess when you write your code using sequence//2 that it will fit in memory. If you cannot make that guess then you shouldn’t be putting it in a list like this.

It should obviously match greedily. I don’t understand the rest of your questions to be honest.

My ability to comprehend is probably the bottleneck.

I shared a possible implementation above. As far as I can tell, you cannot use it for matching this regex. Instead, you’d need sequence//2 as defined at the moment, and a cut. So:

any(X) --> [X].

'/s.*o/' --> "s", sequence(any, _), "o", !.

and then:

?- phrase('/s.*o/', `stackoverflow`, Rest).

And would you be so kind to stop deleting so many of your posts. It makes it even more difficult to follow what it is that we are talking about.

I think that any_but_o is already implemented as string_without//2. But this is indeed “lazy”, using again the examples from the link that you shared then deleted. Like this:

?- portray_text(true).
true.

?- phrase('/s.*o/_lazy', `stackoverflow`, Rest).
Rest = `verflow`.

?- use_module(library(dcg/basics)).
true.

?- phrase(("s", string_without("o", _), "o"), `stackoverflow`, Rest).
Rest = `verflow`.

But we got lost again. sequence//2 as it is defined at the moment is does, exactly as advertised in the docs, greedy matching, and then on backtracking “goes back”; so it does more than a regex can do:

?- phrase(("s", sequence(any, L), "o"), `stackoverflow`, Rest).
L = `tackoverfl`,
Rest = [119] ; % hehe
L = `tack`,
Rest = `verflow` ;
false.

Well not so sure how useful it is. It sometimes (no hard numbers) useful for semi-structured input of limited size. For example reading lines of some arbitrary log format into memory. The exact same thing can be achieved with intercept_all/4.