Confusing behavior of sequence//3 and sequence//5 from dcg/high_order

I’m using: SWI-Prolog version 8.1.17

I want the code to use sequence//3 or maybe sequence//5 from library(dcg/high_order) to parse a string of integers separated by commas, like this: “1,2,3”.

First, the three-argument version, sequence(:Element, :Sep, ?List), documented as:

Match or generate a sequence of Element where each pair of elements is separated by Sep. When parsing , a matched Sep commits . The final element is not committed.

I’d expect:

?- use_module(library(dcg/basics)), use_module(library(dcg/high_order)).
true.

?- phrase(sequence(integer, ",", L), `1,2,3`).
L = [1, 2, 3].

But what I’m getting is:

?- phrase(sequence(integer, ",", L), `1,2,3`).
L = [1, 2, 3] ;
false.

I’d also expect (note the trailing comma in the input!):

?- phrase(sequence(integer, ",", L), `1,2,3,`, Rest).
L = [1, 2, 3],
Rest = [44].

But what I am getting is actually what I would have expected when the trailing comma is not there:

?- phrase(sequence(integer, ",", L), `1,2,3,`, Rest).
L = [1, 2, 3],
Rest = [].

No choice point left behind, either.

And when we then take sequence(:Start, :Element, :Sep, :End, ?List), documented as:

Match or generate a sequence of Element enclosed by Start end End, where each pair of elements is separated by Sep. More formally, it matches the following sequence:

Start (Element,Sep)*, End

So I would for example expect that this does what sequence//3 did in the last example above.

Expected:

?- phrase(sequence("", integer, ",", ",", L), `1,2,3,`).
L = [1, 2, 3].

but instead, I get:

?- phrase(sequence("", integer, ",", ",", L), `1,2,3,`).
false.

It follows the same logic as sequence//3 (so at least the two are consistent!):

?- phrase(sequence("", integer, ",", "", L), `1,2,3,`).
L = [1, 2, 3].

But then I am not sure if it is consistent with:

?- phrase(sequence("", integer, ",", "\n", L), `1,2,3\n`).
L = [1, 2, 3].

To summarize: the separator is also used as an “End” of the sequence, and then the predicate does not leave a choice point. In the 5-argument version, if the “End” is the same as the separator, we need to leave the End argument empty, but if it is different, we need to match it.

I am missing something relevant?

And a second issue: nesting of sequence. It kinda works for me, but I am not sure how to make it “do what I mean”. I would like to have:

?- phrase(<something magical with a sequence inside a sequence to L>,
          `1,2\n3,4\n5,6\n`).
L = [[1,2], [3,4], [5,6]].

The best I could do was:

?- phrase(sequence(sequence(integer, ","), "\n", L), `1,2\n3,4\n5,6\n`).
L = [[1, 2], [3, 4], [5, 6], []] ;
L = [[1, 2], [3, 4], [5, 6]].

How am I supposed to write it?

Since a matched separator commits, I can see why it returns false when matching at

phrase(sequence("", integer, ",", ",", L), `1,2,3,`).

I must admit to having issues building up more complex sequences, but using sequence and coding around its foibles usually means more unstandable code.

Regarding your final example, my initial version is:

:- use_module(library(dcg/basics)).
:- use_module(library(dcg/high_order)).

inner(Xs), "\n" --> sequence("",integer,",","\n",Xs).
outer(Ls) --> sequence("",inner,"\n","",Ls).

start(L) :-
    phrase(outer(L),`1,2\n3,4\n5,6\n`).
?- start(L).
L = [[1, 2], [3, 4], [5, 6]].

My next version looks like:

inner_items(Xs) --> sequence(integer,",",Xs).
inner(Xs), "\n" --> inner_items(Xs), "\n".
outer(Ls) --> sequence(inner,"\n",Ls).

which slightly decouples matching the inner sequence from the outer separator.

I’ve pushed a few clarifications to the docs and some enhancements to the implementation. The most troublesome one is

?- phrase(sequence(integer, ",", L), `1,2,3,`, Rest).

As the “,” commits, it not only commits the element before it, but also consuming the separator. So, this now (consistently) fails. I’ve clarified the specs as matching

Elem? (Sep elem)*

Which was what was intended in the first place. It would be nice if this succeeded without consuming the separator, but I do not see how to achieve that without leaving the matched elements uncommitted or matching the separator twice (using look-ahead). Leaving the matches uncommitted is in practice not acceptable as parsing longer inputs will quickly exhaust memory and if subsequent failure forces exploring the choicepoints you typically have to wait really long for it to say false. I do not really fancy matching the separators twice either.

If someone wants to write a good set of unit tests for this library, please do. That should make all these cases clear and makes sure we do not mess up if we want further changes.

For nesting, I agree with @Ian. If you want to allow for empty sequences, your simple nesting will not work. We could extend the sequence predicates with something that specifies minimum and/or maximum number of arguments matched. That would also fix this.

Thanks for the input.

@Ian Thank you for the tips. The point that I didn’t explicitly make is that there is a silent assumption (or maybe just a secret hope) that meta-predicates compose (also on the syntactic level!). At this point they become even more useful.

The correct and logical way to do it turned out to be:

?- phrase(sequence(sequence("", integer, ",", "\n"), Lines), `1,2\n3,4\n`).
Lines = [[1, 2], [3, 4]] ;
false.

This follows the (unix) convention for text files, where a “line” always has and end-of-line (even if it is the last line of the file. It also makes sense:

The file is a sequence of lines. The line is a sequence of integers separated by commas, and terminated by an end-of-line.

@jan

Thank you for taking the time to look into this.

I was trying to find the original discussion that resulted in creating library(dcg/high_order) and I couldn’t find it. Could you link to it? The point is, I am even more confused now that I thought about it carefully (so probably I shouldn’t think too much)

If this is indeed what it is meant to match, why not commit on a [Sep, Elem]? Why commit on the separator, before the element?

Note, if I interpret this as a regex, with e for element and s for separator:

e?(se)*

and if I take ? to mean “zero or one” and we take * to mean “0 or more”, then sese is also a valid string, and I guess that is not intended. The current implementation at least does not match it. Should it have been,

(e(se)*)?

?

I would write some tests if I knew what is really expected (this is why I asked for the original discussion). To me it seems like my expectations are off so maybe getting some context would help.

It originated from me needing it AFAIK. I looked at some existing practice and quickly hacked something together, assuming users will pick it up and it will be settled. Seems it will be :slight_smile:

Because it is not uncommon for Elem to be non-deterministic. Using a conservative Elem, this typically results in the desired behavior.

Right. Would you like me to fix it or do you prefer submitting a PR?

Test cases are a good way to iron out the specification and turn this into a mature library. There are quite a few cases of obviously desirable behavior, but notable determinism handling is rather complicated and I’m not sure the current behavior is the sweet spot.

Not sure how to interpret this question. Is it more like “please make a PR so that I don’t have to” or is it “would you like a PR so that your name is on it”? I honestly don’t care either way, whatever is less effort. I will not be scandalized even if I am not attributed by name.

BTW, I once even found my name in a commit that had nothing to do with me. It made me feel good though :wink:

I might try to make a PR with tests in library/dcg/high_order.pl in the coming couple of weeks.

Some people prefer submitting PRs for the credits. I have some preference to PRs as it is proof of contributions. Work-wise the difference is neglectable in this case. So I’ll fix it this time :slight_smile:

Much appreciated. Please start a new file in src/Tests/library along the structure of the files already there (well, those using PlUnit, some older ones do not). Called test_*.pl it will automatically be picked up by the regression tests.

1 Like

Yes, I know I am reviving a topic.

How do you suggest to parse a line-oriented text file, where each line must end with a newline?

Here are different options I tried. I strongly suspect I am making a very basic mistake somewhere…

First, I thought that I can use sequence//5 with an empty Start, and both Sep and End a newline, like this:

phrase(sequence("", line_dcg, "\n", "\n", List), Input)

This does not work for me. The problem seems to be that the Sep and the End are the same. Using * and ‘|’ so that it is easier to read:

?- phrase(sequence("", integer, "*", "*", L), `1*2*`).
false.

?- phrase(sequence("", integer, "|", "*", L), `1|2*`).
L = [1, 2].

I guess sequence//3 suffers from the same problem?

?- phrase(( sequence(integer, "|", L), "*" ), `1|2*`).
L = [1, 2].

?- phrase(( sequence(integer, "*", L), "*" ), `1*2*`).
false.

Finally, using sequence//2 with an “element DCG” that consumes the end-of-line:

line(X) --> integer(X), "*".

and then:

?- phrase(sequence(line, L), `1*2*`).
L = [1, 2] ;
false.

… and I guess that for parsing, I should cut after phrase?

?- phrase(sequence(line, L), `1*2*`), !.
L = [1, 2].

Are you asking for a response from anyone or just Jan W.?

Anyone who has advice on the topic :slight_smile: I thought this issue is closed from my side, but in the last year, this has been bothering me just a little bit. I have in the meanwhile started falling back to this for parsing (not generating!) “lines”:

lines([L|Lines]) -->
    line(L), "\n", !,
    lines(Lines).
lines([]) --> [].

I am not sure that this is exactly the same as:

sequence(line, Lines)

because of the choice points left behind. I somehow thought that there will be only one choice point left after:

phrase(sequence(line, Lines), Input)

but in fact, when I traced it, I found one choice point per element. Maybe I am not understanding the tracer? Maybe should read the new tutorial.

When it comes to parsing input, which includes text, and the input is complex enough to require a DCG then I prefer avoiding higher order predicates. My preference is to think of the DCG in the same way one would use BNF to parse the input.

I know that many who use Prolog with DCGs prefer to use higher order predicates with DCGs which puts me either in the minority or one of the few that will openly state that I prefer not to use higher order predicates with DCGs.

Also when writing DCGs I prefer to use open list as opposed to closed list so that I can accumulate multiple parts together instead of having to use append/3 with closed list.

As this is probably not what you want to hear, this is what I came to realize after spending a few years learning to parse large and complex files with DCGs.

I still don’t have all the parsing with DCGs working like I would like, in particular the creating of libraries as a hierarchy but getting closer as the days proceed. Seems my path to a solution is through the land of meta_predicate/1.

HTH

Maybe you can use ("\n", \+ eos) as sep? The general problem with these high order predicate is handling non-determinism. Allowing full non-determinism is asking for (practically) non-terminating parses. It is hard to find the right places to restrict determinism though.

@Boris Do you already have some code in flight? Could you maybe share it? I’d love to take a look.

TL;DR: I wrote some code, find it below. I am not sure what I am doing. EDIT: also look at the code in the next message, maybe it is cleaner that way.


Hi @pkoch, if you are asking how I solved the issues from this post above.

I ended up defining my own “sequence” meta-predicates that are only good for deterministic parsing. I will demonstrate the problem I was trying to solve again, maybe I made a mistake somewhere along the way.

With library(dcg/high_order), you can do like this:

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 8.3.15-9-gb74ddfb9c)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

    CMake built from "swipl-devel/build"

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- use_module(library(dcg/basics)).
true.

?- use_module(library(dcg/high_order)).
true.

?- portray_text(true).
true.

?- phrase(sequence(atom, [a,b,c]), Text).
Text = `abc`.

?- phrase(sequence(integer, "-", [1,2,3]), Text).
Text = `1-2-3`.

?- phrase(sequence("|", integer, "|", "|", [1, 2]), Text).
Text = `|1|2|`.

However, the opposite doesn’t work:

?- phrase(sequence("|", integer, "|", "|", L), `|1|2|`).
false.

… and as discussed in length above, the one problem is that the separator and the end are the same. If they aren’t the same it does what you mean:

?- phrase(sequence("#", integer, ",", "*", L), `#1,2,3*`).
L = [1, 2, 3].

This was now demonstrated with sequence//5 but sequence//3 behaves the same.

In addition, the sequence DCGs do not commit on parsing an element. Try with this helper predicate (uses integer//1 from library(dcg/basics)) and this query and look at the debugger to see what I mean:

integer_line(N) --> integer(N), "\n".

The query:

phrase(sequence(integer_line, Ns), `1\n2\n3\n`).

After this very long intro, here is what I wrote to completely circumvent these issues when all I want to do is parse text greedily.

$ cat detparse.pl 
:- module(detparse, [det_sequence//2,
                     det_sequence//3,
                     det_sequence//5]).
:- meta_predicate det_sequence(3, -, ?, ?).
:- meta_predicate det_sequence(3, //, -, ?, ?).
:- meta_predicate det_sequence(//, 3, //, //, -, ?, ?).

det_sequence(Goal, [X|Xs]) -->
    call(Goal, X),
    !,
    det_sequence(Goal, Xs).
det_sequence(_, []) --> [].

det_sequence(Goal, Sep, [X|Xs]) -->
    call(Goal, X),
    det_sequence_1(Goal, Sep, Xs).

det_sequence_1(Goal, Sep, [X|Xs]) -->
    Sep,
    call(Goal, X),
    !,
    det_sequence_1(Goal, Sep, Xs).
det_sequence_1(_, _, []) --> [].

det_sequence(Start, Goal, Sep, End, Xs) -->
    Start,
    det_sequence(Goal, Sep, Xs),
    End.

With these, I get what I wanted:

?- phrase(det_sequence("|", integer, "|", "|", L), `|1|2|`).
L = [1, 2].

?- phrase(det_sequence(integer_line, Ns), `1\n2\n3\n`).
Ns = [1, 2, 3].

Those det_sequence//2,3,5 predicates that I defined in module detparse will not work for many use cases that are covered by the predicates in library(dcg/high_order). I haven’t even bothered to compare them, really. But I know that my implementation is greedy and will commit immediately on a separator, which might be not what you want at all. And about the names (detparse, det_sequence) I am not sure either. There are probably better names for what they do.

Another thought: As I have currently defined det_sequence//2,3,5, they are good for loading all parsed content to memory. If you’d want to parse in constant memory, you might want to have a version that uses maybe library(intercept) instead of making a list.

This is a very long way to say that I am still not sure what I am doing.

Now that I had to try and motivate my choices, I changed my mind. Greedy parsing is probably more useful if we do not generate a list, by default. Maybe the functionality in my half-baked module detparse is meant to be used with library(intercept). You can still collect all elements to a list using intercept_all/4 and nb_intercept_all/4.

$ cat greedyparse.pl 
:- module(greedyparse, [greedy_sequence//1,
                        greedy_sequence//2,
                        greedy_sequence//4]).
:- meta_predicate greedy_sequence(3, ?, ?).
:- meta_predicate greedy_sequence(3, //, ?, ?).
:- meta_predicate greedy_sequence(//, 3, //, //, ?, ?).

:- use_module(library(intercept)).

greedy_sequence(Goal) -->
    call(Goal, X),
    { send_signal(X) },
    !,
    greedy_sequence(Goal).
greedy_sequence(_) --> [].

greedy_sequence(Goal, Sep) -->
    call(Goal, X),
    { send_signal(X) },
    greedy_sequence_1(Goal, Sep).

greedy_sequence_1(Goal, Sep) -->
    Sep,
    call(Goal, X),
    { send_signal(X) },
    !,
    greedy_sequence_1(Goal, Sep).
greedy_sequence_1(_, _) --> [].

greedy_sequence(Start, Goal, Sep, End) -->
    Start,
    greedy_sequence(Goal, Sep),
    End.

With this, I can still make a list if I wanted to:

?- intercept_all(N, phrase(greedy_sequence("|", integer, "|", "|"), `|1|2|3|`), N, L).
L = [1, 2, 3].

… but I can also parse a large file in constant memory, intercepting only those elements that are interesting to me.

@jan Do you think there is any merit to doing this? How about adding those predicates to a library? Where should they go and what should they be called and so on? Are they useful and is the interface good?

This is a very long way to say that I am still not sure what I am doing.

I think Jan is the only one with any solid bearings in this issue, worry not. :joy:

My main goal was more to take a look into any existing test suite. Since there seems to be none so far, I’ll take a try at what Jan was proposing.

Maybe the functionality in my half-baked module detparse is meant to be used with library(intercept).

I don’t think this coupling is desired. You can achieve this within the goals that you pass into sequence iff that’s desired. (Cool trick tho, thanks for sharing :slight_smile: )

I was wrong, there’s already something there. I’ll just add to it.

Tried my hand at using the following definition (building on Boris’s examples and my intuition of how this should work):

sequence_pkoch(OnElem, OnSep, L) --> sequence_pkoch_1(OnElem, OnSep, L), !.
sequence_pkoch(_, _, []) --> []. % To support empty sequences.

sequence_pkoch_1(OnElem, OnSep, [H|T]) -->
    call(OnElem, H), !,
    (
        (OnSep, sequence_pkoch_1(OnElem, OnSep, T))
        ;{T=[]}
    ).

This worked nicely, but I’m bumping into one test failure that I’d like to discuss here (instead of fragmenting this discussion into a PR).

Here’s what’s failing for me:

test('sequence//3 nondet element', nondet) :-
    phrase(sequence(string, sep, L), `b,b`, R),
    L == [`b`, `b`],
    R == [].

I feel like nondet is hiding a bit too much detail for me, so I enumerated all the cases (with what’s on master):

test('sequence//3 nondet element', all(L-R == [
    [``]-`b,b`,
    [`b`,``]-`b`,
    [`b`,`b`]-[]
])) :-
    phrase(sequence(string, sep, L), `b,b`, R).

Notice how this doesn’t quite follow the intuition of “cut on all separators”, since we have a bunch choice points.

This fails with my version of the code, since we’re cutting after string//0 decides on the empty string so I only get the first result. I think that using string//0 here – combined with the old cutting behaviour – made us accidentally emulate using string_until(Sep). If we’re to change this to another nondet predicate, things would make more sense. Let’s rewrite the test to use b//1:

test('sequence//3 nondet element', all(L-R == [
    [b1,b1]-[],
    [b1,b2]-[]
])) :-
    phrase(sequence(b, sep, L), `b,b`, R).

This now seems to follow suit with what’s intended. However, notice that we’re potentially breaking a lot of people’s code that also accidentally used the first behaviour. Not sequence//5 tho, since encountering End already cut things anyway. I’m ok with this, since this is an experimental module, but I think it’s something that we need to be conscious of.

My implementation is pretty gluttonous. On both cases, it’ll only present the first longest choice ([`b`,`b`] and [b1, b1]). This seems to be what Jan was trying to target, so I feel pretty confident this is the right path.

If this change in behaviour is desired, I’ve already put a draft PR up but let’s not fragment the discussion until we’re set on a path.

Happy holidays, everyone! :sparkles:

Thanks a lot for looking into this. I have been using these predicates and the current behavior has been vexing me on some occasions. I just circumvented it until now.

Can you explain very briefly which use cases you are trying to fix or add?

Keep in mind that the test suite was rudimentary at best. It is definitely not properly testing “generating from a list”. I did not yet understand the why when I wrote those. My understanding now is that generating text from a list is one of the motivations behind this library. Like this example from above:

?- phrase(sequence(integer, "-", [1,2,3]), Text).
Text = `1-2-3`.