Dcg/high_order sequence/3 nested with the same separator

Hi,

I’m trying to parse the Advent of Code 2022 Day 1 input file, which looks like this:

1000
2000
3000

4000

5000
6000

7000
8000
9000

10000

For the purposes of this challenge, this should parse to [[1000,2000,3000], [4000], [5000,6000], [7000,8000,9000], [10000]]. From reading an earlier topic, it seems like this should work:

phrase(sequence(sequence(integer, "\n"), "\n", Integers), Input).

However, this does not work, whereas changing the input so that the separators are different does. I suspect this has something to do with lookahead in the sequence parsing.

I don’t understand enough about how sequence//3 and friends are implemented to fix this, and anyway, it seems like a long shot. But I am wondering if anyone has a clever workaround for this.

Thanks!

1 Like

This is just a guess, I have not tired this in code.

I suspect the problem is with the last integer.
If I understand your code correctly the inner sequence/3 expects it to have \n but it does not and so fails. However it does have an eof delimiter so the inner rule needs to accept either

  • integer,‘\n’
  • integer, eof

Then the outer sequence/3 will fail for the last integer as the sequence does not end with \n. However one can not use the eof for the outer sequence as the inner one ate it. But if one thinks of recursive cases and base cases one so often sees one of the base cases for the empty input and that is what is needed here. So the outer sequence/3 also needs to accept empty but only if it fails for integer,\n

I don’t think lookahead is needed to solve this problem but then as I noted I have not tried solving this with working code.

I also suspect the rules could be modified so that the eof is eaten by the outer sequence and the inner sequence just accepts integer without \n only after failing for integer with \n leaving the eof for the outer sequence/3.

When I did this two years ago(!), I made my own combined version of dcg basics + high order, as the libraries sequence commits greedily, and makes it harder to parse this type of ‘ambiguous’ input. (I also wrote a version for Scryer prolog to contrast and compare hence the naming! :slight_smile: ) .I’ve pasted my version below, and the answers to day1 using it.

:- module(dcg_basics_swi, [
    digits//1,
    digit//1,
    blank//0,
    blank//1,
    blanks//0,
    blanks_to_nl//0,
    nonblank//1,
    hex_digit//1,
    integer//1,
    word//1,
    sequence//3,
    sequence//5,
    string_upto//2,
    string_without//2,
    white//0,
    whites//0,
    eos//0
    ]).

word([S|Ss]) --> alphanum(S), word(Ss).
word([]) --> [].

alphanum(S) --> [S], {
   char_type(S,alnum)
}.

digits([D|Ds]) --> digit(D), !, digits(Ds).
digits([]) --> [].

digit(D) --> [D], {
  D >= 0'0, D =< 0'9
}.

hex_digit(D) --> [D], {
  (D >= 0'0, D =< 0'9) ; (D >= 0'a, D =< 0'f )
}.

integer(N) --> "-", digits(Ds), {
  Ds = [_|_],
  number_chars(PosN,Ds),
  N is -PosN
}, !.

integer(N) --> ("+" ; []), digits(Ds), {
  Ds = [_|_],
  number_chars(PosN,Ds),
  N is PosN
}, !.

blank --> [C], {
  char_type(C,white)
}.

nonblank(C) --> [C], {
  \+ char_type(C,white)
}.

blanks --> blank, !, blanks.
blanks --> [].

blanks_to_nl, "\n" --> whites, "\n".

white --> [C], {
  C \= 0'\n,
  char_type(C, white)
}.

whites --> white, !, whites.
whites --> [].

string_without(Chars, [S|Ss])  --> [S], { \+ memberchk(S, Chars) },!, string_without(Chars,Ss).
string_without(Chars, []), [S] --> [S], { memberchk(S, Chars) }, !.
string_without(_, []) --> eos.

string_upto(Guard, S) --> { length(S,_)}, S, Guard, !.

eos --> [].

%%%%%%% Generation   %%%%%%%

white(" ") --> " ". % generating whitespace
blank(" ") --> " ".

%%%%%%% Higher Order %%%%%%%
sequence(Start, Element, Sep, End, Es) --> Start, sequence(Element, Sep, Es), End.

sequence(Element, Sep, [E|Es]) --> call(Element,E), Sep, sequence(Element, Sep, Es).
sequence(Element, _ , [E])     --> call(Element,E).
sequence(_      , _ , [])      --> [].

day 1

:-use_module(dcg_basics_swi).

l(L)-->sequence((sequence(integer,"\n")),"\n",L).

part1:-
    phrase_from_file(l(L),'./day.1.input'),
    maplist(sumlist,L,T),
    sort(0,@>,T,[Max|_]),
    writeln(Max).

part2:-
    phrase_from_file(l(L),'./day.1.input'),
    maplist(sumlist,L,T),
    sort(0,@>,T,[A,B,C|_]),
    format('~d~n',A+B+C).

The straight-forward solution would be to define “integer on a line” separately. Like this:

int_line(N) --> integer(N), "\n".

and then you can do:

phrase(sequence(int_line, "\n", Ints), ...)

But you should be using phrase_from_file/2 maybe. And sometimes there are choice points left over after using sequence/2 for parsing (instead of generating).

Here is the difference:

?- phrase(sequence(digit, Ds), `123`).
Ds = [49, 50, 51] ;
false.

?- phrase(sequence(atom, [1,2,3]), Out).
Out = [49, 50, 51].