Phrase_from_file vs phrase

EricGT · September 12, 2020, 5:52pm

Here is digits//1 from library(dcg/basics)

digits([H|T]) -->
	digit(H), !,
	digits(T).
digits([]) -->
	[].

digit(C) -->
	[C],
	{ code_type(C, digit)
	}.

Here is the typical usage of digits//1.

Details

File name: example_01.pl

digits([H|T]) -->
	digit(H), !,
	digits(T).
digits([]) -->
	[].

digit(C) -->
	[C],
	{ code_type(C, digit)
	}.

:- begin_tests(digits).

:- set_prolog_flag(double_quotes, codes).

case(01,success,  "0",  0).
case(02,success,"123",123).

case(01,fail,"a").

case(01,error,"").

test(01,[forall(case(_,success,Digits,Number))]) :-
    DCG = digits(Digits),
    phrase(DCG,Codes,[]),
    number_codes(Number,Codes).

test(02,[fail,forall(case(_,fail,Digits))]) :-
    DCG = digits(Digits),
    phrase(DCG,Codes,[]),
    number_codes(_,Codes).

test(03,[forall(case(_,error,Digits)), error(syntax_error(illegal_number),_)]) :-
    DCG = digits(Digits),
    phrase(DCG,Codes,[]),
    number_codes(_,Codes).

:- end_tests(digits).

Example run:

Welcome to SWI-Prolog (threaded, 64 bits, version 8.3.7)

?- working_directory(_,'C:/Users/Groot/Documents').
true.

?- ['example_01'].
true.

?- run_tests.
% PL-Unit: digits .... done
% All 4 tests passed
true.

Here is the way I do digits//2 using an open list.

Details

File name: example_01.pl

digits(T0,T) -->
	digit(T0,T1), !,
	digits(T1,T).
digits(T,T) --> [].

digit([C|T],T) -->
	[C],
	{ code_type(C, digit) }.

:- begin_tests(digits).

:- set_prolog_flag(double_quotes, codes).

case(01,success,  "0",  0).
case(02,success,"123",123).

case(01,fail,"a").

case(01,error,"").

test(01,[forall(case(_,success,Digits,Number))]) :-
    DCG = digits(Digits,T),
    phrase(DCG,Codes,[]),
    T = [],
    number_codes(Number,Codes).

test(02,[fail,forall(case(_,fail,Digits))]) :-
    DCG = digits(Digits,T),
    phrase(DCG,Codes,[]),
    T = [],
    number_codes(_,Codes).

test(03,[forall(case(_,error,Digits)), error(syntax_error(illegal_number),_)]) :-
    DCG = digits(Digits,T),
    phrase(DCG,Codes,[]),
    T = [],
    number_codes(_,Codes).

:- end_tests(digits).

Example run:

Welcome to SWI-Prolog (threaded, 64 bits, version 8.3.7)

?- working_directory(_,'C:/Users/Groot/Documents').
true.

?- ['example_02'].
true.

?- run_tests.
% PL-Unit: digits .... done
% All 4 tests passed
true.

Notice that the code is very similar, but that when digits//2 is done it is still an open list. The open list is converted to a closed list with T = [].

Here are some more difference list DCGs that use digits//2

'digit+'(T0,T) -->
    digit(T0,T1), !,
    'digit*'(T1,T).

'digit*'(T0,T) -->
    digit(T0,T1), !,
    'digit*'(T1,T).
'digit*'(T,T) --> [].

‘digit+’//2 is for one or more digits.
‘digit*’//2 is for zero or more digits.

If you are thinking regular expressions after reading the description of those two DCGs you are correct. Even though I don’t like using RE for parsing, those patterns are so common in parsing that they show up in RE and parsing in general.

So what has this bought me. At first sight this looks like nothing gained but one of the first things you need when writing large parser is that the DCG works as a recognizer. This means that the DCG can read the input without any errors. A recognizer does not return and AST, codes, or anything. A recognizer is just a predicate that returns true or false.

Here is an example recognizer I use for recognizing ABNF files.

recognize(Abnf_path) :-
    setup_call_cleanup(
        open(Abnf_path,read,Abnf_stream),
        (
            set_stream(Abnf_stream, newline(posix)),
            read_stream_to_codes(Abnf_stream,Codes),
            DCG = rulelist(_,_),
            phrase(DCG,Codes,[])
        ),
        close(Abnf_stream)
    ).

Once I have the DCG working as a recognizer the next step I typically take is to get the DCG to reach a fixed point. This means the input will get converted to some format then the format will be processed again. Usually after one or two iterations the output is at a fixed point. At the lowest level the fixed point takes apart the input code and builds a list of output code from scratch. The input codes should be identical to the output codes. Sounds easy, but try doing this with code containing white space or that doesn’t have a well designed BNF, or that has a BNF but the BNF was designed for LL or LR parsers which don’t always make for working nicely with DCGs.

Here is an example fixed point check I use with ABNF files.

fixed_point_codes(Abnf_path) :-
    setup_call_cleanup(
        open(Abnf_path,read,Abnf_stream),
        (
            set_stream(Abnf_stream, newline(posix)),
            read_stream_to_codes(Abnf_stream,Codes_1),
            DCG1 = rulelist(Codes_2,[]),
            phrase(DCG1,Codes_1,[])
        ),
        close(Abnf_stream)
    ),
    assertion( Codes_1 == Codes_2 ).

Notice that the check Codes_1 == Codes_2. The output must be reconstructed exactly the same as the input.

Next I start to build the parsers that will return things such as AST, the output that preserves the input with whitespace, etc.

One of the first predicates that I create are recognizers for common non-terminals, e.g.

digits -->
    digits(_,[]).

Notice that it does not return anything. It just recongizes that the input is a valid input of digits. It is based on the difference list DCG, e.g. digits//2 and uses [] to convert the difference list to a closed list.

Here is an example that converted digit//2 into a term for use in an AST

digit(dec(digits(String))) -->
    digit(Codes,[]),
    { string_codes(String,Codes) }, !.

I know this has only given you some small examples but keeps eluding to much larger things. I haven’t posted the much larger examples yet because I ran into a pattern with rfc for parsing URLs and HTTP request that are based on ABNF using other ABNF. While the code runs correctly, I did not like the tricks I had to use to get the code to work correctly while still being in multiple modules matching the different rfc. The tricks were not for the DCG but for the calling of the DCG in modules which started the discussion on Prolog Many Worlds.

So the point of all of this is that if you start writing lots of parsers using DCGs and keep refactoring the code down and do more and more advanced things with parsing I think you will find that starting at the lowest layer with the DCGs returning differnece list instead of closed list will gain you some major benefits in code reuse, understanding of how to make the DCG efficient, and reaching fixed points.

Other post related to this:
Learned the hard way
Wiki discussion on DCGs.

EDIT

Here is a larger code example I found laying around of some code I started to parse JSON files. I did not test this but it shouldn’t take much to make it usable if it is not already.

Example JSON parser using difference list

/*

   Copyright (c) 2020 Eric G. Taucher

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are met:

   1. Redistributions of source code must retain the above copyright notice, this
      list of conditions and the following disclaimer.
   2. Redistributions in binary form must reproduce the above copyright notice,
      this list of conditions and the following disclaimer in the documentation
      and/or other materials provided with the distribution.

   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
   ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
   WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
   DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
   ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
   (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
   LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
   ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

:- module('JSON DCG parser',[
      array/4,
      % character/4,
      % characters/4,
      digit/4,
      'digit*'/4,
      'digit+'/4,
      element/4,
      % element_first/4,
      % element_rest/4,
      % element_separator/4,
      % escape_code/4,
      % exponent_indicator/4,
      false/4,
      fixed_point_codes/1,
      % hex_digit/4,
      % hex_digits/5,
      json/4,
      member/4,
      % member_first/4,
      % member_rest/3,
      % member_rest/4,
      % member_separator/4,
      % name_value_separator/4,
      null/4,
      number/4,
      % number_exponent/4,
      % number_fraction/4,
      % number_whole/4,
      object/4,
      % optional_negative_sign/4,
      % optional_sign/4,
      recognize/1,
      % start_digit/4,
      string/4,
      true/4,
      % value/4,
      ws//2,
      'ws*'//2,
      'ws+'//2
   ]).

/** <module> JSON DCG parser
These are basic DCG which return difference list.
Difference list are easily converted to standard list (closed list) by unifying the tail with [].

These predicates are not intended to be used directly but instead via predicates in modules that
rely on these predicates, e.g.
modules:
JSON_format_preserving
JSON_structure_preserving

@author Eric G Taucher
@license Simplified BSD License
*/

% -----------------------------------------------------------------------------

% Based on JSON BNF at https://www.json.org/json-en.html

% -----------------------------------------------------------------------------

:- set_prolog_flag(double_quotes,codes).

% -----------------------------------------------------------------------------

json(T0,T) -->
   element(T0,T).

value(T0,T) --> object(T0,T).
value(T0,T) --> array(T0,T).
value(T0,T) --> string(T0,T).
value(T0,T) --> number(T0,T).
value(T0,T) --> true(T0,T).
value(T0,T) --> false(T0,T).
value(T0,T) --> null(T0,T).

object([0'{|T1],T) -->
   "{",
   (
      'ws*'(T1,[0'}|T])
   ;
      'member_first'(T1,[0'}|T])
   ).

member_first(T0,T) -->
   member(T0,T1),
   member_rest(T1,T).

member_rest(T0,T) -->
   member_separator(T0,T1),
   member(T1,T2),
   member_rest(T2,T).
member_rest([]) --> [].

member_separator([0',|T],T) --> ",".

member(T0,T) -->
   'ws*'(T0,T1),
   string(T1,T2),
   'ws*'(T2,T3),
   name_value_separator(T3,T4),
   element(T4,T).

name_value_separator([0':|T],T) --> ":".

array([0'[|T0],T) -->
   "[",
   (
      'ws*'(T0,[0']|T])
   ;
      element_first(T0,[0']|T])
   ),
   "]".

element_first(T0,T) -->
   element(T0,T1),
   element_rest(T1,T).

element_rest(T0,T) -->
   element_separator(T0,T1),
   element(T1,T2),
   element_rest(T2,T).
element_rest(T,T) --> [].

element_separator([0',|T],T) --> ",".

element(T0,T) -->
   'ws*'(T0,T1),
   value(T1,T2),
   'ws*'(T2,T).

string([0'\"|T0],T) -->
   characters(T0,T).

characters(T0,T) -->
   character(T0,T1), !,
   characters(T1,T).
characters([0'"|T],T) --> "\"".

character([Code|T],T) -->
   [Code],
   {
       Code \= 0'"
   ;
       Code \= 0'\
   ;
       \+ ( Code >= 0x00, Code =< 0x0F )
   ;
       Code \= 0x7F
   }.
character(T0,T) --> escape_code(T0,T).
character([0'\\,0'u|T0],T) -->
   [0'\\,0'u],
   hex_digits(4,T0,T).

escape_code([0'\\,0'"|T],T) --> "\"".
escape_code([0'\\,0'\\|T],T) --> "\\".
escape_code([0'\\,0'/|T],T) --> "\\/".
escape_code([0'\\,0'\b|T],T) --> "\\\b".
escape_code([0'\\,0'\f|T],T) --> "\\\f".
escape_code([0'\\,0'\n|T],T) --> "\\\n".
escape_code([0'\\,0'\r|T],T) --> "\\\r".
escape_code([0'\\,0'\t|T],T) --> "\\\t".

hex_digits(0,T,T) --> [].
hex_digits(N0,T0,T) -->
   hex_digit(T0,T1),
   { N is N0 - 1 },
   hex_digits(N,T1,T).

hex_digit([Code|T],T) -->
   [Code],
   { Code >= 0'0, Code =< 0'9 }.
hex_digit([Code|T],T) -->
   [Code],
   { Code >= 0'a, Code =< 0'f }.
hex_digit([Code|T],T) -->
   [Code],
   { Code >= 0'A, Code =< 0'F }.

number(T0,T) -->
   optional_negative_sign(T0,T1),
   number_whole(T1,T2),
   number_fraction(T2,T3),
   number_exponent(T3,T).

optional_negative_sign([0'-|T],T) --> "-".
optional_negative_sign(T,T) --> [].

number_whole([0'0|T],T) --> "0".
number_whole(T0,T) -->
   start_digit(T0,T1),
   'digit*'(T1,T).

number_fraction([0'.|T0],T) -->
   ".",
   'digit+'(T0,T).
number_fraction(T,T) --> [].

number_exponent(T0,T) -->
   exponent_indicator(T0,T1),
   optional_sign(T1,T2),
   'digit+'(T2,T).
number_exponent(T,T) --> [].

exponent_indicator([0'e|T],T) --> "e".
exponent_indicator([0'E|T],T) --> "E".

optional_sign([0'-|T],T) --> "-".
optional_sign([0'+|T],T) --> "+".
optional_sign(T,T) --> [].

start_digit([C|T],T) -->
   [C],
   { between(0'1,0'9,C) }.

'digit+'(T0,T) -->
   digit(T0,T1),
   'digit+'(T1,T).

'digit*'(T0,T) -->
   digit(T0,T1),
   'digit*'(T1,T).
'digit*'(T,T) --> [].

digit([C|T],T) -->
   [C],
   { between(0'0,0'9,C) }.

true([0't,0'r,0'u,0'e|T],T) -->
   "true".

false([0'f,0'a,0'l,0's,0'e|T],T) -->
   "false".

null([0'n,0'u,0'l,0'l|T],T) -->
   "null".

'ws+'(T0,T) -->
   ws(T0,T1), !,
   'ws*'(T1,T).
'ws*'(T0,T) -->
   ws(T0,T1), !,
   'ws*'(T1,T).
'ws*'(T,T) --> [].

ws([0'\s|T],T) --> " ".
ws([0'\n|T],T) --> "\n".
ws([0'\r|T],T) --> "\r".
ws([0'\t|T],T) --> "\t".

% -----------------------------------------------------------------------------

%! recognize(+Path) is semidet.
%
% Recognizes a grammar. Silently fails.
recognize(Path) :-
   setup_call_cleanup(
       open(Path,read,Stream),
       (
           set_stream(Stream, newline(posix)),
           read_stream_to_codes(Stream,Codes),
           DCG = json(_,_),
           phrase(DCG,Codes,[])
       ),
       close(Stream)
   ).

% ?- recognize('xyz.json').

% A parser at this level would be pointless as all it would return is the input
% as a list of character codes which is equivalent to string_codes/2.

%! fixed_point_codes(+Path) is semidet.
%
% Compares the input codes to the output codes which should be identical.
% If they are not the predicate silently fails which indicates a problem with
% the ABNF file or the Prolog Code. <br>
% No specific errors are returned.
fixed_point_codes(Path) :-
   setup_call_cleanup(
       open(Path,read,Stream),
       (
           set_stream(Stream, newline(posix)),
           read_stream_to_codes(Stream,Codes_1),
           DCG1 = json(Codes_2,[]),
           phrase(DCG1,Codes_1,[])
       ),
       close(Stream)
   ),
   assertion( Codes_1 == Codes_2).

% ?- fixed_point_codes('xyz.json').

% -----------------------------------------------------------------------------

SWI-Prolog also has a JSON parser in library(http/json)

GitHub source code

Topic		Replies	Views
What's the idiomatic way of developing DCGs? Help!	8	673	December 17, 2020
Simple example of phrase_from_file/2? Help!	13	954	March 29, 2021
Double_quotes flag and DCGs Discussion dcg	13	694	June 24, 2025
Using phrase_from_file Help!	9	1229	February 19, 2020
Most efficient DCG for text parsing? Algorithm	4	1891	July 5, 2021

Phrase_from_file vs phrase

Related topics