Rtools42, Unit test for sgml

Typewriter land again…

test_sgml.
(...)
c:/rtools42/home/matthias/swipl-devel/packages/sgml/Test/estag.sgml-c:/rtools42/home/matthias/swipl-devel/packages/sgml/Test/ok/estag.ok
WRONG
OK:
[element(oops,[],['<snag attr=2>'])]
ANSWER:
[element(oops,[],['<snag attr=2>\n'])]

I can silence this message by changing load_sgml_file/2 to load_sgml/3 which is recommended anyway, and passing [type(text)] as an option.

Like this:

load_file(File, Term) :-
    load_pred(Ext, Pred),
    file_name_extension(_, Ext, File),
    !,
    retractall(error(_,_,_)),
    call(Pred, File, Term, [type(text)]). %%%%%% here
load_file(Base, Term) :-
    load_pred(Ext, Pred),
    file_name_extension(Base, Ext, File),
    exists_file(File),
    !,
    retractall(error(_,_,_)),
    call(Pred, File, Term, []).


load_pred(sgml, load_sgml).
load_pred(xml,  load_xml).
load_pred(html, load_html).

Moreover, I get quite a few more errors whenever UTF8 is involved, e.g. for utf8.xml. If I understand this correctly, there may be two reasons why a test fails:

  • load_sgml returns the wrong result
  • loading of the “ok-file” returns the wrong result
  • both

The problem is further complicated by the terminal that may have its own rules for rendering the code points.

The output below is from the “normal” swipl-win.exe (not the one built on Rtools42)

WRONG
OK:
[element(utf8,[],['\n',element(name,[],['Dürst']),'\n',element(name,[name='Dürst'],[]),'\n'])]
ANSWER:
[element(utf8,[],['\n',element(name,[],['Dürst']),'\n',element(name,[name='Dürst'],[]),'\n'])]

In Rtools42, the opposite pattern is shown:

64: WRONG
64: OK:
64: [ element(utf8,
64:       [],
64:       [ '\n',
64:         element(name,[],['D�rst']),
64:         '\n',
64:         element(name,[name='D�rst'],[]),
64:         '\n'
64:       ])
64: ]
64: ANSWER:
64: [ element(utf8,
64:       [],
64:       [ '\n',
64:         element(name,[],['Dürst']),
64:         '\n',
64:         element(name,[name='Dürst'],[]),
64:         '\n'
64:       ])

I’m a bit confused. The SGML tests run fine for me on Windows, both under Wine and in real Windows. All UTF-8 conversion is done by SWI-Prolog’s own code, mostly in the low-level I/O code. In what sense is the Rtools42 version different from the MSYS/MinGW compiled version? Does the system not talk to a normal Windows console?

Note that package installation may also try to convert the (test) files to the local encoding?

Ok, never mind, don’t stop the machines because of this. It’s probably a trivial thing on my local system.

1 Like

Ok, I think I have found the problem. It’s harmless: In packages/sgml.pl, line 400, we have

sgml_open_options(Options, OpenOptions, SGMLOptions) :-
    Options = M:Plain,
    (   select_option(encoding(Encoding), Plain, NoEnc)
    ->  (   sgml_encoding(Encoding)
        ->  merge_options(NoEnc, [type(binary)], OpenOptions),
            SGMLOptions = Options
        ;   OpenOptions = Plain,
            SGMLOptions = M:NoEnc
        )
    ;   merge_options(Plain, [type(binary)], OpenOptions),
        SGMLOptions = Options
    ).

based on the following rationale:

%   The  encoding(+Encoding)  option   is    treated   special   for
%   compatibility reasons:
%
%     - If `Encoding` is one of =iso-8859-1=, =us-ascii= or =utf-8=,
%       the stream is opened in binary mode and the option is passed
%       to the SGML parser.
%     - If `Encoding` is present, but not one of the above, the
%       stream is opened in text mode using the given encoding.
%     - Otherwise (no `Encoding`), the stream is opened in binary
%       mode and doing the correct decoding is left to the parser.

If we change the options in “otherwise” from [type(binary)] to [type(text), encoding(utf8)], the tests pass under Windows without me dos2unixing all the estag.sgml. I think that this default is acceptable if it is limited to Windows. I’ll make a PR for your kind consideration.

I guess that you did not see this problem on your Windows, because for some reasons, your estag.sgml has unix line feeds. At least that’s what I get if I unpack the swipl sources using tar xvzf. If I work in my local git clone, estag.sgml has Windows line feeds, and I get an error in the test.

1 Like

Ah, thanks. That was the explanation missing in the PR :slight_smile: I do consider this an issue for the tests only and would prefer to see a patch to test_sgml.pl instead, possibly passing additional options?

I have added steps to reproduce the problem to the PR.

Moreover, I tried a few options in test_sgml.pl:

load_file(File, Term) :-
    load_pred(Ext, Pred),
    file_name_extension(_, Ext, File),
    !,
    retractall(error(_,_,_)),
    call(Pred, File, Term, [type(text), encoding('utf-8')]).
load_file(Base, Term) :-
    load_pred(Ext, Pred),
    file_name_extension(Base, Ext, File),
    exists_file(File),
    !,
    retractall(error(_,_,_)),
    call(Pred, File, Term, [type(text), encoding('utf-8')]).

load_pred(sgml, load_sgml).
load_pred(xml,  load_xml).
load_pred(html, load_html).

This fixes the problem in estag.sgml, but raises new issues in utf8-cent.xml (well, strictly speaking, in utf8-cent.ok, but I am unsure).

WRONG
OK:
[ element(testdoc,
          [id='t7-20020923',resp='MSM'],
          [ '\n',
            element(names,[],['From Espa▒ola -- a ▒test▒ for you.']),
            '\n',
            element(nums,[],['From Espa▒ola -- a ▒test▒ for you.']),
            '\n',
            element(names,[],['From Espa▒ola -- a ▒test▒ for you.']),
            '\n',
            element(nums,[],['From Espa▒ola -- a ▒test▒ for you.']),
            '\n'
          ])
]
ANSWER:
[ element(testdoc,
          [id='t7-20020923',resp='MSM'],
          [ '\n',
            element(names,
                    [],
                    ['From Española -- a ‘test’ for you.']),
            '\n',
            element(nums,
                    [],
                    ['From Española -- a ‘test’ for you.']),
            '\n',
            element(names,
                    [],
                    ['From Española -- a ‘test’ for you.']),
            '\n',
            element(nums,
                    [],
                    ['From Española -- a ‘test’ for you.']),
            '\n'
          ])

I’ve pushed a number of enhancements, the core adding newline(detect) to the open/4 options (this feature was already present using set_stream/2). For me, the sgml package tests now pass for both Linux and Windows, having the test data using POSIX conventions as well as DOS conventions. All file handling now also explicitly specifies UTF-8 encoding. How no system recodes that … Please give it a try …

Auto-detecting newline conventions has been used for quite a while in some of the development tools such as the editor. It isn’t ideal. It seems the best practical way around dealing with mixed environments.

2 Likes