Is there a preferred way to check for valid characters when using DCG?


#1

For this example case I am using DCGs to parse a scientific name. The name is followed by ; as a delimiter and then the end of the line with \n. However the name can also contain a ; and thus when a ; is found the use of the ; has to be qualified to determine if it is part of the name or ends the name. Obviously with two characters of lookahead this is easy; if the stream is ;\n it is a delimiter and if it is not then it is part of the name.

For this I have two different and equally valid predicates that work; they have been tested under real world conditions and work as expected. Also the time difference is almost inconsequential. Now my code has many other predicates for other cases looking for delimiters that have exceptions, so before committing to one way for the many times needed in the code I would like to know if there is a preferred standard used for Prolog?

First:

This one uses one clause with ->/2 and for other parts of the data in the code the conditionals are nested down to 4 levels. See further down for example.

item_code_1(C) -->
    (
        ";\n"
    ->
        { false }
    ;
        [C]
    ).

Second

This one uses multiple clauses instead of one clause.

item_code_2(_) -->
    ";\n",
    { false }, !.
item_code_2(C) -->
    [C], !.

Real code with -> nested 4 levels.

host_name_code(C) -->
    (
        "."
    ->
        (
            "\n"
        ->
            { false }
        ;
            { C = 0'. } % '
        )
    ;
        (
            " "
        ->
            (
                "("
            ->
                { false }  
            ;
                { C = 0'\s } % '
            )
        ;
            [C]
        )
    ).

#2

This all looks a bit strange. Can you provide at least one example of what this code is supposed to parse and how the result should look?


#3
item_code_2(_) -->
    ";\n",
    { false }, !.
item_code_2(C) -->
    [C], !.

This would be my preferred style. But, the first clause is a no-op. Any clause with false before a cut or side-effects is just wasting CPU time. Guess you wanted the cut before the {false}. For the second clause, a ! at the end of the last clause is always suspicious. In this case too it is not needed.

Note that I would expect “;” inside an item code name to be rare, so the first clause rarely does something. A possible better style is

item_code_2(C) :-
    \+ ";\n",
    [C].

#4

Can you provide at least one example of what this code is supposed to parse and how the result should look?

Example:


:- set_prolog_flag(double_quotes,codes).

item_1(Item) -->
    item_codes_1(Codes),
    { string_codes(Item,Codes) },
    ";",
    "\n".

item_codes_1([C|Cs]) -->
    item_code_1(C),
    item_codes_1(Cs).
item_codes_1([]) --> [].

item_code_1(C) -->
    (
        ";\n"
    ->
        { false }
    ;
        [C]
    ).

item_2(Item) -->
    item_codes_2(Codes),
    { string_codes(Item,Codes) },
    ";",
    "\n".

item_codes_2([C|Cs]) -->
    item_code_2(C),
    item_codes_2(Cs).
item_codes_2([]) --> [].

item_code_2(_) -->
    ";\n",
    { false }, !.
item_code_2(C) -->
    [C], !.

:- begin_tests(delimiter_1).

delimiter_test_case("A;\n"   ,"A").
delimiter_test_case("Ab;\n"  ,"Ab").
delimiter_test_case("Ab;c;\n","Ab;c").

test(1,[forall(delimiter_test_case(Input,Expected_Item)),nondet]) :-
    string_codes(Input,Codes),
    phrase(item_1(Item),Codes, Rest),
    assertion( Item == Expected_Item ),
    assertion( Rest == [] ).

:- end_tests(delimiter_1).

:- begin_tests(delimiter_2).

delimiter_test_case("A;\n"   ,"A").
delimiter_test_case("Ab;\n"  ,"Ab").
delimiter_test_case("Ab;c;\n","Ab;c").

test(1,[forall(delimiter_test_case(Input,Expected_Item)),nondet]) :-
    string_codes(Input,Codes),
    phrase(item_2(Item),Codes, Rest),
    assertion( Item == Expected_Item ),
    assertion( Rest == [] ).

:- end_tests(delimiter_2).

The actual data file is from UniProt in this directory (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/) (FTP links don’t seem to work here) and named: uniport_sprot.data.gz and when uncompressed is ~3.1GB. The manual contains the details.

I know there is also an XML version in the same directory, but haven’t looked at it in detail as I was finding that parsing this file with DCG is proving to be quite a valuable learning experience because of all the data design rules for parsing it breaks (as noted in this question), e.g. names with semicolons in them, i.e. AMT1;1


#5

Thanks. Much appreciated.


#6

Hello Eric,

without claiming that I understand the full depth of the problem you are solving, here is one attempt at writing a predicate that passes the tests you showed:

item_3(Name) -->
    string(String), ";\n", !,
    { string_codes(Name, String) }.

I used string//1 from library(dcg/basics), but as you can see for yourself, there is absolutely nothing special about it.

It would seem that it might be doing less work that either of the two definitions you provided, and it does not leave behind choice points (do you need those? Not immediately obvious to me). Here is what I get:

?- time( phrase(item_1(Item), `ab;c;\n`) ).
% 35 inferences, 0.000 CPU in 0.000 seconds (89% CPU, 850237 Lips)
Item = "ab;c" ;
% 17 inferences, 0.000 CPU in 0.000 seconds (92% CPU, 395119 Lips)
false.

?- time( phrase(item_2(Item), `ab;c;\n`) ).
% 38 inferences, 0.000 CPU in 0.000 seconds (90% CPU, 845290 Lips)
Item = "ab;c" ;
% 17 inferences, 0.000 CPU in 0.000 seconds (81% CPU, 359431 Lips)
false.

?- time( phrase(item_3(Item), `ab;c;\n`) ).
% 22 inferences, 0.000 CPU in 0.000 seconds (85% CPU, 1055156 Lips)
Item = "ab;c".

Can you show how you run a realistic test to time this particular predicate in isolation? The data you linked to has too many other fields.


#7

Boris, Thanks.

The data set in the link is released only once a month and is currently about 63 million lines so there is no way this could be parsed in a few seconds or even few minutes and thus I don’t try and write it to be as fast as possible but instead concentrate on the parsing to be accurate, complete, and relatively easy to follow for someone who understands DCGs.

The main reason behind the question was that if I release the this code or other DCG code on GitHub I would hate for it have an improper way of doing something with DCGs and then the code gets copied, pasted and modified and becomes the starting point for a cascade of code that works but is done wrong.

Normally I would give the code you asked for in a second, but for just the part you asked for it would be a few hundred lines of code that is hard to read and follow as it is only proof of concept.

Before I asked this question, the parser was able to parse the entire file accurately and completely, but the code was written as a proof of concept. Now I am going back over the code and refactoring it for real world use thus this question and probably more to come.

Eric