Most efficient DCG for text parsing?

So, sometimes I actually write Prolog, not just C that implements Prolog! :joy:

I’m wondering what the most efficient way is to write a DCG for text parsing, like when using phrase_from_file/2. Specifically, there’s a task that I know would be easy to represent efficiently in an imperative language, but I’m not sure how to do it best/properly with a DCG.

Im trying to parse C source with a DCG. C is particularly well-behaved in this respect, because when it comes to tokenization, you can in almost all cases tell what type of token you’re about to consume just from the first character. (“Almost” because the slash can lead a block comment, a line comment, or an operator, which are wildly different token types where parsing is concerned.) Read a letter or underscore? The next token is an identifier (or keyword, which has the same parsing rules). Backslash? The only valid place to use that, in root parsing context, is just before a newline, as a line continuation.

Now, in an imperative implementation, I’d just make a jump table with 127 entries, since only bare ASCII is allowed in root parsing context, and key it off the next character in the stream. Short of making 127 DCG heads, though, I’m not sure how to implement this method in Prolog. The parse types with a unique, distinct initial character are easy:

token(dqstring(S)) --> `"`, escaped_string(S), `"`.

and that obviously keys on the first character in the text, because the [0'"] list is the first thing in the body. For something based on character class, though, it’s harder:

token(number(N)) --> [C0], {is_digit(C0)}, number_codes(C0, CT), {parsed_number(N, [C0|CT])}.

since in this case, is_digit isn’t tested until we hit the clause body. And, sure enough, when I run this with just explicit character checks (and a fallback of token(C) --> [C]. in place) it can parse a whole source file in under 100 ms, but once I start adding character class checks, it bogs down by a factor of three or more. I tried making an explicit character class check like so:

token(T) --> peek_cclass(Class), token(Class, T).
peek_cclass(Class), [C] --> [C], {is_digit(C) -> Class = digit}.
% etc...
token(digit, number(N)) --> number_codes(L), {parsed_number(N, L)}.

but that slows things down hugely, like, factor of 10 or more. What’s the best way to do this? Also, what’s the best way to write pure DCGs that can be used for input or output when you need to do some transformation on the value, like escaped_string or parsed_number above? When parsing, you want the list production first, so that the arguments are instantiated when you hand off to the transformation predicate, but when generating, you want the transformation first, so you know how many codes are going into the list production. For that matter, it could be as simple as atom_codes on the representation, which won’t work unless at least one of the arguments is fully instantiated. But that’s a separate matter, anyway.

So how should this be done? I mean I suppose I could make 127 predicates, programmatically speaking, or I could make it dynamic and asserta some production rules as I encounter each new leading character, but at the very least it does feel like there should be a better way.

I’ve got some other questions too but I’ll save them for now, largely so I can go take the nap my body has been craving :sweat_smile:

You probably get a slow down, since your code is not deterministic.
Please note the if-then else (->)/2, how you used it, doesn’t cut your
clause. You still have backtracking to other clauses of peek_class//1.

The DCG draft allows the cut !/0 in DCG. So when you want to
classify, its possibly more wise to do the following:

peek_cclass(Class), [C] --> [C], {is_digit(C)}, !, {Class = digit}.

Or alternatively you could also do, in case your DCG
doesn’t support the cut !/0:

peek_cclass(Class), [C] --> [C], {is_digit(C), !, Class = digit}.

Edit 04.07.2021:
I am currently using a solely in Prolog written tokenizer and parser
for Prolog. It is open source. I did not yet do some measurements,
and it has still some corners to make improvements,

but it works. Will do some measurements when I finished with optimizing
the Prolog system itself. The tokenizer and parser do not use
phrase_from_file/2. They are rather implemented “monadic” style.

Means the DCG is used to pass around a lot of stuff. See also here.

Jan already implemented quite a useful C parser in the ffi pack , maybe you can use that; or follow it as an example.

1 Like

ha, I should have known better than to shorthand when I was typing that up. The actual version I wrote does in fact use cut operators (and also an extra predicate), and I’ve checked with A in the tracer to make sure I’m not leaving behind lots of choice points. I don’t have my laptop in front of me just atm, but iirc I wrote a code_cclass(+C, -Class) with SSU =>/2 rules, and peek_cclass//1 has just a single det clause and thus doesn’t need a cut. I’ve also tried adding tabling to code_cclass/2 but it doesn’t make much difference - though I’m not sure tabling really gains much, since this isn’t an “expensive computation” kind of predicate.

A good option is to use term_expansion/2 to generate the 127 rules for you. Typing 127 rules is a bit boring. From the implementation point of view they are most likely the best solution though.

Opinions vary. Roughly you have three options

  • Use var/1 (nonvar/1) checks in the rules.
  • Use delays (when/2, freeze/2)
  • Write two DCGs

All three have their merits. The advantage of the first two is that it keeps the code for parsing and generation together, which makes it easier to maintain the consistency if you change the rules over time. That is particularly true for the delay version. Delays are relatively slow though and you have to be careful if you also want to use cuts (or if->then;else) to make sure all relevant delayed calls have materialized. Committing is more or less obligatory in real-world DCGs, in particular for artificial languages as not doing so typically leads to practically infinite backtracking in case of a syntax error. You also need to be sure not to leave any residual goals behind. And explicit var/nonvar split is typically more efficient, but uglier. Two distinct DCGs avoid all the var/nonvar tests and delays, but make it harder to keep the two in sync. On the other hand, the two DCGs are typically different when it comes to comment, layout handling and other input ambiguities (escaped strings, floating point numbers, etc). The parsing one often simply skips all that while the generating one may wish to keep track of nesting to emit line breaks and indentation. Trying to combine all of that in one implementation is not always a good idea.

… it depends on the use case, requirements and preferences …

3 Likes

About “Write two DCGs”, the original Modula compiler was multi pass. There was a scanner which wrote a file with tokens. Then there was a parser which took this tokens and the next pass started. Until over christmas Wirth and Gutknecht(*) wrote a totally new Modula compiler that did everything together.

Yes two DCGs are the way to go. Jan W. C parser collects a list of tokens first:

c99_tokens(List) -->
      c99_token(H), !, /* calls token */
      (   {skip_token(H)}
        ->  c99_tokens(List)
        ;   {List = [H|T]},
        c99_tokens(T)
    ).

token(T) --> keyword(T), \+ identifier_cont_char(_).
token(T) --> identifier(T).
token(T) --> constant(T).
Etc..

In a “monadic” parser you dont have these two passes, but nevertheless two DCGs. The passes are interwoven and no token list is materialized. One can imagine the DCGs have a ‘C’/3 implementation, which call get_code/2 in the case of the tokenizer and get_token/2 in the case of the parser.

The result is that you can process huge files with little or no memory.

(*) For their Lilith computer in the 80’s.

1 Like

A long time ago, I re-implemented cdecl in Prolog. I no longer have the source, just the first draft of a tutorial I wrote. Anyway, the code was completely bi-directional – it used the DCTG formalism, which more-or-less created an AST with additional information. I think that the key to being able to use the code for either input or output was the intermediate AST.

I also collected some information from the old Prolog Digest about DCGs (circa 1988, which seem to have disappeared from the Internet) – if you’re interested, I can send this to you (it’s an appendix of my tutorial, which I’d rather not publish until I’ve re-read it and decided I wouldn’t be embarrassed by it).

1 Like