So, sometimes I actually write Prolog, not just C that implements Prolog!
I’m wondering what the most efficient way is to write a DCG for text parsing, like when using phrase_from_file/2
. Specifically, there’s a task that I know would be easy to represent efficiently in an imperative language, but I’m not sure how to do it best/properly with a DCG.
Im trying to parse C source with a DCG. C is particularly well-behaved in this respect, because when it comes to tokenization, you can in almost all cases tell what type of token you’re about to consume just from the first character. (“Almost” because the slash can lead a block comment, a line comment, or an operator, which are wildly different token types where parsing is concerned.) Read a letter or underscore? The next token is an identifier (or keyword, which has the same parsing rules). Backslash? The only valid place to use that, in root parsing context, is just before a newline, as a line continuation.
Now, in an imperative implementation, I’d just make a jump table with 127 entries, since only bare ASCII is allowed in root parsing context, and key it off the next character in the stream. Short of making 127 DCG heads, though, I’m not sure how to implement this method in Prolog. The parse types with a unique, distinct initial character are easy:
token(dqstring(S)) --> `"`, escaped_string(S), `"`.
and that obviously keys on the first character in the text, because the [0'"]
list is the first thing in the body. For something based on character class, though, it’s harder:
token(number(N)) --> [C0], {is_digit(C0)}, number_codes(C0, CT), {parsed_number(N, [C0|CT])}.
since in this case, is_digit
isn’t tested until we hit the clause body. And, sure enough, when I run this with just explicit character checks (and a fallback of token(C) --> [C].
in place) it can parse a whole source file in under 100 ms, but once I start adding character class checks, it bogs down by a factor of three or more. I tried making an explicit character class check like so:
token(T) --> peek_cclass(Class), token(Class, T).
peek_cclass(Class), [C] --> [C], {is_digit(C) -> Class = digit}.
% etc...
token(digit, number(N)) --> number_codes(L), {parsed_number(N, L)}.
but that slows things down hugely, like, factor of 10 or more. What’s the best way to do this? Also, what’s the best way to write pure DCGs that can be used for input or output when you need to do some transformation on the value, like escaped_string
or parsed_number
above? When parsing, you want the list production first, so that the arguments are instantiated when you hand off to the transformation predicate, but when generating, you want the transformation first, so you know how many codes are going into the list production. For that matter, it could be as simple as atom_codes
on the representation, which won’t work unless at least one of the arguments is fully instantiated. But that’s a separate matter, anyway.
So how should this be done? I mean I suppose I could make 127 predicates, programmatically speaking, or I could make it dynamic and asserta
some production rules as I encounter each new leading character, but at the very least it does feel like there should be a better way.
I’ve got some other questions too but I’ll save them for now, largely so I can go take the nap my body has been craving