DCG look ahead, how its done?

Hello,

I have been working on creating a DCG gammar for protobuf syntax (BNFs). and want to implement a look ahead rule, where a grammar rule looks to identify a keyword on the current line (i.e. up until the next semi colon).

I noticed that some seem to use phrase but its unclear to me how phrase can access the current unconsumed tail, also what head to use in phrase is not so clear.

E.g. i want to look ahead if a line includes the keyword “syntax” or “package” …

all comments are much appreciated,

Dan

2 ways spring to mind:

  • Grab a line, then parse within that line
  • Parse from left to right as a normal DCG

What’s stopping this from being a normal DCG? Just define/declare the possibilities for a valid line - what are the valid word combinations, what are the significant words and what are the ignorable non-line-ending characters.

Thanks –

The grab a line wasn’t so clear to me, although I then thought to drop out of DCG such as so:

look_ahead(Keyword, Rest, Rest):-
          find_keyword(Keyword, Rest).

The Rest would then hold the rest of the string …

Just use a normal DCG:

my_line --> sentence_start, filler, thing_that_appears_later, end_of_line.

There’s no magical lookahead needed for thing_that_appears_later - it’s just included in the definition of a line.

Yes, I know …

But, i want to save backtracking until the token is found in the middle of the line, so, I want to first look ahead, and if a token is there, to do the processing until the token.

Its to be more efficient …

Edit:

Now i came up with this, note that Codes holds the contents of a whole protobuf file, so i guess the string_codes isn’t very efficient …

I wonder: is there a way to do sub_string on codes ?

find_keyword_until_semicolon(Keyword, Codes) :-
	string_codes(String, Codes),
	sub_string(String, Before,_, _, ";"),		% first, find the next semicolon
        !,
sub_string(String, 0, Before, _, Prefix),		% find the prefix string until the semicolon
	!, 
       sub_string(Prefix, _, _, _, Keyword).           % find in prefix the keyword.

Btw, the predicate is used in DCG as follows:

declaration --> look_ahead("syntax"), ...

look_ahead(Keyword, Codes, Codes) :-
    find_keyword_until_semicolon(Keyword, Codes).
... --> [] | [_], ... .

substr(Sub) --> ..., seq(Sub), ... .

seq([]) --> [].
seq([H|T]) --> [H], seq(T).

Edit: Had missed a - char.

that looks fun — thanks!

Great, some old school magic :slight_smile:

The docs of phrase_from_file/2 contain an example of finding substrings:

Below is an example that counts the number of times a string appears in a file. The library dcg/basics provides string//1 matching an arbitrary string and remainder//1 which matches the remainder of the input without parsing.

:- use_module(library(dcg/basics)).

file_contains(File, Pattern) :-
        phrase_from_file(match(Pattern), File).

match(Pattern) -->
        string(_),
        string(Pattern),
        remainder(_).

match_count(File, Pattern, Count) :-
        aggregate_all(count, file_contains(File, Pattern), Count).

This can be called as (note that the pattern must be a string (code list)):

?- match_count('pure_input.pl', `file`, Count).

How many times does abcabc occur in abcabcabc?

Should be once, not twice.

All usual Prolog tools say twice. I prefer the example “ana” in “banana” or “ananas”, depending if you are religious.

?- sub_atom(banana, Before, Len, After, ana).
Before = 1,
Len = 3,
After = 2 ;
Before = Len, Len = 3,
After = 0.

?- append(Front, Back, `banana`), append(`ana`, _, Back).
Front = [98],
Back = [97, 110, 97, 110, 97] ;
Front = [98, 97, 110],
Back = [97, 110, 97] ;
false.

Even your example from above, when you fix the typo, works like this:

?- [user].
|: ... --> [] | [_], ... .
|: ^D% user://1 compiled 0.01 sec, 1 clauses
true.

?- phrase((..., `ana`, ...), `banana`).
true ;
true ;
false.

If i understand correctly “looking ahead”, then it should be possible to do with push-back lists. See this discussion in the old SWI-Prolog mailing list:>

7.14.3 Push-back lists
A push-back list is a proper list of terminals on the left-hand side of a grammar rule (see 3.14). A push-back list contains terminals that would be asserted in the input terminal list after the terminals consumed by the successful application of the grammar rule.

7.14.3.1 Examples
For example, assume that we need rules to look-ahead one or two tokens that would be consumed next. This could be easily accomplished by the following two grammar rules:

look_ahead(X), [ X] → [ X].
look_ahead(X, Y), [X,Y] → [X,Y].

When used for parsing, procedurally, these grammar rules can be interpreted as, respectively, consuming, and then restoring, one or two terminals.

https://swi-prolog.iai.uni-bonn.narkive.com/QBwIiHBC/how-to-hide-implied-tokens-in-dcg-rules#post16

The exciting thing here, @grossdan , is that EBNFs translate quite nicely to a DCG. You can start manually translating the production rules for Protocol Buffers to DCG rules in correct Prolog. This seems like a mechanical task, meaning it is straight-forward to do manually; and probably possible to write a program that generates the parser from the provided spec. If you want to look into that, maybe start with the EBNF grammar spec in EBNF?

One interesting exercise would be to figure out how the ... (range), the | (alternation), [] (option), and {} (repetition) translate to a Prolog DCG.

In theory this could work very nicely.

In practice when you write the grammar manually you notice unexpected issues – such as how end of line comments are handled when the next declaration has comments as well.

Protobuf grammar (and compiler) also captures comments …

Also, typically the BNF / DCG is interspersed with prolog since we want to do something with the parse – such as generate an AST or simply reformat the file read, and/or other kinds of analyses.

But yes, i am striving to create a close and clean DCG grammer description that mirrors the BNF – so the reader can verfiy that the DCG indeed conforms to the BNF

Dan

I would expect that the most efficient is a DCG that keeps some state to know what to expect next.

DCGs don’t have to be performing enormous amounts of backtracking - there’s different styles.

I’m curious why you’re making a parser for protobufs – there’s already one, namely the protobuf compiler protoc, which outputs all the information as protobufs to a “plugin”. If you want more details, send me a message (or look at the plugin packages/protobufs/bootstrap/protoc-gen-swipl). See also Other Languages | Protocol Buffers Documentation and protobuf/src/google/protobuf/compiler/plugin.proto at 903c3f15b04d99ab88cee53e4cec9464ef292bce · protocolbuffers/protobuf · GitHub

The main reason I decided to write a parse in prolog is because I am mainly interested in the comments and want to see them annotate the AST in specific ways.

Also, the plugin i found for the google protobuf compiler that helps generate, say, JSON, didn’t work – i.e the protobuf files i am parsing failed.

I am not the author of these files so i can’t change them.

I did try to communicate with the authors of the plugin but turn around is too uncertain to be useful.

So, I decided to give it an own try.

I notice that dealing with comments has its own complexity and even a kind of semantics (e.g. is this a comment of a declaration or a group of declaration) …

Dan

Example:

occurs_count(S, C) --> occurs_count(S, 0, C).

occurs_count(_, C, C) --> [].
occurs_count(S, U, C) --> S, !, { succ(U, U1) }, occurs_count(S, U1, C).
occurs_count(S, U, C) --> [_], occurs_count(S, U, C).
?- phrase(occurs_count(`abcabc`, C), `abcabcabc`).
C = 1 ;
false.

?- phrase(occurs_count(`abcabc`, C), `abcabcabciruheirugheriabcabc`).
C = 2 ;
false.

As an alternative to BNF, there’s PEG support in Prolog: GitHub - ridgeworks/pPEGpl: pPEG grammar support for SWI-Prolog

Is that GitHub - chrusty/protoc-gen-jsonschema: Protobuf to JSON-Schema compiler or GitHub - GoogleCloudPlatform/protoc-gen-bq-schema: protoc-gen-bq-schema helps you to send your Protocol Buffer messages to BigQuery. ?

Yes – and it turns out to be surprisingly messy to annotate an AST with source coordinates (it’s also surprisingly messy to produce good error messages). I think that using edcg instead of DCG can be helpful, although there are ways of threading multiple (and custom) accumulators with regular DCGs. Or, there’s this: Dealing with state - #10 by jan

The grammar for protobufs seems to be here: Protocol Buffers Version 3 Language Specification | Protocol Buffers Documentation (there’s also a grammar here; I don’t know if they’re the same: Language Specification | The Protobuf Language)

You should be able to turn that into a DCG directly, although any left-recursion needs to be taken care of by either tabling or by converting to some kind of “sequence” that avoids the left-recursion. There’s no need for any “look-ahead” - the normal backtracking takes care of that. However, you typically need to add cuts to prevent infinite backtracking on errors.

The usual technique is to make two grammars: one for the “tokens” (keywords, identifiers, punctuation, skipping whitespace and comments), which produces a list which is then processed by a context-free grammar to produce a syntax tree (depending on what information you capture in the tokens, this can also have references to the program source). The syntax tree is then walked for processing the semantics.
[In yacc/flex, the tokenizer is done using regular expressions; these are easily done by DCGs, so no need to use library(pcre)]

I had this weird case where there was a comment right at the end of the file and just before a final “}”.in the file.

I had to find a way to recognize that there is no more declaration after the comment block while the “}” can’t be consumed since it needed to end the outer declaration

So, the solutoin was to look ahead for a closing “}” and just for the case of a comment at the end of a declaration block.