DCG look ahead, how its done?

Sounds like you could, instead, define e.g. optional_comment and add that to the DCG, to eat the comment as legitimate, optional text to ignore.

Yes, thats what i do, but that doesnt solve this issue …

Basic example:

opt_stuff --> "opt".
req_stuff --> "req".

my_stuff --> ( opt_stuff | [] ), req_stuff.
?- phrase(my_stuff, _L), atom_codes(A, _L).
A = optreq ;
A = req.

A cut will break the code as a generator. Example:

opt_stuff --> "opt".
req_stuff --> "req".

my_stuff --> ( opt_stuff | [] ), !, req_stuff.

… which gives incomplete answer:

?- phrase(my_stuff, _L), atom_codes(A, _L).
A = optreq.

So no, I don’t want a cut, and the backtracking is intentional.

Ah yes, true. It was a good example of the minefield that cuts cause :grinning:

I, btw, decided to add cuts wherever a declaration has been identified.

Instead of returning the parsed structure as arguments in the grammar, I wanted to have a clean view of the grammar at the top level, and further below I “accumulate” identified clauses in a (backtrackable) queue and then call at appropriate points prolog code to assert identified clauses.

Seems to work well so far … (as i implement this).

If you show some code, this discussion can be less abstract :wink:

Sure,

gen_syntax_declaration dequeues the quoted string and side comment, and asserts them as part of an AST.

protobuf_element -->	syntax_keyword, equal, quoted_string, 
								end_of_statement, 
								optional_side_comment, 
								{!, gen_syntax_declaration}.


quoted_string --> quoted_string(String), {enqueue(String)}.
quoted_string(String) --> skip_whites, "\"", string_without("\"", Codes), "\"", { string_codes(String, Codes) }.

That approach forces cuts/determinism, rather than convenient backtracking.

Yes, it includes determinism – in the form of domain knowledge.

It uses the cut to cut away further unfruitful backtracking.

Once a complete declaration has been identified, there is no need any more for further backtracking – and the declaration can be asserted.

There is no need to wait until the whole grammer has been parsed.

This also allows for less memory usage.

In one version where I did keep state in the DCG memory usage went up to 12GB … presumably, since this grammar includes comments as well.

(I was parsing in one foreach loop a whole folder structure of protobuf files)

When using phrase_from_file/2, it makes sense to wrap the call within once/1 - which will allow for more garbage collection.

  • The JSON parser (2007) pre-dates library(pure_input) (2008)
  • The initial version of pure_input could only handle repositionable streams (e.g., files). One often want to parse JSON from data arriving on a socket. This was only resolved in 2016.
  • The JSON library uses some low-level (foreign) helpers for fast reading of numbers and a few other things from a stream.

Overall, library(pure_input) is quite efficient. Yes, the constraint mechanism adds serious overhead, but we read the input in blocks rather than character by character. There is quite some work getting a character from a stream (get_code/2): validate the stream, lock it, get the character, unlock it.

So, does my code sample eliminate tail recursive optimizations?

Also, I guess this is only for DCG rules that call their head in their last clause.

I guess, my cut would only appear when there is no more tail recursive call in the clause.

Do i understand this correctly?

How is a quotes character represented inside the quoted string? Probably by doubling-up the quotes. Which this DCG does not cater for.

Perhaps skip_whites is best put in the “parent” DCG, for elegance.

re: quoted string

Yes, I realized that – but, luckily, there was no such usage in the proto files I parsed.

re: skip_whites

If i understand you comment correctly …

I decided to push them down as far as possible, so that they are not (needlessly) repeated and that the top level grammar stays with language syntax elements only.

I had such constructions earlier on with skips and tokens, but it wasn’t legible – relative to the BNF descriptions.

I am a bit unsure what the challenges are that you are pointing at with DCG – so far I was able to construct a DCG for protobuf – at least the hundred or so odd protobuf files seem to have parsed without error and with adequate performance.

For now the performance and memory usage are fine – it doesn’t stop processing I have in mind.

If grammar files are thrown at me where current inference overhead becomes a bottleneck then I will need to reevaluate the design of the grammar, and if necessary, the use of prolog for the parsing job – which would be a pity from my personal preference point of view.

Thank you.

I need to review more carefully why I felt that a look ahead for identifying a closing “}” was necessary at the end of the file – i.e. it was the last symbol in the file.

For now the following rule helped me complete the parse

lookahead_closing_brace, [X] --> [X], {X =:= 125}.

I have been loosely looking at the BNFs of google’s protobuf – but at the same time I modified as needed and as I found issues in protobuf files I was parsing.

For example, many had an extra “;” where google BNF spec didn’t have one – so, I allowed for a looser syntax since the AST would not be affected and i am not checking for syntax errors.

Indeed i wasn’t precise – its apparently EBNF since I am handing repetation and optionals.

The patters are rather simple:

Something along this line.

items --> item, items.
items-->[]

item --> ...

How would you improve on this …