Sounds like you could, instead, define e.g. optional_comment
and add that to the DCG, to eat the comment as legitimate, optional text to ignore.
Yes, thats what i do, but that doesnt solve this issue …
Basic example:
opt_stuff --> "opt".
req_stuff --> "req".
my_stuff --> ( opt_stuff | [] ), req_stuff.
?- phrase(my_stuff, _L), atom_codes(A, _L).
A = optreq ;
A = req.
A cut will break the code as a generator. Example:
opt_stuff --> "opt".
req_stuff --> "req".
my_stuff --> ( opt_stuff | [] ), !, req_stuff.
… which gives incomplete answer:
?- phrase(my_stuff, _L), atom_codes(A, _L).
A = optreq.
So no, I don’t want a cut, and the backtracking is intentional.
Ah yes, true. It was a good example of the minefield that cuts cause
I, btw, decided to add cuts wherever a declaration has been identified.
Instead of returning the parsed structure as arguments in the grammar, I wanted to have a clean view of the grammar at the top level, and further below I “accumulate” identified clauses in a (backtrackable) queue and then call at appropriate points prolog code to assert identified clauses.
Seems to work well so far … (as i implement this).
If you show some code, this discussion can be less abstract
Sure,
gen_syntax_declaration dequeues the quoted string and side comment, and asserts them as part of an AST.
protobuf_element --> syntax_keyword, equal, quoted_string,
end_of_statement,
optional_side_comment,
{!, gen_syntax_declaration}.
quoted_string --> quoted_string(String), {enqueue(String)}.
quoted_string(String) --> skip_whites, "\"", string_without("\"", Codes), "\"", { string_codes(String, Codes) }.
That approach forces cuts/determinism, rather than convenient backtracking.
Yes, it includes determinism – in the form of domain knowledge.
It uses the cut to cut away further unfruitful backtracking.
Once a complete declaration has been identified, there is no need any more for further backtracking – and the declaration can be asserted.
There is no need to wait until the whole grammer has been parsed.
This also allows for less memory usage.
In one version where I did keep state in the DCG memory usage went up to 12GB … presumably, since this grammar includes comments as well.
(I was parsing in one foreach loop a whole folder structure of protobuf files)
When using phrase_from_file/2, it makes sense to wrap the call within once/1 - which will allow for more garbage collection.
- The JSON parser (2007) pre-dates library(pure_input) (2008)
- The initial version of pure_input could only handle repositionable streams (e.g., files). One often want to parse JSON from data arriving on a socket. This was only resolved in 2016.
- The JSON library uses some low-level (foreign) helpers for fast reading of numbers and a few other things from a stream.
Overall, library(pure_input) is quite efficient. Yes, the constraint mechanism adds serious overhead, but we read the input in blocks rather than character by character. There is quite some work getting a character from a stream (get_code/2): validate the stream, lock it, get the character, unlock it.
So, does my code sample eliminate tail recursive optimizations?
Also, I guess this is only for DCG rules that call their head in their last clause.
I guess, my cut would only appear when there is no more tail recursive call in the clause.
Do i understand this correctly?
How is a quotes character represented inside the quoted string? Probably by doubling-up the quotes. Which this DCG does not cater for.
Perhaps skip_whites
is best put in the “parent” DCG, for elegance.
re: quoted string
Yes, I realized that – but, luckily, there was no such usage in the proto files I parsed.
re: skip_whites
If i understand you comment correctly …
I decided to push them down as far as possible, so that they are not (needlessly) repeated and that the top level grammar stays with language syntax elements only.
I had such constructions earlier on with skips and tokens, but it wasn’t legible – relative to the BNF descriptions.
I am a bit unsure what the challenges are that you are pointing at with DCG – so far I was able to construct a DCG for protobuf – at least the hundred or so odd protobuf files seem to have parsed without error and with adequate performance.
For now the performance and memory usage are fine – it doesn’t stop processing I have in mind.
If grammar files are thrown at me where current inference overhead becomes a bottleneck then I will need to reevaluate the design of the grammar, and if necessary, the use of prolog for the parsing job – which would be a pity from my personal preference point of view.
Thank you.
I need to review more carefully why I felt that a look ahead for identifying a closing “}” was necessary at the end of the file – i.e. it was the last symbol in the file.
For now the following rule helped me complete the parse
lookahead_closing_brace, [X] --> [X], {X =:= 125}.
I have been loosely looking at the BNFs of google’s protobuf – but at the same time I modified as needed and as I found issues in protobuf files I was parsing.
For example, many had an extra “;” where google BNF spec didn’t have one – so, I allowed for a looser syntax since the AST would not be affected and i am not checking for syntax errors.
Indeed i wasn’t precise – its apparently EBNF since I am handing repetation and optionals.
The patters are rather simple:
Something along this line.
items --> item, items.
items-->[]
item --> ...
How would you improve on this …