Actually I was considering something like your suggestion. But then opted in favour of [Sep] because I did not want to introduce just one smallish rule. In fact, my dcg’s will end-up with a lot “business related” rules, so this separator is a very low-level implementation detail.
At the origin of my post was the frustration that my DCG was working differently when using phrase vs. phrase_from_file.
But this conversation led me to a better understanding of this matter, and I acknowledge that I should go with the codes rather than the chars. I will adapt my DCGs accordingly.
Thanks you so much for your insights.
By the way, I looked up the code of phrase_from_file/3 in module pure_input.
As I understand the code, the actual reading of the file is done in a clause attr_unify_hook_ndebug/2 attached as a handler to an attributed var.
The relevant code there is
Besides read_pending_codes/3 used in this clause, prolog has a read_pending_chars/3
So changing the behaviour of phrase_from_file, should one consider to do so, would be to pass an additional option from phrase_from_file/3 down the line to the attribute hook.
I studied your code, and this has been of great inspiration to me.
But may I ask: why are you using the combination “read_stream_to_codes(…), phrase(…)” rather than phrase_from_file?
However in looking at the JSON spec, it seems that the CR is not required so I will have to see if my parser is following the correct spec.
But you are correct that in most places I would simply use phrase_from_file/2,3 and you could use it in your code.
In an eariler version of this post I noted optimizations as the reason for not using phrase_from_file/2,3 but have since rewrote that part; so leaving this part as it is still of use with regards to DCGs in general.
If you read my example JSON parser (post) and notes you will see that my example code has some optimizations that I should not have done during development per my rules of development.
Just curious, for what kind of JSON parsing did you need this code? How is the JSON support that comes with SWI-Prolog not fitting your use case? Is it something about how the JSON maps to Prolog terms?
When I wrote the parser it was to better understand the intricacies of what is valid JSON for use with Cytoscape.js (ref), but then I found JSONLint (ref).
Now that I know the JSON needed for Cytoscape.js, the JSON should not pass a simple JSONLint (syntax), but pass a more complex set of rules (semantics).
While JSON provides a syntactic framework for data interchange, unambiguous data interchange also requires agreement between producer and consumer on the semantics of a specific use of the JSON syntax. (Spec)
Specifically for Cytoscape.js this implies that there should be a JSON recognizer (semantics) for the elements, style and layout. (ref) But the JSON recognizer will not recognize the syntax of JSON, but the semantics of each JSON file.
SWI-Prolog can check the syntax of JSON, but AFAIK there is no predicate to check the semantics. Such a check would probably need something like a DTD. I did find JSON Schema but have not looked at it beyond finding it. I would not be surprised if Jan W. tells me how it is done, but it was not apparent to me when I searched. While I could ask, I find that taking the time to do the search and understand the predicates I found when searching makes it much easier to understand the responses from others when asking, e.g. Is there a way to go from HTML to the Prolog representation of the HTML? before asking I was researching quasiquotations. (ref) I just did not see the connection.
No, as noted it is about the difference between syntax versus semantics and the need for something like a DTD for the semantics.
For something similar see the ini files used with ODBC connections. ini file type specifies the syntax but not the semantics. (ref)
We observe that in many programs, most strings are only handled as a single unit during their lifetime. Examining real code tells us that double quoted strings typically appear in one of the following roles:
A DCG literal
Although represented as a list of codes is the correct representation for handling in DCGs, the DCG translator can recognise the literal and convert it to the proper representation. Such code need not be modified.
is not valid syntax for read/1. It is just used by library(portray_text) to indicate that a list of code points ends in the variable Tail. We could of course add it to the syntax, though using a backslash for disambiguation.
`hello world\|Tail`
I think this makes sense but I think it is not a big enough improvement to justify the incompatibility and difficulties handling this correctly in (IDE) tools. We have partial evaluation of phrase("hello world", List, Tail) as an acceptable and portable alternative.
phrase_from_file loads lazily. It has an attributed variable, when you try to unify it it fires off
a goal that loads more text. This kinda-sorta works.
@jan - would it make sense to have a non-lazy phrase_from_file that supported the current line/char reporting? I usually am using phrase_from_file not because it’s lazy but because it supports line tracking, which is pretty much always needed - you will eventually have to deal iwth invalid files,
and ‘not valid’ is rarely the right error message.
I did some more research on that topic and found an excellent tutorial on DCGs from Markus Triska.
In the section about reading from files, he kind of recommends to set double_quotes to char, which will not work with SWI, as we learned in this thread.
In the same section, he references a “pure io” library from Ulrich Neumerkel (http://www.complang.tuwien.ac.at/ulrich/Prolog-inedit/sicstus/pio.pl).
Interestingly Neumerkel’s library “auto-adjusts” to codes vs chars:
Annie, what’s wrong with laziness. Actually I consider this a quality.
Should’nt we all be lazy, and aren’t we using Prolog because we want to be lazy?
I was about to bring this up; I thought it isn’t relevant, but apparently it is.
There was some ideological struggle about this at some point of time. I definitely did not understand what it is about, exactly, but the take-home message for me was that it was indeed about ideology and not technology only.
Up until about three years ago (when I stopped wasting time on Stackoverflow) there was also a clique of high-rep prolific answerers on the [prolog] tag; they would always recommend to use chars and set certain global flags and so on. Those would fix some “defects” and break other things at the same time, thus creating some confusion that still persists.
Just to give you some context.
PS: this all was somehow entangled with discussions about purity and ISO standardization and compliance. It was strange.
Well, the only thing that works is unifying the lazy list with either [_|_] or []. This is what DCGs do as long as you do not start dirty hacking, so there is not much of a problem. Things go wrong if you write e.g.
at_eof(End,End) :- End == [].
or the application expects the 3rd argument of phrase/2 to be a list and call e.g., length/2 on it.
Not that much. The way it works is a little tricky. When asked for the position it scans the list forward to the attributed variable. There it finds the position info for the start of the current block that enables it to compute the position for the current point in the list. This works fairly well as the amount to scan is just one block. Doing this without the lazy stuff simply means do the position calculation from the beginning of the list. If the list is long that gets pretty expensive …
That makes some sense, but unlike SICStus, most of SWI-Prolog syntax changes are module local. This holds for operators, but also the double_quotes flag. So, this can work but handling the module context correctly gets pretty hairy if you want to put DCGs in modules. The module-local syntax for double_quotes is intended to allow for using (typically ugly) code that relies all over the place on “” to be a list of character codes (chars).
One of the nasty issues with the code/chars is that your entire application has to agree if text is passed as lists between components. As most existing components are written with codes in mind, I think it is best to stick with codes and improve the environment to make working with them as pleasant as possible. With the development of SWI-7 I’ve considered adding code as a primary type. That would have solved this issue, but it became to complicated and expensive and I dropped the idea.