In the section Syntax of the manual, in 2.15.1.3 Character Escape Syntax in particular:
I am not understanding how the \<NEWLINE> escape syntax works, and what is meant exactly by “unescaped newlines (which cannot appear in ISO mode)”, since I cannot figure out how exactly the <NEWLINE> gets recognised.
Here are just two hypotheses on how the <NEWLINE> gets recognised:
Lexing is sensitive to the current operating system (maybe also overridable by some runtime flag?), such that, e.g. in Windows a <NEWLINE> is literally the sequence “\r\n”, while “\r” and “\n” alone are not recognised as a <NEWLINE>; while in Unix only “\n” is a <NEWLINE>; etc.
Any sequence “\r\n” gets consumed eagerly, then also any remaining “\r” and “\n” alone, and they get all recognised as a <NEWLINE>, independently of the OS.
The first solution seems unreasonable, especially considering that a file might have line endings that have nothing to do with the system on which the lexing is happening: unless there is an implicit expectation that line endings in a file get normalized before lexing (somehow/to some standard).
The second solution seems more reasonable and workable, and, especially for the escape sequence \<NEWLINE>, it does not conflict with strict ISO mode as for “unescaped newlines” since it fully consumes any pair “\r\n” as a single <NEWLINE>. Sort of “line ending insensitive”.
Anyway, can anyone tell how that actually works? (Otherwise maybe point me to the relevant source file?)
which says it is “fully compatible with SWI-Prolog version 5”.
And, in 6 Character classification, we find:
char_table('\n', eol,'\n'). % new line mark
i.e. where <NEWLINE> is just and only ‘\n’.
But that just doesn’t seem right, unless inputs gets pre-processed in order to normalise any newlines (any (\r\n|\r|\n)?) to just ‘\n’. Indeed, on the SWI side, I have found 12.9.6 Foreign stream line endings, which is a hint to that pre-processing.
(Maybe, you can even say which file is that? Though, I don’t look into the code if I know where to look into the docs.)
But that’s about streams, while I am still working at the lexer and having to decide what to do there: and the idea of fixing the lexer to only recognise \n as a newline (which is the SWI/Tau/ISO way, IIUC), and delegating normalization to a pre-processing of the input stream, does not convince me. It’s the easiest to implement (still talking about the lexer), but the lexer is already full of options, so I’d rather prefer a more flexible/controllable approach (see below).
BTW, to compare notes, here are the only three places where I am finding the need to make that decision:
(unescaped) newlines in quoted text;
newlines after a backslash for line-break escapes in quoted text (the \<NEWLINE> escape sequence I was talking about in the OP);
newlines closing an end-of-line comment.
I was indeed thinking these options: either fixed to one of /\r\n/, /\r/, or /\n/, or any of those, i.e. /\r\n|\r|\n/. But whether translated or untranslated I am still not sure, as I haven’t yet got to the printing/listing/debugging side: is there any value in keeping the original source code around past the compilation phase?
P.S. I have meanwhile decided for a simple boolean flag: either just \n, as in ISO and apparently most Prolog systems, or universal, i.e. /\r\n|\r|\n/. And all untranslated as far as lexing, so the source text of tokens, goes.
You are miquoting, that was offered by @Boris upthread.
And again, my question is about the lexer, not the streams: you snip too much, indeed you have also dropped all the reasons that may go with that choice.
In most places it boils down to just that, but in some places, e.g. with the \<NEWLINE> escape, we must consumethe whole newline sequence, or tokenization is broken… But I am still unsure about the details, as I am working on it (the lexer) as we speak.
Anyway, the issue of which chars and char sequences to consume is preliminary and independent of any char or char sequence classification: in fact, “untranslated” and, in my implementation, also leaving all “interpretation” to the parsing phase, e.g. my “numbers” at the lexing stage are rather “numerals”.
SWI-Prolog did not run on MacOS before MacOS moved to a BSD core :). Such files are now pretty much non-existent. If you have one, recode it first using external tools. SWI-Prolog will compile it fine, but report all issues as appearing in line 1.
Well if the stream doesn’t deliver CR you cannot do much in the lexer.
Then fix the streams, still not the lexer! Distinguishing distinct things helps, not the other round.
Let me close this one one explicitly: newlines in ISO lexing are implementation dependent, while SWI only handles \n (which is a reasonable choice, though supporting “universal” newlines is not much harder).
Which I’d still call messing up on purpose: I say ISO, and you repeat ISO but rather quote a SWI-Prolog session… [Meanwhile edited.]
That said, I don’t particularly care about the phrasing, indeed mis-phrasing rather belongs to the bugs: the newline is a sequence of chars, not just a single character.