How does Prolog lexing of the `<NEWLINE>` work?

In the section Syntax of the manual, in 2.15.1.3 Character Escape Syntax in particular:

I am not understanding how the \<NEWLINE> escape syntax works, and what is meant exactly by “unescaped newlines (which cannot appear in ISO mode)”, since I cannot figure out how exactly the <NEWLINE> gets recognised.

Here are just two hypotheses on how the <NEWLINE> gets recognised:

  1. Lexing is sensitive to the current operating system (maybe also overridable by some runtime flag?), such that, e.g. in Windows a <NEWLINE> is literally the sequence “\r\n”, while “\r” and “\n” alone are not recognised as a <NEWLINE>; while in Unix only “\n” is a <NEWLINE>; etc.

  2. Any sequence “\r\n” gets consumed eagerly, then also any remaining “\r” and “\n” alone, and they get all recognised as a <NEWLINE>, independently of the OS.

The first solution seems unreasonable, especially considering that a file might have line endings that have nothing to do with the system on which the lexing is happening: unless there is an implicit expectation that line endings in a file get normalized before lexing (somehow/to some standard).

The second solution seems more reasonable and workable, and, especially for the escape sequence \<NEWLINE>, it does not conflict with strict ISO mode as for “unescaped newlines” since it fully consumes any pair “\r\n” as a single <NEWLINE>. Sort of “line ending insensitive”.

Anyway, can anyone tell how that actually works? (Otherwise maybe point me to the relevant source file?)

Thanks in advance.

I have tried looking into SWI’s code but I cannot find it.

But meanwhile I have found this one:

M.A. Covington,
ET: an Efficient Tokenizer in ISO Prolog (2003)
https://ai1.ai.uga.edu/mc/ET/et.pdf

which says it is “fully compatible with SWI-Prolog version 5”.

And, in 6 Character classification, we find:

char_table('\n', eol,'\n'). % new line mark

i.e. where <NEWLINE> is just and only ‘\n’.

But that just doesn’t seem right, unless inputs gets pre-processed in order to normalise any newlines (any (\r\n|\r|\n)?) to just ‘\n’. Indeed, on the SWI side, I have found 12.9.6 Foreign stream line endings, which is a hint to that pre-processing.

Check out this post and the whole thread maybe? I found it by searching for “newline” in this forum. There are some hints through the thread about the location of the code. Maybe somewhere here?

That’s it, thank you! – Sorry, I had searched but I have managed to miss it.

It’s across several files: never mind, that thread is enough.

(Maybe, you can even say which file is that? Though, I don’t look into the code if I know where to look into the docs.)

But that’s about streams, while I am still working at the lexer and having to decide what to do there: and the idea of fixing the lexer to only recognise \n as a newline (which is the SWI/Tau/ISO way, IIUC), and delegating normalization to a pre-processing of the input stream, does not convince me. It’s the easiest to implement (still talking about the lexer), but the lexer is already full of options, so I’d rather prefer a more flexible/controllable approach (see below).

BTW, to compare notes, here are the only three places where I am finding the need to make that decision:

  1. (unescaped) newlines in quoted text;
  2. newlines after a backslash for line-break escapes in quoted text (the \<NEWLINE> escape sequence I was talking about in the OP);
  3. newlines closing an end-of-line comment.

I was indeed thinking these options: either fixed to one of /\r\n/, /\r/, or /\n/, or any of those, i.e. /\r\n|\r|\n/. But whether translated or untranslated I am still not sure, as I haven’t yet got to the printing/listing/debugging side: is there any value in keeping the original source code around past the compilation phase?

P.S. I have meanwhile decided for a simple boolean flag: either just \n, as in ISO and apparently most Prolog systems, or universal, i.e. /\r\n|\r|\n/. And all untranslated as far as lexing, so the source text of tokens, goes.

You are miquoting, that was offered by @Boris upthread.

And again, my question is about the lexer, not the streams: you snip too much, indeed you have also dropped all the reasons that may go with that choice.

In most places it boils down to just that, but in some places, e.g. with the \<NEWLINE> escape, we must consume the whole newline sequence, or tokenization is broken… But I am still unsure about the details, as I am working on it (the lexer) as we speak.

Anyway, the issue of which chars and char sequences to consume is preliminary and independent of any char or char sequence classification: in fact, “untranslated” and, in my implementation, also leaving all “interpretation” to the parsing phase, e.g. my “numbers” at the lexing stage are rather “numerals”.

SWI-Prolog did not run on MacOS before MacOS moved to a BSD core :). Such files are now pretty much non-existent. If you have one, recode it first using external tools. SWI-Prolog will compile it fine, but report all issues as appearing in line 1.

Well if the stream doesn’t deliver CR you cannot do much in the lexer.

Then fix the streams, still not the lexer! Distinguishing distinct things helps, not the other round.

Let me close this one one explicitly: newlines in ISO lexing are implementation dependent, while SWI only handles \n (which is a reasonable choice, though supporting “universal” newlines is not much harder).

Thanks for the help.

That is off-topic, especially inappropriate coming from you who cry off-topic when somebody even dares comment on your “discourses”.

Please, do open a separate thread, about streams and bugs with streams.

And I have also marked you last post as abusive: your continued disturbance here, whichever the reasons, is patently on purpose.

[But meanwhile you keep writing, of course: so now let me read what at least looks like a genuine reply to the point…]

A single character sequence, not character.

Anyway, thanks: that doesn’t say what SWI does, but at least now we know about ISO.

Which I’d still call messing up on purpose: I say ISO, and you repeat ISO but rather quote a SWI-Prolog session… [Meanwhile edited.]

That said, I don’t particularly care about the phrasing, indeed mis-phrasing rather belongs to the bugs: the newline is a sequence of chars, not just a single character.

Mis-phrasing, in ISO as anywhere else, rather belongs to the bugs.

Mis-phrasing or the plain bugs of course: as I don’t remember the details, but the ISO standard grammar indeed is not void of problems…

Stop injecting noise. If the SWI lexer does not just handle \n as I have surmised, please (anybody) say so, otherwise this topic is solved.

EOD.

The minefield up-thread is all and only your doing.

I hope the moderators do something about you.