Can I use read_line_to_string/2 on a Mac?

Based on this list I ran a few test cases:

  • A newline (line feed) character (‘\n’),
  • A carriage-return character followed immediately by a newline character (“\r\n”),
  • A standalone carriage-return character (‘\r’),

A.2. Line terminators
https://www.mpi.nl/corpus/html/elan/apas02.html

I guess the first two queries are ok, but the last query somehow goes wrong:

/* SWI-Prolog 9.1.21 */
?- open_string('abc\ndef', S), read_line_to_string(S, T).
S = <stream>(0000027f1cb3da90),
T = "abc".

?- open_string('abc\r\ndef', S), read_line_to_string(S, T).
S = <stream>(0000027f1cb3dba0),
T = "abc".

?- open_string('abc\rdef', S), read_line_to_string(S, T).
S = <stream>(0000027f1cb3d430),
T = "abc\rdef".

Some way to improve that?

1 Like

You could maybe leave out “on a Mac”? The behavior you show is most probably the same across platforms. A precursory search tell me that the \r is was used as a new line on “Classic Mac OS” shipped until 2001.

And could you use read_string/5 to give it your own “end of line”?

?- open_string('abc\rdef', S), read_string(S, "\r", "", _, T).
S = <stream>(0x5baa5dba2600),
T = "abc".

The current definition of read_line_to_string/2 specifically doesn’t use \r as an end of line.

1 Like

Historical note: I read an interesting explanation on the origins of the use of two end-of-line characters (\r\n) in some OSs (I thought it was Windows though). It has to do with how teletypes worked and signal was transmitted at the time

The explanation seemed to make sense, but of course I took it for granted :sweat_smile: don’t know if anyone has ever heard/read something similar

1 Like

On a teletype, there was a movable printhead (also called a “carriage”) … when the printhead/carriage reached the end of line, the “return” code would send it back to the beginning of the line. The “newline” code would advance the roller to the next line. (So, if you wanted to overtype something, you could either use backspace, or you could use “return” and forward space)

The carriage return took some time, so if you immediately followed it with more text, you’d get “random” overtyping. To handle this, you would send some “null” characters that did nothing (I think that the “null” moved the carriage slightly up and down, as if a space was being typed). “Null” was different from “space” in that it didn’t move the carriage one position to the right.

So, at end of line, you’d send \r to start the carriage back to the beginning of the line, then \n to advance the roller, then some number of \0s.

Here’s a video: https://youtu.be/S81GyMKH7zw?t=240

(There were some additional complexities if you had a human operator rather than punched paper tape, at least on the oldest teletype machines … the operator had to type at a steady speed, which most typists couldn’t do. My father told me that in his office there was one person who had been a teletype operator and when he first sat down at a typewriter, the other typists all got up and gathered around in amazement at the regular sound of his typing)

The old teletypes also explain some other codes such as tab, vertical-tab, form-feed, bell, shift-out, shift-in, etc.

2 Likes

Thanx a lot, Peter

read_line_to_string/2 is implemented as follows:

read_line_to_string(Stream, String) :-
    read_string(Stream, '\n', '\r', Sep, String0),
    ...

So uses '\n' in the 2nd argument of read_string/5. Which covers this part:

Your code covers this part:

How would one implement a read_line_to_string/2 predicate that
covers all 3 cases, i.e. 100% of the 3 line separators and not only 66%.

Yes, I really love open source software, you can link it, you can read it.

Either way, \r has not been an end of line for over 20 years. If you wanted it to be one, you can just add it to the Sep argument?

?- open_string('abc\ndef', S), read_string(S, "\r\n", "\r", _, T).
S = <stream>(0x5baa5de82c00),
T = "abc".

?- open_string('abc\r\ndef', S), read_string(S, "\r\n", "\r", _, T).
S = <stream>(0x5baa5dba2a00),
T = "abc".

?- open_string('abc\rdef', S), read_string(S, "\r\n", "\r", _, T).
S = <stream>(0x5baa5df2b300),
T = "abc".

?? Maybe I am overthinking it

If I define an alternative read_line_to_string2/2 I get:

read_line_to_string2(S, T) :- read_string(S, "\r\n", "\r", _, T).

Then there is the error that this here gives a wrong B:

?- open_string('abc\r\rdef', S), 
   read_line_to_string2(S, A), 
   read_line_to_string2(S, B).
S = <stream>(000001ba6e80ac80),
A = "abc",
B = "def".

B should be empty line. You can try even on Windows 10:

?- open('foo.txt',write,S), write(S, 'abc\r\rdef'), close(S).

And then open in Notepad again even Windows 10 detects Macintosh CR:

image

I don’t think the OS matters at all, does it? It seems now that you would like to support multiple mutually exclusive end-of-line markers? I don’t see how you can make \r\n be a single end of line but at the same time both \r and \n be individually also end of line.

You’d need some heuristic that decides what kind of EOL it should expect in this file. I am certain this is what your Notepad is doing.

What happens on your Windows 10 if you give it a file like this:

abc\r\ndef\rghi\nx

??

You should probably also try out the different orders in which the three linebreaks appear in your file.

It should give the same as the SWI-Prolog line_count/2 property
during SWI-Prolog parsing. How is line count implemented?

image

Edit 30.01.2023
But currently '\r' are not counted, see file pl-stream.c:

switch(c)
  { case '\n':
      p->lineno++;
      p->linepos = 0;
      s->flags &= ~SIO_NOLINEPOS;
      break;
    case '\r':
      p->linepos = 0;
      s->flags &= ~SIO_NOLINEPOS;
      break;

But when you have flags in the stream it is easy to count
'\r' as well. This counts all 3 cases correctly:

        if (ch === 13 || (ch === 10 && 
                 (stream.flags & MASK_SRC_SKIP) === 0))
            stream.lineno++;
        if (ch === 13) {
            stream.flags |= MASK_SRC_SKIP;
        } else {
            stream.flags &= ~MASK_SRC_SKIP;
        }

Credits go to the Java class LineNumberReader since
release 1.1 which does that via a skipLF boolean stream state.

Admittedly if you have unget() it might get more complicated.

But I wonder whether in practice this option is important:

If UNIX_LINES mode is activated, then the only line terminators
recognized are newline characters.

A.2. Line terminators
https://www.mpi.nl/corpus/html/elan/apas02.html

For example Python has quite some options:

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If newline is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If newline has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

class io.TextIOWrapper(buffer , encoding=None, errors=None,
newline=None, line_buffering=False, write_through=False)
https://docs.python.org/3/library/io.html#io.TextIOWrapper

Internally, text in SWI-Prolog is supposed to be Unicode where lines are separated by single \n. Translation on I/O is done by streams using the encoding and newline properties. As is, it supports posix, which does no translation and dos which simply removes \r and emits newlines as \r\n. \r terminated lines as (I think) existed on old MacOS are not supported. That is probably not perfect, but it seems good enough in practice. Notably blindly removing \r is a little dubious, but it “fixes” common cases where some lines use \r\n and others just \n while some have \r\r\n, etc.

A possibly useless observation: one difference between read_string/5 and most other “read line” offers is that read_string/5 takes a set of characters in the “end of line” argument (SepChars). Others (Python, awk, getline/getdelim in glibc, and so on…) either use a char, or a string. For a “set of characters” as in read_string/5 you need to use a higher level mechanism like a regular expression.

True. If we are talking a single separator though, we can use atomics_to_string/3 or the atom-based atomic_list_concat/3:

25 ?- atomics_to_string(L, "\r\n", "aap\r\nnoot").
L = [aap, noot].

This mode from Python is especially appealing for certain use cases:

newline is None , universal newlinesmode is enabled:
Lines in the input can end in '\n' , '\r' , or '\r\n' , and these
are translated into '\n' before being returned to the caller.

Mark Reinhold calls this “compression”. Which is possibly
justified by the observation that two characters of '\r\n' will
be compressed into one character '\n'. The JavaDoc of

the class LineNumberReader says:

Line terminators are compressed into single newline (‘\n’) characters.

Class LineNumberReader
https://docs.oracle.com/javase/8/docs/api/java/io/LineNumberReader.html#read–

The application that uses lines produce by this method can
now assume that lines always end with a single ‘\n’. Irrespective
whether they come from Mac, Unix or Windows. And one

can do for example, using the old IF-Prolog atom_split/3 predicate:

?- atom_split(Input, '\n', Lines).

Even removal of some “padding” is not needed anymore, since compression
does that already. It also solves some problems when generating HTML.
But I would need some more time to pin down what usually can

go wrong in HTML output.

I don’t find this compression in SWI-Prolog:

/* SWI-Prolog 9.1.21 */
?- open_string('a\r\nb', S), get_code(S,C1), 
   get_code(S,C2), get_code(S,C3).
S = <stream>(000001e4e2901f80),
C1 = 97,
C2 = 13,
C3 = 10.

In Dogelog Player in release 1.1.5 I had compression
enabled by default. Currently for release 1.1.6 I have
disabled it, and I am experimenting with a separate predicate

get_compress/2 besides get_code/2. The predicate
get_compress/2 is to support some legacy code that assumes
compression. Especially stateful compression, based on a stream

state. The new uncompressed behaviour get_code/2 is
based on the observation that SWI-Prolog can live with it. So
Dogelog Player should also be able to live with it. Now exploring

the corners that break when using uncompressed get_code/2.

If you have a string, I/O has done its job, i.e., strings are already supposed to be in canonical form. Translation holds for text streams created from files and a few other cases, such as process pipes. In other places (e.g., sockets), you have to set the encoding and newline mode yourself as these are typically dictated by some OS independent standard. I guess you can also set the newline mode for a stream that originates from a string. Probably you cannot change the encoding as this is assumed to be fixed.

And, as is, translation is only enabled by default on Windows.

How is it implemented? It swallows lines! It tells me only one line:

/* SWI-Prolog 9.1.21 */
?- open('foo.txt', write, S), write(S, 'a\r\r\nb'), close(S).
S = <stream>(000001c6a0b3b820).

?- open('foo.txt', read, S), get_code(S,C1), get_code(S,C2), get_code(S,C3).
S = <stream>(000001c6a0b3e4c0),
C1 = 97,
C2 = 10,
C3 = 98.

But when I use Notepad on Windows 10 for the generated file it shows
me two lines, which is correct interpretation of '\r\r\n' as one line
separator '\r' and then one line separator '\r\n':

image

Its a finite state machine with two states,
provided get_code/2 gives the raw file codes:

/* compression algorithm as Prolog */
sys_get_file(S, L) :-
  get_code(S, C),
  sys_get_file(C, L, S).

/* state 1, otherwise */
sys_get_file(-1, [], _) :- !.
sys_get_file(0'\r, [0'\n|L], S) :- !,
   get_code(S, H),
   sys_get_file2(H, L, S).
sys_get_file(C, [C|L], S) :-
   get_code(S, H),
   sys_get_file(H, L, S).

/* state 2, last char was 0'\r we can skip 0'\n */
sys_get_file2(0'\n, L, S) :- !,
   get_code(S, H),
   sys_get_file(H, L, S).
sys_get_file2(C, L, S) :- 
   sys_get_file(C, L, S).

Seems to work:

/* SWI-Prolog 9.1.21 */
?- open_string('a\r\r\nb', S), sys_get_file(S, L), atom_codes(A, L).
S = <stream>(000001a459ed3260),
L = [97, 10, 10, 98],
A = 'a\n\nb'.

I guess it is a matter if taste. \r\r\n is the same as \r\n on the good old teletype :slight_smile: Bill Gates is old enough to know that :slight_smile: I told you, the dos newline mode simply removes \r.