Library(strings) - dedent_string, indent_string, split_lines

In Quasi-quotations, again , @Jan announced a new library(strings), defining dedent_string/3, interpolate_string/4, string/4. I decided to port the Python test cases (because there are a lot of corner cases) and in the process discovered some subtle issues.

(Prologue: I’m proposing we follow Python’s semantics for some things because there’s a lot of practical experience there, and there’s a lot of discussion over any new feature.)

Let’s start with splitlines, which isn’t yet in the library, but probably should be (and should be spelled split_lines). The related Python function is str.splitlines([keepends]), which allows a much larger list of line separators than just "\n". In addition, with the keepends argument, the caller can control whether or not to keep the end-of-line characters.

The Python splitlines method returns an empty list for the empty string, and a terminal line break does not result in an extra line:

>>> "".splitlines() 
[] 
>>> "\n".splitlines()
[""]
>>> "One line\n".splitlines() 
["One line"]

On the other hand, split("\n") – which is essentially the same as SWI-Prolog’s split_string(Str, "\n", "", Lines) – gives:

>>> "".split("\n") 
[""] 
>>> "\n".split("\n")
["", ""]
>>> "Two lines\n".split("\n") 
["Two lines", ""]

Because dedent_string and indent_string split the string into lines, it’s important that we decide how to do this splitting. I propose that we add split_lines/2 and split_lines_with_ends/2 predicates and use those in dedent_string and indent_string.

There is an additional item with dedent/indent – how to handle empty lines (that is lines, with only whitespace). I propose following the Python semantics: ignoring empty lines when determining the dedent and of not indenting them (the latter can be changed by specifying a predicate that controls which lines are indented). For dedent, my experience is that this works nicely with multi-line strings in source code.

Yes, you can use atomic_list_concat/3, but it doesn’t behave “nicely” for things trailing newline (and it returns atoms instead of strings):

?- atomic_list_concat(X, "\n", "a\nb\n").
X = [a, b, ''].
?- atomic_list_concat(X, "\n", "a\nb").
X = [a, b].

Internally, lines are always separated by a single "\n" Adding and removing "\r" is done by IO.

Yes. This considers a string as a list of records where each record is terminated by something (a newline in this case). You get that by using split and remove the last element, probably for lines best only when this is empty. The join is also different as it has to join as atomics_to_string/3, but adding a final terminator.

If we add this to Prolog, I guess it should be as string_lines/2, working both ways. It is a nice logical relation, so there is no reason to add verbs to the name. It should probably be added as built-in as using the current split/join primitives you end up copying the list in one case and copying the string in the other. A small abstraction at the C level can probably quite easily realise these primitives at very small overhead.

It might make sense to generalize the record/line separator to be a string.

As the ends in this situation are fixed I do not see that much reason in having a version that preserves them. I don’t think I ever needed that and in the rare situation where you do need that it is a simple one liner.

Note that SWI-Prolog’s text model is fundamentally different from Python as far as I understand that (please correct if I’m wrong). Python strings seem to be primarily byte sequences that may have some encoding, line endings, etc. SWI-Prolog strings (and atoms) are always sequences of Unicode code points using "\n" as line break. I/O deals with encodings, line breaks, etc. as the file system, network protocol, etc. wants to represent text.

1 Like

I guess it all depends on your viewpoint. In the good old teletype days \r made the print head go to the start of the line and \n shifted the paper (or head) to the next line. Then we got computers where this all doesn’t matter too much anymore. Systems started using \r\n, which does what the good old teletypes expected. \n\r does exactly the same in physical space, so one can easily claim they are locally the same.

The \r\n habit continued in DOS/Windows. Unix went to single \n and does the mapping when addressing a teletype. MacOS went for a single \r (printing all text on the same line :slight_smile: . SWI-Prolog follows the Unix perspective: internally all is single \n. The ISO syntax allows creating an atom/string with a \r, but you should normally not do so. Only if you exchange text as byte sequences you must start thinking about that, but that is normally only done when talking to some external device.

I guess any tool will do its own thing on what is in the end illegal. A text file in DOS has lines ending in \r\n. On POSIX it is \n. If you feed it something else you can more or less expect anything, depending on the application. That is similar to illegal UTF-8 (or whatever multi-byte encoding). Some tools signal an error, some handle it as the reserved Unicode illegal character, etc.

Let us concentrate on more interesting stuff, such as good long string support.

1 Like

The code for indent_lines uses this, to preserve the line endings from the original (in the Python version, things like \v and \f are treated as line endings, so even if we can safely use \n as newline internally and change it to the appropriate value on IO, we would still want to preserve the line endings).

This is sort of true in Python 2, but Python 3 uses Unicode for strings; byte strings are a separate beast, and encode/decode must be done to convert between strings and byte strings. (In some cases, both strings and byte strings have the same method/function – for example splitlines – but in others, only one exists – for example dedent) This change in string semantics has put back Python development by years – on the other hand, the old semantics were even more error-prone than C++ string semantics.

SWI-Prolog’s use of \n seems better than Python’s conventions. I’ll have to look more carefully at Python’s universal newlines conventions to see whether split_lines_with_ends is needed or not (but see above, which allows things such as \v to mark end-of-line).

Thanks for the Python 2/3 update. Glad SWI-Prolog got that right from the beginning :slight_smile: I’m not sure we should worry about \v and \f. Except for doing some fun stuff on a teletype there is not much use for it these days that I can see. Discarding is probably never noticed by 99.9% of the users while it allows for a relation string_lines/2 and doesn’t need the endings stuff. My current judgements is that less is better prevails. After all, it is not the case something becomes impossible. One can still use lower level primitives to deal with text holding these characters.

Here’s the full list of line separators that Python uses. I agree that less-is-better, but just want confirmation that we don’t care about any of these, including the two Unicode ones (Line Separator, Paragraph Separator):

Representation Description
\n Line Feed
\r Carriage Return
\r\n Carriage Return + Line Feed
\v or \x0b Line Tabulation
\f or \x0c Form Feed
\x1c File Separator
\x1d Group Separator
\x1e Record Separator
\x85 Next Line (C1 Control Code)
\u2028 Line Separator
\u2029 Paragraph Separator
2 Likes

The like is notably a vote to ignore all this. As I already claimed, that doesn’t mean you can’t use this stuff. It merely means you need to do the work yourself if you want this.

1 Like

Agreed. I’ll remove split_lines_with_ends and the \r\n test cases.

FWIW, Python test cases for indent/dedent specifically don’t handle \n\r as a single line-end – they’re treated as 2 line-ends. (I’m not saying that Python always gets such decisions right, but there’s a lot of accumulated experience in Python, so it’s worth carefully considering their decision choices.)

Anyway, as @jan mentioned somewhere, SWI-Prolog treats the end-line conversion as an external issue, handled by the IO; internally "\n" is used. This greatly simplifies things, so it seems a reasonable design choice. I’ll make the appropriate changes in my PR and in follow-up PRs.

Prolog got lucky by treating strings as lists of integers, even though it felt a bit hacky (e.g., with portray/1 being needed to display things nicely and some people were concerned about performance issues – although I recall Richard O’Keefe made a defence of both the performance and space characteristics). Integers can easily handle Unicode; and everything else can be relegated to I/O, thereby avoiding a lot of unpleasant encode/decode book-keeping.

And SWI-Prolog’s strings are a pleasant extension to list-of-integers, although there are a few places where they don’t work quite as desired (e.g., atomic_list_concat/3 does a split into atoms and not strings; but that’s easily wrapped using maplist([S,A]>>atom_string(A,S), ...)).

I don’t think compatibility with Python is a worthwhile goal, especially when it adds to code complexity. Learning from Python (including its mistakes) is worthwhile.

As it is, Python and SWI-Prolog have slightly different models for external and internal representations. The Prolog model is simpler, by avoiding the need to explicitly do encode and decode. (Python has various convenience options that help avoiding encode and decode for common cases, but the documentation is still cluttered with details about things like DOS line endings, which aren’t needed in Prolog.)

1 Like

I said for some things. Basically, I want to take advantage of Python’s wide range of experience, but not slavishly follow it where it doesn’t make sense or where it overly complicates things.

But in the case of newlines, (a) SWI-Prolog already has a solution (always use \n internally and translate to-from the external form (\n, \r\n, \r, whatever) during IO – and IO can be captured with with_output_to/2) and (b) I think that SWI-Prolog’s solution is simpler and easier to use than Python’s. So, for this, I’d follow the SWI-Prolog and ignore Python.

(My main intent with following Python’s semantics was in how to handle lines with all whitespace, which takes care of problems like text with or without a final or initial newline.)

BTW, in Python, \v, \f, etc. are only for strings; for byte strings (which @j4n_bur53 used as an eample) , they’re not recognized as line separators:

$ python3.9
Python 3.9.0rc1 (default, Aug 11 2020, 23:07:22) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b'abc\vdef\fnghi'.splitlines()
[b'abc\x0bdef\x0cnghi']
>>> u'abc\vdef\fnghi'.splitlines()  % same, but a string, not a byte-string
['abc', 'def', 'nghi']

PS: CRLF was chosen over LFCR because of timing considerations on some mechanical devices – it took longer for the print head to move from right-to-left than for the carriage to advance one line. Some devices required extra NULs after the CRLF because the carriage return took so long that the sender had to wait – you can see this behaviour in old movies that show teletypes (50 baud?), where for short lines there’s a pause while the head moves up and down at the left margin.

TBH, I don’t care about CRLF and similar. There’s a set_stream/2 newline oroperty that can be either posix or dos, and I presume it works as advertised. Assuming that lines are separated by "\n"s is good enough; if something fancier is needed, someone who needs that can write the additional code.

As for compatibility with Python: I’m mainly interested in similar handling of empty lines (note: the current code does not do this yet, nor does my latest PR), and mainly because that turns out to be convenient for a number of use cases. I don’t expect to see "\r" in strings (because IO should have taken care of that), so no special handling for that, resulting in significant code simplification.

2 Likes

I don’t think it’s the job of split_lines (or indent_lines or dedent_lines) to deal with files that have been copied from another operating system without being transformed by a tool like dos2unix.

If you need to deal with such a situation, you can read/write the raw bytes (encoding(octet)). That’s a better way to handle the problem, if dos2unix can’t be used.

To put it another way: https://youtu.be/_FdvXl5GAes?t=8 :wink: … someone else can write the additional code and test cases, if it’s important to them.

2 Likes