Re_replace behaviour

This question on StackOverflow highlights a problematic behaviour of re_replace.

In PHP I get

php > echo preg_replace("#_#", "\\\\_", "a_b_c");
a\_b\_c

so the question seems well posed.
In SWI (ver. 8.5.8), I confirm that the described behaviour holds, I thought to use a double replacement, but

?- re_replace("_"/g, "#_", "a_b_c", S1), re_replace("#"/g, "\\", S1, Str).
S1 = "a#_b#_c",
Str = "a\\_b\\_c".

then at least, the documentation is wrong about the necessity to use \\\\ for a single backslash in the replaced pattern. But there is no way to use a single backslash, since it escapes the string delimiter, and causes the CLI to misunderstand the input.

I think there are a few different things going on here. First, regarding the original post on Stack Overflow - I can’t reproduce the failure in the second example. For me, using SWI-Prolog 8.5.10, both calls work as expected:

?- use_module(library(pcre)).
true.
?- re_replace("_"/g, "QW", "a_b_c", Str).
Str = "aQWbQWc".
?- re_replace("_"/g, "\\\\_", "a_b_c", Str).
Str = "a\\_b\\_c". % no failure!

Unfortunately the SO user didn’t post any version information, so it’s hard to diagnose this. Perhaps the previous development version had some bugs because of the recent migration to pcre2?

Second, about the result of re_replace when using backslashes: keep in mind that (SWI-)Prolog by default displays atoms, strings, etc. in quoted form and escapes any special characers inside - including backslashes themselves. So the string "a\\_b\\_c" doesn’t actually contain any double backslashes - it’s just the quoted representation of the text a\_b\_c.

You can reproduce this with your PHP example as well - you just have to explicitly ask PHP to quote the string:

php > var_export(preg_replace("#_#", "\\\\_", "a_b_c"));
'a\\_b\\_c'

Finally, about the note in the re_replace/4 documentation:

Because \ is an escape character inside strings, you need to write “\\\\” to get a single backslash.

It is correct that you need to write four backslashes to get a single backslash, but the explanation is incomplete. In addition to Prolog’s use of \ for escaping in strings, PCRE itself also uses backslashes for escaping and character classes (inside regexes) and for capture group references (inside replacement patterns). That is why you need four backslashes - the string "\\\\" is unescaped by Prolog to the text \\, and that is again unescaped by PCRE to the literal character \.

My version is 8.2.1. OS: Linux Slackware 14.2.

What do you get from swipl --version?

There were a number of bugs in the PCRE1 version for re_replace/4. In particular, there were some problems with how “$” and “\” were handled. (I don’t recall the details). I did the migration to PCRE2 in three steps - first, fixing the various bugs in predicates such as re_replace/4 (and adding quite a few test cases); refactored some code; and then the migration to PCRE2. So, there was a point where the development version was using PCRE1 but with an improved re_replace/4.

You might want to look at the test cases that have “\” or “$”, such as replace_escape-backslash1 et seq. https://github.com/SWI-Prolog/packages-pcre/blob/b9dabbc47559b2d4f4b936d5813fb83d43a12dc6/test_pcre.pl#L737=

Currently, library(pcre) doesn’t use pcre2_substitute(), which is a new feature in PCRE2 that wasn’t in PCRE1. Instead, it emulates pcre2_substitute(). I believe PHP does the same. It is possible that the emulation doesn’t 100% match either what PHP, JavaScript, Python, etc. do nor what PCRE2 does.

If the text in the documentation is confusing, please suggest something better. There are two issues:

  1. Escaping the single characters “$” and “\” in the replacement (the argument toWith).
  2. SWI-Prolog’s escaping of “\” in strings.

Here are some examples:

?- re_replace('^.', '$1', 'abc', Result).
ERROR: key `1' does not exist in re_match{0:0-1}

?- re_replace('^.', '$$1', 'abc', Result).
Result = "$1bc".

?- re_replace('^.', '\1', 'abc', Result).
Result = "\u0001bc".

?- re_replace('^.', '\\1', 'abc', Result).
ERROR: key `1' does not exist in re_match{0:0-1}
?- re_replace('^.', '\\\\1', 'abc', Result).
Result = "\\1bc".

?- re_replace('^.', '\\\\1', 'abc', Result), string_chars(Result, ResultChars).
Result = "\\1bc",
ResultChars = [\, '1', b, c].

SWI-Prolog version 8.2.1 for i686-linux

I also see the failure with SWI 8.5.3 but the correct behaviour with 8.5.10, so presumably it’s something @peter.ludemann fixed with his recent efforts.

That’s quite an old version, and uses PCRE1. I recommend you upgrade to the latest swipl-devel version (8.5.10) to fix the problem. (If you can’t upgrade swipl, you could try using “$” instead of “\” for substitution numbers/names.)
The fixes I made to library(pcre) - for either PCRE1 or PCRE2 - will not be backported to the “stable” version. If you can’t use the “devel” version, I’m afraid you’ll have to wait until the next major release of the “stable” version.

1 Like

One of the original reasons why I started using a rolling release linux distribution (and as a result, regularly rebuild SWI-Prolog from the latest devel branch) is that I spent a few days fighting an obscure bug in a very popular command line tool. The bug had been fixed, the docs had moved on, and of course I was reading the “latest stable” docs on the internet but using some old buggy version of the tool.

Never again.

This btw happens often enough with SWI-Prolog, people use the ridiculously outdated version that comes with the package manager of the distro but of course they read the “latest” docs on the SWI-Prolog website. Is there any technical solution to this? Or do we just keep on saying “please upgrade?”

PS: from the point of view of a software developer, building SWI-Prolog from source is fairly straight-forward. Re-downloading the latest installer/binary for windows and macos is even less time-consuming but somehow feels like effort. Why is that I don’t know.

1 Like

Thanks to all, and specially @peter.ludemann for taking care of so many details of our beloved language.
Once upgraded to ver. 8.5.10 (sorry I missed ver. 8.5.9), I get the correct behaviour

?- re_replace("_"/g, "\\\\_", "a_b_c", Str), write(Str).
a\_b\_c
Str = "a\\_b\\_c".

I don’t know. Suggestions? Interactive end-user applications typically just tell their users to upgrade to the one and only new version. For us this is a bit harder. Sometimes we do (need to) make incompatible changes. Some people really do not want to be in a situation where an upgrade of the software they use puts them into a situation where their application breaks down and they need to start debugging. That also makes sense.

In the end we are dealing with an enormous set of dependent software components that evolve largely independent. As a result developers must behave responsible and be as flexible as possible wrt. their dependencies and as predictable as possible to those that depend on them. At the same time we cannot just stop evolving. All that creates a delicate balancing act. I’m always impressed that works as well as it does.

On my Chromebook, I’m more-or-less forced to rebuild SWI-Prolog and Python because I (so far) haven’t been able to add the PPAs (there’s no “Release file”, whatever that means); but I am able to upgrade Bazel because it uses a different method of setting its PPA. I don’t know enough about apt-secure(8) to figure out what’s going on (this is from the cryptic message I get when I try to use ppa:swi-prolog/devel on Debian, and also http://ppa.launchpad.net/deadsnakes/ppa/ubuntu jammy Release ).

My experience is the polar opposite. Unless you’re immersed in the whole building C apps eco-system, and all that that entails, it’s just so much easier to download a binary. At least for the MacOS bundle, it just seems to work.