Careful with sub_atom/5 and feature request last_sub_atom/5

Hi,

I guess sub_atom/5 can be improved. This leaves a spurious choice point:

/* SWI-Prolog (threaded, 64 bits, version 8.5.11) */
?- sub_atom('baaab', X, Y, Z, 'aa').
X = 1,
Y = Z, Z = 2 ;
X = Y, Y = 2,
Z = 1 ;
false.

?- sub_atom('abracadabra', X, Y, Z, 'abra').
X = 0,
Y = 4,
Z = 7 ;
X = 7,
Y = 4,
Z = 0 ;
false.

My old system couldn’t do it either, but my new system
does eliminate the choice point for the second test case:

?- sub_atom('baaab', X, Y, Z, 'aa').
X = 1, Y = 2, Z = 2;
X = 2, Y = 2, Z = 1;
fail.

?- sub_atom('abracadabra', X, Y, Z, 'abra').
X = 0, Y = 4, Z = 7;
X = 7, Y = 4, Z = 0.

In the second test case for the second solution
we have Z=0, so no choice point anymore needed.

Also a kind of feature request. I am not sure whether something
like this exists in SWI-Prolog? But many programming languages
have pairs like indexOf/lastIndexOf in JavaScript or find/rfind in Python.

How do I do a sub_atom/5 that searches backwards? My proposal
would be last_sub_atom/5. Here is what it would do, also with
choice point elimination now:

?- last_sub_atom('baaab', X, Y, Z, 'aa').
X = 2, Y = 2, Z = 1;
X = 1, Y = 2, Z = 2;
fail.

?- last_sub_atom('abracadabra', X, Y, Z, 'abra').
X = 7, Y = 4, Z = 0;
X = 0, Y = 4, Z = 7.

Makes sense. Fixed by a7321dda56275bd210c5fde2fb0f1ef77c49b5cd contributed by Eshel Yaron. Note that I consider this a small patch. Typically mode (+.-,-,-,+) is nondet, and callers need to be aware of that. This only imrpoves one corner case. The implementation could also search for an alternative and succeed deterministically if that does not exist, only leaving a choice point if there exists an alternative match. This may make sense, although there are of course cases where the haystack is huge and you only want the first match.

I think this is a fair request. I have not needed this often but occasionally, yes. I wonder whether there is more established practice regarding the naming. Not that I have something better, but I’m not very enthusiastic about this name either.

1 Like

Thanks to the contributor!

Yes, there are still some spurious choice points, its the same
improvement and residual problem like here:

?- member(X, [1,2,3]).
X = 1 ;
X = 2 ;
X = 3.

?- member(2, [1,2,3]).
true ;
false.

Here is a use case for last_sub_atom/5. Its useful to make an OS polyglott
Prolog system, at least polyglott among Unix, Windows and Mac. You
can get rid of Logtalks internal_os_path/2, which I think is a little

design flaw in my old Prolog system, which my new Prolog system tries
to avoid. One can do the following like here, maybe there are also other
agorithms based on last_sub_atom/5, but this is easiest to understand:

sys_file_parent(Path, Dir) :-
   (last_sub_atom(Path, Pos1, _, _, '/') -> true; Pos1 = -1),
   (last_sub_atom(Path, Pos2, _, _, '\\') -> true; Pos2 = -1),
   Pos is max(Pos1, Pos2), Pos \== -1, !,
   Pos3 is Pos+1,
   sub_atom(Path, 0, Pos3, _, Dir).
sys_file_parent(_, '').

sys_file_name(Path, Name) :-
   (last_sub_atom(Path, Pos1, _, _, '/') -> true; Pos1 = -1),
   (last_sub_atom(Path, Pos2, _, _, '\\') -> true; Pos2 = -1),
   Pos is max(Pos1, Pos2), Pos \== -1, !,
   Pos3 is Pos+1,
   atom_length(Path, Len),
   Len2 is Len-Pos3,
   sub_atom(Path, Pos3, Len2, _, Name).
sys_file_name(Path, Path).

Maybe regex could do the same. But can regex search backwards?
You would need a kind of greedy regex pattern, that nevertheless
has an alternative in it. And a regex library is more heavier, than some

last_sub_atom/5 which is still relative lightweight to provide, only
a variant of sub_atom/5. Here are some example runs:

?- sys_file_parent('/mnt/c/foo/bar/baz.p', X).
X = '/mnt/c/foo/bar/'.

?- sys_file_name('C:/foo/bar/baz.p', X).
X = 'baz.p'.

?- sys_file_parent('C:\\foo\\bar\\baz.p', X).
X = 'C:\\foo\\bar\\'.

?- sys_file_name('C:\\foo\\bar\\baz.p', X).
X = 'baz.p'.

Its not the same as decompose_file_name/3 from Logtalk, which is possibly
also not OS polyglott. Its not the same since it leaves the directory delemiter,
either '/' or '\\' at the end of the parent intact. So the reverse function

of this decomposition is simply atom_concat/3:

?- atom_concat('/mnt/c/foo/bar/', 'baz.p', X).
X = '/mnt/c/foo/bar/baz.p'.

?- atom_concat('C:\\foo\\bar\\', 'baz.p', X).
X = 'C:\\foo\\bar\\baz.p'.

This seems to do what you want (handling backslash is left out for simplicity):

?- re_matchsub('^(.*)/([^/]*)$', '/mnt/c/foo/bar/baz.p', X).
X = re_match{0:"/mnt/c/foo/bar/baz.p", 1:"/mnt/c/foo/bar", 2:"baz.p"}.

(last_)sub_atom/5 is still faster, was simulating last_sub_atom/5 via reverse input,
but I guess I was cheating, since I didn’t extract the components:

?- time((between(1,1000000,_), re_matchsub('^(.*)/([^/]*)$', 
'/mnt/c/foo/bar/baz.p', X), fail; true)).
% 13,000,000 inferences, 2.656 CPU in 2.655 seconds (100% CPU, 4894118 Lips)
true.

?- time((between(1,1000000,_), once(sub_atom(
|    'p.zab/rab/oof/c/tnm/', X, _, _, '/')), fail; true)).
% 3,000,000 inferences, 0.422 CPU in 0.413 seconds (102% CPU, 7111111 Lips)
true.

Maybe if you pre-compile the regex? Don’t know, can SWI-Prolog do that?
Or when the alternative '\\' to '/' is also present, then regex is faster anyway?

Edit 24.05.2022:
Test results without cheating:

?- [user].
decompose(X, Y, Z) :- 
    sub_atom(X, A, _, C, '/'), 
    sub_atom(X, 0, A, _, Y),
    sub_atom(X, _, C, 0, Z), !.

?- decompose('p.zab/rab/oof/c/tnm/', X, Y).
X = 'p.zab',
Y = 'rab/oof/c/tnm/' .

And the numbers are now, sub_atom/5 is still faster:

?- time((between(1,1000000,_), re_matchsub('^(.*)/([^/]*)$', 
'/mnt/c/foo/bar/baz.p', X), fail; true)).
% 13,000,000 inferences, 2.656 CPU in 2.655 seconds (100% CPU, 4894118 Lips)
true.

?- time((between(1,1000000,_), decompose(
'p.zab/rab/oof/c/tnm/', X, Y), fail; true)).
% 5,000,000 inferences, 0.828 CPU in 0.813 seconds (102% CPU, 6037736 Lips)
true.

The regex solution is a bit faster with precompilation of the regex (re_matchsub/3 uses a cache (tabling), so the difference in performance is the cache lookup).

?- time((between(1,1000000,_), sub_atom('p.zab/rab/oof/c/tnm/', X, _, _, '/'), fail; true)).
% 7,000,003 inferences, 0.712 CPU in 0.713 seconds (100% CPU, 9827276 Lips)
true.

?- re_compile('^(.*)/([^/]*)$', Re, []), time((between(1,1000000,_), re_matchsub(Re, 'p.zab/rab/oof/c/tnm/',
X), fail; true)).
% 7,000,001 inferences, 1.504 CPU in 1.505 seconds (100% CPU, 4654184 Lips)
Re = <regex>(0x57e29af40e70, /^(.*)/([^/]*)$/).

?- time((between(1,1000000,_), re_matchsub('^(.*)/([^/]*)$', 'p.zab/rab/oof/c/tnm/', X), fail; true)).
% 13,019,759 inferences, 2.192 CPU in 2.193 seconds (100% CPU, 5939458 Lips)
true.

There’s also a JIT in the underlying PCRE2 library, which gets a bit more performance:

?- re_compile('^(.*)/([^/]*)$', Re, [optimise(true)]), time((between(1,1000000,_), re_matchsub(Re, 'p.zab/rab/oof/c/tnm/', X), fail; true)).
% 7,000,000 inferences, 1.479 CPU in 1.480 seconds (100% CPU, 4732542 Lips)
Re = <regex>(0x57e29af416c0, /^(.*)/([^/]*)$/).

It appears that constructing the matches takes quite a bit of time; switching to re_match/2, which succeeds if the pattern matches but doesn’t construct the dict of results is quite a bit faster:

?- re_compile('^(.*)/([^/]*)$', Re, [optimise(true)]), time((between(1,1000000,_), re_match(Re, 'p.zab/rab/oof/c/tnm/'
), fail; true)).
% 6,000,000 inferences, 0.777 CPU in 0.778 seconds (100% CPU, 7720696 Lips)
Re = <regex>(0x57e29af41730, /^(.*)/([^/]*)$/).

I made an error, one needs to use a once/1 around sub_atom/5 to compare
the right thing. I already edited my old post. The result is now:

Summary is now as follows, using the data from @peter.ludemann:

regex JIT no-extract: 0.778 seconds
sub_atom no-extract: 0.413 seconds
regex compiled extract: 1.505 seconds
sub_atom extract: 0.813 seconds

Dear Jan,

The name going towards sub_atom_rev(erse) or _backwards seem more appropriate.
Would n’t be though easy enough to just reverse the atom and apply the predicate ?
The example use case does not seem too convincing SWI has directory_file_path/3 and
pack(os_lib) has os_path/3.

The argument about the extra backtrack points don’t look convincing either.
Prolog implements depth first search in order to trade time for space.
Performing look-ahead is just expanding the code base and reduces the elegance of
the operational understanding- a programmer would need to really read the doc of each
predicate to know if look ahead is performed and to what extend.

Regards,

Nicos Angelopoulos

https://stoics.org.uk/~nicos

1 Like

Maybe there is a misunderstanding, but the idea was never to
also change the residual problem. residual = wont fix, and
for example introduce a look-ahead. I hinted checking Before=0

before maintaining a choice point. No look-ahead here:

You still get a spurious choice point in member(2,[1,2,3]). Only
listing as in member(X,[1,2,3]) does leave one less choice point.
Here is testing of the daily build of SWI-Prolog:

/* SWI-Prolog 8.5.11 */
?- sub_atom('123', _, 1, _, X).
X = '1' ;
X = '2' ;
X = '3'.

?- sub_atom('123', _, 1, _, '2').
true ;
false.

Seems to be everything fine now. The spurious choice point
in the first query is gone. The spurious choice point in the second
query is won’t fix, still there. Was testing version swipl-w64-2022-05-29.exe.

Edit 29.05.2022:
I just see SWIPL 7 could already do this case:

/* SWI-Prolog 7.2.0 */
?- sub_atom('123', _, 1, _, X).
X = '1' ;
X = '2' ;
X = '3'.

So I gues it this analogue, the Gertjan van Noord trick
applied to sub_atom/5. For member/2 we have nowadays,
no spurious choice point at the end:

?- member(3, [1,2,3]).
true.

Lets try sub_atom/5 old:

/* SWI-Prolog 7.2.0 */
?- sub_atom('123', _, 1, _, '3').
true ;
false.

And sub_atom/5 new:

/* SWI-Prolog 8.5.11 */
?- sub_atom('123', _, 1, _, '3').
true.

Oki Doki. Thats the idea, yes!!! Just the status quo of member/2,
which doesn’t do look ahead only the Gertjan van Noord trick,
applied as well to sub_atom/5.

I introduced this case also into the discussion, although
its not a problem for SWI-Prolog, worked already in SWIPL 7
and wasn’t broken in SWIPL 8:

The SWIPL 7 and SWIPL 8 behaviour are a further instance of the
Gertjan van Noord trick analogue from member/2 to sub_atom/5,
no look ahead, only economic handling of choice points.

I guess I introduced this case into the discussion, because other
Prolog systems behave differently, like for example.

/* Trealla Prolog v1.29.14 */
?- sub_atom('123', _, 1, _, X).
   X = '1'
;  X = '2'
;  X = '3'
;  false.

And also for example:

/* Scryer Prolog v0.9.0-159 */
?- sub_atom('123', _, 1, _, X).
   X = '1'
;  X = '2'
;  X = '3'
;  false.

rev seems to be the common term. I think it is indeed better than last, in particular for Prolog as it does not only return the last match (as it does for most other languages). You can of course reverse the atom and reverse the result again, but that is rather expensive in terms of created atoms. As this is just some numerical logic it shouldn’t be too hard to abstract the implementation a bit further.

I don’t know what the best is wrt the backtrack point. The small enhancement was practically free and but merely fixes a single corner case. Look-ahead might not be a bad idea. When the haystack is not too big it won’t take a lot of time and if it makes the first result deterministic you gain the need to preserve the context needed for retrying. I would not be surprised if it turns out to be faster in most except extreme cases (only need the first result on a very long haystack).

Rather a pretty bad idea without any benefit, since the
intended use case of non-det last_sub_atom/5, is det lastIndexOf(_),
namely like here, slightly optimized version now with only one clause:

So there is a (->)/2, which cuts away last_sub_atom/5. So if
you do some extra work in last_sub_atom/5, you will:

  • Not eliminate the choice point. For example if the file name is
    "foo/bar/baz.p", the look ahead will find another “/”, and you will
    get a choice point.

  • Only generate overhead. Since you will internally search the first “/”
    from left, and then the second “/” from left.

The only thing that removes the choice point is the local cut in (->)/2.
And one way to avoid some extra work is to not have look-ahead.

Edit 30.05.2022:
BTW, to avoid getting into this trap:

“Complaining about a problem without posing a solution is called whining.”
― Teddy Roosevelt
https://www.goodreads.com/quotes/8011953

Here a hint. Its utterly trivial to implement last_sub_atom/5. Just copy
paste the implementation of sub_atom/5, and change ++ into --,
<= text.length into >= 0, etc…

Extremly straight forward. Costs you less than a day. Costs
you maybe more than a day if you also do test cases. And what
did cost me some extra time is surrogate pairs,

which is still less annoying than if you would have UTF-8 encoded
strings. In the later case the problem can get really nasty.

Obviously I am not an expert on the matter, but even allowing for the “extreme cases” to exist opens the door to malicious attacks. The only counter is then to decide at run time, based on the actual inputs, what algorithm to apply, what optimizations can be safely allowed and so on. Just a thought.

As the implementation is now, you are right as it allocates the data that allows reentrance of the implementation as soon as it detects that it might be nondet. This should probably be improved. Most foreign predicates delay creating this data structure to the moment they detect the first call is actually nondet.

Duplicating a 253 line C function for changing a couple of tiny bits seems wrong :slight_smile:

No pain, no gain.

Also I have heard some clever people have implemented
sub_atom/6, so that for example one can call:

?- sub_atom('abba', _, _, X, 'B', [reverse(true), ignore_case(true)]).
X = 1

The rest are then stubs:

sub_atom(X, Y, Z, T, U) :- sub_atom(X, Y, Z, T, U, []).
sub_atom_icasechk(X, Y, Z) :- sub_atom(X,Y, _, _, Z, [ignore_case(half)]), !.
last_sub_atom(X, Y, Z, T, U) :- sub_atom(X, Y, Z, T, U, [reverse(true)]).
Etc...

Edit 30.05.2022:
But sub_atom/5 is quite far away from optimal. What would be better
some regionMatches() as in Java. One use case in sub_atom/5 is missing,
search from a particular start point. I cannot search ‘abra’ in ‘abracadabra’

starting from 5-th character, this here involves backtracking,
is a little nonsensical:

?- sub_atom('abracadabra', X, _, _, 'abra'), X >= 5.
X = 7.

And this here involves copying an atom:

?- atom_length('abracadabra', L), H is L-5, 
sub_atom('abracadabra', 5, H, _, J), 
sub_atom(J, Y, _, _, 'abra'), X is Y+5.
X = 7.

So regionMatches() is missing in the ISO core standard.
Not only backward search.

Dear Jan,

Would it make sense to have a sub_atom(Full,Sub,Opts)
where the current and pre, post and length arguments become options and rev(erse) is a new option ?
That also gives room for future additional options.

Regards,

Nicos Angelopoulos

https://stoics.org.uk/~nicos

Possibly. That is also the idea behind the sub_atom/6 quoted by @j4n_bur53. I think I like the idea of options better than merely adding a reverse version. As we see that leads to new demands. We could also argue that extracting based on positions is the task of this predicate and searching is the task of something else, notably the regular expression library. In that view we are pretty much ok. Yes, the regular expression library is slower, but there are options to improve on that.

Dear Jan,

I would prefer an /3 version as it is often the case that not all arguments of the /5 version are needed. pack(stoics_lib) has an /2 version and an /4 version (no length).

Regards,

Nicos Angelopoulos

https://stoics.org.uk/~nicos