Safety / Performance: findall/3 without call_cleanup/2

I came up with a findall/3 that doesn’t use call_cleanup/2.
The findall2/3 realization uses nb_linkarg/3 and duplicate_term/2.
Its not as fast as the bundled findall/3. Lets take these test cases:

test :-
   between(1,1000,_), findall(Y-X-Y, between(1,1000,X), _), fail; true.

test2 :-
   between(1,1000,_), findall2(Y-X-Y, between(1,1000,X), _), fail; true.

I get these benchmark results:

/* SWI-Prolog 9.3.14 */
?- time(test).
% 2,008,999 inferences, 0.125 CPU in 0.172 seconds (73% CPU, 6071992 Lips)
true.

?- time(test2).
% 6,007,999 inferences, 0.422 CPU in 0.427 seconds (99% CPU, 14241183 Lips)
true.

But I wonder was it safe to use nb_linkarg/3 and duplicate_term/2 ?
And can such a findall2/3 be made faster somehow? The source

code, just a single linked list with a known end element. The end
gets updated and the known end itself changes as well:

findall2(Template, Goal, List) :-
   sys_find2_init(State),
   (Goal, sys_find2_next(Template, State), fail; true),
   sys_find2_fini(State, List).

sys_find2_init(X) :-
   X = v(_,_),
   C = [-|_],
   nb_linkarg(1, X, C),
   nb_linkarg(2, X, C).

sys_find2_next(T, X) :-
   C = [_|_],
   duplicate_term(T, H),
   nb_linkarg(1, C, H),
   arg(1, X, J),              %%% known end cell
   nb_linkarg(2, J, C),
   nb_linkarg(1, X, C).

sys_find2_fini(v(C,D), L) :-
   nb_linkarg(2, C, []),
   arg(2, D, L).

Interestingly the findall2/3 is not impressed by calling
garbage_collect/0 or trim_stacks/0:

/* SWI-Prolog 9.3.14 */
?- findall2(Y-X-Y, (between(1,10,X);garbage_collect,
    trim_stacks, between(11,20,X)), L), write(L).
[_212-1-_212,_230-2-_230,_248-3-_248,_266-4-_266,
_284-5-_284,_302-6-_302,_320-7-_320,_338-8-_338,
_356-9-_356,_374-10-_374,_396-11-_396,_418-12-_418,
_440-13-_440,_462-14-_462,_484-15-_484,_506-16-_506,
_528-17-_528,_550-18-_550,_572-19-_572,_594-20-_594]

Memory consumption is also the same:

?- between(4,8,K), N is 10^K, space(findall(X, 
between(1,N,X),_)), fail; true.
% Memory 0.229 mB
% Memory 4.162 mB
% Memory 33.522 mB
% Memory 268.403 mB
ERROR: Stack limit (1.0Gb) exceeded

?- between(4,8,K), N is 10^K, space(findall2(X, 
between(1,N,X),_)), fail; true.
% Memory 0.393 mB
% Memory 4.063 mB
% Memory 33.423 mB
% Memory 268.304 mB
ERROR: Stack limit (1.0Gb) exceeded

But I still do not 100% believe that it works…

I have looked at this as well. It seems attractive to get rid of the global data structure that needs protection to handle exceptions, etc. It can be implemented correctly, you can get rid of some of the nb_linkarg/3 calls, but there still seems a serious performance gap. I could get it a little faster if there are few answers involved.

I’ve not studied it in detail. One would assume that a duplicate_term/2 can be cheaper than a recorda/recorded (which is more or less what happens, be it at a lower level) of findall/3. Building the list is more efficient in findall/3 as we know how long it is, so we can do the allocation beforehand and simply create the list structure.

I can speed it up, if I use nb_setarg/3 instead of the explicit
duplicate_term/2 and nb_linkarg/3 combo. The change is as follows:

/* Before */
sys_find2_next(T, X) :-
   C = [_|_],
   duplicate_term(T, H),
   nb_linkarg(1, C, H),
   arg(1, X, J),
   nb_linkarg(2, J, C),
   nb_linkarg(1, X, C)

/* After */
sys_find2_next(T, X) :-
   C = [_|_],
   nb_setarg(1, C, T),
   arg(1, X, J),
   nb_linkarg(2, J, C),
   nb_linkarg(1, X, C).

Now the timings are better:

/* Before */
?- time(test2).
% 6,007,999 inferences, 0.422 CPU in 0.427 seconds (99% CPU, 14241183 Lips)
true.

/* After */
?- time(test2).
% 5,007,999 inferences, 0.297 CPU in 0.298 seconds (100% CPU, 16869049 Lips)
true.

A wooping more than 100 milliseconds are gone!

Edit 09.11.2024
With this little change it already beats all the call_cleanup/2 based
findall/3 implementations around from some novell Prolog systems:

/* Trealla Prolog 2.59.15 */
?- time(test).
% Time elapsed 0.356s, 3004004 Inferences, 8.436 MLips
   true.

/* Scryer Prolog 0.9.4-201 */
?- time(test).
   % CPU time: 0.464s, 5_062_044 inferences
   true.