Sorry, I should have done the look up myself instead of just giving you the search. I am working on the extended list for own use and don’t plan to publish it here unless others are interested.
Bagof is a complicated beast. There is probably some explanation why this case is so much harder and there are the profiler and source code to figure it out
A bit of search reveals that bind_bagof_keys/3, and notably keysort/2 cause the big slowdown. Profiling under valgrind shows almost all time goes into deRef(), the macro that follows a variable reference chain. The bind_bagof_keys/3 trick ensures consistent variable bindings and makes sure this test passes:
sets(bagof-1) :-
List = [_,_,_],
bagof(X, member(X, List), Xs),
Xs == List.
The way this was written creates a 1,000 long variable reference chain from the shared variable dictionary through each of the answers. I changed this by creating the shared variable dictionary list before the findall/3 call, so each answer creates (possibly) a reference pointer into the dictionary. This has some quite drastic effect. Times on AMD 3950X, Ubuntu 19.10, normal (no PGO) optimization.
?- time((between(1,1000,_), bagof(T, pr(X,1,1,T), L), fail; true)).
% 6,322,999 inferences, 0.609 CPU in 0.609 seconds (100% CPU, 10377325 Lips)
?- time((between(1,1000,_), bagof(Z-T, pr(X,Y,Z,T), L), fail; true)).
% 6,324,007 inferences, 0.678 CPU in 0.679 seconds (100% CPU, 9321305 Lips)
And
?- time((between(1,1000,_), bagof(Z-T, pr2(X,Y,Z,T), L), fail; true)).
% 5,099,007 inferences, 0.797 CPU in 0.797 seconds (100% CPU, 6395849 Lips)
?- time((between(1,1000,_), setof(Z-T, pr2(X,Y,Z,T), L), fail; true)).
% 4,923,001 inferences, 0.934 CPU in 0.935 seconds (100% CPU, 5268815 Lips)
Ratio 117%
I thought long variable reference chains are hardly ever a problem, but apparently it is possible to find counter examples in real life. Learned something new Thanks.