Arithmetic performance - 9.2.3 vs. 9.2.4

Was tracking down some performance anomalies in my code when I discovered the following on MacOS-Intel:

Peirera benchmarks 9.2.3:

?- benches.
Correct to: "bench_peirera:benches"? yes
tail_call_atom_atom    took 0.51-0.10=0.40/145946 =      2.8 usec/iter.
binary_call_atom_atom  took 0.50-0.06=0.44/94737 =       4.6 usec/iter.
cons_list              took 0.38-0.06=0.32/91525 =       3.5 usec/iter.
walk_list              took 0.49-0.08=0.40/122727 =      3.3 usec/iter.
walk_list_rec          took 0.37-0.09=0.29/125581 =      2.3 usec/iter.
args(1)                took 0.36-0.08=0.27/120000 =      2.3 usec/iter.
args(2)                took 0.34-0.06=0.28/81818 =       3.4 usec/iter.
args(4)                took 0.48-0.04=0.44/54545 =       8.1 usec/iter.
args(8)                took 0.42-0.02=0.39/33333 =      11.8 usec/iter.
args(16)               took 0.47-0.01=0.46/19355 =      23.6 usec/iter.
cons_term              took 0.39-0.06=0.33/84375 =       3.9 usec/iter.
walk_term              took 0.44-0.07=0.36/110204 =      3.3 usec/iter.
walk_term_rec          took 0.47-0.08=0.39/122727 =      3.2 usec/iter.
shallow_backtracking   took 0.32-0.29=0.02/415385 =      0.1 usec/iter.
deep_backtracking      took 0.27-0.04=0.23/59341 =       3.8 usec/iter.
choice_point           took 0.70-0.07=0.64/94737 =       6.7 usec/iter.
trail_variables        took 0.64-0.06=0.58/87097 =       6.6 usec/iter.
medium_unify           took 0.85-0.55=0.30/771429 =      0.4 usec/iter.
deep_unify             took 0.74-0.16=0.58/235161 =      2.5 usec/iter.
integer_add            took 0.66-0.03=0.63/49091 =      12.8 usec/iter.
floating_add           took 0.62-0.03=0.59/40909 =      14.4 usec/iter.
arg(1)                 took 0.17-0.03=0.14/40000 =       3.5 usec/iter.
arg(2)                 took 0.17-0.03=0.14/40909 =       3.5 usec/iter.
arg(4)                 took 0.16-0.03=0.13/37500 =       3.5 usec/iter.
arg(8)                 took 0.16-0.03=0.13/38217 =       3.5 usec/iter.
arg(16)                took 0.16-0.03=0.13/38298 =       3.5 usec/iter.
index                  took 0.49-0.07=0.42/100000 =      4.2 usec/iter.
assert_unit            took 1.65-0.00=1.65/1525 =     1080.0 usec/iter.
access_unit            took 0.28-0.02=0.26/26471 =       9.8 usec/iter.
slow_access_unit       took 0.02-0.00=0.02/1607 =       10.0 usec/iter.
setof                  took 0.16-0.01=0.16/7692 =       20.5 usec/iter.
pair_setof             took 0.21-0.01=0.20/6522 =       31.1 usec/iter.
double_setof           took 0.30-0.00=0.30/1837 =      163.9 usec/iter.
bagof                  took 0.17-0.01=0.16/10112 =      16.0 usec/iter.
34 benchmarks took 12.20 seconds
true.

Peirera benchmarks 9.2.4:

?- benches.
Correct to: "bench_peirera:benches"? yes
tail_call_atom_atom    took 0.66-0.26=0.40/145946 =      2.8 usec/iter.
binary_call_atom_atom  took 0.59-0.17=0.43/94737 =       4.5 usec/iter.
cons_list              took 0.48-0.16=0.32/91525 =       3.5 usec/iter.
walk_list              took 0.62-0.21=0.41/122727 =      3.3 usec/iter.
walk_list_rec          took 0.50-0.22=0.28/125581 =      2.2 usec/iter.
args(1)                took 0.48-0.21=0.28/120000 =      2.3 usec/iter.
args(2)                took 0.43-0.14=0.29/81818 =       3.5 usec/iter.
args(4)                took 0.49-0.10=0.40/54545 =       7.3 usec/iter.
args(8)                took 0.45-0.06=0.39/33333 =      11.6 usec/iter.
args(16)               took 0.47-0.03=0.44/19355 =      22.7 usec/iter.
cons_term              took 0.48-0.15=0.33/84375 =       4.0 usec/iter.
walk_term              took 0.57-0.19=0.37/110204 =      3.4 usec/iter.
walk_term_rec          took 0.62-0.22=0.41/122727 =      3.3 usec/iter.
shallow_backtracking   took 0.75-0.73=0.03/415385 =      0.1 usec/iter.
deep_backtracking      took 0.33-0.10=0.23/59341 =       3.9 usec/iter.
choice_point           took 0.80-0.17=0.64/94737 =       6.7 usec/iter.
trail_variables        took 0.72-0.15=0.57/87097 =       6.5 usec/iter.
medium_unify           took 1.69-1.38=0.31/771429 =      0.4 usec/iter.
deep_unify             took 1.02-0.42=0.60/235161 =      2.6 usec/iter.
integer_add            took 3.26-0.09=3.17/49091 =      64.7 usec/iter.
floating_add           took 2.79-0.07=2.72/40909 =      66.4 usec/iter.
arg(1)                 took 0.21-0.07=0.14/40000 =       3.4 usec/iter.
arg(2)                 took 0.21-0.07=0.14/40909 =       3.5 usec/iter.
arg(4)                 took 0.20-0.07=0.13/37500 =       3.5 usec/iter.
arg(8)                 took 0.20-0.07=0.13/38217 =       3.5 usec/iter.
arg(16)                took 0.20-0.07=0.13/38298 =       3.5 usec/iter.
index                  took 0.61-0.18=0.43/100000 =      4.3 usec/iter.
assert_unit            took 1.66-0.00=1.66/1525 =     1087.9 usec/iter.
access_unit            took 0.31-0.05=0.27/26471 =      10.0 usec/iter.
slow_access_unit       took 0.02-0.00=0.01/1607 =        9.3 usec/iter.
setof                  took 0.17-0.01=0.16/7692 =       20.5 usec/iter.
pair_setof             took 0.21-0.01=0.20/6522 =       31.1 usec/iter.
double_setof           took 0.31-0.00=0.30/1837 =      164.9 usec/iter.
bagof                  took 0.18-0.02=0.17/10112 =      16.4 usec/iter.
34 benchmarks took 16.87 seconds
true.

All pretty consistent other than integer_add and floating_add which are about 5 times slower moving to 9.2.4 (and is still the case in 9.2.7).

I’m guessing that you ran this without the -O option, which gives me ~2.5x speedup for integer_add.

I couldn’t find the sources for 9.2.3, so I tried 9.2.0. It got similar numbers to the latest “dev” (9.3.11) on Linux with gcc 12.2.0. I noticed that for mine, the ratio of inter_add to deep_unify is about 30:1, which corresponds to your 9.2.4 results whereas your 9.2.3 is about 6:1

I’ve put my outputs here:

(I used these sources - is there a better one? examples/benchmarks/pereira/pereira.pl - prolog-cafe - Git at Google )

I use them from my-prolog-lib/peirera.pl at 3864d4252aa89f82b00b8f255ec6b26d16ec8b2c · JanWielemaker/my-prolog-lib · GitHub. Probably the loop calibration differs and there are 36 rather than 34 benchmarks. The deep-unify/integer-add ratio is 1:6 without -O and 1:3 with (using current GIT source, gcc-14 on Apple Silicon).

I’d suspect that -O is involved somehow. Otherwise, there are only minor patches in the stable series and the 9.2.x releases for Intel are all compiled by the same compiler on the same system. I cannot think of any reason for serious differences.

In both cases I ran the benchmarks immediately after the starting up Prolog. In both cases the optimise flag is set to false.

When I set the optimise flag to true, *_add BM for both versions take ~6 usec/iter, i.e. twice as fast as 9.2.3 unoptimised and a factor of 10 better than 9.2.4 unoptimised. So AFAICT the slowdown is somewhere in unoptimised arithmetic for 9.2.4.

The changelog shows

FIXED: Clang handling of max_integer_size Clang cannot handle setjmp()/longjmp in PL_next_solution(), but it can in the simple evalExpression() function. This does not have measurable impact on performance, so we enabled this as default. This fixes limiting the integer size for WASM and Clang builds on Linux. This problem did not surface on the MacOS version.

(18d5d539af0807808ba6fee6ec66394aaec589f1 on swipl-devel.git)

This could be involved. I was afraid of a slowdown, but could not measure that on gcc/Linux. Possibly setjmp()/longjmp() on MacOS are a lot slower than on Linux?

edit comparing on gcc on Fedora Linux on Apple M1 is 5.8 vs 6.5 usec. Looks like it is worthwhile to make this change only on clang. It seems not all clang versions are affected either :frowning: