MacOS performance degradation

@ridgeworks discovered that 8.1.26 is up to nearly twice as fast as the current version on MacOS. I first kind of didn’t believe it, but in the end it proves true. What changed is that I installed the latest (10.15.4) MacOS version on the build machine and I got two Xcode updates since. Current version using clang 11.0.3. I don’t know the other versions

See this on the old but still useful Pereira benchmark suite below.

Anyone has a clue what causes this? Did Apple really mess up so badly or did they change some default flag/added additional security/… that causes this? Hints on how to get back to good old performance are welcome.

8.1.26 from download page

tail_call_atom_atom    took 0.38-0.09=0.29/145946 =      2.0 usec/iter.
binary_call_atom_atom  took 0.41-0.06=0.35/94737 =       3.7 usec/iter.
cons_list              took 0.29-0.06=0.24/91525 =       2.6 usec/iter.
walk_list              took 0.36-0.07=0.28/122727 =      2.3 usec/iter.
walk_list_rec          took 0.34-0.08=0.26/125581 =      2.1 usec/iter.
args(1)                took 0.34-0.07=0.26/120000 =      2.2 usec/iter.
args(2)                took 0.29-0.05=0.24/81818 =       3.0 usec/iter.
args(4)                took 0.28-0.03=0.25/54545 =       4.5 usec/iter.
args(8)                took 0.31-0.02=0.29/33333 =       8.7 usec/iter.
args(16)               took 0.32-0.01=0.31/19355 =      15.8 usec/iter.
cons_term              took 0.30-0.05=0.24/84375 =       2.9 usec/iter.
walk_term              took 0.33-0.07=0.26/110204 =      2.4 usec/iter.
walk_term_rec          took 0.48-0.08=0.41/122727 =      3.3 usec/iter.
shallow_backtracking   took 0.29-0.27=0.02/415385 =      0.0 usec/iter.
deep_backtracking      took 0.23-0.04=0.20/59341 =       3.3 usec/iter.
choice_point           took 0.66-0.06=0.60/94737 =       6.3 usec/iter.
trail_variables        took 0.56-0.06=0.51/87097 =       5.8 usec/iter.
medium_unify           took 0.79-0.51=0.29/771429 =      0.4 usec/iter.
deep_unify             took 0.70-0.15=0.55/235161 =      2.3 usec/iter.
integer_add            took 0.63-0.03=0.60/49091 =      12.2 usec/iter.
floating_add           took 0.53-0.03=0.49/40909 =      12.1 usec/iter.
arg(1)                 took 0.30-0.03=0.27/40000 =       6.8 usec/iter.
arg(2)                 took 0.31-0.03=0.28/40909 =       6.9 usec/iter.
arg(4)                 took 0.28-0.03=0.26/37500 =       6.9 usec/iter.
arg(8)                 took 0.28-0.03=0.26/38217 =       6.8 usec/iter.
arg(16)                took 0.28-0.03=0.26/38298 =       6.8 usec/iter.
index                  took 0.57-0.07=0.50/100000 =      5.0 usec/iter.
assert_unit            took 1.03-0.00=1.03/1525 =      676.1 usec/iter.
access_unit            took 0.32-0.02=0.30/26471 =      11.4 usec/iter.
slow_access_unit       took 0.02-0.00=0.02/1607 =       11.2 usec/iter.
setof                  took 0.20-0.01=0.20/7692 =       25.4 usec/iter.
pair_setof             took 0.23-0.01=0.22/6522 =       34.3 usec/iter.
double_setof           took 0.46-0.00=0.46/1837 =      251.5 usec/iter.
bagof                  took 0.17-0.01=0.16/10112 =      16.2 usec/iter.
34 benchmarks took 11.16 seconds

8.1.26 rebuilt using latest clang

tail_call_atom_atom    took 0.75-0.13=0.62/145946 =      4.2 usec/iter.
binary_call_atom_atom  took 0.73-0.08=0.65/94737 =       6.9 usec/iter.
cons_list              took 0.68-0.08=0.60/91525 =       6.6 usec/iter.
walk_list              took 0.88-0.11=0.77/122727 =      6.2 usec/iter.
walk_list_rec          took 0.76-0.12=0.64/125581 =      5.1 usec/iter.
args(1)                took 0.69-0.11=0.58/120000 =      4.8 usec/iter.
args(2)                took 0.65-0.07=0.58/81818 =       7.1 usec/iter.
args(4)                took 0.69-0.05=0.64/54545 =      11.7 usec/iter.
args(8)                took 0.76-0.03=0.73/33333 =      22.0 usec/iter.
args(16)               took 0.86-0.02=0.84/19355 =      43.6 usec/iter.
cons_term              took 0.73-0.09=0.64/84375 =       7.6 usec/iter.
walk_term              took 0.89-0.12=0.77/110204 =      7.0 usec/iter.
walk_term_rec          took 1.08-0.11=0.97/122727 =      7.9 usec/iter.
shallow_backtracking   took 0.42-0.38=0.04/415385 =      0.1 usec/iter.
deep_backtracking      took 0.44-0.05=0.39/59341 =       6.5 usec/iter.
choice_point           took 1.10-0.09=1.01/94737 =      10.7 usec/iter.
trail_variables        took 0.97-0.08=0.89/87097 =      10.3 usec/iter.
medium_unify           took 0.98-0.72=0.26/771429 =      0.3 usec/iter.
deep_unify             took 0.74-0.22=0.52/235161 =      2.2 usec/iter.
integer_add            took 0.85-0.04=0.81/49091 =      16.4 usec/iter.
floating_add           took 0.74-0.04=0.70/40909 =      17.2 usec/iter.
arg(1)                 took 0.49-0.04=0.46/40000 =      11.5 usec/iter.
arg(2)                 took 0.51-0.04=0.47/40909 =      11.5 usec/iter.
arg(4)                 took 0.47-0.03=0.43/37500 =      11.5 usec/iter.
arg(8)                 took 0.48-0.04=0.44/38217 =      11.6 usec/iter.
arg(16)                took 0.48-0.04=0.45/38298 =      11.7 usec/iter.
index                  took 0.95-0.09=0.86/100000 =      8.6 usec/iter.
assert_unit            took 0.81-0.00=0.81/1525 =      531.1 usec/iter.
access_unit            took 0.57-0.02=0.54/26471 =      20.5 usec/iter.
slow_access_unit       took 0.03-0.00=0.03/1607 =       19.9 usec/iter.
setof                  took 0.24-0.01=0.24/7692 =       30.7 usec/iter.
pair_setof             took 0.27-0.01=0.26/6522 =       40.2 usec/iter.
double_setof           took 0.54-0.00=0.54/1837 =      292.9 usec/iter.
bagof                  took 0.24-0.01=0.23/10112 =      22.5 usec/iter.
34 benchmarks took 19.42 seconds

Hi Jan,

Just in case it helps, Darwin19/Xcode-11 broke compilation for Ciao (and other projects). Our last resort was using the -fno-stack-check option, which we found by looking at how other systems were dealing with the same problem.

Cheers,
–Jose

Hi Jose,

Thanks. It was worth the try, but adding -fno-stack-check seems to have no significant impact on the performance. SWI-Prolog compiles and runs fine using Xcode-11, just at almost half the speed :frowning: If you say “broke” I guess you mean something different than Ciao being just slower?

Using gcc 9 from Macports as follows

sudo port install gcc9

And next build using

CC=gcc-mp-9 cmake -G Ninja ..
../script/pgo-compile.sh
ninja

We get the timings below. Without PGO the total time is 12.75, so we have

compiler time
old Xcode 11.16
Xcode 11 19.42
gcc 9 12.75
gcc 9 + PGO 10.05

Paulo and me never got PGO to work correctly with Apple clang. When we finally got it to compile without warnings the result became slower, so we gave op.

The whole Mac bundle build works fine using gcc 9 (and PGO), so I guess we should use that from now on.

6 ?- bench_peirera.
tail_call_atom_atom    took 0.37-0.07=0.30/145946 =      2.1 usec/iter.
binary_call_atom_atom  took 0.32-0.04=0.27/94737 =       2.9 usec/iter.
cons_list              took 0.28-0.04=0.24/91525 =       2.6 usec/iter.
walk_list              took 0.35-0.06=0.30/122727 =      2.4 usec/iter.
walk_list_rec          took 0.34-0.06=0.28/125581 =      2.2 usec/iter.
args(1)                took 0.32-0.06=0.27/120000 =      2.2 usec/iter.
args(2)                took 0.31-0.04=0.27/81818 =       3.3 usec/iter.
args(4)                took 0.34-0.03=0.31/54545 =       5.7 usec/iter.
args(8)                took 0.36-0.01=0.35/33333 =      10.4 usec/iter.
args(16)               took 0.43-0.01=0.43/19355 =      22.0 usec/iter.
cons_term              took 0.28-0.04=0.24/84375 =       2.9 usec/iter.
walk_term              took 0.33-0.05=0.28/110204 =      2.5 usec/iter.
walk_term_rec          took 0.40-0.06=0.34/122727 =      2.8 usec/iter.
shallow_backtracking   took 0.24-0.21=0.03/415385 =      0.1 usec/iter.
deep_backtracking      took 0.17-0.03=0.14/59341 =       2.4 usec/iter.
choice_point           took 0.47-0.04=0.43/94737 =       4.5 usec/iter.
trail_variables        took 0.41-0.04=0.38/87097 =       4.3 usec/iter.
medium_unify           took 0.65-0.37=0.28/771429 =      0.4 usec/iter.
deep_unify             took 0.57-0.11=0.46/235161 =      2.0 usec/iter.
integer_add            took 0.55-0.02=0.53/49091 =      10.7 usec/iter.
floating_add           took 0.56-0.02=0.54/40909 =      13.2 usec/iter.
arg(1)                 took 0.27-0.02=0.25/40000 =       6.2 usec/iter.
arg(2)                 took 0.24-0.02=0.22/40909 =       5.3 usec/iter.
arg(4)                 took 0.21-0.02=0.20/37500 =       5.3 usec/iter.
arg(8)                 took 0.22-0.02=0.20/38217 =       5.3 usec/iter.
arg(16)                took 0.22-0.02=0.20/38298 =       5.2 usec/iter.
index                  took 0.42-0.05=0.38/100000 =      3.8 usec/iter.
assert_unit            took 0.83-0.00=0.83/1525 =      547.5 usec/iter.
access_unit            took 0.23-0.01=0.22/26471 =       8.3 usec/iter.
slow_access_unit       took 0.01-0.00=0.01/1607 =        8.1 usec/iter.
setof                  took 0.15-0.00=0.15/7692 =       19.0 usec/iter.
pair_setof             took 0.18-0.00=0.18/6522 =       27.6 usec/iter.
double_setof           took 0.44-0.00=0.44/1837 =      237.9 usec/iter.
bagof                  took 0.14-0.01=0.14/10112 =      13.4 usec/iter.
34 benchmarks took 10.05 seconds

Sorry I don’t have a Mac to profile the different runtimes, so the following will probably be irrelevant. In case, forgive for the noise…

Just noticed that medium_unify and deep_unify did slightly better in the new runtime.
My guess is that clauses timestamps could be involved and could cause degradation.
If I understood correctly the docs, they have a fundamental role in the ‘logical update view’ of Prolog database, that should be irrelevant when doing unification (apart attributed variables).

Yes, “broke” means that Xcode was generating incorrect code. We saw no performance degradation in new Xcode but we’ll keep an eye on it.

Odd. I have a impression the degradation is somewhere in the VM. Do you use gcc’s threaded code extension (get addresses for labels and have the VM code simply use the label address as instruction so you can do goto *PC++ to jump to the next VM instruction?) Old clang also has severe performance issues using this extensions, Maybe they re-introduced slow handling of this …

Anyway, for now I’ll stick with gcc. It does give a lot of additional build dependencies for Macports. Otherwise it is an easy 10% improvement over the old Xcode and almost double performance for the new one.

That is a good point. We have optional (experimental) threaded bytecode, but it is disabled by default. We’ll let you know if we see a similar performance degradation.

Using the threaded bytecode support of GCC which is also supported by clang seems to be the problem. By explicitly disabling this the bench_peirera test completes in 9.06 seconds, which is the fastest we have! Pity they messed up this feature (again, it behaved really poorly when Apple moved to clang and got better later).

Initially threaded bytecode shows a performance improvement of nearly 10% on GCC as far as I recall. I now retested this finding only about 3% for gcc-9. Could be that the 10% was really long ago, before some common sequences of VM instructions were replaced by mega instructions.

Anyway, it seems wise to disable this extension for llvm/clang again and switch back to clang for SWI-Prolog on MacOS.

edit

Seems I made a mistake, by accident running the gcc version and my machine being in a fast mood :slight_smile: (there is quite some variance in the timing). Disabling threaded bytecode does help for clang, but changed the time from 19.4 to 14.8 sec. So it is still useful to disable, but it seems wise to keep building the binary bundles with gcc.

1 Like