Mac fat binaries slow on Intel

I’ve referred to this issue a few times before in other threads but don’t think I ever saw an explanation/resolution.

The fat Mac binaries supporting Apple silicon run significantly slower on Intel Macs than the old Intel only binaries. Using the standard bench mark suite in the release on my Intel Mac:

?- version.
Welcome to SWI-Prolog (threaded, 64 bits, version 8.4.1)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

true.

?- bench.
'Program'            Time     GC
================================
boyer               1.040  0.070
browse              1.043  0.000
chat_parser         1.282  0.000
crypt               1.312  0.000
derive              1.505  0.000
fast_mu             1.329  0.000
flatten             1.261  0.000
log10               1.190  0.000
meta_qsort          1.285  0.000
mu                  1.112  0.000
nand                1.252  0.000
nreverse            1.160  0.000
ops8                1.395  0.000
perfect             1.210  0.000
poly_10             1.309  0.000
prover              1.403  0.000
qsort               1.052  0.000
queens_8            1.106  0.000
query               1.229  0.000
reducer             1.288  0.000
sendmore            1.230  0.000
serialise           1.124  0.000
simple_analyzer     1.274  0.000
tak                 1.134  0.000
times10             1.435  0.000
unify               1.283  0.000
zebra               1.115  0.000
sieve               2.306  0.000
queens_clpfd        1.649  0.000
pingpong            1.550  0.000
fib                 3.830  0.000
moded_path          3.809  0.000
det                 1.236  0.000
           average  1.447  0.002
true.

vs.

?- version.
Welcome to SWI-Prolog (threaded, 64 bits, version 8.5.20-DIRTY)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

true.

?- bench.
'Program'            Time     GC
================================
boyer               1.923  0.077
browse              1.898  0.000
chat_parser         1.642  0.000
crypt               1.882  0.000
derive              2.090  0.000
fast_mu             2.071  0.000
flatten             1.928  0.000
log10               1.797  0.000
meta_qsort          1.967  0.000
mu                  1.666  0.000
nand                1.892  0.000
nreverse            1.892  0.000
ops8                2.180  0.000
perfect             1.844  0.000
poly_10             2.076  0.000
prover              2.133  0.000
qsort               1.785  0.000
queens_8            1.841  0.000
query               1.801  0.000
reducer             1.825  0.000
sendmore            1.679  0.000
serialise           1.852  0.000
simple_analyzer     1.794  0.000
tak                 1.719  0.000
times10             2.120  0.000
unify               1.900  0.000
zebra               1.555  0.000
sieve               2.957  0.000
queens_clpfd        2.173  0.000
pingpong            2.379  0.000
fib                 4.676  0.000
moded_path          4.809  0.000
det                 1.772  0.000
           average  2.107  0.002
true.

so an average of 1.45 on Intel only vs. 2.1 on fat binary. If I set the optimise flag to ``true`, it’s 1.23 vs. 1.82.

So any prospect of getting the new Mac fat binary releases to perform as well as old ones on Mac Intel?

The explanation is rather simple: the fat binaries are compiled using Apple Clang (on an M1 machine) rather than gcc. While Clang is in generally only a little slower than gcc, it does a rather poor job for compiling the giant switch that realizes SWI-Prolog’s virtual machine and unlike with GCC, using PGO (Profile Guided Optimization) does not help.

Build the system yourself using GCC. At some point we can possibly replace Clang with GCC for building the fat binaries. When I tried it, GCC was not able to produce a correctly working M1 executable (it compiled, but failed on several tests).

1 Like

Is Apple aware of the problem?

Is this process documented anywhere for the naive user (such as myself)? Including the bundled app? My problem is that I don’t have a lot of confidence in anything I build locally. I just blindly follow the build instructions at SWI-Prolog -- Installation on Linux, *BSD (Unix)

Perhaps at least the stable version could be built with GCC (for Intel only use) for performance comparison with past versions?

Maybe bugs in Rosetta (running Intel binaries on AS)?

Probably at some level. Since I sent a totally trivial bug reports which took them a year before responding that they acknowledged the bug (without a fix) I won’t try :frowning: It is first of all a difference between Clang and GCC. Both have their merits. On most programs the performance difference is small. On the coding style used by the VM (a giant routine with a big switch and quite a bit of jumping from one instruction to another), GCC is clearly superior. Possibly Clang has other tricks to get better performance. The main development platform has always been GCC.

Just building the commandline versions with the X11 tools is pretty easy. Building the bundle is harder. See doc/Release.md for a lot of info, but unfortunately not all details are there and some might be outdated.

No. The issue where with the M1 (AS) version. Could well have been improved with more recent GCC versions.

Which makes this all the more frustrating. But I do sympathize with your point in getting Apple to do anything, having been down that road before with different Apple software.

So doc/Release.md is pretty much incomprehensible to me.

I do have a version of gcc:

$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

As an increment on the standard SWI-Prolog – Installation on Linux, *BSD (Unix) what do I need to do to build a performant version for Intel only? Perhaps the pre-fat binary build scripts can be retrofitted to work with current source?

??

If we can get Windows development releases built nightly, why can’t we get at least stable Mac GCC/Intel only binaries built every year or so?

No. It says Apple LLVM version 10.0.0 (clang-1000.10.44.4) and is simply clang called gcc. You can install real gcc using Macports or Homebrew.

Just download the source. Installing dependencies is described at Building SWI-Prolog on MacOSX. In Macports, gcc is called e.g. gcc-mp-11, and the build is done using these basic options, other options at your choice (location to install, etc.)

CC=gcc-mp-11 cmake -DCMAKE_BUILD_TYPE=PGO -G Ninja ..
ninja

I don’t want to confuse the downloads even further. Eventually I hope gcc will work to make fast universal binaries. Possibly that is already possible. As for nightly builds, we use a Linux cloud server for that using cross-compilation. Running a MacOS M1 machine in the cloud is remarkably expensive :frowning: Possibly, cross-compilation is also an option. If someone sorts it out, I’m happy to use that.

And this is where I go off the rails. Trying to install or update anything with Macports immediately generates a permissions error. After some fiddling with permissions, I eventually got to:

$ port -v selfupdate
--->  Updating MacPorts base sources using rsync

Willkommen auf dem RSYNC-server auf ftp.fau.de.
Nicht all unsere Mirror sind per rsync verfuegbar.

Welcome to the RSYNC daemon on ftp.fau.de.
Not all of our mirrors are available through rsync.


receiving file list ... done
rsync: mkstemp "/opt/local/var/macports/sources/rsync.macports.org/macports/release/tarballs/.base.tar.xDSfTF" failed: Permission denied (13)
inflate returned -3 (37 bytes)
rsync error: error in rsync protocol data stream (code 12) at /BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-52.0.1/rsync/token.c(419) [receiver=2.6.9]
rsync: connection unexpectedly closed (31 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at /BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-52.0.1/rsync/io.c(453) [generator=2.6.9]
Command failed: /usr/bin/rsync -rtzvl --delete-after rsync://rsync.macports.org/macports/release/tarballs/base.tar /opt/local/var/macports/sources/rsync.macports.org/macports/release/tarballs
Exit code: 12
Error: Error synchronizing MacPorts sources: command execution failed
Error: Follow https://guide.macports.org/#project.tickets if you believe there is a bug.
Error: /opt/local/bin/port: port selfupdate failed: Error synchronizing MacPorts sources: command execution failed

What am I to make of this??

I’m quite capable of blindly copying commands to the terminal but have no idea what’s happening under the covers. That’s all on me, but don’t expect it to change. And I think you’re asking a lot for non-C professionals to start building an application as complex as SWIP from source.

Since I’m incapable of building binaries I can trust, all I see is a significant performance regression on a supposedly supported platform that’s persisted for a year. I was very happy with a performant, bundled SWIP application that meant I didn’t have to touch the terminal app, but I’m not all that satisfied with the status quo. At the very least there should be an open issue so this doesn’t get buried by other development activities.

Macports typically requires some maintenance after MacOS upgrades, etc. That is something to figure out on the Macports forum. If you let it as is for years I’d consider removing and reinstalling.

For now, I’m afraid I am too busy to figure out whether a GCC based fat binary is now possible. I wonder whether PGO optimization goes together with fat binaries anyway? It would be great if some Mac user is volunteering to figure this out. FYI, the release build instructions are in docs/Release.md and the scripts to do the final job are in scripts/make-distribution.

1 Like

I have pushed a number of fixes that allows compiling using gcc-12 on the M1. Here the advantage of GNU GCC over Apple Clang (and Apple’s gcc command) is a lot smaller than on Intel, about 7%. Passes the tests. As there were several problems I’d not trust the result too much though.

The disadvantage is that GNU GCC cannot create “fat” binaries. This means we must create arm64 and x86_64 binaries separately and create fat binaries using the lipo tool. That is still fine, but GNU GCC for M1 cannot target x86_64, so we need (to build?) a cross-compiler. If that is done we must hope PGO is supported when cross-compiling (might be as the M1 can run X68_64 binaries). Looks pretty complicated :frowning:

If anyone has a better route, please share your thoughts.

Thanks for pushing on this a bit.

The biggest stumbling block seems to be building a GCC x86_64 version of SWIP on an M1 Mac. This requires a cross-compiling GCC (to x86_64) and an additional step (lipo) to build a fat binary. And PGO requires running the x86_64 build on the M1 architecture.

The cross compiling (and PGO?) could be replaced by a building on an Intel Mac but that may not be feasible.

Hard to simplify things when you’re trying to compile for and run two different architectures using a single piece of hardware. (Apple provides a solution, just not the optimal one.)

Another approach is to simplify/fool-proof the build process of a standard configuration to the extent that dummies (like me) can successfully use it. I think I can manage the basic clone/cmake/ninja build process fine, but setting up the tools/prerequisites gets a bit much if anything goes wrong. Even a script which checked for the availability of required software, including compliers etc., in standard places and provided useful diagnostics would help. (It would appear(?) cmake does some of this but the results are buried in hundreds of lines of output so it’s hard to know what’s relevant.)

And the status quo isn’t terrible, but I wanted to give this issue some visibility.

1 Like

Just wanted to add a data point - one of my benchmarks using a PEG grammar to parse a 1000 line JSON file:

SWIP 8.4.1 (Intel only) - 4 ms., 10 MLips
SWIP 9.1.8 (fat binary, “big switch” WAM) - 6 ms., 6.6 MLips
SWIP 9.1.10 (fat binary, “function based” WAM?) - 6.4 ms., 6.2 MLips

MacOS 10.13.6

Seems that on Apple Silicon it gets faster and on Intel it gets slower :frowning: I don’t see a good solution to this. One option might be for someone to volunteer and setup building a fast version for Mac/Intel. Currently this (AFAIK) requires a Mac/Intel machine. It might be possible to setup the build in Docker using Darling