Windows vs Linux performance

After configuring Windows 11 in a VirtualBox VM I ran some tests, including the benchmarks. It turns out the Windows version runs about 20% slower then the Linux version. I wonder why. Both are compiled with GCC 11.2 using PGO. This difference shows on bench/run.pl from the sources as well as another program. I wonder whether this difference is real or it is due to the usage of a VM. Ok, VMs slow down a lot, but I thought not the CPU and running these benchmarks involves very little OS interaction.

Anyone with both Windows and Linux on the same hardware who can confirm this? If virtualization is the problem we should see the reverse between the native Windows version vs using WSL2. A dual boot system or two identical systems running Linux and Windows is probably the only safe comparison.

1 Like

Might be Microsoft Defender.

Performance tip Due to a variety of factors (examples listed below) Microsoft Defender Antivirus, like other antivirus software, can cause performance issues on endpoint devices

If you use .bat files for benchmarking, be careful to give it high priority:

start /high <your benchmarking>

start launches a program with Normal priority. start /high launches
it with High priority in the Windows scheduler. When you use /high,
your program is less likely to be preempted by background tasks.

I have already seen a 50% degradation when I didn’t use /high. /high
might reveals hardware behavior that Normal priority fails to unlock.
And this can show in comparison Linux / Windows on the same machine.

When the Linux’s scheduler assumes “serious work” by default.

Thanks. Defender is running. start /high make little difference. I’d expect neither to have much impact. Defender after all monitors files and networking I suppose, not raw CPU behaviour? The machine has 16 (virtual) cores and pretty much nothing to do. Practically nothing is installed or running on it and the task manager claims somewhere between 0 and 2% CPU usage.

The result is rather hard to interpret though. Times are the averages as reported by swipl -O bench/run.pl -s 10.

Platform Time
swipl under Linux 0.060
swipl-win under Linux 0.060
swipl.exe using Wine 0.065
swipl-win.exe using Wine 0.073
swipl.exe using Windows 11 in VirtualBox 0.079
swipl-win.exe using Windows 11 in VirtualBox 0.082

This numbers are rough averages. VirtualBox is a rather hungry application, causing the benchmarks to reproduce rather poorly. The rough picture though seems to be that swipl/swipl-win is the same on Linux while swipl-win seems to perform worse on Windows. That could be because Prolog execution does not run on the main thread in swipl-win. Wine seems to do a better job that Windows under VirtualBox.

If you use an ARM processor and run x86 code. Its usally 50% slower.
This has less to do with the operating system, more how object files
are executed by the CPU. If you have a .exe compiled for ARM,

picture might look different. But you didn’t tell us what architectures,
and object file types were involved. So this is just a speculation. Also
if you intened to use ARM you might directly use a Windows machine:

Windows 11 (specifically 24H2 and later, popular in 2025) runs
x86/x64 apps on ARM processors (like Snapdragon X Elite) via Prism.
https://learn.microsoft.com/de-de/windows/arm/apps-on-arm-x86-emulation

But I am not sure what real benchmarks say, like whether its pure
marketing gag, or really gives some bang, this Prism. Even if
a certain CPU has specific Prism support, it might still be weak.

I could possibly run the same tests on a native Windows 11 build using CMake with MSVC (not in a virtualized environment). Currently, the CTest suite passes at about 98% on SWI-Prolog 10.x.y, and I am still working through the remaining issues.

Could you share the test details so I can review them?

This is comparing executables for x86_64, where the Linux version is compiled using GCC 15.2 and the Windows version using MinGW GCC 15.2. Compilation uses the same flags: -O3 and Profile Guided Optimization (PGO).

Everything runs on the same system, an AMD3950X CPU. Plenty (128Gb) memory. The system is practically idle. One would assume that a CPU bound application (none of the tests access files, networking, etc.) performs the same. Timing was done on the raw system for Linux (Fedora 43) and for the Windows version both running the application under Wine and under Windows 11 running inside a VirtualBox VM under the Linux OS.

I can see at least some possible reasons for small differences:

  • C function calling convention. Are these the same?
  • AFAIK, PIC code as used in Linux uses an additional register. This should favour Windows though.
  • Different memory allocation library. Though only a couple of the benchmarks use significant malloc()/free().

:clap:

That won’t come out great. I don’t recall the exact figure, but I think the MSVC compiled version didn’t even provide half the performance of the MinGW/GCC version. It might be wise to test configured with both -DVMI_FUNCTIONS=ON and -DVMI_FUNCTIONS=OFF. Typically -DVMI_FUNCTIONS=OFF is the faster one, but on some platforms/compilers it is not.

Simply run

swipl[-win].exe -O bench/run.pl -s 10

That writes

'Program'            Time     GC
================================
boyer               0.071  0.006
browse              0.061  0.000
chat_parser         0.077  0.000
...
eval                0.079  0.000
           average  0.061  0.000

My table gave the average number. The -s 10 means “speedup 10 times” (i.e., run 1/10th of the normal iterations per benchmark). All benchmarks have a calibrated number of iterations that caused them to use 1 second per benchmark without -O on the same hardware as I still use (last calibration in 2021). Without -O I now get 0.070, so we gained about 30% in 5 years :slight_smile:

1 Like

Could you provide the specific CMake configuration options used for the build so that we can make a fair, apples-to-apples comparison?

I realize you run these builds many times a day, but I don’t have enough context to know which options are appropriate for this test.

For example, I am currently building a Debug configuration so I can debug in the Visual Studio IDE, and I am enabling AddressSanitizer (ASan) to help track down some particularly stubborn bugs.

cmake .. -G "Visual Studio 18 2026" -A x64 -DCMAKE_TOOLCHAIN_FILE=C:/dev/vcpkg/scripts/buildsystems/vcpkg.cmake -DCMAKE_C_FLAGS="/fsanitize=address" -DCMAKE_CXX_FLAGS="/fsanitize=address" -DINSTALL_DOCUMENTATION=OFF -DSWIPL_PACKAGES_ARCHIVE=OFF -DSWIPL_PACKAGES_YAML=OFF -DSWIPL_PACKAGES_PYTHON=OFF -DSWIPL_PACKAGES_PCRE=OFF

cmake --build . --config Debug  --verbose

EDIT

If you need me to do a GCC build, that will have to wait until I finish finding and fixing the issues using MSVC. I don’t currently have GCC installed on this machine, nor am I set up with a working GCC toolchain.

The build should go for maximum performance. That implies a release and definitely no sanitizers. For GCC this uses -O3 at the moment. It also uses Profile Guided Optimization. This implies the core is compiled with instrumentation, the benchmarks are executed and the core is compiled again using the gathered statistics. This should improve branch prediction and hot/cold code. It helps significantly for GCC targeting x86_64. It does not help at all for Clang. Also the gain for GCC targeting arm64 is small. To make it work for MSVC you have to do a little hacking to the CMake configuration.

You can just fetch the release binaries. They are cross-compiled from Linux using MinGW/GCC 15.2. The code for creating the build docker is in GitHub - SWI-Prolog/docker-swipl-build-mingw: Docker to cross-compile SWI-Prolog for Windows

1 Like

I fetched and installed the release binaries, one after the other, for both the stable and dev versions and neither had the build directory with run.pl. After finding the build directory in the GitHub repository to make sure of what I was seeking looked at the version I am building for MSVC and did see the build directory, e.g.

C:\dev\swipl-devel\bench\run.pl

My take is that I will have to build the GCC version, run the test for PGO (new experience for me), then rebuild so PGO can optimize the build, run the tests again and report.

If there is a way to do the PGO steps using the binary downloads, details would help, maybe I missed something. :person_shrugging:

? The binary downloads are compiled as described (-O3, PGO build). True, the bench directory is not part of the binary distribution. So, get a shell, navigate to the bench directory in the source distribution you have and run (adjust paths):

c:\program files\swipl\bin\swipl.exe -O run.pl -s 10
1 Like
C:\Users\Eric>"C:\Program Files\swipl\bin\swipl" --version
SWI-Prolog version 10.1.1 for x64-win64

C:\Users\Eric>"C:\Program Files\swipl\bin\swipl" -O "C:\dev\swipl-devel\bench\run.pl" -s 10
'Program'            Time     GC
================================
boyer               0.031  0.000
browse              0.047  0.000
chat_parser         0.047  0.000
crypt               0.031  0.000
derive              0.063  0.000
fast_mu             0.031  0.000
flatten             0.047  0.000
log10               0.047  0.000
meta_qsort          0.047  0.000
mu                  0.031  0.000
nand                0.031  0.000
nreverse            0.047  0.000
ops8                0.062  0.000
perfect             0.031  0.000
poly_10             0.031  0.000
prover              0.047  0.000
qsort               0.047  0.000
queens_8            0.016  0.000
query               0.031  0.000
reducer             0.047  0.000
sendmore            0.016  0.000
serialise           0.031  0.000
simple_analyzer     0.047  0.000
tak                 0.047  0.000
times10             0.047  0.000
divide10            0.047  0.000
unify               0.031  0.000
zebra               0.062  0.000
sieve               0.047  0.000
queens_clpfd        0.047  0.000
pingpong            0.078  0.000
fib                 0.125  0.000
moded_path          0.094  0.000
det                 0.047  0.000
eval                0.031  0.000
           average  0.046  0.000

This is the first run.

Sorry, I have no idea if I am suppose to do something for PGO, such as build with GCC or just run the test again.

The binary release is built using PGO optimization. There is nothing you need to do to use that. It is a compilation technique that helps the C compiler with statistical data about the execution to make more informed judgements on what is the best way to compile some statement.

In this case it is a little dubious as the code from bench is both used for the training and for the testing. This implies that the performance improvement on arbitrary code is most likely smaller than what is measured by the benchmark. On Linux, we gain about 15% on the benchmark compared to e.g., about 9% on CHAT80. Better coverage of the benchmarks should make this difference smaller.

Your hardware is quite a bit faster than mine …

I wouldn’t jump to that conclusion just yet.

As I don’t known the details of the bench code and used the directory from my bug fixing, the bug fix directory might have changes that invalided the test such as dropping a count from 5000 to 1300 for a SEGFAULT bug, so that the test would run without the SEGFAULT to see if there was another bug, which has happened on one occassion.

I pulled the bench directory down from the GitHub repository then ran the test.

C:\Users\Eric>"C:\Program Files\swipl\bin\swipl" -O "C:\Program Files\swipl\bench\run.pl" -s 10
'Program'            Time     GC
================================
boyer               0.031  0.016
browse              0.047  0.000
chat_parser         0.047  0.000
crypt               0.031  0.000
derive              0.062  0.000
fast_mu             0.031  0.000
flatten             0.047  0.000
log10               0.047  0.000
meta_qsort          0.047  0.000
mu                  0.047  0.000
nand                0.031  0.000
nreverse            0.063  0.000
ops8                0.031  0.000
perfect             0.016  0.000
poly_10             0.047  0.000
prover              0.047  0.000
qsort               0.031  0.000
queens_8            0.016  0.000
query               0.031  0.000
reducer             0.047  0.000
sendmore            0.031  0.000
serialise           0.031  0.000
simple_analyzer     0.062  0.000
tak                 0.031  0.000
times10             0.047  0.000
divide10            0.078  0.000
unify               0.047  0.000
zebra               0.047  0.000
sieve               0.062  0.000
queens_clpfd        0.062  0.000
pingpong            0.078  0.000
fib                 0.109  0.000
moded_path          0.094  0.000
det                 0.078  0.000
eval                0.047  0.000
           average  0.049  0.000

image

1 Like

Interestingly Moore’s Law continues, a 2025 laptop is
twice as fast as the Zen 2, AMD Ryzen 3950X from 2019,
and uses half of the energy ca. 105 W versus 45 W:

           average  0.034  0.000

This can be obtained for example with a AMD Ryzen
AI 7 350, or with a Intel Core Ultra 7 258 V. Key
seems for example faster RAM, i.e. DDR5X 8533MT/S.

P.S.: Results without -s 10, using Windows 11, for example
crypt, query and sendmore stand out, and some others
as well, compared to the Intel i7-1195G7 from year 2021,

only the AMD Ryzen AI 7 350 results:

'Program'            Time     GC
================================
boyer               0.375  0.016
browse              0.328  0.000
chat_parser         0.375  0.000
crypt               0.172  0.000
derive              0.359  0.000
fast_mu             0.250  0.000
flatten             0.312  0.000
log10               0.359  0.000
meta_qsort          0.344  0.000
mu                  0.344  0.000
nand                0.312  0.000
nreverse            0.344  0.000
ops8                0.297  0.000
perfect             0.172  0.000
poly_10             0.281  0.000
prover              0.375  0.000
qsort               0.234  0.000
queens_8            0.172  0.000
query               0.172  0.000
reducer             0.359  0.000
sendmore            0.156  0.000
serialise           0.281  0.000
simple_analyzer     0.344  0.000
tak                 0.250  0.000
times10             0.406  0.000
divide10            0.453  0.000
unify               0.266  0.000
zebra               0.406  0.000
sieve               0.391  0.000
queens_clpfd        0.359  0.000
pingpong            0.453  0.000
fib                 0.859  0.000
moded_path          0.750  0.000
det                 0.484  0.000
eval                0.344  0.000
           average  0.347  0.000

Thanks. Note that the benchmarks that stand out also stand out on my old hardware. As said, the number of iterations for each benchmark was calibrated in 2021. Part of the improvements is the result of some speedup to arithmetic and clause indexing that affect some benchmarks more than others. Below are the full results for AMD3950X on Fedora 43. They are also a bit faster then the -s 10 results. That is probably scheduling as the results are comparable to the -s 10 results after switching the system from the balanced to the performance power profile.

> swipl -O ../bench/run.pl 
Program              Time     GC
――――――――――――――――――――――――――――――――
boyer               0.651  0.051
browse              0.656  0.000
chat_parser         0.743  0.000
crypt               0.330  0.000
derive              0.688  0.000
fast_mu             0.456  0.000
flatten             0.515  0.000
log10               0.644  0.000
meta_qsort          0.681  0.000
mu                  0.585  0.000
nand                0.591  0.000
nreverse            0.602  0.000
ops8                0.635  0.000
perfect             0.291  0.000
poly_10             0.508  0.000
prover              0.704  0.000
qsort               0.418  0.000
queens_8            0.293  0.000
query               0.339  0.000
reducer             0.653  0.000
sendmore            0.240  0.000
serialise           0.466  0.000
simple_analyzer     0.586  0.000
tak                 0.396  0.000
times10             0.692  0.000
divide10            0.745  0.000
unify               0.446  0.000
zebra               0.586  0.000
sieve               0.528  0.000
queens_clpfd        0.694  0.000
pingpong            0.711  0.000
fib                 0.806  0.000
moded_path          0.771  0.000
det                 0.543  0.000
eval                0.724  0.000
           average  0.569  0.001

Yes, Intel CPUs (Eric’s) and AMD Ryzen CPUs (Yours and mine) have
quite different dynamics. Maybe if you would test some like 100 CPUs
samples, from 100 different machines, i.e. make a score board like

Geekbench has , but for SWI-Prolog. You would find certain patterns.
But already obtaining the bench folder requires quite some acrobatics,
since its not part of the distribution. If you take the average these

results get washed out, for example my Intel and my AMD have
the same average in your benchmark, i.e. 0.033 and 0.034. Why
these differences? For example instruction latency

differences, see also https://uops.info/:

Integer Division (64-bit DIV/IDIV)
AMD Zen 5 (Big/Dense Core):      ~13-20c lat (est.)
Intel Lion Cove:                 ~35-60c lat (est.)
Etc... Etc..

I wonder whether this implies one should do Intel and AMD PGO
separately? Don’t know really. But there are academic papers
discussing the matter, like 2507.16649v1. New complication, the

emergence of Windows ARM and Prism.

We only produce a generic binary. If you build on your local machine to can optimize for the target CPU. I once tried that for AMD3950X, but I was not impressed (forgot the exact gain).

Every CPU/compiler combination comes with its own characteristics and optimal settings …

For the fun I used my old performance chart generator with your info (purple), @EricGT (green, multiplied by 10 to compensate for -s 10) and using Wine on AMD3950X (blue). You can indeed see that both AMD chips have a comparable distribution and the Intel behaves differently.

The differences are only marginally. Since the Prolog test
cases are more than benchmarking a CPU instruction.
So producing non-generic binary could be overkill.

These 2025 laptops using AMD or Intel are rather
traditional machines when it comes to CPU. What is
interesting are the GPU and the NPU. The Intel machine

can do 18 tokens / second, fully locally, using LM Studio
with Qwen3-Coder-30b and only 32 GB memory. No
Claude, OpenAI, etc.. subscription needed: