Time discrepancies when using concurrent_maplist

Hi!

The work described here was done in collaboration with Alfredo Capozucca from University of Luxembourg.

We find time discrepancies when running the same goal using concurrent_maplist w.r.t. when the goal is run sequentially (no parallelization).

Part of the code we use is as follows (the full program can be found here):

countSdk4(S) :-
   open('log.txt',write,Log),
   findall(_,
     (findnsols(S,Y, between(1, 100, Y), L1),
      get_time(Ti),
      concurrent_maplist(sudoku4234(4),L1),
      get_time(Tf),
      T is Tf - Ti,
      write(Log,T),
      write(Log,'\n')
     ),
     _
   ),
   close(Log).

where sudoku4234/2 is a hard goal based on finding all the solutions for a 4x4 sudoku.

So, if we call countSdk4(1), the findnsols call will return only one solution, so L1 is a single element list and so concurrent_maplist(sudoku4234(4),L1) is basically maplist and so only one processor is used. Then, through get_time we take the time needed to solve each call to sudoku4234.

In a second experiment we call countSdk4(8) which means that batches of 8 calls to sudoku4234 will be run in parallel using 8 processors. In this case we take the time used to solve each batch of 8 calls.

In machines with more than 8 processors we expected that the time needed to solve a batch of 8 calls will be close to the maximum of the times needed to solve the same 8 calls sequentially (i.e. when countSdk4(1) is called). But we found (consistent) discrepancies between 20% and 30% more time. That is, instead of the maximum we get 20-30% more time than that. For instance, these are the times (in seconds) needed to solve the first 8 calls to sudoku4234 in a Mac laptop with 12 processors:

33.0132069587707
33.0883748531341
32.6344561576843
32.7681801319122
32.6260249614715
32.5810809135437
26.2362399101257
31.1244869232177

while the same goals run in parallel took 42.5555839538574. As you can see the maximum of these 8 times is 33.0883748531341 meaning that running them in parallel took 28% more time than the maximum. The full data set can be found here.

Are we doing anything wrong? Is our expectation wrong? Can anyone tell us why is this happening?

Although we post the sudoku-based experiments we intend to use the same technique with {log} (setlog) (www.clpset.unipr.it/setlog.Home.html) which is a satisfiability solver for set theory and set relation algebra implemented in Prolog. The goals to be parallelized in {log} take way more time than solving 4x4 sodokus and the discrepancies are even higher than those shown here. This behavior would render the parallel approach not very useful.

Thanks!

Regards,
Maxi

Your theory is fine, but as usual, practice is sometimes a little different. There are various factors that can contribute to less that expected performance. To name a few:

  • The CPU throttles down to prevent overheating.
  • Related, when using a single core, most systems will boost the clock for this single core and move the task regularly to a different core to prevent overheating.
  • Also related, some systems have fast and (slower) energy saving cores.
  • Memory bandwidth may become a limiting factor.
  • Finally, there may be some amount of synchronization between the cores. That depends on the Prolog program. It should be rather minimal on nice pure Prolog code.

Thanks a lot for the quick answer!

Maxi

My rule-of-thumb for doing anything in parallel is to expect 10-30% overhead. E.g., if I do a “parallel make” with 8 tasks on an 8-CPU machine, where every task takes 10 seconds, I would expect the elapsed time to be ~12 seconds. (And I often see timing discrepancies of 10-20% when re-running benchmarks, so a difference of even 30-40% wouldn’t surprise me.)
I first observed such behaviours back in the days of dinosaurs (mainframes were the only multi-core/CPU devices at the time); if anything, the variability of timings has increased over the years.

1 Like

There is a large YMMV. With parallel builds we should be free of memory synchronization issues assuming each step is a single threaded process. Early versions of SWI-Prolog had a lot of synchronization, typically slowing down rather than speeding up after 4-8 threads. That is a lot better now and I’ve measured up to 80 times speedup on 128 cores/threads (PPC hardware).

But nowadays, and in particular or consumer grate CPUs, we see boosting single core workloads, thermal throttling, hyperthreading, fast and low-energy cores, etc. That all favors performance on tasks with fewer active threads.

Also the OS and compiler play a role. Linux seems to do a good job here.

I’ve seen one system with hyperthreading (4 CPUs or 8 hyperthreads), where my benchmarks showed that turning off hyperthreading was actually faster - that is, hyperthreading introduced 100% overhead.

It used to be that if you ran a CPU-bound job in the background on a single-CPU machine, Windows UI would be very sluggish; whereas Linux (or Solaris) would barely show any difference in responsiveness.
Execution models can matter a lot also - the original WhatsApp was written in Erlang and got extremely good scaling. I’ve seen improvements of ~10x on a single multi-core machine by changing the code from multi-threading to cooperative multi-tasking.

(On the other end, I’ve seen enormous slow-downs by being too aggressive with multi-threading - Strand being an example.)

That should be caused by too much memory synchronization. I guess there are cases where this is unavoidable. In other cases the choice for data structures and design of the synchronization is important. Old SWI-Prolog used a lot of reference counting on data shared between threads. The use of reference counts is now limited to places where the reference count doesn’t change often. That was a huge improvement.

In this particular case, the slow-down with threading seems to have been due to the costs of context switching, probably TLB flushes, but possibly data cache misses also. Probably some tuning of the kernel parameters could have improved things somewhat; but cooperative coroutining was a definite winner. (This can often lead people to use ugly async code [the existing Python and Javascript libraries encourage this] … the Erlang style of passing around state with the event handlers seems to give high performance and easy-to-understand code.)

I’ve also sped up some Java code that had been written by a programmer who seemed to believe that sprinkling threads throughout the code would magically improve performance (Java synchronized methods can lead novices into making such mistakes) – I removed some threads and got close to an order of magnitude performance boost (similar to how @jan sped up the old SWI-Prolog by reducing the needs for memory synchronization). This performance tuning was challenging because the code had to perform well both for the situation of having many connections with low data rates and also for the situation of having a few connections with high data rates. (In addition, I got some significant performance improvements by rewriting some C++ code in Java - and this was in the days before JITs.)

Performance tuning parallel code is even more challenging than optimizing regular code - there are very few profiling tools, and fixing performance problems can often require significant design changes. On top of that, the bottlenecks in CPUs, memory architectures, network, etc. change every few years.