@Wisermans posted an interesting challenge. Before it gets automatically deleted, since he deleted it, I want to preserve his challenge.
@Wisermans wrote Brut performance can be analyzed looking at the generated machine language compiled code when it is a question of ticks and cache where the processor matters at a low level … BUT on bigger projects what matters is the way the programmer does the job / the readability / the upgradabily of the code in the long run …
Moreover when you manage huge amounts of data the architecture is most of the time more important than the processor itself … memory allocation, caching, indexing are even often more important than bandwdith itself … knowing that most “young” programmers live in luxury and most of the time optimize nothing, spending most of their time using libraries done by others and not even understanding how they really work.
As for brut performance, in the 90s i was thinking of getting Forth encapsulated into some Prolog as it is close to Assembly, easily portable and i prefer RPN / stack / etc. to all those brackets that you see on most languages to do silly maths. Looking at what Jan did with dicts they both share a dictionary approach.
Compared to Forth, C / C++ / Java are blabla languages already “far” from the processor pure performance moreover a funny stuff with modern processors is that they are like “full computers” on a chip With the modern processors we have now you could fit a Forth system into one core
Looking at Prolog the same way there could certainly be a core level optimization too to gain on some ticks, but then why not also think about using GPUs ??? and take into account the fact that you can upload code into graphic cards where memory is MUCH MORE quicker too (same principle as mining systems) … which means making a dedicated core level optimization … and on such an approach your Prolog system would be exploding any other system by much more than just a few ticks …
Once again … architecture matters … by the way i am curious to see what the new Tensor processors from Google are going to implement. A funny stuff could also be to look at the underused GPU features from Intel processors … when all is on a chip it always works much quicker.
I was around when the Fifth Generation project was going on, and also similar projects in the UK and Europe. The various attempts to improve performance either had overheads that they hadn’t considered (Strand comes to mind, where the costs of setting up the parallelism often outweighed the performance increase – similar to how concurrent_maplist/3 is great sometimes but blind use leads to performance degradation), and/or made significant modifications to the language (Erlang has probably done a better job of this; but it’s not Prolog).
Aquarius had a significant compiler written in Prolog, which is how it got good performance on some benchmarks … more work in that direction, including type inferencing, might yield the best results. But this also leads to a more complex VM, because the best performance is attained by compiling down to untagged/unboxed items. (BTW, the early SPARC designs had some tag manipulation instructions - intended for LISP implementations - but I’m not aware of anyone actually using them.)
existing = multitasking = you need to supervise it, so the question is to know if multitasking was needed and if the need is not just about pure speed (higher frequency, registry optimization, caching, throughput) which is also about how SWI Prolog could work even quicker
not existing = GPUs = useful for ML, NLP, signal processing, sound and image processing, etc. (or the color palette example)
not existing - distributed architecture = can be useful for crawling, scrapping etc. and any stuff that can be processed by parts / on different systems or externalized and that could make different SWI-Prolog systems interact in an easy way