Something like PyTorch for SWI-Prolog?

@Wisermans posted an interesting challenge. Before it gets automatically deleted, since he deleted it, I want to preserve his challenge.

@Wisermans wrote Brut performance can be analyzed looking at the generated machine language compiled code when it is a question of ticks and cache where the processor matters at a low level … BUT on bigger projects what matters is the way the programmer does the job / the readability / the upgradabily of the code in the long run …

Moreover when you manage huge amounts of data the architecture is most of the time more important than the processor itself … memory allocation, caching, indexing are even often more important than bandwdith itself … knowing that most “young” programmers live in luxury and most of the time optimize nothing, spending most of their time using libraries done by others and not even understanding how they really work.

As for brut performance, in the 90s i was thinking of getting Forth encapsulated into some Prolog as it is close to Assembly, easily portable and i prefer RPN / stack / etc. to all those brackets that you see on most languages to do silly maths. Looking at what Jan did with dicts they both share a dictionary approach.

Compared to Forth, C / C++ / Java are blabla languages already “far” from the processor pure performance moreover a funny stuff with modern processors is that they are like “full computers” on a chip With the modern processors we have now you could fit a Forth system into one core :slight_smile:

Looking at Prolog the same way there could certainly be a core level optimization too to gain on some ticks, but then why not also think about using GPUs ??? and take into account the fact that you can upload code into graphic cards where memory is MUCH MORE quicker too (same principle as mining systems) … which means making a dedicated core level optimization … and on such an approach your Prolog system would be exploding any other system by much more than just a few ticks …

Once again … architecture matters … by the way i am curious to see what the new Tensor processors from Google are going to implement. A funny stuff could also be to look at the underused GPU features from Intel processors … when all is on a chip it always works much quicker.

I was around when the Fifth Generation project was going on, and also similar projects in the UK and Europe. The various attempts to improve performance either had overheads that they hadn’t considered (Strand comes to mind, where the costs of setting up the parallelism often outweighed the performance increase – similar to how concurrent_maplist/3 is great sometimes but blind use leads to performance degradation), and/or made significant modifications to the language (Erlang has probably done a better job of this; but it’s not Prolog).

Aquarius had a significant compiler written in Prolog, which is how it got good performance on some benchmarks … more work in that direction, including type inferencing, might yield the best results. But this also leads to a more complex VM, because the best performance is attained by compiling down to untagged/unboxed items. (BTW, the early SPARC designs had some tag manipulation instructions - intended for LISP implementations - but I’m not aware of anyone actually using them.)

Concerning SWI-Prolog, let’s split it in 3 parts:

  1. existing = multitasking = you need to supervise it, so the question is to know if multitasking was needed and if the need is not just about pure speed (higher frequency, registry optimization, caching, throughput) which is also about how SWI Prolog could work even quicker
  2. not existing = GPUs = useful for ML, NLP, signal processing, sound and image processing, etc. (or the color palette example)
  3. not existing - distributed architecture = can be useful for crawling, scrapping etc. and any stuff that can be processed by parts / on different systems or externalized and that could make different SWI-Prolog systems interact in an easy way

I keep coming back to closer to system level programming that requires more control over memory allocation and use/access patterns.

this is not the typical big data use case, but it does open up prolog to a range of (embedded) real time systems as well …

You could hook into (=..)/2 as follows. Lets say the normal behaviour is:

  • (=..)/2 needs to find the size of the arguments, to create a compound of this size, this is the first pass. You usually loop over the elements start with size=0 and then size+=1 for each element.
  • (=..)/2 if you have the size, you allocate the compound, and in this second pass, you fill the elements of the compound.

Then enhance the behaviour as follows:

  • In the first pass also determine the minimal type. Lets say you have types byte (8-bit), short (16-bit), int (32-bit) and any. You start with type="byte" and then type=max_type(elem_type,type).
  • When allocating the compound, you also take into account that type, not only the size. So there would be CompoundByte, CompoundShort, CompoundInt and CompoundAny.

What are the pros and cons of this?

  • One con is that it makes compound access more difficult, which is needed in (=)/2, (==)/2, arg/3, etc… all of Prolog execution. You need possibly to make a compound a data view, as found in Python or JavaScript in their builtin Array type. So that you can access different internal types of compounds transparently.
  • A further con is representing variables in these data views. If you create the compund with functor/3 instead of (=..)/2 your compound will be initially populated with variables. But what if you create CompoundByte with size=1000. How do you reference 1000 variables if you have only one byte storage in each element. JavaScript has some tricks for compact data views and null values in their builtin Array type. Maybe some data view overlay could do the job for Prolog variable. Python might also provide something, not sure.
  • A finally con is how about supporting change_arg/3 and friends. If you change a byte to a byte in CompoundByte you can go with a mutable data view. But what if you change the element into something larger or go from something larger to something smaller at many places, or even change something into a variable. JavaScript has also some tricks here in their builtin Array type. Python might also provide something, not sure.
  • One pro is you are automatically closer to GPUs or single board computers like Rasberry pi etc… You could try mapping stuff to these data views, now understood by your Prolog system.
  • A further pro is you would need some libraries such as ffi pack to bring more compact arrays to an application, and have more control over memory allocation and use/access patterns. The problem is shifted from explict control to implicit optimization by the Prolog system.
  • A final pro, the data views for (=..)/2 and functor/3 would be a rather dynamic approach to some optimization by the Prolog system. The Prolog system could of course also do static optimizations, optimizing the static clauses of a Prolog text.

Edit 15.08.2021:
I am using the term data view in a narrow sense. In the context of PyTorch there is a tensor data view. A tensor is an array of matrices. The broader sense of data view is to have also derived views and not only compact storage patterns:

How does the “view” method work in PyTorch?

But many GPUs can work with small tensors, also called 3D data. Run their many CPUs over such a structure, so you can not only access thread.x and thread.x, but also thread.z. Disclaimer: Did not try something except a little playing around with this, mainly 2D:

Shadertoy BETA

You can make mini projects, enter and execute your GPU code directly from the browser.

interesting …

Can you show an example-- i can’t visualize it – the size and the two passes …

Do you mean to enhance the swi-prolog code implementing the =… operator – or, do you mean a “hook” i.e. some kind of a prolog based enhancement …