Pure QOI (Quite Ok Image) format implementation

Hello,

I’m back again with another completly pointless endeavour of implementing a pure DCG of the Quite Ok Image format: qoi-prolog/qoi_pure.pl at 4a82f1b5493522fb6fc133638a00b98eed39ce31 · kwon-young/qoi-prolog · GitHub

For context, I have previously made a post about a prolog decoder/encoder while trying to be as fast as possible and found that I was around 100 times slower than the reference C implementation.
Then, I couldn’t stop asking myself:

  1. Is it possible to write a fully logically pure implementation of the format ?
  2. How slow would it be ?

By the way, here is my (totally made up on the spot) definition of a “pure implementation” :

the exact same prolog code can be used with any instantiation pattern to do the same logical task in forward, backward, generative or test mode.

As helper library, I use clpfd (of course) and a reified unification predicate =/3 that I ripped out from the library reif.

Here are some thoughts and comments on the experience of writing this:

  • Writing pure prolog code is much much more complicated than writing the imperative style decoder/encoder. If I did not write the encoder/decoder separatly before, I would not have been able to write the pure version. I think the saying the whole is greater than the sum of its part is true here and applies to the notion of complexity ^^
  • This endeavour shows that it is possible to write a logical pure version of this type of programs, although I think it also shows why we should not do it :slight_smile:
  • The pure implementation is roughly 1000 time slower than the non pure prolog implementation, and in turns means that it is 100_000 time slower than the reference C implementation ! I measured a whopping 5 kilopixel/sec decoding speed and 2 kilopixel/sec encoding speed :joy:
  • As expected, from naive profiling, clpfd various arithmetic and reifying predicates are the bottleneck of the grammar.
  • The code can be used to decode/encode but also to generate an image (leave the raw data and encoded bytes uninstantiated) and of course test an image and its encoded version, except you only have to test a single implementation !
  • One of the great factor that increased complexity was to not leave any superfluous choicepoints in encoding/decoding modes since leaving a choicepoint per pixel is essentially synonyms with a memory leak. It will also breaks tail call optimisation which also means stacks will overflow, even for really small images.

Here is one question I have in mind now:
We often says that pure prolog programs is like a logical specification of a problem with citations like “Algorithm = logic + control”.
Would it be possible to take such pure prolog implementation and derive encoder/decoder programs that optimise for speed depending of instantiation pattern of the query ?
So, something like parser generators (bison, flex, yacc…) but for prolog code ? maybe even do it on the fly ?
I know that clpfd does some compile time expansion to optimise a bit the arithmetic when input variables are ground, but I’m talking something like runtime expansion based on the instantiation pattern of the query ?

1 Like

Thanks for the writeup. It mostly confirms my expectations. As for the performance, this can probably be cured by a sufficiently sophisticated partial evaluation, i.e., specialize the program for a given mode/input. That is not readily available though.

The multi-moded nature of Prolog is useful, but in my opinion often overestimated, surely if by nature non-pure predicates are involved such as arithmetic, atom_codes/2, etc.

Yes, that was what I was thinking in the end of my writeup.

What do you mean by this ? Has this never been tried before ?
Or do you have knowledge of some existing projects that are not available ?
How would you even google search to find this kind of projects ?

Given the simplicity of the idea (I’m not saying that it is simple to implement ^^), I’m kind of flabbergasted that nobody tried it before…

There have been several papers on partial evaluation in the academic logic programming community. I’m not aware of much sophisticated being in actual use on any major Prolog implementation. I also do not know whether work has been done of partial evaluation with clp(fd) (or constraints in general). I guess that should be possible. Mode inferencing should be enough to move many clp(fd) expressions around to a place where they can be replaced by simply arithmetic.

1 Like

Thank you very much for the term “partial evaluation”. I think this was the term I was searching for googling this :slight_smile:

I think work has also been published as “full program optimization”.

For e.g.:

qoi_uint32(N) -->
   [W1, W2, W3, W4],
   {  [W1, W2, W3, W4] ins 0..255,

This should be faster:

qoi_uint32(N) -->
   [W1, W2, W3, W4],
   {  maplist(between(0, 255), [W1, W2, W3, W4]),

Evidence:

?- time(maplist(between(0, 255), [W1, W2, W3, W4])).
% 8 inferences, 0.000 CPU in 0.000 seconds (89% CPU, 505051 Lips)

?- time([W1, W2, W3, W4] ins 0..255).
% 288 inferences, 0.000 CPU in 0.000 seconds (84% CPU, 4960386 Lips)

Thanks for the suggestion, but I think you are missing the point of the write up.

The predicate qoi_uint32//1 is only used in the declaration of the header, which runtime is microscopic compared to the encoding/decoding of pixels.

And by the way, your suggestion is to change the constraints approach into a generate and test approach, but your evidence does not test the labelling part of the qoi_uint32//1 predicate…

So, I’m not saying that your suggestion is bad, only that your test did not prove your claim that qoi_uint32//1 would be faster with your change.
I also believe there is going to be a lot of thrashing when trying to unify with a big number N when encoding an image.
And finally, your solution will always leave an unneeded choicepoint, which will ultimately harm the overall grammar much more.

No - putting the maplist before the [W1, W2, W3, W4] would be generate-and-test.

Where is the choicepoint? Example:

?- numlist(1, 4, Ws), maplist(between(0, 255), Ws).
Ws = [1, 2, 3, 4].
<No choicepoint remains>

I believe you are missing the point of qoi_uint32//1 which is to unify a number N with its 4 byte representation [W1, W2, W3, W4].

I believe this proves my point:

?- N = 9999999, time(([W1, W2, W3, W4] = Ws,
                      maplist(between(0, 255), Ws),
                      N #= W1 << 24 + W2 << 16 + W3 << 8 + W4)).
% 80,078,434 inferences, 2.816 CPU in 2.823 seconds (100% CPU, 28434041 Lips)
N = 9999999,
W1 = 0,
W2 = 152,
W3 = 150,
W4 = 127,
Ws = [0, 152, 150, 127] ;
^CAction (h for help) ? abort
% 38,896,363 inferences, 1.356 CPU in 1.952 seconds (69% CPU, 28691279 Lips)
% Execution Aborted
<choicepoint remains!>
?- N = 9999999, time(([W1, W2, W3, W4] ins 0..255,
                      N #= W1 << 24 + W2 << 16 + W3 << 8 + W4)).
% 5,068 inferences, 0.001 CPU in 0.001 seconds (97% CPU, 8591505 Lips)
N = 9999999,
W1 = 0,
W2 = 152,
W3 = 150,
W4 = 127.
<no choicepoint remains>
?- 

Ah, OK.

clpfd’s inference ability isn’t really being taken advantage of. clpBNR is an alternative:

qoi_uint32(N) -->
   [W1, W2, W3, W4],
   {  [W1, W2, W3, W4] ins 0..255,
      N #= W1 << 24 + W2 << 16 + W3 << 8 + W4
   }.
   
qoi_uint32_bnr(N) -->
    [W1, W2, W3, W4],
    {   [W1, W2, W3, W4]::integer(0, 255),
        { N == (W1 * 256^3) + (W2 * 256^2) + (W3 * 256) + W4 }
    }.  

Performance:

?- time(forall(between(1, 10_000, N), phrase(qoi_uint32(N), Ws))).
% 51,738,371 inferences, 3.486 CPU in 3.503 seconds (100% CPU, 14839962 Lips)
true.

?- time(forall(between(1, 10_000, N), phrase(qoi_uint32_bnr(N), Ws))).
% 17,256,955 inferences, 1.999 CPU in 2.003 seconds (100% CPU, 8633390 Lips)
true.

What do you mean by this ?

Wow, did not expect that. Do you know why clpBNR is faster here ?
Maybe @ridgeworks will know ?

You shouldn’t have shown me this, I now want to rewrite the code using clpBNR :sob:

The code is using reification in clpfd (e.g. #<==> IsDiff)… and then immediately forcing unification with 0 or 1, rather than letting the reification logic accumulate so that clpfd can find a clever path through it.

Ah yes, I used that pattern to implement pure code while avoiding unneeded choicepoints.

If you try the code, you will find out that when encoding/decoding an image, the IsDiff/IsRun and other similar variables will all be ground before the call to the next predicate, therefore taking advantage of First Argument Indexing and pruning uneeded choicepoints.

So the pattern of using clpfd reification operators is a pure replacement of the non-pure if-then-else (_ -> _ ; _) construct.
It is the same idea as using the zcompare/3 predicate as in the factorial example of the documentation.

I was somewhat surprised by this and as I don’t know much about the internals of clpfd, I had to do a little digging.

My first observation is that clpfd performs better with a different but equivalent constraint. Instead of:

qoi_uint32(N) -->
   [W1, W2, W3, W4],
   {  [W1, W2, W3, W4] ins 0..255,
      N #= W1 << 24 + W2 << 16 + W3 << 8 + W4
   }.

write:

qoi_uint32P(N) -->
   [W1, W2, W3, W4],
   {  [W1, W2, W3, W4] ins 0..255,
      N #= W1 * 0x1000000 + W2 * 0x10000 + W3 * 0x100 + W4
   }.

Now:

?- N = 9999999,time((between(1,1000,_),phrase(qoi_uint32(N),_),fail)).
% 5,075,999 inferences, 0.576 CPU in 0.630 seconds (92% CPU, 8807208 Lips)
false.

?- N = 9999999,time((between(1,1000,_),phrase(qoi_uint32P(N),_),fail)).
% 1,238,999 inferences, 0.172 CPU in 0.191 seconds (90% CPU, 7196621 Lips)
false.

So a factor of 4 improvement with this simple change. A quick scan of profiler data suggests that the optimized form is recognized by clpfd as a linear equation subject to special treatment, whereas shift isn’t really a mathematical operation at all and requires a general approach (appears to be more of a generate and test approach internal to clpfd).

As an aside, clpfd does a significant amount of analysis and rewriting of constraints at load time; try listing/1 on qoi_uint32//1 to see it. It primarily does this to minimize the overhead when the instantiation pattern is equivalent to standard functional arithmetic so that the argument can be made (with some justification IMO) that clpfd can be used as a replacement for standard functional integer arithmetic.

Now this more optimal clpfd version actually performs better than clpBNR for a couple of reasons (at least). clpBNR performs a runtime analysis of any constraint to see whether some optimzations can be done, particularly to avoid the interval arithmetic dependancy issue, e.g., (with flag clpBNR_verbose=true):

?- [X,Y]::real,{Y==X+X}.
X::real(-5.0e+15, 5.0e+15),
(Y::real(-1.0e+16, 1.0e+16), {Y==2*X}).

This analysis adds a runtime cost for any added constraint even if it achieves nothing. This is usually justified because the cost of building the constraint network for the vast majority of clpBNR programs I’ve encountered is a small fraction of the cost of constraint consistency checking. This example is an exception; 40% of the runtime is initializing the constraint network.

A second reason is that clpBNR is actually a constraint sub-language over the continuous domain of reals so all domain bounds calculations are done assuming correctly rounded floating point arithmetic functions. This incurs a cost even when no floating point arithmetic is done, as in the case of this example.

Thirdly, clpBNR simply “compiles” constraints to a set of primitive narrowing operations. There is no attempt to consider optimized treatment of the constraint as a whole, such as clpfd does for this particular case. That is considered to be meta-programming with respect to clpBNR and left to the application (or third party library) to sort out. (library(clpBNR_toolkit) has examples of this, e.g., using library(simplex) to solve systems of linear equations more efficiently.)

Bottom line: if you have an constraint problem over the integer domain, it’s probably better to focus on clpfd. If the problem is over the real domain or is mixed, clpBNR may be better suited.

And if the problem is simple enough, a moded (non-logical) solution using standard functional arithmetic will perform the best because it doesn’t involve any constraint consistency checking and propagation over attributed variables. (And will perform even better when flag optimise=true.)

A final big picture comment. I haven’t looked at any detail on the “QOI” translation problem but it seems to be mainly about performing arithmetic on mutable array values. That seems to be right in C’s wheelhouse - it can do these types of operations (array access and arithmetic on predefined numeric values) inline in a few machine instructions. None of that plays into Prolog’s strengths so you have to implement the equivalent of arrays (compound terms probably the most space/time efficient) and access values through builtin primitives (think expensive function calls). Now I add a whole (WAM) virtual machine layer (at least for SWIP) for everything else. Using a CLP (just for logical arithmetic?) adds the equivalent of yet another VM layer, this time one written in Prolog itself. So it comes as no surprise to me that a SWI-Prolog implementation is orders of magnitude slower than C for this problem - horses for courses. I must assume that you’re not purely interested in a performant solution.

5 Likes

Regarding performance - perhaps there is a gap for an integer library which allows clp (e.g. X = Y + Z, with actual values postponed) which is not trying to be clever in other ways :grinning:

@jan – is there a way of doing the equivalent of freeze/2 or when/2 in a C extension? Or is C code limited to the equivalent of var/1, nonvar/1, ground/1, etc.?

(I’m asking because library(protobufs) uses C code for low-level conversions between numbers and wire format (presumably because of performance, although I’ve never measured it); but for the “pure” bi-directional code, there are wrappers that use freeze/2 or when/2, and it would simplify things (plus better performance?) if the freeze/2 or when/2 could be moved into the C code.)

@ridgeworks Thank you so much for the extensive and interesting response !

Well, I’m more trying to learn new things through these experiments.
For example, while writing these grammars, I’ve learned that:

  • you can use tail call optimize predicates to split your grammar into smaller logical parts
  • I have learned the subtil difference between the different reification operators in clpfd like #<==> and #<==
  • I have learned that there is 3 order of magnitude difference between the speed of raw prolog arithmetic and clp pure arithmetic

It’s also because everybody says that you should not do this kind of things ^^

Some write poetry, I prefer this kind of writing :slight_smile:
I think @jan called it a mild case of stockholm syndrom :joy:

1 Like

No. It would be really hard to maintain the non-determinism of the wakeup. And anyway, it would lead to possibly deeply nested Prolog → C → Prolog call chains that are hard to control.

Perhaps. If your objective is just to defer execution until say 2 of 3 vars in X is Y + Z are instantiated that shouldn’t be too difficult - see when/2. This approach could also be used to make “logical” versions of other non-arithmetic builtin functions, e.g., functor/3, arg/3, or char_code/3 to avoid immediate instantiation errors by deferring. This would be quite a different style of logic programming.