Quite Ok Image (QOI) format challenge

kwon-young · August 27, 2024, 8:42am

Hello everybody,
Some times ago, I’ve stumbled on the QOI image format.
The interesting thing about this image format is that you can read the specification in about 5 min and write a decoder in about the same time ^^
It is also a format that is optimized to be very fast while getting a quite ok compression.

My first thought after reading this was to write a decoder in prolog and so I wanted to share my attempt here: qoi-prolog/qoi.pl at 3a29b9d4dff8a300e14d359a7bc00a9b22152f53 · kwon-young/qoi-prolog · GitHub

One thing to note is that I have “tried” to optimize the decoding for speed while keeping readable code, so here are my design decisions:

use a DCG for readable code
pack the decoded pixels in a large data/N compound of arity N pixels, where each element is a compound color(R, G, B, A).
use tail calls for both the tail call optimisation and code readability
the qoi format requires the use of a mutable array of 64 colors. I use a compound colors/64 together with setarg/3 to mutate the array through the decoding

In term of speed, I am ~100x slower than the reference implementation in C when run with swipl -O optimized flag… which is not great but not too bad either ?
After profiling the code, I don’t really know how to optimize it further…

I have included some test images in the github repo:

?- time(qoi_decode(_, "qoi_test_images/IMGP5493_seamless_2.qoi")).
% 6,245,272 inferences, 0.859 CPU in 0.870 seconds (99% CPU, 7267876 Lips)
true.

So, around a 0.8 seconds to decode a 1024x1024 image.

I would be very interested if anyone can come up with more tricks to speed up the decoding or propose a completely new approach ?

Here are some questions I have:

I have heard that in C, you can use tail call optimisation to completly remove the overhead of calling a function, does it hold true in prolog too when tail calling a predicate ?
Does the code size of a predicate matter ? Again, I heard that it is simpler to optimize simpler functions than bigger ones in C.
Should I use another representation to pack my pixels ?

Any comments are welcome

PS: here is a very interesting blog related to parsing protobuf using tail calls: Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C

burlingtonbertie99 · August 27, 2024, 3:31pm

Sounds great. I’m no expert here but- when you say you profiled the code- did you check IO performance? I have a hunch that this is the bottleneck.

peter.ludemann · August 27, 2024, 4:10pm

In my experience, code like this are difficult to improve by profiling.

The protobufs library has some low-level predicates for things like converting 4 bytes into an integer (e.g., protobufs:int32_codes/2, used in payload//2). I don’t know how much this speeds things up; I inherited that code and then extended it. Anyway, there are a number of places in your code that have extensive arithmetic (e.g. qoi_decode_op_diff) where a foreign predicate might improve things.

I noticed a few places in your code where you use if-then-else; possibly it would be faster to use an auxiliary predicate. E.g.:

qoi_decode_op_short(Data, Counter, Arity, Colors, Prev, Op) -->
   { StartOp is Op >> 6 },
   qoi_decode_op_short(StartOp, Data, Counter, Arity, Colors, Prev, Op).

qoi_decode_op_short(0b11, Data, Counter, Arity, Colors, Prev, Op) -->
    { RunLength is Op /\ 0b00111111 },
    qoi_decode_op_runlength(Data, Counter, Arity, Colors, Prev, RunLength).
qoi_decode_op_short(0b00, Data, Counter, Arity, Colors, Prev, Op) -->
    qoi_decode_op_index(Data, Counter, Arity, Colors, Op).
qoi_decode_op_short(0b01, Data, Counter, Arity, Colors, Prev, Op) -->
    qoi_decode_op_diff(Data, Counter, Arity, Colors, Prev, Op).
qoi_decode_op_short(0b10, Data, Counter, Arity, Colors, Prev, Op) -->
    qoi_decode_op_luma(Data, Counter, Arity, Colors, Prev, Op).

Also, I think you could spread things up by changing qoi_decode/2 to use read_file_to_codes/3 and phrase/2.

maren · August 27, 2024, 4:37pm

I second this. Without any further setup, phrase_from_stream/2 will buffer at most 4096 bytes. That means a lot of tiny reads are going to happen over the course of parsing this. Much better to just read in the whole thing at once.

Alternatively, you can set the buffer to be a bit larger with set_stream/2 like so

set_stream(Stream, buffer_size(1048576))

This should set up the stream to read in chunks of 1mb.

kwon-young · August 27, 2024, 6:17pm

thanks for all your reply !

So, going on a hunch, I thought that prolog arithmetics would be so slow that I did not thought that IO would ever be a bottleneck.
I’ll report back the influence of the IO on parsing speed.

Is that true ? I thought that both approach (if-then-else and FAI of predicates) would be fully equivalent.
I guess that it would be possible that if-then-else constructs are implemented by creating a checkpoint and then doing a cut ?
I’ll convert the code to use FAI and report back if there is any improvements.

peter.ludemann · August 27, 2024, 6:29pm

About if-then-else vs auxiliary predicate with first-argument indexing - the trade-offs are similar to those between if-then-else and switch in C. There is one more thing: the call to the auxiliary predicate requires shifting the parameters, so that overhead might make things worse.

Sometimes it helps to look at the code that’s generated – use vm_list/1 to look at it.

As to my comment about this kind of code being difficult to optimize by using profiling – that’s because some design decisions are spread across all the code (e.g., using a list with DCGs or using setarg/3 rather than dicts or RB-trees). BTW, nb_setarg/3 might be faster than setarg/3, if you know that there’s no backtracking. Also, you might add det/1 directives to ensure that choice points aren’t being created.

(BTW, I’ve never looked into the performance of library(protobufs) compared with C++ implementations because nobody has complained about the performance (even though some of the things I did probably slowed it down a bit). Perhaps the original author (@JeffRose ) has some insights into performance because he used C code for some of the arithmetic operations.)

jan · August 28, 2024, 7:18am

Do you have evidence? In my experience the read overhead does not depend much on the buffer size if you do not make it really small (1kb or less). Of course, it depends on how expensive the grammar is.

maren · August 28, 2024, 10:06am

Intuition based on io in other languages. Which turns out to be wrong here.

I just tried it out with a simple DCG that just reads the entire file 1 code at the time, doing nothing else. If anything, the larger file buffer makes things slightly slower. Apologies for the misinfo.

Also, it looks like the linked file was already just loading everything into memory using read_file_to_string/3, not reading directly from the file on disk, so my point was moot anyway.
Also also, it actually looks like reading from disk without first reading to a string is faster.

Test program:

:- use_module(library(pure_input)).
walk_file -->
    [_],
    { ! },
    walk_file.
walk_file --> {true}.

acquire_stream(preload, Stream) :-
    read_file_to_string('/tmp/bigfile', String, [encoding(octet)]),
    open_string(String, Stream).
acquire_stream(disk, Stream) :-
    open('/tmp/bigfile', 'read', Stream, [encoding(octet)]).

acquire_buf_stream(Type, BufSize, Stream) :-
    acquire_stream(Type, Stream),
    set_stream(Stream, buffer_size(BufSize)).

walk_file_buf(Type, BufSize) :-
    acquire_buf_stream(Type, BufSize, Stream),
    phrase_from_stream(walk_file, Stream).

try_multiple_read_styles :-
    Types = [preload, disk],
    Sizes = [1024, 4096, 1048576],
    forall((
               member(Type, Types),
               member(Size, Sizes)
           ),
           (
               format("~q ~q~n", [Type, Size]),
               time(walk_file_buf(Type, Size))
           )).

Execution

> dd if=/dev/urandom bs=1k count=100k of=/tmp/bigfile     
102400+0 records in
102400+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.541145 s, 194 MB/s
> swipl -O -f bigfile.pl -g try_multiple_read_styles -t halt
preload 1024
% 106,472,742 inferences, 5.033 CPU in 5.046 seconds (100% CPU, 21153373 Lips)
preload 4096
% 105,241,756 inferences, 4.683 CPU in 4.695 seconds (100% CPU, 22475083 Lips)
preload 1048576
% 105,241,756 inferences, 5.064 CPU in 5.078 seconds (100% CPU, 20782012 Lips)
disk 1024
% 106,393,638 inferences, 3.165 CPU in 3.174 seconds (100% CPU, 33613267 Lips)
disk 4096
% 105,241,638 inferences, 2.949 CPU in 2.958 seconds (100% CPU, 35682309 Lips)
disk 1048576
% 105,241,638 inferences, 3.083 CPU in 3.091 seconds (100% CPU, 34139606 Lips)

So 4k buffer beats both 1k and 1m buffer. And streaming from disk beats preloading. Fun times.

maren · August 28, 2024, 10:40am

Also this turns out to actually be the worst option. It’s extremely memory hungry, which in turn (I think) requires lots of garbage collection to happen during execution. I’ve seen runtimes of up to 26 seconds for a 100mb file depending on how i set up my stack, vs 3 seconds for phrase_from_file/3.

peter.ludemann · August 28, 2024, 5:06pm

So, read_file_to_string, open_string, phrase_from_stream is faster than phrase_from_file? That surprised me, but shows how unintuitive it can be to speed up code.

maren · August 28, 2024, 5:10pm

No, phrase_from_file/2 is the fastest. prereading the file into a string, then opening that string as a stream and phrase_from_stream/2 is slightly slower. Slowest of all is read_file_to_codes/3 followed by phrase/2.

kwon-young · August 29, 2024, 1:41pm

Hello everybody,

I’m back after working a little bit on the encoding side of things.
I have updated the repo with a working encoder, some tests and a benchmark function:

101 ?- benchmark.
decoding
qoi_test_images/IMGP5493_seamless_2.qoi
...
qoi_test_images/testcard.qoi
decoding speed: 1.3536433420448817 megapixel/sec
encoding
qoi_test_images/IMGP5493_seamless_2.qoi
...
qoi_test_images/testcard.qoi
encoding speed: 1.140793931136414 megapixel/sec
true.

If you compare with the official qoi benchmarks, we are ~100x slower than the c implementation.

About the impact of IO, my findings are consistent with the previous comments.
I could not beat phrase_fom_file/2, either by loading the whole file in string or by directly using get_byte/2 in the grammar.

On first argument indexing vs if-then-else:

removing all if-then-else and converting all the code to use FAI by adding compare/3 calls before auxilliary predicates made the decoder slower, even after adding dummy arguments to keep static arguments from shifting.
converting a single if-then-else (the one for qoi_decode_op_shoft) to use FAI made no discernable difference.

On using nb_setarg/3 instead of setarg/3, I found nb_setarg/3 to be slower than setarg/3.

Question: we have phrase_from_file/2 but why is there no phrase_to_file/2 ?

The only solution I found to dump of list of byte was to do:

   open(Filename, write, Stream, [encoding(octet)]),
   maplist(put_byte(Stream), Bytes),
   close(Stream).

Is there a better solution ?

Topic		Replies	Views
Pure QOI (Quite Ok Image) format implementation General	30	177	September 14, 2024
DCG is too slow with existing search strategy Help!	4	360	February 9, 2023
Discussion about web app Split Topic discussion	2	916	March 31, 2020
Compare/3 faster than arithmetic why? Help!	6	463	August 16, 2022
Visual Learning Environment Looking for Group	10	1461	March 27, 2021

Quite Ok Image (QOI) format challenge

Related topics