How to build a table in parallel

Is it possible to build a large table for a tabled predicate using multiple threads?

1 Like

For a definitive answer, you’d have to ask @jan; but in general, SWI-Prolog is thread-safe, except where explicitly mentioned. Some things are per-thread, so you need to check the documentation.

But: what do you mean by “large table for a tabled predicate”? I can think of multiple interpretations – could you explain in a bit more detail?

I haven’t built a large predicate using multiple threads, but I have built the indexes for predicates with many ground facts, in parallel, for a big speed-up. I used concurrent_maplist/2 to do this, using fake queries that trigger generating the just-in-time indexes.
There is overhead in using concurrent_maplist/2, and when you’re doing things multi-thread, there is thread-locking overhead, so the speed-up can be less than you would hope for.

I’m using tabling to store unique answers to a query to a large db (~70Mb) and I have a machine with 32 cores. Using 1 thread takes some hours, I was wondering if I can exploit multiple threads to speed up the process.

That would also be my idea but I’m not sure how to do it.

70Mb isn’t very big … can’t you simply represent the database as Prolog facts and keep it in memory? JITI (just-in-time indexing) will take care of the indexes - no need for tabling. I “pre-compute” the indexes so that my first few queries don’t take a long time, but if it’s OK for initial queries to be slow, there’s no need for this:

My raw .pl file is 193Mb (500K facts); I compile it to a 71Mb .qlf file.

The problem is I’m tabling queries to the db, not just the db.
@peter.ludemann your approach gave me an idea. Suppose you want to compute all paths in a graph from two source nodes, an and b. Is it faster to do

:- table path/2.
  <do something with a and b>


:- table path/2 as shared.
  <do something with a and b>



For now, the answer is “no” :frowning: SWI-Prolog has shared tabling, but that implies that threads are using each other’s tables. If a table is missing, one thread will create it and other threads that need it wait for the creating thread to finish its job. In theory, the delimited control based tabling can be executed concurrently. Roughly, tabling results in a graph of call variants (the tables) and continuations that provide the code that connects one variant to another. That is, if the source variant gets a new answer we need to push it through the continuation, which may (or not) come up with an answer for the destination variant, etc., continuing until fixed point is reached.

In theory, this second phase of tabling (completion) can be done concurrently as both the tables and the agenda with continuations are global data structures. Most of these data structures are already thread safe. Getting this to work probably takes a fair amount of effort though. I’m afraid this has to wait until someone is willing to invest in it, either in person or by paying for it. That effort could in part also be spent on improving the tabling performance in general, something which is not really hard, but also a lot of work. If you can make that part of a research proposal, I’m in :slight_smile:

Thanks @jan for the information. If I can make it part of a research proposal I will surely contact you.

1 Like

In your examples, you’re tabling simple queries to the database. For these, you’d probably see a big performance improvement by converting the database to Prolog facts and letting Prolog’s indexing speed things up. Other possibilities exist, such as a higher performance key-value store (there are “packs” for rocksdb and RDF-HDT).

If you’re doing something more general with finding paths, tabling might or might not help - there’s an overhead in tabling that can be higher than the performance gain, in my experience. :frowning: Also, I haven’t found a way of doing tabling with queries that store path information.

My objective in using tabling was to remove duplicates from the answers to a conjunctive query over a large db. So a key-value store would do as well but I’m afraid of the communication costs.
An alternative approach I tried was to write all answers to a file, one answer per row, and then use

sort -u --parallel=32 infile >outfile

infile would be several Gb and outfile tens of Mb. This approach seems to be faster than using tabling.

doesn’t do what you want? (Depending on your query and the indexes, this might do the equivalent of sort -u, or it might read rows in sequence by some index, which could be faster … the EXPLAIN keyword might help you speed this up)

“Several GB” isn’t very big in my books - doesn’t it fit into main memory as Prolog facts? :slight_smile:

I did some tests on duplicate removing:

  1. with tabling, it takes 7 minutes and 22 secs
  2. by materializing the set of answers in a file and using Unix sort, it takes 16 minutes for generating the file (36Gb) and 6 minutes for sorting (output 82Mb)
  3. with setof, without tabling, it takes 7 minutes and 26 secs
    So tabling and setof are faster.

And what happens if you generate the answers and add them to a dynamic predicate if they are not already there? If you can make multiple threads generating different subsets of the answers, this may give you some of the concurrency, although you’d have to lock the lookup-and-assert-if-not-found.

Note that rather than using tabling, your problem seems so simple that just using the trie API is probably simpler and -again- opens options for concurrency.

With tries and 32 threads I get 1 minute and 56 secs using this code





  LenChunks is round(Len/N),


  N1 is N-1,

where g3/4 is the predicate computing the answers to be deduplicated.

With dynamic predicates it took 51 minutes with this code



  (\+ q(R,R1,R2,R3),


With horizontal partitioning you could use a lockless datastructure in each thread,
for example thread local facts. But I don’t know about the extra cost of copying the
“shards” back into a lockfull datastructure. If this is required. When the result

is a file, you could of course dump each thread into a file via opening the file in
“append” mode. But sometimes people use “shards” as their primary storage
mechanism. So that queries would later be also evaluated against these “shards”

and executed in parallel. What is your query language later? See also:

Revisiting Data Partitioning for Scalable RDF Graph Processing
Jorge Galicia - January 2021

If you can deduplicate the key list here in the first place:

By for example using this here:

const(C0), sort(C0, C), 

You not only get a shorter list, but a simple consideration
tells us that this lock is not anymore needed:

Because R comes from the deduplicate keys, so its not possibe that
two threads try to add the same tuple to the dynamic database and
that there would be a race condition. So you could remove the

with_mutex/2 call, which might be the bottle-neck, that might have
been responsible for turning a parallel algorithm basically into a
sequential one, i.e. just use this:

  (\+ q(R,R1,R2,R3),

But I don’t know how you could remove the lock around
the trie. Maybe create multiple trie roots? Or is trie_insert/2
re-entrant? If I remember well the dynamic database of

SWI-Prolog has a kind of lockless implementation already.
So removing the with_mutex/2 for the dynamic database variant
of your testing will possibly show already some speed up.

On the other hand, how to do it with trie, I don’t know right now
except for the suggestion to use multiple trie roots to be on
the safe side. The only lock you would need would be during

dumping of each thread, so that two threads don’t write
at the same time into a file, which might produce garbel.

Yes, the list was deduplicated.

You’re right, with dynamic predicates I can remove the mutex and the time does down to 27 minutes, still much higher than with tries. For those, I believe the mutex must remain, as only with a global data structure I can make sure that not duplicates are added.

I don’t think that is necessary. AFAIK, tries are thread-safe. Better, they use lock-free instructions to add new nodes and can thus add nodes fully concurrently in different parts of the trie.

You’re right. Without mutex it takes 51 secs

From 1 min 56. Not bad :slight_smile: Might be wise to test with some variation in the number of threads. Sometimes elapsed time increases with more threads :frowning: The predicate mutex_statistics/0 may give hints about the bottlenecks.

So, tries are a good candidate for deduplication of terms if two terms are considered equal when they are variants.

By the way, @jan the manual
esplicitly says that tries are not thread safe (that’s why I used the mutex).