How to build a table in parallel

Is it possible to build a large table for a tabled predicate using multiple threads?

1 Like

For a definitive answer, you’d have to ask @jan; but in general, SWI-Prolog is thread-safe, except where explicitly mentioned. Some things are per-thread, so you need to check the documentation.

But: what do you mean by “large table for a tabled predicate”? I can think of multiple interpretations – could you explain in a bit more detail?

I haven’t built a large predicate using multiple threads, but I have built the indexes for predicates with many ground facts, in parallel, for a big speed-up. I used concurrent_maplist/2 to do this, using fake queries that trigger generating the just-in-time indexes.
There is overhead in using concurrent_maplist/2, and when you’re doing things multi-thread, there is thread-locking overhead, so the speed-up can be less than you would hope for.

I’m using tabling to store unique answers to a query to a large db (~70Mb) and I have a machine with 32 cores. Using 1 thread takes some hours, I was wondering if I can exploit multiple threads to speed up the process.

That would also be my idea but I’m not sure how to do it.

70Mb isn’t very big … can’t you simply represent the database as Prolog facts and keep it in memory? JITI (just-in-time indexing) will take care of the indexes - no need for tabling. I “pre-compute” the indexes so that my first few queries don’t take a long time, but if it’s OK for initial queries to be slow, there’s no need for this:

My raw .pl file is 193Mb (500K facts); I compile it to a 71Mb .qlf file.

The problem is I’m tabling queries to the db, not just the db.
@peter.ludemann your approach gave me an idea. Suppose you want to compute all paths in a graph from two source nodes, an and b. Is it faster to do

:- table path/2.
main:-
  path(a,A), 
  path(b,B),
  <do something with a and b>

or

:- table path/2 as shared.
main:-
  table_path,
  path(a,A), 
  path(b,B),
  <do something with a and b>
  ....
table_path:-
  concurrent_maplist(path,[a,b]).

path(X):-
  path(X,_).
....

?

For now, the answer is “no” :frowning: SWI-Prolog has shared tabling, but that implies that threads are using each other’s tables. If a table is missing, one thread will create it and other threads that need it wait for the creating thread to finish its job. In theory, the delimited control based tabling can be executed concurrently. Roughly, tabling results in a graph of call variants (the tables) and continuations that provide the code that connects one variant to another. That is, if the source variant gets a new answer we need to push it through the continuation, which may (or not) come up with an answer for the destination variant, etc., continuing until fixed point is reached.

In theory, this second phase of tabling (completion) can be done concurrently as both the tables and the agenda with continuations are global data structures. Most of these data structures are already thread safe. Getting this to work probably takes a fair amount of effort though. I’m afraid this has to wait until someone is willing to invest in it, either in person or by paying for it. That effort could in part also be spent on improving the tabling performance in general, something which is not really hard, but also a lot of work. If you can make that part of a research proposal, I’m in :slight_smile:

Thanks @jan for the information. If I can make it part of a research proposal I will surely contact you.

1 Like

In your examples, you’re tabling simple queries to the database. For these, you’d probably see a big performance improvement by converting the database to Prolog facts and letting Prolog’s indexing speed things up. Other possibilities exist, such as a higher performance key-value store (there are “packs” for rocksdb and RDF-HDT).

If you’re doing something more general with finding paths, tabling might or might not help - there’s an overhead in tabling that can be higher than the performance gain, in my experience. :frowning: Also, I haven’t found a way of doing tabling with queries that store path information.

My objective in using tabling was to remove duplicates from the answers to a conjunctive query over a large db. So a key-value store would do as well but I’m afraid of the communication costs.
An alternative approach I tried was to write all answers to a file, one answer per row, and then use

sort -u --parallel=32 infile >outfile

infile would be several Gb and outfile tens of Mb. This approach seems to be faster than using tabling.

SELECT DISTINCT ... WHERE ... [ORDER BY ...]
doesn’t do what you want? (Depending on your query and the indexes, this might do the equivalent of sort -u, or it might read rows in sequence by some index, which could be faster … the EXPLAIN keyword might help you speed this up)

“Several GB” isn’t very big in my books - doesn’t it fit into main memory as Prolog facts? :slight_smile:

I did some tests on duplicate removing:

  1. with tabling, it takes 7 minutes and 22 secs
  2. by materializing the set of answers in a file and using Unix sort, it takes 16 minutes for generating the file (36Gb) and 6 minutes for sorting (output 82Mb)
  3. with setof, without tabling, it takes 7 minutes and 26 secs
    So tabling and setof are faster.

And what happens if you generate the answers and add them to a dynamic predicate if they are not already there? If you can make multiple threads generating different subsets of the answers, this may give you some of the concurrency, although you’d have to lock the lookup-and-assert-if-not-found.

Note that rather than using tabling, your problem seems so simple that just using the trie API is probably simpler and -again- opens options for concurrency.

With tries and 32 threads I get 1 minute and 56 secs using this code

main:-
  tell('outtrie.pl'),
  const(C),
  chunks(C,32,Chunks),
  trie_new(Tr),
  concurrent_maplist(gen(Tr),Chunks),
  trie_gen(Tr,q(R,R1,R2,R3)),
  writeln(q(R,R1,R2,R3)),
  fail.

main:-
   told.

gen(Tr,C):-
  member(R,C),
  g3(R,R1,R2,R3),
  with_mutex(qstore,
  trie_insert(Tr,q(R,R1,R2,R3))),
  fail.

gen(_,_).

chunks(L,N,Chunks):-
  length(L,Len),
  LenChunks is round(Len/N),
  split_list(L,N,LenChunks,Chunks).

split_list(L,1,_,[L]):-!.

split_list(L0,N,NL,[H|L]):-
  N>1,
  N1 is N-1,
  length(H,NL),
  append(H,T,L0),
  split_list(T,N1,NL,L).

where g3/4 is the predicate computing the answers to be deduplicated.

With dynamic predicates it took 51 minutes with this code

main:-
  tell('outdyn.pl'),
  const(C),
  chunks(C,32,Chunks),
  concurrent_maplist(gen,Chunks),
  q(R,R1,R2,R3),
  writeln(q(R,R1,R2,R3)),
  fail.

main:-
   told.

gen(C):-
  member(R,C),
  g3(R,R1,R2,R3),
  with_mutex(qstore,
  (\+ q(R,R1,R2,R3),
    asserta(q(R,R1,R2,R3))
  )),
  fail.

gen(_).

Yes, the list was deduplicated.

You’re right, with dynamic predicates I can remove the mutex and the time does down to 27 minutes, still much higher than with tries. For those, I believe the mutex must remain, as only with a global data structure I can make sure that not duplicates are added.

I don’t think that is necessary. AFAIK, tries are thread-safe. Better, they use lock-free instructions to add new nodes and can thus add nodes fully concurrently in different parts of the trie.

You’re right. Without mutex it takes 51 secs

From 1 min 56. Not bad :slight_smile: Might be wise to test with some variation in the number of threads. Sometimes elapsed time increases with more threads :frowning: The predicate mutex_statistics/0 may give hints about the bottlenecks.

So, tries are a good candidate for deduplication of terms if two terms are considered equal when they are variants.

By the way, @jan the manual https://www.swi-prolog.org/pldoc/man?section=trie
esplicitly says that tries are not thread safe (that’s why I used the mutex).

Right. A quick scan says this is probably true for various operations one may wish to perform concurrently on tries. Multiple threads inserting things is probably safe, although I’m not 100% sure. Surely, do not run deletes or enumerations concurrent with the inserts. Concurrent lookup (without enumeration) is probably fine too as long as the inserts to not modify the associated value.

You could generate bouquets totally lockfree and then merge them.
But this only makes sense if the bouquets are relatively small
compared to the shards you are processing. This is for example

the case if you have a lot of duplicates. Each parallel task will
then already do quite some reduction, and the merging will
do a final combined reduction. The merging phase can also be done

concurrently where each shard task is a producer and the merger
turns the shard task multi-producers into a result single producer.
You can also use this approach for distributed search engines.