Indexing on Genomic Positions for Range Queries in SWI-Prolog

xabush · July 22, 2024, 10:41am

I’ve imported a large genomic database into SWI Prolog to run queries over the imported db. Among other things, the database contains genomic coordinates for bio entities. For example.

snp(rs367896724).
chr(snp(rs367896724), chr1).
start(snp(rs367896724), 10177).
end(snp(rs367896724), 10177).

%..................

snp(rs768515087).
chr(snp(rs768515087), chry).
start(snp(rs768515087), 26624609).
end(snp(rs768515087), 26624609).

One query I executed on this database is to see if two genomic entities overlap.

overlaps_with(A, B) :-
    chr(A, ChrA),
    chr(B, ChrB),
    start(A, StartA),
    start(B, StartB),
    end(A, EndA),
    end(B, EndB),
    ChrA = ChrB,
    StartB < StartA,
    EndA < EndB.

Executing the above goal overlaps_with(snp(rs367896724), X) takes a rather long time, especially the first time running it.. This is understandable as prolog as to search over the entire genome and there are billions of such entities.

I was wondering if there are suggestions from the SWI-Prolog community on how to index the position info so that queries like the above will run faster?

jan · July 22, 2024, 11:17am

The slow first run is due to just in time indexing. You can (probably) make this a lot faster by reordering. Also, chr(A, ChrA), chr(B, ChrB), ... ChrA = ChrB can be
chr(A, Chr), chr(B, Chr), ... You should move the constraints as early as you can. Thus, StartB < StartA should be right after the two start/2 calls. Next you need to think about the ordering of these chs/start/end calls. As a simple rule of thumb, move the one with the smallest number of expected solutions first.

This reminds me a little of optimizing conjunctions of RDF statements. There we estimate the runtime of each ordering based on properties of each of the calls. Possibly we should do something similar for basic Prolog code …

peter.ludemann · July 22, 2024, 4:48pm

I have a few tricks for large files of facts:

Compile to .qlf form (15x speed-up in loading) - you can trigger this by having an empty .qlf file in the same directory as the .pl file.
Warm up the index by running a set of concurrent queries with instantiated fields that I want indexed (e.g.: pykythe/browser/src_browser.pl at 7a96675f78a8c09e57ea34f45d1e3000266e76f5 · kamahen/pykythe · GitHub)
Flatten the facts. For example, I have triples that contain compound terms; I flatten them like this:

edge(vname(Signature1, Corpus1, Root1, Path1, Language1),
           Edge,
           vname(Signature2, Corpus2, Root2, Path2, Language2)) :-
    edge(Signature1, Corpus1, Root1, Path1, Language1,
               Edge,
               Signature2, Corpus2, Root2, Path2, Language2).

Topic		Replies	Views
How to write data for fast indexing? Help!	2	688	August 12, 2019
Ann: SWI-Prolog 9.3.19 Releases	0	124	January 23, 2025
A little benchmark shows a slow first access to tabled predicates Help!	0	42	January 23, 2025
Puzzling Performance Issue Help!	8	885	November 28, 2021
Database micro-benchmark (Discussion) Discussion	22	839	April 25, 2020

Indexing on Genomic Positions for Range Queries in SWI-Prolog

Related topics