Indexing on Genomic Positions for Range Queries in SWI-Prolog

I’ve imported a large genomic database into SWI Prolog to run queries over the imported db. Among other things, the database contains genomic coordinates for bio entities. For example.

snp(rs367896724).
chr(snp(rs367896724), chr1).
start(snp(rs367896724), 10177).
end(snp(rs367896724), 10177).

%..................

snp(rs768515087).
chr(snp(rs768515087), chry).
start(snp(rs768515087), 26624609).
end(snp(rs768515087), 26624609).

One query I executed on this database is to see if two genomic entities overlap.

overlaps_with(A, B) :-
    chr(A, ChrA),
    chr(B, ChrB),
    start(A, StartA),
    start(B, StartB),
    end(A, EndA),
    end(B, EndB),
    ChrA = ChrB,
    StartB < StartA,
    EndA < EndB.

Executing the above goal overlaps_with(snp(rs367896724), X) takes a rather long time, especially the first time running it.. This is understandable as prolog as to search over the entire genome and there are billions of such entities.

I was wondering if there are suggestions from the SWI-Prolog community on how to index the position info so that queries like the above will run faster?

1 Like

The slow first run is due to just in time indexing. You can (probably) make this a lot faster by reordering. Also, chr(A, ChrA), chr(B, ChrB), ... ChrA = ChrB can be
chr(A, Chr), chr(B, Chr), ... You should move the constraints as early as you can. Thus, StartB < StartA should be right after the two start/2 calls. Next you need to think about the ordering of these chs/start/end calls. As a simple rule of thumb, move the one with the smallest number of expected solutions first.

This reminds me a little of optimizing conjunctions of RDF statements. There we estimate the runtime of each ordering based on properties of each of the calls. Possibly we should do something similar for basic Prolog code …

3 Likes

I have a few tricks for large files of facts:

edge(vname(Signature1, Corpus1, Root1, Path1, Language1),
           Edge,
           vname(Signature2, Corpus2, Root2, Path2, Language2)) :-
    edge(Signature1, Corpus1, Root1, Path1, Language1,
               Edge,
               Signature2, Corpus2, Root2, Path2, Language2).
4 Likes