I’ve imported a large genomic database into SWI Prolog to run queries over the imported db. Among other things, the database contains genomic coordinates for bio entities. For example.
snp(rs367896724).
chr(snp(rs367896724), chr1).
start(snp(rs367896724), 10177).
end(snp(rs367896724), 10177).
%..................
snp(rs768515087).
chr(snp(rs768515087), chry).
start(snp(rs768515087), 26624609).
end(snp(rs768515087), 26624609).
One query I executed on this database is to see if two genomic entities overlap.
overlaps_with(A, B) :-
chr(A, ChrA),
chr(B, ChrB),
start(A, StartA),
start(B, StartB),
end(A, EndA),
end(B, EndB),
ChrA = ChrB,
StartB < StartA,
EndA < EndB.
Executing the above goal overlaps_with(snp(rs367896724), X)
takes a rather long time, especially the first time running it.. This is understandable as prolog as to search over the entire genome and there are billions of such entities.
I was wondering if there are suggestions from the SWI-Prolog community on how to index the position info so that queries like the above will run faster?