Making a large prolog based knowledge base for bioinformatics/epidemiology

I have read the recent paper:

Advances in Big Data Bio Analytics https://arxiv.org/abs/1909.08254 by @nicos and @jan

and

I now work for ALSPAC http://www.bristol.ac.uk/alspac/ which is large longitudinal cohort study which has collected a lot of both 'omic and phenotypic data.

I would like to build an internal tool that could be later opened up to the web in an access controlled manner. I am imagining a swish instance where people could query the data and run some basic analyses and possibly submit jobs to our compute resource. A typical analysis would be running a Genome Wide Association Study (GWAS). This just means millions of associations between genetic SNP variables and a phenotype variable. The construction of the phenotype variable is where prolog/swish could be very useful.

On the 'omics side, most of our data is stored in either “oxford” file formats:
https://www.well.ox.ac.uk/~gav/bgen_format/
https://web.archive.org/web/20181010160322/http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html
or “plink” (https://www.cog-genomics.org/plink/1.9/formats) formats.

Whereas the phenotype data is mostly stored in pdfs and stata files.
We have an existing R based tool to search for phenotype variables http://variables.alspac.bris.ac.uk/ this is accessible to anyone on the web. There is also an R package that people can use internally to extract the data for given variables https://github.com/explodecomputer/alspac.
At the moment if a researcher wanted to perform a gwas on “hearing” they would use the search tool to find any vars that have been recorded on hearing and then create a script to follow some rules about what the construction of their specific phenotype var of interest would be. i.e. thresholding on one ear or another in some test at some age. Then they would use software such as plink or snptest (https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html) to run the GWAS.
We do not have any equivalent tool to “the variable search tool” for searching 'omics data but it would be useful to be able to query it easily in prolog. For example, what are the major and minor alleles of a specific SNP, who has what allele for a set of SNPs or which people have minor alleles of SNPs in the region of this gene X.

So I am looking for advice for starting this project on two fronts. One is how to read the large omic files for querying in a swish instance. Do I need too have so much RAM that I load the files in or can I have something that can query the files on disk. I noticed this paper:

Lazy Stream Programming in Prolog https://arxiv.org/abs/1907.11354 by Paul Tarru, @jan and Tom Schrijvers

But I am not sure if it is relevant. I guess I want something as simple as possible but allows flexibility of querying. I need to be able to parse the binary file formats either way which is not something I am familiar with.

The second piece of advice is how to work so that the data could work well with the approach by Nicos and Jan in “Big Data Bio Analytics”. I think that loading the bio_db data sets would enhance the ability to query the 'omics data.

Thanks in advance for any thoughts and pointers! Would be really great to collaborate on this idea if anyone would like to. It could potentially by high impact as ALSPAC is a well regarded and well used resource. (https://scholar.google.co.uk/scholar?start=0&q=alspac&hl=en&as_sdt=0,5 ) if successful the ideas could be expanded to cover over cohorts of data such as UKBiobank.

Sam

Hi Sam,

Long time no see.

At one point I was going to add a reference in here that the OP should talk to you, but then noticed that you were the OP, that caught me off guard as I thought you would be the one to go to for info.

A few questions first so I understand exactly what you mean. I know enough about what you are talking about on both the Prolog side and the problem domain side to not run away in horror, but not enough that I can read through what you wrote in one reading and have it all make sense.

When you ask how to read are you asking about how should Prolog access the data? It seems that it could a large amount of data and you are concerned about loading it all into memory, is that correct? If you
gave the size of all of the data it would be nice for more specific feedback. Are you aware of quick load files? See: Quick Load Files If you read the details in my post you will notice that in those examples, Gigabytes of data are being accessed and the data is stored on a USB flash drive that you can pick up at the grocery store check out lane. :smiley:

Also see: Scaling to billions of facts?

If you go the route of storing the data as facts, but want them to be updatable, then you should also look at Solving two consecutive dependent goals from command line and in particular this post

As you know DCGs are your friends. Again if you look at the example in Quick Load Files you will notice the data is from UniProt which while there is an XML version of the data, I used DCGs to parse the flat file format which is much closer to what you would need to parse the binary format.

The one thing that I was surprised to see you not ask was how to parse PDF and PostScript files for data curation. Did you ever notice my post about using DCG on PostScript

Feel free to private message me if that works better for you.

Also it feels like we have been walking along similar paths for the last several months and they just crossed again. :smiley:

Hi Sam! The lazy stream processing mostly provides you with a fairly elegant, but some of it also fairly slow way to process an infinite stream of data. Notably the older phrase_from_file/3 allows running a DCG on a very large file (or infinite stream) in limited memory. It doesn’t solve the problem to store it (as clauses). If you have more clauses then your computer can hold my first choice is typically either a proper database or one of the key-value store noSQL databases (BerkelyDB is packaged, RocksDB is available as a pack).

This is pretty much what SWISH is designed for. Packaging it for SWISH if you use login and trust your users mainly implies putting things in modules and make sure it is thread-safe. For public usage it becomes harder as you need to deal with the sandboxing.

Thanks for the responses, Eric, Jan.

I thought you would be an expert on this by this time.

Ha -I wish!

It seems that it could a large amount of data and you are concerned about loading it all into memory, is that correct?

Yes that is part of it, but it is also the low level detail of reading binary file formats.

Lets maybe concentrate on the reading of SNP data stored in oxford format of .BGEN files as a first problem to solve (ref: https://www.biorxiv.org/content/biorxiv/early/2018/05/02/308296.full.pdf).

The easy and maybe cheating solution is to use Rserve integration and to use the R library:
Log in with Atlassian account but I think I need something that scans through the .BGEN to answer a query without loading the whole file into memory all the time. I have an example .BGEN file which is 3 gigabytes in size, so not crazy big but I have a few hundred .BGEN files. Alternatively I could write out the BGEN files as plain text .GEN files and read these, but this takes up a lot more storage space.

Let’s stick to Prolog for the time being.

A BGEN file conceptually stores three values for each SNP (aka variants) for each sample. These are probabilities.Typically we would have one BGEN file for each chromosome. There would be between 10,000, to 2,000,000 snps in one file, and lets say 1000-50,000 samples (rarely up to a million).
SNP’s /Variants will be identified by their location.

According to the docs a BGEN file is organised in a number of blocks. First a header block, which contains some meta data such as the number of samples and snps as well as some other fields. Then a ‘sample identifier block’ which can optionally contain information on the samples. Following this we have “variant data blocks”. Each variant data block contains info on variant (e.g. its identifier, genomic position, and alleles), followed by the genotype probability
data for the variant. I understand that the genotype probability data is encoded in a packed bit representation which is compressed.

I do not understand how to open binary files and to apply a DCG to them to convert the data to terms to query.

I want to be able to write a predicate that I could query:

% bgen_snp_sample_value(Bgen_file,SnpLocation,SampleId,Values)
?bgen_snp_sample_value("my_chr1.bgen",10000,5,probs(P1,P2,P3)).
P1 =0.05
P2 =0.15
P3 = 0.8

I have the following skeleton:

bgen_snp_sample_value(Bgen_file,Snp,Sample,Value):-
   open_bgen(Bgen_file,snp_sample_value(Snp,Sample,Value),R).

open_bgen(Filename,Query,Result):-
         setup_call_cleanup(open(Filename, read,  StreamIn,[type(binary)]),
		   stream_query_result(StreamIn,Query,Result),
		 close(StreamIn)).

 stream_query_result(S,Q,R):-
/* How do we define this?
   We want to maybe use a DCG to parse the binary data.
   */
.

 /* DCG skeleton */ 
 bgen([HB,SB,VBs]) --> headerblock(HB),
   sample_block(SB),
   variantblocks(VBs).

 variantblocks([]) -->[].
 variantblocks(VB|VBs]) --> variantblock(VB),
   variantblocks(VBS).

I guess I would use ‘phrase_from_stream/2’ would I define my DCG using characters codes of 1’s and 0’s? Could someone give an example of opening a basic binary file with a DCG and phrase from file?

I don’t rule out using a database such as rocksDB but I think there are many advantages to keeping the data in the .bgen file format only and writing prolog that queries on them if that turns out to be possible.


Your approach is different than the approach I have been taking.

Correct this if this is wrong.

In your approach you want to keep the data stored as a BGEN file and then at the time of query
a. Parse the file
b. Convert the output of the parse into something usable by the query.


In my approach
Parse phase
a. Parse the file
b. Store the file as facts

Query phase
a. Load the facts one time and leave in memory
b. Query the facts like any normal query.

Now as I noted in one post, I expect that by the time I get done with parsing all of these files that there might be several terabytes of data and that storing them all in memory will not be feasible which is why I looked to using Neo4j as it can store the data as files and has fast access. The problem with Neo4j is that the query language is not as powerful as Prolog, so I am considering adding a Prolog driver to Neo4j which would have to be written from scratch.

In the world of bioinformatic database I find that the data is typically made available in three formats.

  1. A custom app on a web site.
  2. REST query
  3. File dump (flat or XML) released with more than a week between updates.

The reason I chose to go with the parsing of the flat files is

  1. I can put all of the data into facts and once there query it very fast.
  2. The data is not updated often so no sense in re-parsing it multiple times
  3. I can check the data as I parse it. I have actually found a few problems in some of the files that were reported and corrected. The problems were not with the biology side, but the data organization side. It seems the bio people are great in the lab and not so good with computers.
  4. I don’t need a public interface as it is only for my own use at present.

So what I don’t understand is, is there a requirement that the files be parsed each time a query is run? If so can you share?


Sent you a private e-mail, not private message, with Prolog source code for paring binary files and the test file that is used by the test cases.

The code is like a library of predicates and used as such.

get_string_helper(Stream,Block_type,Bytes0,Block_data_size0,Address0,Address,String) :-
    Block_type \== 16'FFFF,
    get_word(Stream,Address0,Address1,_String_size_or_end),         % 16'0000 or 16'FFFF
    get_bytes(Stream,Address1,Address2,Block_data_size0,Bytes0,Bytes1),
    get_word(Stream,Address2,Address3,16'9000),
    get_word(Stream,Address3,Address4,Block_data_size),
    get_word(Stream,Address4,Address5,Continue_or_end_block),
    get_string_helper(Stream,Continue_or_end_block,Bytes1,Block_data_size,Address5,Address,String).

The predicates that return usable information need to pass along the address as it is incremented after each read.

HTH.


I have to go now, but will take a look at this in much more detail latter.

You also have another option:

  • Write a small foregin package in C (or C++) that exposes the rbgen library API that you mentioned as Prolog foreign predicates
  • The C shim library would be as small as possible, its only objective being to expose the rbgen API.

This has been done many times for SWI prolog to integrate with external libraries.

Pros

  • Very fast
  • Probably would require little maintenance if rbgen doesn’t change often
  • Bug fixes and improvements to the rbgen library would be easily available in Prolog
  • Makes the rbgen API available in Prolog, so you don’t have to re-write the rbgen functionality in Prolog
  • The SWI-Prolog foreign C interface is very stable, mature and
    used in many many projects and packages
  • This way you create a prolog ‘rbgen’ package that can be reused in many different projects.

Cons

  • You need to write C/C++ code (but only a shim that connects to the rbgen library)
  • Debugging problems with foreign C code is harder
  • Deploying to many different (possibly unknown) machines is a little bit more difficult than just plain prolog code

Given the size of your files and data, and the fact that rbgen is available, I would suggest this approach if you know how to code in C/C++.

NOTE: the documentation for the SWI-Prolog foreign library interface is here: https://www.swi-prolog.org/pldoc/man?section=foreign

2 Likes

And, if there is a shared object/dll and C .h files, you might be able to get it connected using the ffi pack without writing a line of C.

Dear Sam,

looks like a nice application. hope it all goes well for you (and Prolog).
Keep in mind, that there is nothing that cannot be programmed in other ways,
but Prolog does have advantages (in my opinion).

These are actually 2 pieces of advice you are seeking above (let 's say 1A, and 1B)

1A. Use of Swish.
Jan is the best placed person to answer this.
In my experience swish is not ready for public access to resources (yet).
If you need control is likely to save you a lot of grief if you just wrote
a web server in SWI. And of course you know all about Pengines…
if everyone accessing is within a single network, then Swish
is becoming more appealing

1B. No you don’t need to load everything in memory.

i) user “swi” below already indicated a way to do that.
write a C wrapper. you can look into the source code
of Real for an example of a similar tool.

you can avoid the C-wrapper if you can use command line interface,
via process_create/3 but I wouldn’t recommend this.

Btw, if the underlying tools are accessible via R, you can also use Real to
make a much simpler wrapper.

ii) use the bio_db approach.
map your data into Prolog predicates,
then convert the predicates to one of the backends: RocksDB, SQLite, Berkeley DB.
you can then dispose of the textual representation of the

bio_db will do all the wrapping for you. Which backend you choose is irrelevant
to how data are represented within Prolog.
Once you have your data tables and typical queries, you can experiment with which
backend gives you best performance (time and space).

The order I gave above, was the best in terms of retrieval for the experiments reported
in the PPDP 17 paper: Accessing biological data as Prolog facts.

https://dl.acm.org/citation.cfm?id=3131857
(let me know if you need a pdf version of the paper).

bio_db has become very modular and flexible. You can use its infrastructure
easily once you are familiar. Say currently it has data from human and mouse dbs,
you can create a new sub directory cell/gwas or cell/smp or cell/alspac
and have a local version of bio_db that takes advantage of the infrastructure,

i have to warn you, loading of code is taken to the limit of what is acceptable …
bio_db uses (and was the driving force for some of the features) pack(lib).

Yes.
And as I mentioned above bio_db is (very) extensible.
if you identify bio databases that augment your data,
you can easily hack the scripts (in directory auxil/build_repo) to
incorporate them in your local version of bio_db.

a good exercise to that end is to try to build the whole of bio_db_repo.
(the data portion of bio_db).
by running auxil/build_repo/std_repo.pl

Hope it helps.
Happy to discuss further.
Unfortunately, I am tide up with teaching till December. But in theory should be
more research active in January.

Good luck,

Nicos


Nicos Angelopoulos

I’m not really sure what makes you think swish isn’t ready to disclose resources. I think it very much is. The experience connecting bio_db wasn’t great. The two are largely incompatible. SWISH requires code that is thread-safe and compatible with the sandboxing. bio_db depends on real rather than Rserve and although the two are similar in API, there are subtle differences. SWISH can’t be compatible with real because real directly talks to R and R cannot manage multiple sessions and isn’t safe. Rserve fixes that, but comes with some other (I think manageable) limitations. bio_db tends to take control of the Prolog environment, downloading stuff on demand and using a config system that reads file and flags in several places and has a lot of hooks and indirections that make it rather hard to work in SWISH. But then, these limitations come from Pengines and Sam knows that part :slight_smile:

That probably won’t work, because the bgen library is written in C++:

BGEN reference implementation
This repository contains a reference implementation of the BGEN format, written in C++. The library can be used as the basis for BGEN support in other software, or as a reference for developers writing their own implementations of the BGEN format.

Indeed. The ffi package doesn’t talk to C++. Some good examples for C++ embedding are the rocksdb and HDT packages. Good news is that the C++ interface typically requires less typing as the interface deals for a large part automatically with data conversion, cleanup and forwarding exceptions from C++ to Prolog.

1 Like

Found the file format for BGEN, but then found out that the data is not free. Was going to write a DCG to parse the data, but need some examples to test on; can you provide or link to example data. :smiley:

Thanks for the suggestions so far by everyone. I will try and digest.

@EricGT I think there are examples in the c++ code here: https://bitbucket.org/gavinband/bgen/src/default/example/
If you want to explore.

There is an example file that uses the api to convert bgen to vcf https://bitbucket.org/gavinband/bgen/src/default/example/bgen_to_vcf.cpp

I have not really done any programming in c or c++ since intro to programming 15 years ago, so it’s a little intimidating for me to understand.

Ps thanks for the email as well I will have a look at that too.

1 Like

I am in the same boat.

Annoyingly that link has been down for a while but you can use the wayback machine: https://web.archive.org/web/20181010160322/http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html

It might worth your while to fly to Amsterdam and have a hack session with Jan,

btw, there is:

so you might be able to use Rserve/Real.

In the very least you can have a look in their code. See what a C(++) interface
would look for bgen.

Nicos Angelopoulos

So what I don’t understand is, is there a requirement that the files be parsed each time a query is run? If so can you share?

You are right in that there are two ways to approach this, parse once or parse for each query. My thinking so far is that if it is fast to parse and search through files for typical queries then that might make sense as I do not have to have another copy of the data either in Prolog format, or in a database. It could be a useful tool for people to be able to use Prolog to query a BGEN as a standalone program. But I am also thinking that fully understanding the BGEN format to do this and the Prolog parse code might be spending effort on something that is not the most efficient use of my brain space for the whole project! If I just want to parse once then I could use a program that converts a BGEN file into a plain text .GEN file, then simply parse the plain text. But one drawback is that typically these files are 10 times larger.

Is this the entire format for BGEN?

If so then I can knock it out in soon as I have been doing this kinds of parsing for the last several months. I will be busy with a large family function for the next few days so screen time will be very limited and thus several days instead of a day or so.

Also, if you do want to use this, then when a problem with a specific file comes up, would you be able to send me the specific file, or at least the section that is causing the problems so that I can correct the DCG code. :smiley:

I think so, but there might also be a version 2 I will try and find out.
I will also find a “real” example .BGEN that I can share. You can also make .BGENs with qctools https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/converting.html by first making a .GEN file and converting it.

There is some more examples here where a python wrapper is made : https://github.com/lemieuxl/pybgen