I have read the recent paper:
Advances in Big Data Bio Analytics https://arxiv.org/abs/1909.08254 by @nicos and @jan
and
I now work for ALSPAC http://www.bristol.ac.uk/alspac/ which is large longitudinal cohort study which has collected a lot of both 'omic and phenotypic data.
I would like to build an internal tool that could be later opened up to the web in an access controlled manner. I am imagining a swish instance where people could query the data and run some basic analyses and possibly submit jobs to our compute resource. A typical analysis would be running a Genome Wide Association Study (GWAS). This just means millions of associations between genetic SNP variables and a phenotype variable. The construction of the phenotype variable is where prolog/swish could be very useful.
On the 'omics side, most of our data is stored in either âoxfordâ file formats:
https://www.well.ox.ac.uk/~gav/bgen_format/
https://web.archive.org/web/20181010160322/http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html
or âplinkâ (https://www.cog-genomics.org/plink/1.9/formats) formats.
Whereas the phenotype data is mostly stored in pdfs and stata files.
We have an existing R based tool to search for phenotype variables http://variables.alspac.bris.ac.uk/ this is accessible to anyone on the web. There is also an R package that people can use internally to extract the data for given variables https://github.com/explodecomputer/alspac.
At the moment if a researcher wanted to perform a gwas on âhearingâ they would use the search tool to find any vars that have been recorded on hearing and then create a script to follow some rules about what the construction of their specific phenotype var of interest would be. i.e. thresholding on one ear or another in some test at some age. Then they would use software such as plink or snptest (https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html) to run the GWAS.
We do not have any equivalent tool to âthe variable search toolâ for searching 'omics data but it would be useful to be able to query it easily in prolog. For example, what are the major and minor alleles of a specific SNP, who has what allele for a set of SNPs or which people have minor alleles of SNPs in the region of this gene X.
So I am looking for advice for starting this project on two fronts. One is how to read the large omic files for querying in a swish instance. Do I need too have so much RAM that I load the files in or can I have something that can query the files on disk. I noticed this paper:
Lazy Stream Programming in Prolog https://arxiv.org/abs/1907.11354 by Paul Tarru, @jan and Tom Schrijvers
But I am not sure if it is relevant. I guess I want something as simple as possible but allows flexibility of querying. I need to be able to parse the binary file formats either way which is not something I am familiar with.
The second piece of advice is how to work so that the data could work well with the approach by Nicos and Jan in âBig Data Bio Analyticsâ. I think that loading the bio_db data sets would enhance the ability to query the 'omics data.
Thanks in advance for any thoughts and pointers! Would be really great to collaborate on this idea if anyone would like to. It could potentially by high impact as ALSPAC is a well regarded and well used resource. (https://scholar.google.co.uk/scholar?start=0&q=alspac&hl=en&as_sdt=0,5 ) if successful the ideas could be expanded to cover over cohorts of data such as UKBiobank.
Sam