I’m using: SWI-Prolog version 8.1.19 for x86_64-linux
I want the code to: filter an arbitrary amount of data, from standard input to standard output, in constant memory, using DCGs.
But what I’m getting is: the code I wrote seems clumsy. Is there a cleaner way?
My code looks like this: sorry, some background/motivation first. You don’t need to read it, feel free to jump to the end, where it says “The client code” for the Prolog code relevant to my question.
Background and motivation
Prolog is not the popular choice for this kind of tasks. As I have ranted before, Python (and previously Perl) is what is popular. This is why I feel all this detail is relevant.
Why constant memory?
In “bioinformatics” or “computational biology” or whatever you want to call it, we end up with a lot of data. Tens and hundreds of GB is the new normal it seems. See, for example, gnomAD. Check out the file sizes under the gnomAD v3 tab.
gnomAD v3 is an exceptionally large database but still, things regularly don’t fit in main memory.
Why DCG?
The data itself is such that using DCGs is quite nice. Here is a simplified (made up) version of a real task I had:
Take as input an arbitrary number of fasta records and give me the reverse complements, in the same format.
I am taking this example because it requires reading a typical bioinformatics “structured text” format and writing the same format.
The “FASTA format” is a line starting with “>”, followed by an identifier; then, a sequence of either nucleotides or aminoacids, usually with line breaks at 70 or 80 columns. You can have many records one after the other, so for example (warning, fake data, trimmed!):
>NR_118889.1 Amycolatopsis azurea strain NRRL 11412 16S ribosomal RNA, partial sequence
GGTCTNATACCGGATATAACAACTCATGGCATGGTTGGTAGTGGAAAGCTCCGGCGGTACGGGATGAGCCCGCGGCCTAT
CAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGACCGGCNACACTGGGACTG
AGACACGGCCNAGACTCCTACNGGAGGNAGCAGTGGGGAATATTGCNCAATGGGCGAAAGCCTNATGCAGCGACGCCGCG
TGAGGGATGACGGCTTCGGGTTGTA
>NR_118899.1 Actinomyces bovis strain DSM 43014 16S ribosomal RNA, partial sequence
GGGTGAGTAACACGTGAGTAACCTGCCCCNNACTTCTGGATAACCGCTTGAAAGGGTNGCTAATACGGGATATTTTGGCC
TGCTCGCATGGGTGGGTTTGGAAAGGTTCTTTTTCTGGTTGGGGATGGGCTCGCGGCCTATCAGCTTGTTGGTGGGGTGA
TGGCCTACCAAGGCTTTGACGGGTAGCCGGCCTGAGAGGGTGGA
Why Prolog?
So, to reverse-complement it, I’d have to read each “sequence”, split over multiple lines, then reverse it and complement it, then again print it, line-breaking at 80 again. It is straight-forward and it definitely can be done using “standard command-line tools” like Awk, but it gets very tiresome. It is usually not as simple anyway.
Why not use an existing tool?
There are many tools out there. However:
- Sometimes they do thousand things, but not exactly the one thing what you need. This happens regularly.
- Often the data is dirty. Sometimes the tools that produce it are quirky, sometimes the format itself is too permissive.
- The tools are often written in Python (sometimes Python2, sometimes Python3) by bioinformaticians. Works fine for their data but on large data, it is sometimes prohibitively slow, sometimes too memory-hungry. The tools have their own quirks. Code quality varies wildly (Samtools is an outlier, it is actually great software).
My solution
I have split it in two parts. The “clean” library code is obvious and straight-forward (I think). This is not what the question is about.
Library code
The fasta reading and writing:
$ cat fasta.pl
:- module(fasta, [
fasta_record//1,
generate_fasta_record//1 ]).
:- use_module(library(dcg/basics)).
fasta_record(fasta(Descr, Seq)) -->
">", string_without("\n", D), "\n",
{ atom_codes(Descr, D)
},
fasta_sequence(Seq).
fasta_sequence(Seq) -->
string_without(">", Seq0),
{ exclude(==(0'\n), Seq0, Seq)
}.
generate_fasta_record(R) -->
generate_fasta_record(R, 80).
generate_fasta_record(fasta(Descr, Seq), N) -->
">", atom(Descr), "\n",
generate_fasta_sequence(Seq, N, N).
generate_fasta_sequence([], _, _) --> "\n".
generate_fasta_sequence([X|Xs], N1, N) -->
( { succ(N0, N1) }
-> [X],
generate_fasta_sequence(Xs, N0, N)
; "\n",
generate_fasta_sequence([X|Xs], N, N)
).
and the code for complementing the “IUPAC ambiguity codes” as found here:
$ cat iupac.pl
:- module(iupac, [iupac_meaning/2, iupac_complement/2]).
/* snip */
iupac_complement(0'A, 0'T).
iupac_complement(0'C, 0'G).
iupac_complement(0'G, 0'C).
iupac_complement(0'T, 0'A).
iupac_complement(0'M, 0'K).
iupac_complement(0'R, 0'Y).
iupac_complement(0'W, 0'W).
iupac_complement(0'S, 0'S).
iupac_complement(0'Y, 0'R).
iupac_complement(0'K, 0'M).
iupac_complement(0'V, 0'B).
iupac_complement(0'H, 0'D).
iupac_complement(0'D, 0'H).
iupac_complement(0'B, 0'V).
iupac_complement(0'N, 0'N).
The client code
My main issue is that the reading/parsing and generating/writing are completely entangled:
$ cat revcomp.pl
:- use_module(fasta).
:- use_module(iupac).
main(_) :-
phrase_from_stream(fasta_revcomp, current_input).
fasta_revcomp -->
fasta_record(fasta(Descr, Seq)),
{ reverse(Seq, Rev),
maplist(iupac_complement, Rev, RevCompl),
phrase(generate_fasta_record(fasta(Descr, RevCompl)), Codes),
format(current_output, "~s", [Codes])
},
!,
fasta_revcomp.
fasta_revcomp --> [].
I have defined the following:
swipipe() {
swipl \
-g "current_prolog_flag(argv,Args),main(Args)" \
-g halt \
"$1" -- "${@:2}"
}
… and this is how I call it:
swipipe revcomp.pl < input.fasta > output.fasta
Some observations:
- I need the cut. Without it, on a ~30MB file, with ~20K records, I run out of stack.
- It takes less than a minute on my laptop; I am happy with the performance.
I have been using variations of this setup many times now: the bash function + the main/1
predicate + entangled reading and writing.
The Question(s)
- is there something obviously wrong with the approach?
- is there an obvious improvement?
- is it worth codifying this somehow, or is this a really niche use case?
An orthogonal concern is that unintended choice points are actually errors. I know how to catch them early during development, but “Prolog script as a shell pipe” would have to include that check too.