Using DCGs for reading and writing in shell "pipes"

Boris · December 26, 2019, 2:31pm

I’m using: SWI-Prolog version 8.1.19 for x86_64-linux

I want the code to: filter an arbitrary amount of data, from standard input to standard output, in constant memory, using DCGs.

But what I’m getting is: the code I wrote seems clumsy. Is there a cleaner way?

My code looks like this: sorry, some background/motivation first. You don’t need to read it, feel free to jump to the end, where it says “The client code” for the Prolog code relevant to my question.

Background and motivation

Prolog is not the popular choice for this kind of tasks. As I have ranted before, Python (and previously Perl) is what is popular. This is why I feel all this detail is relevant.

Why constant memory?

In “bioinformatics” or “computational biology” or whatever you want to call it, we end up with a lot of data. Tens and hundreds of GB is the new normal it seems. See, for example, gnomAD. Check out the file sizes under the gnomAD v3 tab.

gnomAD v3 is an exceptionally large database but still, things regularly don’t fit in main memory.

Why DCG?

The data itself is such that using DCGs is quite nice. Here is a simplified (made up) version of a real task I had:

Take as input an arbitrary number of fasta records and give me the reverse complements, in the same format.

I am taking this example because it requires reading a typical bioinformatics “structured text” format and writing the same format.

The “FASTA format” is a line starting with “>”, followed by an identifier; then, a sequence of either nucleotides or aminoacids, usually with line breaks at 70 or 80 columns. You can have many records one after the other, so for example (warning, fake data, trimmed!):

>NR_118889.1 Amycolatopsis azurea strain NRRL 11412 16S ribosomal RNA, partial sequence
GGTCTNATACCGGATATAACAACTCATGGCATGGTTGGTAGTGGAAAGCTCCGGCGGTACGGGATGAGCCCGCGGCCTAT
CAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGTGACCGGCNACACTGGGACTG
AGACACGGCCNAGACTCCTACNGGAGGNAGCAGTGGGGAATATTGCNCAATGGGCGAAAGCCTNATGCAGCGACGCCGCG
TGAGGGATGACGGCTTCGGGTTGTA
>NR_118899.1 Actinomyces bovis strain DSM 43014 16S ribosomal RNA, partial sequence
GGGTGAGTAACACGTGAGTAACCTGCCCCNNACTTCTGGATAACCGCTTGAAAGGGTNGCTAATACGGGATATTTTGGCC
TGCTCGCATGGGTGGGTTTGGAAAGGTTCTTTTTCTGGTTGGGGATGGGCTCGCGGCCTATCAGCTTGTTGGTGGGGTGA
TGGCCTACCAAGGCTTTGACGGGTAGCCGGCCTGAGAGGGTGGA

Why Prolog?

So, to reverse-complement it, I’d have to read each “sequence”, split over multiple lines, then reverse it and complement it, then again print it, line-breaking at 80 again. It is straight-forward and it definitely can be done using “standard command-line tools” like Awk, but it gets very tiresome. It is usually not as simple anyway.

Why not use an existing tool?

There are many tools out there. However:

Sometimes they do thousand things, but not exactly the one thing what you need. This happens regularly.
Often the data is dirty. Sometimes the tools that produce it are quirky, sometimes the format itself is too permissive.
The tools are often written in Python (sometimes Python2, sometimes Python3) by bioinformaticians. Works fine for their data but on large data, it is sometimes prohibitively slow, sometimes too memory-hungry. The tools have their own quirks. Code quality varies wildly (Samtools is an outlier, it is actually great software).

My solution

I have split it in two parts. The “clean” library code is obvious and straight-forward (I think). This is not what the question is about.

Library code

The fasta reading and writing:

$ cat fasta.pl 
:- module(fasta, [
        fasta_record//1,
        generate_fasta_record//1 ]).
:- use_module(library(dcg/basics)).

fasta_record(fasta(Descr, Seq)) -->
    ">", string_without("\n", D), "\n",
    {   atom_codes(Descr, D)
    },
    fasta_sequence(Seq).

fasta_sequence(Seq) -->
    string_without(">", Seq0),
    {   exclude(==(0'\n), Seq0, Seq)
    }.

generate_fasta_record(R) -->
    generate_fasta_record(R, 80).

generate_fasta_record(fasta(Descr, Seq), N) -->
    ">", atom(Descr), "\n",
    generate_fasta_sequence(Seq, N, N).

generate_fasta_sequence([], _, _) --> "\n".
generate_fasta_sequence([X|Xs], N1, N) -->
    (   { succ(N0, N1) }
    ->  [X],
        generate_fasta_sequence(Xs, N0, N)
    ;   "\n",
        generate_fasta_sequence([X|Xs], N, N)
    ).

and the code for complementing the “IUPAC ambiguity codes” as found here:

$ cat iupac.pl
:- module(iupac, [iupac_meaning/2, iupac_complement/2]).

/* snip */

iupac_complement(0'A, 0'T).
iupac_complement(0'C, 0'G).
iupac_complement(0'G, 0'C).
iupac_complement(0'T, 0'A).
iupac_complement(0'M, 0'K).
iupac_complement(0'R, 0'Y).
iupac_complement(0'W, 0'W).
iupac_complement(0'S, 0'S).
iupac_complement(0'Y, 0'R).
iupac_complement(0'K, 0'M).
iupac_complement(0'V, 0'B).
iupac_complement(0'H, 0'D).
iupac_complement(0'D, 0'H).
iupac_complement(0'B, 0'V).
iupac_complement(0'N, 0'N).

The client code

My main issue is that the reading/parsing and generating/writing are completely entangled:

$ cat revcomp.pl 
:- use_module(fasta).
:- use_module(iupac).

main(_) :-
    phrase_from_stream(fasta_revcomp, current_input).

fasta_revcomp -->
    fasta_record(fasta(Descr, Seq)),
    {   reverse(Seq, Rev),
        maplist(iupac_complement, Rev, RevCompl),
        phrase(generate_fasta_record(fasta(Descr, RevCompl)), Codes),
        format(current_output, "~s", [Codes])
    },
    !,
    fasta_revcomp.
fasta_revcomp --> [].

I have defined the following:

swipipe() {
    swipl \
        -g "current_prolog_flag(argv,Args),main(Args)" \
        -g halt \
        "$1" -- "${@:2}"
}

… and this is how I call it:

swipipe revcomp.pl < input.fasta > output.fasta

Some observations:

I need the cut. Without it, on a ~30MB file, with ~20K records, I run out of stack.
It takes less than a minute on my laptop; I am happy with the performance.

I have been using variations of this setup many times now: the bash function + the main/1 predicate + entangled reading and writing.

The Question(s)

is there something obviously wrong with the approach?
is there an obvious improvement?
is it worth codifying this somehow, or is this a really niche use case?

An orthogonal concern is that unintended choice points are actually errors. I know how to catch them early during development, but “Prolog script as a shell pipe” would have to include that check too.

EricGT · December 26, 2019, 3:17pm

Yes, I think others here who work with Bioinformatics data would gladly like a Prolog library for such data, I know I would and am working toward the same. Some of this work can currently be found in the SWI-Prolog packages. However one of the problems I find is that there are so many different aspects of the data that specific labs focus on, that I have yet to find one comprehensive library that can access all of the data formats (databases), interchange the data formats as necessary, and allow for querying and using the data as needed. It seems that Representational state transfer is one of the few standards in this area that is emerging.

No. I don’t personally see it that way.

Yes I see that often with my DCGs. If the data format is deterministic then the DCG should be deterministic and is often one of the main reasons my parsing runs are not as fast as I expect.

While I am aware of Runtime non-failure checker for SWI-Prolog I have not had the time to use it with my DCGs.

I don’t find that odd in Prolog since many predicates strive to work in both directions. I am actually surprised to see both fasta_record//1 and generate_fasta_record//1. I would have expected one predicate with 2 arguments and modes (+,-) and (-,+) for the fasta_sequence and perhaps another similar predicate that converts the text data to a fasta_sequence and a fasta_sequence to a text data.

Anyway, that is my personal view with no ill-will, fell free to ignore any and all of it.

jan · December 26, 2019, 3:50pm

Below is my version. This doesn’t need any shell functions. Surely, as fasta_record//1 seems something reusable, I’d write it as a multi-moded predicate and put it in a library. My experience writing real multi-moded DCGs are not great. They are typically clumsy and highly non-deterministic, which is great for small stuff, but typically not for processing large data sets.

Note that use less high level stuff. It doesn’t make it any better in this case IMO. Also note the use of initialization/2 using main as class. You can use the below code as a complete script without the need for a shell. If you want to debug it, run swipl -l script.pl and it will only load the code such that you can run it under the debugger.

Would be nice to have the equivalent of phrase_from_stream for output, but I never found a satisfactory way to implement that. Only some highly dubious routes that require low level support hacked into the Prolog garbage collector. You can realise direct DCG output if the generating DCG is fully deterministic. If not, you need to figure out which part of the output list can no longer change which implies you need to figure out the location into the list accessible from the oldest choice point. That is hard

not tested!

#!/usr/bin/env swipl

:- use_module(library(dcg/basics)).
:- use_module(library(pure_input)).

:- initialization(main, main).

fasta_record(Record) -->
    { ground(Record) },
    !,
    gen_fasta_record(Record).
fasta_record(fasta(Descr, Seq)) -->
    ">", string(D), "\n", !,
    { atom_codes(Descr, D) },
    fasta_seq(Seq).

fasta_seq([]), ">" --> ">", !.
fasta_seq(S) --> "\n", !, fasta_seq(S).
fasta_seq([H|T]) --> [H], fasta_seq(T).

gen_fasta_record(fasta(Descr, Seq)) -->
    ">", atom(Descr), "\n",
    gen_fasta_seq(Seq, 0, 80).

gen_fasta_seq([], _, _) -->
    !, "\n".
gen_fasta_seq(List, Max, Max) -->
    !, "\n",
    gen_fasta_seq(List, 0, Max).
gen_fasta_seq([H|T], N, Max) -->
    [H],
    { N1 is N+1 },
    gen_fasta_seq(T, N1, Max).

main(_Argv) :-
    phrase_from_stream(fasta_revcomp, current_input).

fasta_revcomp -->
    fasta_record(fasta(Descr, Seq)),
    { reverse(Seq, Rev),
      maplist(iupac_complement, Rev, RevCompl),
      phrase(fasta_record(fasta(Descr, RevCompl)), Codes),
      format(current_output, "~s", [Codes])
    },
    fasta_revcomp.

Boris · December 27, 2019, 10:17am

@EricGT Thank you very much for your thoughtful comments.

As you have also noticed, bioinformatics tools and formats are in constant flux, with new kinds of data trying to fit into existing formats, new tecnologies producing huge amounts of data, and so on. It is not an easy task to make anything truly useful (to others) and I don’t even feel like trying at the moment it is just too much work and I don’t have time for it.

To put it differently, I can maybe write some quick code to do something, and it will take me 30min or an hour. If I think a bit about it a bit I could turn it into a Prolog module that I can reuse. But turning it into something useful to others is an order of magnitude more work.

Either way, I only “handroll” code because I know at the time that it would take me a day or two at least to figure out which existing tool would do the same job and figure out the incantation and then confirm that it did what I thought it would do. Success is not guaranteed, while with my own code I can actually predict how long it should take me to write it and run it.

Along the same train of thought, life’s too short to write DCGs that parse/generate with the same code. It is however straight-forward and predictable (for me) to write a DCG that parses, and another one that generates.

Boris · December 27, 2019, 10:29am

@jan Thank you for the example code. I will take the time to rethink the idea of scripts with a Prolog shebang. I have very vague memories from at least 5 years ago that I tried something and somehow failed to use it properly, or maybe it exhibited some behavior I did not understand.

What provoked me to write my question (and your example has the exact same structure!) is that there is a DCG that then has embedded another { phrase(...), format(...) }. I thought I am missing the big picture.

I need to digest your faste_seq//1 and gen_fasta_seq//3 too.

Boris · December 27, 2019, 5:57pm

So I dug out the code for checking for choice points. It looked like this (and if memory serves me, it was provided by @jan after I asked for it):

goal_is_det(Goal) :-
    setup_call_cleanup(true, Goal, Det = true),
    (   Det == true
    ->  true
    ;   !,
        throw(error(mode_error(notdet),_))
    ).

The idea was that if you want to make sure that your Goal succeeds without choice points, you call it as goal_is_det(Goal) and if there is a choice point you get an exception. Not sure if this is enough code for its own module. It really was only intended to help me avoid mistakes in my code that was meant to be deterministic (always succeeds exactly once without any choice points). This in a way was an attempt to artificially limit the expressiveness of Prolog.

EricGT · December 28, 2019, 4:07pm

Personal notes put here for others that might find them of value.

Click triangle to expand

Genetics

Upstream and downstream (DNA) (Wikipedia)
The Genetic Code
Transcription (biology) (Wikipedia)
Primary transcript (Wikipedia)

Data format

Sequence alignment (Wikipedia)
FASTA format (Wikipedia)
Sequence Locations and Identifiers

Databases

GenBank
European Molecular Biology Laboratory (EMBL)
Protein Information Resource (PIR)
UniProt (Swiss-Prot)
Unites States Patent and Trademark Office (uspto)
NCBI Reference Sequence Database (RefSeq)
DNA Data Bank of Japan (DDBJ)
Protein Research Foundation (PRF)
Protein Databank (PDB)
UniProt (TrEMBL)

Data exchange and query

Federated search (Wikipedia)
REpresentational State Transfer (REST) (Wikipedia)

Topic		Replies	Views
Read_stream_to_codes Predicate	15	1585	March 25, 2020
What's the idiomatic way of developing DCGs? Help!	8	689	December 17, 2020
Are DCG's just "syntactic sugar"? Help!	18	1475	July 9, 2019
Phrase_from_file vs phrase Help!	29	2946	September 14, 2020
Making a large prolog based knowledge base for bioinformatics/epidemiology Request For Comments swish , discussion , how-to	23	4864	December 16, 2019