(I hope I didn’t misunderstand the conversation)
The aggregate predicate family is useful but at the moment not generic enough. The obvious to me solution to single-pass aggregation (very large files, streams) is as follows:
- Create a predicate that backtracks over the solutions;
- Use a non-backtrackable data structure for aggregation.
For the second one, I have used (nb)_rbtrees
with this tiny bit of code:
nbdict_apply(X, Key, Pred, Init) :-
( nb_rb_get_node(X, Key, Node)
-> nb_rb_node_value(Node, Val0),
call(Pred, Val0, Val1),
nb_rb_set_node_value(Node, Val1)
; nb_rb_insert(X, Key, Init)
).
This will either insert a default Initial value or apply the Predicate to the existing value associated with the Key. This post by Jan W discusses the computational complexity. It also suggests an obvious and better way to count word frequency that does not work if your input is big enough. I guess your question refers in part to that post?
If you do it like this, you can simply use a forall/2. If we assume you have defined file_word/2
predicate that succeeds once for every word in a file, you would do:
rb_empty(Freqs),
forall(file_word(File, Word),
nbdict_apply(Freqs, Word, succ, 1)),
forall(rb_in(Key, Val, Freqs),
format("~w ~w~n", [Key, Val]))
But this is still not optimal, you need to hand-roll both the backtracking predicate and the non-backtrackable accumulator for anything non-trivial. Do you have a better idea?