Pandas Aggregation in Prolog

I am trying to replicate this example:


https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html

Besides various limitations of library(aggregate) in Prolog,
such that it doesn’t allow function pointers like np.median or
that it doesn’t spread an aggregate along multiple columns,

i.e. data1 and data2 in the above example. I also found that already
a min and max pair has some limitations:

?- aggregate_all(min(X), between(1,100000000,X), Y).
Y = 1.

?- aggregate_all(max(X), between(1,100000000,X), Y).
Y = 100000000.

?- aggregate_all((min(X), max(X)), between(1,100000000,X), Y).
ERROR: Stack limit (1.0Gb) exceeded

In another Prolog system the 3rd query works as expected:

?- aggregate_all((min(X), max(X)), between(1,100000000,X), Y).
Y = (1, 100000000).

Disclaimer: np.median is probably a bad example anyway, since
it is not streaming friendly such as min and max. Unless you dare
to run the aggregate goal between(1,100000000,X) twice?

Go ahead and create a PR for dealing with complex templates that allow for streaming processing. As is, Ebrahim Azarisooreh contributed this for simple templates.

I am afraid I have to pass writing a SWI-Prolog Pull Requests (PRs).
This would require coding a solution in the C programming language,
and making it accessible via GitHub. My C is a little rusty, but I

like thinking about the problem in the spirit of a PIP. What is also
an interesting non-functional test case, i.e. a test case that tests
either memory or speed or both, is this one:

/* merily a functional test case */
?- aggregate(min(X), (between(1000,2000,X), Y is X mod 2), R).
Y = 0,
R = 1000 ;
Y = 1,
R = 1001.

/* a test case with a non-functional aspect*/
?- aggregate(min(X), (between(100000000,200000000,X), Y is X mod 2), R).
Y = 0,
R = 1000 ;
Y = 1,
R = 1001.

Currently in SWI-Prolog:
ERROR: Stack limit (1.0Gb) exceeded

Would be swell if this could be solved. I had a Prototype in
formerly Jekejeke Prolog using some bits and pieces from
mode directed tabling, but the bag/1 and set/1 aggregator

behaviour had a bug. Need to solve that as well sometime.

Why not in Prolog?

I think a trie would do what you want. It allows associating a variant to a value. So, you split the variable set in those involved in the aggregation and those not (also accessible as primitive, see current aggregate), and you create a trie using bindings of the free variables. See trie_update/3.

This could possibly be a good alternative to setof/3 as well, i.e., building on top of a trie rather than findall/3.

Its very difficult to get right. But I suspect it is not impossible.
Here is are some test cases:

p(A, _, A).
p(_, A, A).
p(A, _, A).

/* SWI-Prolog 9.3.11 */
?- aggregate(set(Z), p(X,Y,Z), L).
L = [X, Y].

I guess the above corresponds to this tabling:

:- table q/3.
q(A, _, A).
q(_, A, A).
q(A, _, A).

/* SWI-Prolog 9.3.11 */
?- q(X,Y,Z).
X = Z ;
Y = Z.

Somehow showing that a trie or whatever underlies tabling,
could be a suitable building block.

Edit 15.10.2024
For example to begin with, Scryer Prolog can even not
do the tabling correctly. It uses bag/1 and not set/1:

/* Scryer Prolog 0.9.4-193 */
?- q(X,Y,Z).
   X = Z
;  Y = Z
;  X = Z
;  false.

I don’t think it is particularly hard. Tries provide all you need as the same things are needed for answer subsumption (moded) tabling. Such an implementation can make aggregate/3 and setof/3 run in bounded space when possible.

As findall/3 is highly optimized when it comes to memory management, copying results back to the stacks and dealing with asynchronous atom garbage collection, such an implementation probably looses on performance in those scenarios where the memory costs of the findall/3 route are bearable. Tries are not optimized in any of these respects.

Probably one can easily beat tabling by just using the tries.

I tested your new trie based distinct/1 realization. Quite
amazing how it beats hash table from library(nb_set):

/* SWI-Prolog 9.3.14 */
?- time(test).
% 1,389,089 inferences, 2.078 CPU in 2.094 seconds (99% CPU, 668434 Lips)
true.

/* SWI-Prolog 9.3.14, nightly from 09. Nov 2024 */
?- time(test).
% 720,310 inferences, 0.578 CPU in 0.592 seconds (98% CPU, 1245942 Lips)
true.

Its the same test case as from the bagof/3 choker.
So mostlikely putting aggregate/3, bagof/3, setof/3 on a trie
footing will not only avoid choking but also give speed?!

test :-
   distinct(gen(_)), fail; true.

gen(L) :-
   between(1,300,_),
   between(1,300,N),
   length(L,N).

Probably. A complete implementation is not really trivial though. I did a partial implementation that showed similar performance as the current implementation. It was not completely functional though, notably no cyclic terms or attribute support. Possibly it is acceptable to use a trick similar to distinct/2 to rewrite such terms first.

I don’ have tries in my system. But I recently introduced Okasaki tree,
but it turned out that they are totally useless. What works better in
my case was hash+keysort, as strange as it may sound.

I believe it depends on the grouping factor. If there is a high
grouping factor there is higher precentage of read from the
grouping datastructure and than write into the grouping datastructure.

So if read is only O(1) as in hash, you have better performance
than if read is O(log N) as in tree. And you can also afford a later
keysort. Here is the bagof/3 choker with hash+keysort compared to tree:

/* Dogelog Player 1.2.5, tree */
?- time(test3).
% Zeit 6073 ms, GC 1 ms, Lips 5182931, Uhr 10.11.2024 00:58
true.

/* Dogelog Player 1.2.5, hash+keysort */
?- time(test4).
% Zeit 3070 ms, GC 2 ms, Lips 9808736, Uhr 10.11.2024 00:58
true.

Quite amazing, if we compare to the traditional bagof/3:

/* Trealla Prolog 2.59.17, keysort */
?- time(test).
% Time elapsed 13.280s, 16158649 Inferences, 1.217 MLips
   true.

Edit 10.11.2024
How did I solve the variant key problem, which is present in the
bagof/3 choker? I went with numbervars/3 and unnumbervars/3.
But most likely if tries are used for tabling, they can already do the

same. I was looking a little bit at your trie_insert() C implementation
used in distinct/1. It seems distinct/1 can solve the variant key problem
by means of trie_insert/2 alone, since for ordinary terms it uses a no-op:

trieable(Term, ForTrie) :-
    acyclic_term(Term),
    term_attvars(Term, []),
    !,
    ForTrie = t(Term).

Not sure what is needed to enumerate variant key, turn them
back into ordinary terms together with the associated values.
I am also using the nb_linkarg/3 trick for findall/3 inside bagof/3.

So the values are basically these single linked lists. The same
approach for bagof/3, is also suitable for aggregate/3.