More efficient group_by/4 for SWI-Prolog?

j4n_bur53 · February 10, 2025, 9:40am

Just noticed that group_by/4 calculates variables and then
delegates to bagof/3. But the later predicate calculates also
varables, so I suspect quite an overhead:

/* SWI-Prolog 9.3.19 */
group_by(By, Template, Goal, Bag) :-
    ordered_term_variables(Goal, GVars),
    ordered_term_variables(By+Template, UVars),
    ord_subtract(GVars, UVars, ExVars),
    bagof(Template, ExVars^Goal, Bag).

I went with another soluton. First I provided a variant of aggregate/3
by the name aggregate_by/4 where one can offload the internal
term_variables/2 calculation. Then use this bootstrapping:

/* Dogelog Player 1.3.0 */
group_by(Witness, Template, Goal, List) :-
   aggregate_by(Witness, bag(Template), Goal, List).

Here is some testing:

/* SWI-Prolog 9.3.19 */
?- length(_H,4000), time((between(1,2000,_), 
        group_by(X,Y,(nonvar(_H),between(1,10,Y),between(1,10,X)),L), 
        fail; true)).
% 1,153,998 inferences, 0.562 CPU in 0.568 seconds (99% CPU, 2051552 Lips)
true.
?- length(_H,8000), time((between(1,2000,_), 
        group_by(X,Y,(nonvar(_H),between(1,10,Y),between(1,10,X)),L), 
        fail; true)).
% 1,153,998 inferences, 1.047 CPU in 1.060 seconds (99% CPU, 1102326 Lips)
true.

/* Dogelog Player 1.3.0 */
?- length(_H,4000), time((between(1,2000,_), 
        group_by(X,Y,(nonvar(_H),between(1,10,Y),between(1,10,X)),L), 
        fail; true)).
% Zeit 399 ms, GC 0 ms, Lips 16987636, Uhr 10.02.2025 10:49
true.
?- length(_H,8000), time((between(1,2000,_), 
        group_by(X,Y,(nonvar(_H),between(1,10,Y),between(1,10,X)),L), 
        fail; true)).
% Zeit 400 ms, GC 1 ms, Lips 16945167, Uhr 10.02.2025 10:50
true.

The old version suffers from some term_variables/2 dependency
whereas the new version is totally immune on the size of the
given goal, since any internal term_variables/2 has been offloaded.

I couldn’t name aggregate_by/4 as aggregate/4, since the later
already exists in SWI-Prolog and SICStus Prolog and has a different
semantics, it is not the analog of distinct/2, where one can specify Witnesses.

j4n_bur53 · February 16, 2025, 10:02am

I was exploring group_by/4 respectively a new aggregate_by/4 for
some machine learning statistics. But the slowdown is not that
aggravated if the extra parameter _H is ground. The cost is smaller

factor then. So I had a change of mind in favor of more declarative
aggregate/3. Too much new predicates isn’t healthy so I dismissed
the idea of supporting distinct/2, group_by/4 and a new aggregate_by/4.

But somehow I fell in love with the idea of a new firstof/2 predicate,
instead of distinct/2, it could be bootstrapped as follows:

firstof(X, Q) :-
   bagof(X, Q, L),
   L = [X|_].

Except it can be implemented like distinct/2 more eagerly
fitting to some other predicates from the library(sequence):

p(1,a).
p(1,b).
p(2,c).
p(2,d).
p(2,e).

?- firstof(Y,p(X,Y)).
Y = a, X = 1;
Y = c, X = 2;
fail.

Boris · February 17, 2025, 8:34am

You are slowly moving in the general direction of SQL window functions. The syntax in SQL is absurd but the features are useful.

j4n_bur53 · February 17, 2025, 9:14am

SQL window functions are sometimes a short hand
for subexpressions that are SQL queries. For example
this here with SQL window functions:

SELECT depname, empno, salary, avg(salary) OVER 
(PARTITION BY depname) FROM empsalary;

Is the same as this here without SQL window functions:

SELECT depname, empno, salary, 
       (SELECT AVG(salary) FROM empsalary e2 WHERE 
e2.depname = e1.depname) AS avg_salary
FROM empsalary e1;

There is no GROUP BY in the SQL solution, and a SWI-Prolog
translation does neither need group_by/4, you can do it with the
de facto library(aggregate) in Prolog:

?- empsalary(Depname, Empno, Salary),
    aggregate_all((count,sum(_Salary)), 
        empsalary(Depname, _, _Salary),
                      (_Count, _Sum)),
    Avg_Salary is _Sum/_Count.

So I rolled back my group_by/4 introduction, and instead
I introduced something else in my Prolog system. Namely
marked variables in queries and in Prolog text consulting.

So that one can emphasize temporary variables such as
_Count and _Sum, and they don’t clutter answer substitutions.
Marked variables do not yet work in Scryer Prolog ,

but they work in Trealla Prolog as well.

j4n_bur53 · February 17, 2025, 9:21am

Things are totally different in SQL than in Prolog systems, since
SQL has sorted indexes, already implied in CREATE INDEX, you
have to opt-out via USING hash . But back to Prolog systems,

you can try this test data:

empsalary(develop, 11, 5200).
empsalary(develop, 7, 4200).
empsalary(develop, 9, 4500).
empsalary(develop, 8, 6000).
empsalary(develop, 10, 5200).
empsalary(personnel, 5, 3500).
empsalary(personnel, 2, 3900).
empsalary(sales, 3, 4800).
empsalary(sales, 1, 5000).
empsalary(sales, 4, 4800).

You can check the query:

j4n_bur53:

?- empsalary(Depname, Empno, Salary),
    aggregate_all((count,sum(_Salary)), 
        empsalary(Depname, _, _Salary),
                      (_Count, _Sum)),
    Avg_Salary is _Sum/_Count.

But you can also write another solution where you first do a
aggregate/3 and then empsalary/3, like for example. But a Prolog
system does usually not go into the pain of doing the aggregate/3

eagerly window style, like using some Depname first argument
indexing. Eagerly window style does already not work since by
definition aggregate/3 is bootstrapped from bagof/3 so it also

involves keysorting and will also sort Depname. But Prolog first
argument indexing is hardly sorted in most Prolog systems. So
aggregate/3 will have all Depnames in memory and not one by one:

?- aggregate((count,sum(_Salary)), 
        _Empno^empsalary(Depname, _Empno, _Salary),
                      (_Count, _Sum)),
    Avg_Salary is _Sum/_Count,
    empsalary(Depname, Empno, Salary).

I really start likeing marked variables, but I am little bit
overdoing it. If the top-level supresses unbound variables,
you don’t need to mark _Salary and _Empno. But _Empno

is quantified by (^)/2, which you cannot do with an annoynmous
variable, so marked variables are quite handy here. The
alternative solution with my suggested aggregate_by/4 can

be easily derived, but it doesn’t give something more readable,
except that the 1st argument aggregate_by/4 could be read off
from a corresponding PARTITION BY SQL clause.

But even with aggregate_by/4 the 2nd empsalary/3 call would
still be outside the aggregate_by/4 call, since the conceived
aggregate_by/4 DOES NOT give a full OVER SQL clause.

Topic		Replies	Views
Trying to understand solution_sequences:group_by/4 Predicate	3	830	March 20, 2020
The bagof/3 choker in SWI-Prolog General	0	54	November 9, 2024
Nuances and usage tips on findall/4? Help!	2	875	June 26, 2019
Library(aggregate): Weird stuff! Help! how-to	3	2755	April 26, 2020
Simple Question Help!	7	999	June 27, 2019

More efficient group_by/4 for SWI-Prolog?

Related topics