No avg(X) for aggregate_all?

joeblog · February 26, 2021, 9:00am

What would the best way to translate the SQL query

select avg(GPA)
from Student;

into Prolog?

I’ve come up with two ways

findall(_GPA, student(_, _, _GPA, _), _GPAs),
sum_list(_GPAs, _T),
length(_GPAs, _L),
Average is _T/_L.

aggregate_all(sum(_GPA), student(_, _, _GPA, _), _T),
aggregate_all(count, student(_, _, _GPA, _), _L),
Average is _T/_L.

It seems weird to me that there’s no

aggregate_all(avg(_GPA), student(_, _, _GPA, _), Average).

So I’m not sure if there is something like that, and I can’t find it in the documentation.

As mentioned previously, I’ve been working through Stanford’s database MooC, doing the problems in both Prolog and Datalog (and finding it very educational, my notes are at SWISH -- SWI-Prolog for SHaring).

Boris · February 26, 2021, 11:15am

Note that with the current implementation, doing it “in one go” does a findall. It takes twice as long as going over the backtrackable predicate twice.

Here is how I tested it:

?- forall(between(1, 1 000 000, _), ( random(X), assertz(x(X)) )).
true.

?- time(( aggregate_all(sum(X), x(X), Sum), aggregate_all(count, x(_), N), Avg is Sum/N )).
% 4,000,008 inferences, 0.470 CPU in 0.472 seconds (100% CPU, 8509189 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time(( aggregate_all(sum(X), x(X), Sum), aggregate_all(count, x(_), N), Avg is Sum/N )).
% 4,000,008 inferences, 0.431 CPU in 0.433 seconds (100% CPU, 9284337 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time(( aggregate_all(sum(X), x(X), Sum), aggregate_all(count, x(_), N), Avg is Sum/N )).
% 4,000,008 inferences, 0.421 CPU in 0.422 seconds (100% CPU, 9506445 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time( ( aggregate_all((sum(X), count), x(X), (Sum, N)), Avg is Sum/N )).
% 14,000,057 inferences, 0.962 CPU in 1.000 seconds (96% CPU, 14546829 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time( ( aggregate_all((sum(X), count), x(X), (Sum, N)), Avg is Sum/N )).
% 14,000,057 inferences, 0.981 CPU in 1.051 seconds (93% CPU, 14265451 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time( ( aggregate_all((sum(X), count), x(X), (Sum, N)), Avg is Sum/N )).
% 14,000,057 inferences, 0.979 CPU in 0.992 seconds (99% CPU, 14293196 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

It didn’t profile it (lazy again) but it is probably the list that findall creates that consumes memory and time.

EDIT: Now that I think about it, in the case of floats specifically there might also be an accumulating error if you do the average the obvious way. But I don’t really know enough to comment on that in detail.

Boris · February 26, 2021, 12:30pm

It does get faster, but seems altogether slower. I guess without the code, difficult to judge what is the reason behind this.

I was about to suggest a PR for that but I got bogged down in detail. For example, this discussion:

and also this: https://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/

In addition to the mean, this also finds the standard deviation, as a bonus.

All this information brought me to a block.

Boris · February 26, 2021, 12:34pm

Yes, I did check the code. A naive avg() addition to aggregate_all/3 will just do the sum and count separately, as a way to avoid duplication in the client code. The “proper” solution, I can’t judge how to really do it right now, so I will leave it for my next vacation.

Boris · February 26, 2021, 12:37pm

Yeah, probably. But weekends are for taking a break from programming

peter.ludemann · February 26, 2021, 4:54pm

The functor is not limited to (',')/2 … you can have an arbitrary template term and evaluate it:

aggregate_all(sum(_GPA) / count, student(_, _, _GPA, _), AvgExpr),
Average is AvgExpr.

Boris · February 26, 2021, 10:27pm

Oracle has had LISTAGG for a very long time. It is something you can’t do without, it is group_concat() in sqlite and I think array_agg() in Postgres? But basically you must have this functionality.

This was on a Monday.

Topic		Replies	Views
Library(aggregate): Weird stuff! Help! how-to	3	2755	April 26, 2020
Forall/2, foreach/2, double negation, relational division and SQL Nice to know database , sql	5	969	June 3, 2022
Announcing plstat - statistics using prolog General	6	662	May 24, 2021
Assigment for university Help! swish , how-to	1	426	May 18, 2022
`nb_setarg` used to get the max number - Reply 1 General	7	372	July 12, 2022

No avg(X) for aggregate_all?

Related topics