No avg(X) for aggregate_all?

What would the best way to translate the SQL query

select avg(GPA)
from Student;

into Prolog?

I’ve come up with two ways

findall(_GPA, student(_, _, _GPA, _), _GPAs),
sum_list(_GPAs, _T),
length(_GPAs, _L),
Average is _T/_L.

aggregate_all(sum(_GPA), student(_, _, _GPA, _), _T),
aggregate_all(count, student(_, _, _GPA, _), _L),
Average is _T/_L.

It seems weird to me that there’s no

aggregate_all(avg(_GPA), student(_, _, _GPA, _), Average).

So I’m not sure if there is something like that, and I can’t find it in the documentation.

As mentioned previously, I’ve been working through Stanford’s database MooC, doing the problems in both Prolog and Datalog (and finding it very educational, my notes are at SWISH -- SWI-Prolog for SHaring).

1 Like

Note that with the current implementation, doing it “in one go” does a findall. It takes twice as long as going over the backtrackable predicate twice.

Here is how I tested it:

?- forall(between(1, 1 000 000, _), ( random(X), assertz(x(X)) )).
true.

?- time(( aggregate_all(sum(X), x(X), Sum), aggregate_all(count, x(_), N), Avg is Sum/N )).
% 4,000,008 inferences, 0.470 CPU in 0.472 seconds (100% CPU, 8509189 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time(( aggregate_all(sum(X), x(X), Sum), aggregate_all(count, x(_), N), Avg is Sum/N )).
% 4,000,008 inferences, 0.431 CPU in 0.433 seconds (100% CPU, 9284337 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time(( aggregate_all(sum(X), x(X), Sum), aggregate_all(count, x(_), N), Avg is Sum/N )).
% 4,000,008 inferences, 0.421 CPU in 0.422 seconds (100% CPU, 9506445 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time( ( aggregate_all((sum(X), count), x(X), (Sum, N)), Avg is Sum/N )).
% 14,000,057 inferences, 0.962 CPU in 1.000 seconds (96% CPU, 14546829 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time( ( aggregate_all((sum(X), count), x(X), (Sum, N)), Avg is Sum/N )).
% 14,000,057 inferences, 0.981 CPU in 1.051 seconds (93% CPU, 14265451 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

?- time( ( aggregate_all((sum(X), count), x(X), (Sum, N)), Avg is Sum/N )).
% 14,000,057 inferences, 0.979 CPU in 0.992 seconds (99% CPU, 14293196 Lips)
Sum = 500477.74828369956,
N = 1000000,
Avg = 0.5004777482836995.

It didn’t profile it (lazy again) but it is probably the list that findall creates that consumes memory and time.

EDIT: Now that I think about it, in the case of floats specifically there might also be an accumulating error if you do the average the obvious way. But I don’t really know enough to comment on that in detail.

It does get faster, but seems altogether slower. I guess without the code, difficult to judge what is the reason behind this.

I was about to suggest a PR for that but I got bogged down in detail. For example, this discussion:

and also this: https://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/

In addition to the mean, this also finds the standard deviation, as a bonus.

All this information brought me to a block.

Yes, I did check the code. A naive avg() addition to aggregate_all/3 will just do the sum and count separately, as a way to avoid duplication in the client code. The “proper” solution, I can’t judge how to really do it right now, so I will leave it for my next vacation.

Yeah, probably. But weekends are for taking a break from programming :slight_smile:

The functor is not limited to (',')/2 … you can have an arbitrary template term and evaluate it:

aggregate_all(sum(_GPA) / count, student(_, _, _GPA, _), AvgExpr),
Average is AvgExpr.

Oracle has had LISTAGG for a very long time. It is something you can’t do without, it is group_concat() in sqlite and I think array_agg() in Postgres? But basically you must have this functionality.

This was on a Monday.