% 57,215,239 inferences, 24.604 CPU in 24.604 seconds (100% CPU, 2325469 Lips)
Without -O this gets
% 114,281,864 inferences, 33.533 CPU in 33.534 seconds (100% CPU, 3408033 Lips)
What is really strange is that the Windows version is indeed much slower (95 sec) while both the Linux and Windows version are compiled with GCC 11 using PGO optimization and you’d assume the OS doesn’t matter as there is (should not be) any OS interaction.
GNU-Prolog gives 10603 ms, YAP 18.6 seconds, SICStus 14.7 sec Part of the explanation is that SWI-Prolog has overflow checking and deals with unbounded integers. Native code generation probably helps a lot, in particular if you omit the overflow checking. Both GNU-Prolog and YAP give
?- A is 2^100.
A = 0.
Whereas SWI-Prolog
?- A is 2^100.
A = 1267650600228229401496703205376.
C gives 38 milliseconds. For simple things, simple languages are better
There is probably something wrong with (^)/2. Because the timings
for AMD 3950X on Ubuntu are also not that good. I get with Java,
which is usually slower than SWI-Prolog:
/* Jekejeke Prolog 1.5.4 */
?- time((exact(385, X, Y, Z), fail; true)).
% Threads 18,969 ms, GC 32 ms, Up 19,020 ms (Current 09/20/22 10:56:49)
true.
But if I read my (^)/2 code I have a fast path for smallints. Here is
the decision made for the fast path. The < 63 is a little bit defensive,
maybe =< 63 would also work, dont know:
/* smallint ^ smallint case */
if (bitlength(Math.abs(y)) * x < 63) {
return TermAtomic.normBigInteger(pow(y, x));
} else {
return TermAtomic.normBigInteger(
BigInteger.valueOf(y).pow(x));
}
And the fastpath is rather trivial:
/* int = 32-bit signed, long = 64-bit signed */
private static long pow(long m, int n) {
long r = 1;
while (n != 0) {
if ((n & 1) != 0)
r *= m;
n >>= 1;
if (n != 0)
m *= m;
}
return r;
}
Or maybe SWI-Prolog has a fastpath, but smallints are different? I
have smallints 32-bit, but the above calculates a 64-bit result, and
then, through normBigInteger() it goes back into smallint or bigint.
On 64-bit CPUs 64-bit integers are quite fast. So whenever possible
I compute up to 64-bit results, even if the input is smallint 32-bit. Like
for example in multiplication or here in integer exponentiation.
The ar_pow() function in pl-arith.c ultimately does the work. It is mostly done my @ridgeworks in its current shape, dealing with integers, rationals and floats with many weird special cases. Ultimately it seems to using GMP for integers, which means allocation, freeing, copying, etc. I’m surprised it isn’t any slower This could use some opportunistic paths to deal with simple cases that can be done fast. Its open source
The trick to implementing a “smallint” fast path is to predetermine when an overflow will not occur. SWIP has three integer sub-types tagged (57 bits = 64 - tag_size), 64 bit, and GMP big ints, so subtype is insufficient information to detect potential overflow. I suppose you could do a floating point calculation using eg., pow or log, to estimate the size of the result but that doesn’t come for free. Other alternatives?
I think the current implementation of punting integer exponentiation to GMP is a pragmatic solution but definitely isn’t optimal for a common exponentiation use case. The question is when and how much does it matter?
You have everything in your code, to estimate whether an overflow will
happen using registers or not. You compute already the bitlength of the
raise to power base, i.e. the first argument:
/* from ar_pow() routine */
{ case V_INTEGER:
op1_bits = MSB64(n1->value.i);
break;
Now observe this law for positive op1 and op2, you use it also
in &r_bits, or maybe somebody else wrote the code am
I am talking to a brick wall?
bitlength(op1^op2) = bitlength(op1)*op2
But I don’t understand why there is no absolute value taken
of n1->value.i ? See for example my Java code, I first compute
Math.abs() before computing bitlength().
Maybe there are more problems in ar_pow() lurking? It wouldn’t astonish
me, it was only later in the process introduced by the ISO core standard,
so it could have missing the test of time. It was made official in 2012:
Not my code, but you’re right. In fact I think the the number of bits in the result has already been calculated in r_bits except that that it’s out of scope when needed. I’ll do some experimenting to see what can be easily done.
That is my contribution The problem is/was that when calling GMP to compute the power of too large operands it would simply take down the process. Normally GMP allocates memory through the Prolog hooks that check that the numbers do not get too large. When too large allocations are requested, the hooks do a longjmp() out of the arithmetic evaluation, cleanup and throw a resource error. Calling the power function with large enough operands caused (causes?) GMP not to call the allocation but abort immediately. That is why we need to do a rough estimation on the size of the result
Bypassing GMP for small integers will probably help a lot. GMP is great for really large numbers, but it rather hard to safely embed it. We need to guard it on the outside as in this examples as well as on the “inside” by replacing the memory allocation. Possibly there are options for a more lightweight allocation mechanism to deal with the common function evaluation that only involve a handful small GMP objects? Below is a fragment of the valgrind analysis on the exact/4 call (interrupted after some time). The function below ar_pow() is part of mpz_ui_pow_ui() if I interpret the source annotation of valgrind correctly.
But maybe the above figures look better with the new (**)/2 ?
In many cases above the result doesn’t fit in smallint. Not
sure why its slower, possibly no speed up with new smallint
fastpath for (**)/2 since it doesn’t fit. But the slogan could be,
by allowing less than 1‰ (promille) and eradicating the 50%
discrepancy, you get double the speed. Or you eradicate only
the 50% discrepancy, and don’t do a speed up. i.e. Only fix (**)/2 somehow but do not switch to Math.pow().
I’ve generated a PR to optimize the “small” integer case. Seems fairly localized and the existing arithmetic tests all pass.
The new optimization does help somewhat on your new “float” example. ** with floats does use the C pow function but there’s some extra work around it to ensure proper rounding under the different modes. Not sure that explains it all but it will have a detrimental effect.
On my processor config, non-optimized arithmetic:
?- time((between(1,1000000,N), float(N)**10.0 =\= float(N**10), fail; true)).
% 2,000,001 inferences, 0.676 CPU in 0.676 seconds (100% CPU, 2959457 Lips)
true.
Applied after some more enhancements. Notably replaced integer overflow checking by using the GCC/Clang __builtin_mul_overflow() builtin. Used that to catch overflow in the 64-bit implementation. We now also use integer exponentiation if there is no unbounded integer support when possible.
Original exact/4 test now runs in 7.6 seconds (was 24). That beats GNU-Prolog, YAP and SICStus This also makes the Windows version perform about the same as the Linux version rather than almost 4 times slower. So either GMP or malloc() is much slower on Windows
Thats still slower than Jekejeke Prolog. Which had 466 ms. But I don’t
know how fast your machine is. I still wonder whether there are some
rational arguments against providing separate ar_pow() and ar_powint()?
ar_pow() and ar_powint() could share code, in the end phase,
when its about bigint, rational numbers etc… But the point would be
that ar_pow() would use intrinstic Math.pow(double, double) from the
underlying runtime system (possibly compiler dependent and platform
dependent). And it would be documented that ar_pow() i.e. (**)/2
would be based on Math.pow(double,double) and thus have not the
same precision as (^)/2. ar_pow() would be in general faster than ar_powint(). I then wonder whether the benchmark runs faster or not.
But you possibly need to get a grip of the precision problem first.