Integer arithmetic hangs or slow

I wonder whats going on here, was killing it after a minute:

/* SWI-Prolog 8.5.17 */
?- time((exact(385, X, Y, Z), fail; true)).
Action (h for help) ? abort
% 64,175,095 inferences, 74.922 CPU in 75.873 seconds (99% CPU, 856560 Lips)
% Execution Aborted

On the other hand in GNU Prolog I get speedily:

/* GNU Prolog 1.5.0 */
?- exact(385, X, Y, Z), fail; true.
(15515 ms) yes

Another Prolog system which had also bigints showed same speed.

The Prolog text was simply:

exact(M, X, Y, Z) :-
   N is M-1,
   between(0, N, X),
   between(0, N, Y),
   between(0, N, Z),
   X^3+Y^3+9-Z^3 =:= 0.

Using swipl -O on Ubuntu, AMD 3950X I get

 % 57,215,239 inferences, 24.604 CPU in 24.604 seconds (100% CPU, 2325469 Lips)

Without -O this gets

% 114,281,864 inferences, 33.533 CPU in 33.534 seconds (100% CPU, 3408033 Lips)

What is really strange is that the Windows version is indeed much slower (95 sec) while both the Linux and Windows version are compiled with GCC 11 using PGO optimization and you’d assume the OS doesn’t matter as there is (should not be) any OS interaction.

GNU-Prolog gives 10603 ms, YAP 18.6 seconds, SICStus 14.7 sec Part of the explanation is that SWI-Prolog has overflow checking and deals with unbounded integers. Native code generation probably helps a lot, in particular if you omit the overflow checking. Both GNU-Prolog and YAP give

?- A is 2^100.
A = 0.

Whereas SWI-Prolog

?- A is 2^100.
A = 1267650600228229401496703205376.

C gives 38 milliseconds. For simple things, simple languages are better :slight_smile:

There is probably something wrong with (^)/2. Because the timings
for AMD 3950X on Ubuntu are also not that good. I get with Java,
which is usually slower than SWI-Prolog:

/* Jekejeke Prolog 1.5.4 */
?- time((exact(385, X, Y, Z), fail; true)).
% Threads 18,969 ms, GC 32 ms, Up 19,020 ms (Current 09/20/22 10:56:49)
true.

But if I read my (^)/2 code I have a fast path for smallints. Here is
the decision made for the fast path. The < 63 is a little bit defensive,
maybe =< 63 would also work, dont know:

    /* smallint ^ smallint case */
    if (bitlength(Math.abs(y)) * x < 63) {
         return TermAtomic.normBigInteger(pow(y, x));
    } else {
         return TermAtomic.normBigInteger(
                        BigInteger.valueOf(y).pow(x));
    }

And the fastpath is rather trivial:

    /* int = 32-bit signed, long = 64-bit signed */
    private static long pow(long m, int n) {
        long r = 1;
        while (n != 0) {
            if ((n & 1) != 0)
                r *= m;
            n >>= 1;
            if (n != 0)
                m *= m;
        }
        return r;
    }

Or maybe SWI-Prolog has a fastpath, but smallints are different? I
have smallints 32-bit, but the above calculates a 64-bit result, and
then, through normBigInteger() it goes back into smallint or bigint.

On 64-bit CPUs 64-bit integers are quite fast. So whenever possible
I compute up to 64-bit results, even if the input is smallint 32-bit. Like
for example in multiplication or here in integer exponentiation.

The ar_pow() function in pl-arith.c ultimately does the work. It is mostly done my @ridgeworks in its current shape, dealing with integers, rationals and floats with many weird special cases. Ultimately it seems to using GMP for integers, which means allocation, freeing, copying, etc. I’m surprised it isn’t any slower :slight_smile: This could use some opportunistic paths to deal with simple cases that can be done fast. Its open source :slight_smile:

ar_pow() is slow because it uses mpz_init() and mpz_ui_pow_ui()
for its smallint fast path. But this is not really a fast path.

The trick to implementing a “smallint” fast path is to predetermine when an overflow will not occur. SWIP has three integer sub-types tagged (57 bits = 64 - tag_size), 64 bit, and GMP big ints, so subtype is insufficient information to detect potential overflow. I suppose you could do a floating point calculation using eg., pow or log, to estimate the size of the result but that doesn’t come for free. Other alternatives?

I think the current implementation of punting integer exponentiation to GMP is a pragmatic solution but definitely isn’t optimal for a common exponentiation use case. The question is when and how much does it matter?

1 Like

You have everything in your code, to estimate whether an overflow will
happen using registers or not. You compute already the bitlength of the
raise to power base, i.e. the first argument:

    /* from ar_pow() routine */
    { case V_INTEGER:
	op1_bits = MSB64(n1->value.i);
        break;

Now observe this law for positive op1 and op2, you use it also
in &r_bits, or maybe somebody else wrote the code am
I am talking to a brick wall?

     bitlength(op1^op2) = bitlength(op1)*op2

But I don’t understand why there is no absolute value taken
of n1->value.i ? See for example my Java code, I first compute
Math.abs() before computing bitlength().

Maybe there are more problems in ar_pow() lurking? It wouldn’t astonish
me, it was only later in the process introduced by the ISO core standard,
so it could have missing the test of time. It was made official in 2012:

9.3.10 (^)/2 – integer power
https://www.complang.tuwien.ac.at/ulrich/iso-prolog/dtc2#pow

For example its really strange how the sign affects the performance:

/* SWI-Prolog 8.15.17 */
?- time((between(1,1000000,_), _ is 33^10, fail; true)).
% 2,000,000 inferences, 0.516 CPU in 0.515 seconds (100% CPU, 3878788 Lips)
true.

?- time((between(1,1000000,_), _ is (-33)^10, fail; true)).
% 2,000,000 inferences, 0.656 CPU in 0.645 seconds (102% CPU, 3047619 Lips)
true.

I cannot reproduce the same:

/* Jekejeke Prolog 1.5.4 */
?- time((between(1,1000000,_), _ is 33^10, fail; true)).
% Threads 141 ms, GC 2 ms, Up 163 ms (Current 09/20/22 20:32:52)
true.

?- time((between(1,1000000,_), _ is (-33)^10, fail; true)).
% Threads 157 ms, GC 3 ms, Up 165 ms (Current 09/20/22 20:32:56)
true.

Not my code, but you’re right. In fact I think the the number of bits in the result has already been calculated in r_bits except that that it’s out of scope when needed. I’ll do some experimenting to see what can be easily done.

Seems (**)/2 has also some brakes attached. I get:

?- time((between(1,1000000,_), _ is 33^10, fail; true)).
% 2,000,000 inferences, 0.500 CPU in 0.504 seconds (99% CPU, 4000000 Lips)
true.

?- time((between(1,1000000,_), _ is 33**10, fail; true)).
% 2,000,000 inferences, 0.500 CPU in 0.502 seconds (100% CPU, 4000000 Lips)
true.

On the otherhand I have:

/* Jekejeke Prolog 1.5.4 */
?- time((between(1,1000000,_), _ is 33^10, fail; true)).
% Threads 187 ms, GC 0 ms, Up 195 ms (Current 09/21/22 09:59:29)
true.

?- time((between(1,1000000,_), _ is integer(33**10), fail; true)).
% Threads 203 ms, GC 2 ms, Up 206 ms (Current 09/21/22 10:07:18)
true.

That is my contribution :slight_smile: The problem is/was that when calling GMP to compute the power of too large operands it would simply take down the process. Normally GMP allocates memory through the Prolog hooks that check that the numbers do not get too large. When too large allocations are requested, the hooks do a longjmp() out of the arithmetic evaluation, cleanup and throw a resource error. Calling the power function with large enough operands caused (causes?) GMP not to call the allocation but abort immediately. That is why we need to do a rough estimation on the size of the result :frowning:

Bypassing GMP for small integers will probably help a lot. GMP is great for really large numbers, but it rather hard to safely embed it. We need to guard it on the outside as in this examples as well as on the “inside” by replacing the memory allocation. Possibly there are options for a more lightweight allocation mechanism to deal with the common function evaluation that only involve a handful small GMP objects? Below is a fragment of the valgrind analysis on the exact/4 call (interrupted after some time). The function below ar_pow() is part of mpz_ui_pow_ui() if I interpret the source annotation of valgrind correctly.

1 Like

Whats the fast path for multiplication (*)/2 in the case of
smallint arguments, how did you do that? Is there even one?

Raising a power is really really slow. Normally X^3 should be faster
than X*X*X. Now I got these measurements for SWI-Prolog:

/* X^3 */
exact/4 % 27,763,684 inferences, 15.141 CPU in 15.140 seconds
(100% CPU, 1833721 Lips)
modular/4 % 3,816,483 inferences, 1.859 CPU in 1.843 seconds
(101% CPU, 2052562 Lips)

/* X*X*X */
exact/4  % 27,763,683 inferences, 4.172 CPU in 4.162 seconds
(100% CPU, 6654965 Lips)
modular/4 % 3,816,483 inferences, 0.797 CPU in 0.788 seconds
(101% CPU, 4789312 Lips)

In SWIP I believe these two operators are synonymous.

Using the new experimental version based on your C implementation of pow, I now see (MacOS, 3 GHz. 2013 MacPro):

?- time((between(1,1000000,_), _ is 33^10, fail; true)).
% 2,000,002 inferences, 0.195 CPU in 0.195 seconds (100% CPU, 10242921 Lips)
true.

?- time((between(1,1000000,_), _ is 33**10, fail; true)).
% 2,000,000 inferences, 0.195 CPU in 0.195 seconds (100% CPU, 10239659 Lips)
true.

?- time((between(1,1000000,_), _ is (-33)^10, fail; true)).
% 2,000,000 inferences, 0.195 CPU in 0.195 seconds (100% CPU, 10251311 Lips)
true.

Good job, its more speedier now. My code was Java not C,
but they look the same. Its basically this algorithm:

exp_by_squaring_iterative
https://en.wikipedia.org/wiki/Exponentiation_by_squaring

If you can handle the rounding correctly you
could let (**)/2 behave as floating point pow(),
you wouldn’t see a difference to (^)/2 in less

than 1‰ (promille) cases. But currently there is
a systematic error somewhere which creates a
discrepancy in more than 50% of the cases.

See also here:

Rounding error in (**)/2
https://swi-prolog.discourse.group/t/rounding-error-in-2/5795

And you could gain some speed, because
SWI-Prolog (**)/2 is not really Math.pow(), its
slower than a Math.pow() based (**)/2:

/* SWI-Prolog 8.5.17 */
?- time((between(1,1000000,N), float(N)**10.0 =\= float(N**10), fail; true)).
% 2,000,000 inferences, 0.953 CPU in 0.954 seconds (100% CPU, 2098361 Lips)
true.

/* Jekejeke Prolog 1.5.4 */
?- time((between(1,1000000,N), float(N)**10 =\= float(N^10), fail; true)).
% Threads 453 ms, GC 3 ms, Up 466 ms (Current 09/21/22 19:14:43)
true.

But maybe the above figures look better with the new (**)/2 ?
In many cases above the result doesn’t fit in smallint. Not
sure why its slower, possibly no speed up with new smallint

fastpath for (**)/2 since it doesn’t fit. But the slogan could be,
by allowing less than 1‰ (promille) and eradicating the 50%
discrepancy, you get double the speed. Or you eradicate only

the 50% discrepancy, and don’t do a speed up. i.e. Only fix
(**)/2 somehow but do not switch to Math.pow().

I’ve generated a PR to optimize the “small” integer case. Seems fairly localized and the existing arithmetic tests all pass.

The new optimization does help somewhat on your new “float” example. ** with floats does use the C pow function but there’s some extra work around it to ensure proper rounding under the different modes. Not sure that explains it all but it will have a detrimental effect.

On my processor config, non-optimized arithmetic:

?- time((between(1,1000000,N), float(N)**10.0 =\= float(N**10), fail; true)).
% 2,000,001 inferences, 0.676 CPU in 0.676 seconds (100% CPU, 2959457 Lips)
true.
2 Likes

Applied after some more enhancements. Notably replaced integer overflow checking by using the GCC/Clang __builtin_mul_overflow() builtin. Used that to catch overflow in the 64-bit implementation. We now also use integer exponentiation if there is no unbounded integer support when possible.

Original exact/4 test now runs in 7.6 seconds (was 24). That beats GNU-Prolog, YAP and SICStus :slight_smile: This also makes the Windows version perform about the same as the Linux version rather than almost 4 times slower. So either GMP or malloc() is much slower on Windows :frowning:

Nice improvements.

4 Likes

Thats still slower than Jekejeke Prolog. Which had 466 ms. But I don’t
know how fast your machine is. I still wonder whether there are some
rational arguments against providing separate ar_pow() and ar_powint()?

ar_pow() and ar_powint() could share code, in the end phase,
when its about bigint, rational numbers etc… But the point would be
that ar_pow() would use intrinstic Math.pow(double, double) from the

underlying runtime system (possibly compiler dependent and platform
dependent). And it would be documented that ar_pow() i.e. (**)/2
would be based on Math.pow(double,double) and thus have not the

same precision as (^)/2. ar_pow() would be in general faster than
ar_powint(). I then wonder whether the benchmark runs faster or not.
But you possibly need to get a grip of the precision problem first.