LibBF binding

Perhaps you’ve found it by now, but looks like a simple bug in the test:

cmp_bits_at(Bit, Num, Neg) :-
    bit(Bit, Num, NumBit), 
    bit(Bit, Neg, NegBit),  % was bit(Neg, Num, NegBit),
    cmp_bits(Bit, NumBit, NegBit).

Is it possible to profile the C code to pinpoint the bottleneck? Probably an algorithm issue but want to eliminate lower level functions as being the root cause.

1 Like

:slight_smile:

I normally use valgrind for this task. It seems to have a Macports port. Hopefully that comes with kcachegrind, a very nice tool for vizualizing the results. If you can get it installed:

valgrind --tool=callgrind swipl ...
<normal interaction>
?- halt.

This creates a file callgrind.out.<pid>, which you can analyze using kcachegrind callgrind.out.<pid>.

There are various “tools”. The default is for memory errors. The price is that your program runs about 20 times slower. Because the profiling is based on event counting rather than statistics you do not need long run times though.

This doesn’t sound too bad. Fabrice Bellard (the author of LibBF) shows you can win significantly with some gcc optimization options on the inner loops that dictate behavior on larger numbers. I have to sort out whether that affects running the binaries on arbitrary Intel/AMD CPUs. Also have to sort out the implications for the remainder of the system or whether we should just compile this one file that way.

Overhead for small numbers can definitely be improved if we avoid memory allocation for these by including the limbs for small numbers in the type and actually start using the error return codes rather that setting up an environment and do setjmp()/longjmp() to recover from allocation failures. The latter requires that we define a new interface for bignums that we use from Prolog and two implementations (GMP/LibBF), where the “implementation” should not be more than some small inline functions.

But anyway, using GMP will probably always be faster on large integers. There is a reason why it is considered the fastest bignum library :slight_smile:

Possibly useful - clang has some interesting extensions: https://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions

Multiprecision Arithmetic Builtins (the documentation is scanty, but seems to do unsigned arithmetic with carry)

Checked Arithmetic Builtins (I don’t know how SWI-Prolog does checked arithmetic, but the clang builtin might be faster)

It does have more builtins than GCC. I hope the two keep and eye on each other in that sense. I surely do not want to give up on gcc as it is the default compiler on LInux and, last time I checked, produces substantially (~20%) faster binaries for Prolog. SWI-Prolog arithmetic uses some of the gcc/clang builtins. May well be there are more opportunities. The most important is an overflow checking multiplication. That is used.

Pushed an update that, like GMP, keeps the allocated size in the structure. As the bf_t structs are only used during computation I also round up the allocation requests and do not reduce them. That saves a lot of realloc() calls and doubles the performance of gcd :slight_smile: The gap is now reduced to about 5 times.

That’s good because I’ve run out of ideas. Was a little surprised at the difference, but given that the candidate gcd values are always getting smaller, it makes sense.

I’m about to generate a final PR for mpz_gcd which is mainly a code cleanup, minimizing the direct dependancies on libbf (uses mpz_ functions instead). I can’t measure any any performance impact.

1 Like

I was also a little surprised, notably because valgrind rated allocation around 30%. The allocation hooks however use thread local storage. I suspect that valgrind doesn’t guess the overhead of that correctly. We can probably gain more on the allocation and error handling. That requires some redesign of the interfaces though.

One thing I’m wondering is whether the extended euclidean algorithm might do any good? I think it is claimed to be superior for coprimes. Doing a quick check on our benchmark suggests about 2/3 of the test cases concern coprimes. What is your impression?

Another observation is that about half of the time is spent on “renormalization” of the floating point representation after the subtract. I wonder whether we can simply avoid that and overall ignore most of the higher level operations associated with bf_sub()?

I’ve read this article but admit I don’t fully understand it. It appears to be just the Euclidean algorithm augmented by calculation of a couple of additional values (the Bézout coefficients). Their use is unclear to me, and I think there’s an associated overhead, so I haven’t investigated further.

I think normalization, including “removal” of trailing zeros, has some value in later steps, but rounding an “infinite precision” value makes little sense. There doesn’t appear to be a “fast path” for this case. At one point I added one but it didn’t seem to make much difference; with the latest optimizations, it may make sense to revisit this.

On the other “higher level operations” in the bf_sub path, I’ve resisted meddling in libBF code I didn’t understand as I didn’t want to create an ongoing maintenance headache. (Maybe valgrind performance analysis of bf_sub would justify further effort in this area?)

At one point I also investigated an optimization of the 64 bit integer gcd using Stein’s and a builtin. At the time it also didn’t seem to make much difference, but I’ll re-evaluate.

The combination of the two optimizations mentioned above now appears to gain another 15-20%. Shall I generate a PR?

1 Like

15-20% is not a tiny bit. If you have it around anyway, please make a PR, so I can check.

If we have a good case I’ll try to get it merged upstream. I also intend to do that with the allocation change in due time.

At two places in libbf.h, realloc() seems to be called with a valid ptr and size = 0. glibc’s realloc() hands this over to free().

static inline void bf_free(bf_context_t *s, void *ptr)
{
    /* must test ptr otherwise equivalent to malloc(0) */
    if (ptr)
      bf_realloc(s, ptr, 0);
}

static inline void bf_delete(bf_t *r)
{
    bf_context_t *s = r->ctx;
    /* we accept to delete a zeroed bf_t structure */
    if (s && r->tab) {
       bf_realloc(s, r->tab, 0);
    }
}

The man page for realloc says that the behavior for realloc(ptr, 0) is, in fact, nonportable, and valgrind 3.21 now complains about it. If I replace the calls to bf_realloc by free(ptr) and free(r->tab), the warnings go away.

A proper fix probably requires an additional pointer to free in the bf_context.

Thanks. As I recall from the work I did on LibBF to reduce the number of reallocations, this use of realloc() is quite deep in the system. Might need some study to fix is cleanly.

I made a PR. It did not appear too complicated, except that the wrapper function for free() seems to request a size argument. The standard free() doesn’t, but of course, other functions may need it. I was a bit unsure what to do with it (and that may invalidate the entire PR).

1 Like

how can I Limit the Precision of a Big Number e.g. eight Digits after the Point ?

Big numbers are integers, so there is no point. format/2 has a lot of options for printing floats at different precision, including printing integers as fixed point decimals. As is, floats in SWI-Prolog are C doubles, i.e. 64 bit IEEE
float on almost all modern hardware. LibBF is at the moment used as a replacement for GMP for big integers and rationals only.

1 Like