LibBF binding

I took some rainy Sunday afternoon time to have a closer look at LibBF as alternative for LibGMP. Summarizing the potential benefits:

  • Cleaner embedding interface can simplify and speedup the binding to Prolog
  • Uniform data representation for big integers, floats and decimal floats means we need to do the tricky work for converting the data structures to/from Prolog only once.
  • Provides arbitrary precision as well as decimal floats.
  • Claims correct rounding with many stateless rounding modes.
  • Small
  • Simple plain C code
  • MIT license. Together with small, this means there is no reason anymore to make bignum support a build-time option.
  • Written by Fabrice Bellard. Find his name, he has quite a track record :slight_smile:

There are also some problems

  • Not as widespread as libGMP. That may imply maintenance problems. On the other hand, the code is plain C and C as well as math principles are a rock solid basis.
  • Lacks quite a few functions, notably rationals.
  • Slower than LibGMP.

I wanted to understand the options, so I’ve written a partial LibGMP emulation using LibBF. You can find that in a branch libbf of swipl-devel.git.

So far, the status is:

  • Identified the 92 functions we use from LibGMP and put them in bf_gmp.h
  • Implemented 52 of them. Except for 3, all are short inline functions
  • Implemented exchange to the Prolog stacks and recorded database (used for toplevel variable reuse).

The basic big integer (+,-,*,/,^) stuff runs. Whenever you touch a function that is not implemented it will tell you (and often crash). Findings:

  • Full emulation seems not very hard. Quite a few of the 40 remaining functions are simple one-liners, some require some work. I wouldn’t expect completing the emulation to take more than a week.
  • Performance of ^ is poor. Seems it is using the general approach for floats here (LibBF is primarily a floating point library). Seems there are integer related functions in the non-public part of the library. Otherwise it is not too hard to write.
  • LibBF only stores the bits between lsb and msb. That means it can deal more efficiently with bitmaps than LibGMP. It lacks most functions to do so, but these are easy to implement.
  • Performance of basic bigint +,-,*,/ is about twice as slow as LibGMP. As the core distribution indicates, one can gain about 40% on Intel/AMD using some platform dependent optimization flags. If we could leverage the better API we are likely to win on bigints of just a few limbs.

A possible line to continue is

  • Complete the emulation. This would setup the system to (probably by default) use LibBF and as a build option use LibGMP. The result would be somewhat slower, but actually realizing permissive licensing for the complete system. Right now, anyone who cannot handle the LGPL license need to drop big integers.
  • Abstract several of the operations we need and implement them independently on LibGMP and LibBF to based them on the optimal combination of primitives.
  • Abstract the interface such that we can actually profit from LibBF’s better embedding interface. As we are now emulating LibGMP’s void functions, use a complicated setjpm/longjmp and allocation wrapper we cannot profit.

After that we could consider if/how to support arbitrary precision and/or decimal floats.

Comments? Volunteers? Commercial support to make this happen?

1 Like

And I still not really see the point. If the power functions tries to use the exact integer/rational path when possible you can always force it to use the float path by casting one of its arguments (or both, but the other will follow by implicit conversion anyway). So, ISO X**Y is simply the same as SWI-Prolog’s float(X)**Y. Closely linked to if we want a float from multiplying two large integers we can write R is float(X*Y) or R is float(X)*Y.

LibBF and this are only remotely related. LibBF mainly can provide a route out of the license limitations of LibGMP while at the same time allowing to extend towards arbitrary precision and decimal floats with very limited effort because all three are based on the same LibBF type. LibBF provides correctly rounded versions of some float functions, including pow() by computing them at higher precision before rounding. The price is performance, so if we go that way we need some way to choose. An oddity of LibBF is that, unlike LibGMP, float values have no set precision. Instead, computations have. I.e., you tell sin() that you want N bits precision. The existence of these two views may affect how we should provide access to this from Prolog.

The first choice though is whether to invest in LibBF, some other library or do nothing.

I have attended a lot of ISO meetings, including when this was done. The core reason why ** cannot handle integers is that is as standardized as it is. Changing that would imply a modification to the standard, which seems next to impossible. I don’t buy that. Maintaining backward compatibility is something we must aim for. Doing it at all cost creates an ugly language with a lot of near duplicate primitives just because you cannot extend what is there.

Math should have been allowed to provide more precise results. As is, it is even hard to fit rational numbers in as / is supposed to do floats. We already have // for truncated integer division (ok), so how do we allow in a natural way to say A is 1/3A = 1r3? Now we have (also in ECLiPSe) a flag prefer_rationals that dictates the outcome. I have it to true in my default init file. Works fine :slight_smile:

So, in SWI-Prolog ** and ^ are identical. It is a pity ^ had to be added :frowning:

?- X is 1889** -29.
X = 1/102500687830512064253064716070601123218909116358654017979384519544555980744900395407329380298209.

A couple of 10,000 ft. questions:

Does this mean current float representation (C doubles) is replaced by bf_t’s? That would imply either using libBf software for all simple arithmetic functions (+,-,*,/) or converting to/from C doubles for each operation. Either way, it would seem to be a significant performance hit on all float arithmetic.

How would rationals be done. Re-implementing GMP rationals would appear to be a far from trivial task.

This proposal would seem to largely benefit those who would like to use GMP functionality but can’t due to licensing/build issues, so hope we hear from them. And I suppose there’s some advantage in streamlining the SWIP configuration (?), but that’s not something most users (as opposed to developers) see.

Bottom line: I’m struggling a bit with net benefits (my own interest in cr_* functions aside).

Probably not. I guess that we will use a dual representation for floats, either as double or as bf_t, unless the transformation is sufficiently fast not to impact performance significantly. That may actually be the case. Creating a double from a bt_t is merely grabbing the mantissa, exponent and sign and stick them into the right place. As we already need to fiddle with bits to get the double from the Prolog stack this might not make things much worse. It could be interesting to replace the triple implementation for integers with a double representation. Now we have (64 bit platform) tagged for up to 54 bits, indirect for 55-64 bit and GMP for > 64 bits.

It is not a lot that we use. We need to write gcd to normalize. The rest is close to trivial. Given a divrem (division with remainder) which is already there, gcd is not a big deal.

I wouldn’t call this minor :slight_smile: The next step can bring arbitrary precision floats (no clue how popular these are) and decimal floats. These are interesting for talking to databases, particularly in the finance world.

For you, it would only provide slow correct functions for the limited set that LibBF supports.

There doesn’t appear to be any other takers, so I’d be willing to push this agenda a bit further based on the existing libbf fork of swipl-devel.

Based on the current state, here’s a suggestion for the next step:

  1. Complete the emulation by filling in implementations for all the bf_not_implemented() cases.

  2. Treat random number generation as a separate topic. There already exists an non-GMP implementation so just use that if GMP isn’t available, rather than try to replicate this GMP functionality using libBF. This eliminates any of the mpf_ functions as well as the mpz random functions from this phase. (If necessary, I suspect there are a few useful C implementations of random number generation that could be used in place of current GMP based implementation.)

I would focus on the libbf folder code, leaving configuration and build issues to those who know what they’re doing.

I’d also suggest that an issue be opened on github so we can get the inevitable discussions of technical details off the forum.

Based on the level of my C programming skills, it will likely take me longer than that, so if you have a better offer, you should take it.

Thanks for picking this up. I agree with the plan. In due time we’ll need a replacement for mpz random numbers, but there is no hurry. I assume there is a free implementation of the Mersenne Twister somewhere.

I’ve pushed some further progress to the libbf branch. Good and bad …

  • Make the elementary handling of rationals work (reading, printing, copy to/from the stacks and recorded database.
  • Implemented mpz_gcd() and mpq_canonicalize()
  • Implemented rational number +, -, *, /, currently except for mqp_cmp() and some of the support functions that do things such as adding small integers to rationals.

That is the good news. The bad news is the mpz_gcd() is really slow. I use the Euclidean algorithm, but remainder of integer division seems slow. Now about 30 times slower than GMP. There are some remarks in LibBF about Euclidean and, libbf.c contains some internal functions that may help us out. If we cannot seriously improve, the option to use LibBF as a default replacement for GMP is no longer valid.

The current approach is to implement all functions sometimes in a bit naive way, i.e., using (too) high level primitives.

Anyway, contributions are welcome. Let us keep in touch to avoid double work.

I just finished cloning the libbf branch. Ran into a couple of build issues which had relatively easy fixes:

/Volumes/Mac Pro HD/Developer/swipl-devel/swipl-devel-libbf/src/libbf/bf_gmp.h:456:26: error: parameter name omitted
mpq_set_d(mpq_t r, double)
                         ^
1 error generated.
ninja: build stopped: subcommand failed.

==> mpq_set_d(mpq_t r, double d)

typedef unsigned long mp_bitcnt_t;
                      ^
/Volumes/Mac Pro HD/Developer/swipl-devel/swipl-devel-libbf/src/libbf/bf_gmp_types.h:12:16: note: previous definition is here
typedef limb_t mp_bitcnt_t;
               ^
1 error generated.

In bf_gmp_types.h added:

#define HAVE_MP_BITCNT_T

With these two fixes, I built an executable, albeit with several warnings (“suggest braces around initialization of subobject” and “warning: absolute value function 'abs' given an argument of type 'slimb_t' (aka 'long long') but has parameter of type 'int' which may cause truncation of value [-Wabsolute-value]”) and the following:

ERROR: /Volumes/Mac Pro HD/Developer/swipl-devel/swipl-devel-libbf/build/home/xpce/prolog/lib/emacs/prolog_mode.pl:92:
ERROR:    Unknown procedure: rational/1
ERROR:      However, there are definitions for:
ERROR:            rational/3

It looks like you’ve already done a fair amount of what I was considering so maybe I should focus on:

Any particular test case for performance?

Agreed. Any other words of wisdom? libBF doesn’t have much in the way of documentation or examples so I’m mainly using your current code for educational purposes.

Fixed these. Didn’t look at the warnings. Please send PRs :slight_smile: Whow. You tried to build the whole system? I only build the core and the libedit package for now.

I forgot the exact tests. Just tried some arbitrary calls like this, make sure at least one of A or B is a bigint (> 64 bits). X^Y is a good way to create big integers, preferably do this outside the loop to only measure gcd. Compare to the normal GMP based version.

 ?- forall(between(1, 100 000, _), _ is gcd(A,B)).

Docs are not brilliant :slight_smile: Funny thing is that some of the functions do have some docs in the libbf.c file (rather than the libbf.h). From using valgrind to get an idea where the time is spent it appears to be doing bf_divrem() using a float approach. Possibly there is some low-level stuff in libbf.c that can help us and there is a intriguing BF_DIVREM_EUCLIDIAN rounding mode …
You may also download the original source, that comes with some tests and demo programs.

A post was merged into an existing topic: Rounding error in (**)/2

Pushed some more emulation: most of the bit operations, fixes and completion to the various mpz integer division functions and some other low hanging fruit.

We are getting close to a complete implementation. Missing stuff I could use some help with are notably mpz_root(), mpz_rootrem() and mpz_powm(). Most of the other remaining ones should be short. Tests and benchmarks are also needed.

I also got a simple idea to avoid most of the allocation overhead. Both LibGMP and LibBF represent numbers using a struct with a pointer to the limbs. When bignums are on the stack, they are serialized, where the serialization captures the limbs and some additional data such as the sign and, for LibBF, the exponent. I.e., using they only live shortly in their normal struct representation and we only have a few around during the arithmetic evaluation. As we just compile LibBF into the SWI-Prolog library, we can add a fixed-size limb buffer that will be good enough for “small bigints”, say 10 limbs (640 bits).

mpz_tstbit numbers bits using the convention that bit 0 is the least significant bit.

?- X is getbit(2^65+5,0).
X = 1.

?- X is getbit(2^65+5,1).
X = 0.

?- X is getbit(2^65+5,2).
X = 1.

bf_tstbit doesn’t seem to follow that convention:

?- B is getbit(2^65+5,0).
n= =0x0.80000000000000014000000000000000p66
B = 0.

?- B is getbit(2^65+5,1).
n= =0x0.80000000000000014000000000000000p66
B = 0.

?- B is getbit(2^65+5,2).
n= =0x0.80000000000000014000000000000000p66
B = 0.

Can you shed any light on how to get from the mpz semantics to that used in libBF?

Other than the obvious reason, I’ve been looking at an alternative gcd algorithm that depends on bit test, but maybe you’ve already found a solution to the slow gcd.

mpz_tstbit() was one of the first things I tried to add internally to LibBF, finding some bit function in the set of private functions. Seems I misunderstood the private function.

Anyway, this is easy to implement. Pushed a working version.

Nope. I leave the gcd performance to you. If you could look at mpz_root() and mpz_rootrem(), that would be great too.

Thanks.

I’ll look at the root functions if I can get acceptable performance on gcd. At first glance, they’ll have to be implemented from scratch using more basic arithmetic. Performance may not be great but that probably doesn’t matter too much.

Here’s a first cut at gcd using the binary algorithm - just shifts and subtractions (ref: Binary GCD (GNU MP 6.2.1) and Greatest common divisor - Wikipedia ). GMP appears to recommend its use for a smallish number of limbs.

mpz_gcd(mpz_t r, const mpz_t n1, const mpz_t n2)
void
mpz_gcd(mpz_t r, const mpz_t n1, const mpz_t n2)
{ bf_t a, b;

  if ( bf_is_zero(n1) )
  { mpz_abs(r, n2);
    return;
  }
  if ( bf_is_zero(n2) )
  { mpz_abs(r, n1);
    return;
  }

  mpz_init(&a);
  mpz_init(&b);
  // gcd is always positive
  mpz_abs(&a, n1);
  mpz_abs(&b, n2);
  int d = 0;

  while (mpz_cmp(&a, &b) != 0)
    switch (mpz_tstbit(&a, 0)*2 + mpz_tstbit(&b, 0))
    { case 0: // both even
        mul_2exp(&a, -1);
        mul_2exp(&b, -1);
        d++;
        break;
      case 1: // a even, b odd
        mul_2exp(&a, -1);
        break;
      case 2: // a odd, b even
        mul_2exp(&b, -1);
        break;
      case 3: // both odd, use r as temporary for swapping
        mpz_sub(r, &a, &b);
        if (r->sign)
        { mpz_set(&b, &a);  // a < b --> a0    -> b1
          bf_neg(r);
          mpz_set(&a, r);   //           b0-a0 -> a1
        } else {
          mpz_set(&a, &b);  // a > b --> b0    -> a1
          mpz_set(&b, r);   //           a0-b0 -> b1
        }
        break;      
    }  
  
  mul_2exp(&a, d);   // a==b, so we're done, a*2^d -> r
  mpz_set(r, &a);
  mpz_clear(&a);
  mpz_clear(&b);
  return;

}

/* multiply 'r' by 2^e */
// shift in place (copied from libbf.c with rounding removed)
void
mul_2exp(bf_t *r, slimb_t e)
{
    slimb_t e_max;
    if (r->len != 0)
    { e_max = ((limb_t)1 << BF_EXT_EXP_BITS_MAX) - 1;
      e = bf_max(e, -e_max);
      e = bf_min(e, e_max);
      r->expn += e;
    }
    return;
}

Seems to do a little better than Euclidean (division) but I suspect it may be data dependant, Still a ways to go before matching GMP, though.

Thanks. On two random equally sized integers we now loose by a a bit over 10 times, which is already a lot better than 30 times. On equally sized integers both implementations seem pretty much linear in the number of bits of the operants. On inequal size, as @j4n_bur53 suggested, the old algorithm wins, so we probably need to switch as well.

I’m still a bit puzzled though. The only time consuming operation in the Stein algorithm seems the subtract and that is only 10% slower on two random 400 bits integers. So, why do they win 10 times?

Here is the benchmark code I now use. Call for example using

?- bench(gcd, 1000, 200, 300).

to perform 100,000 gcd’s using 1,000 different pairs of integers where the first is a random one between 0 and 1<<200 and the second between 0 and 1<<300.

:- use_module(library(apply_macros), []).

bench(Func, N, M1, M2) :-                                    
    call_time(bench0(Func, N, M1, M2), Time),                
    call_time(bench0(nop, N, M1, M2), Nop),                  
    Used is Time.cpu - Nop.cpu,                              
    format('~3f sec (overhead was ~3f sec)~n',               
           [Used, Nop.cpu]).

bench0(Func, N, M1, M2) :-                                   
    Max1 is 1<<M1,                                           
    Max2 is 1<<M2,                                           
    (   between(1, N, _),                                    
        R1 is random(Max1),                                  
        R2 is random(Max2),                                  
        forall(between(1, 100, _),                           
               func(Func, R1, R2)),                          
        fail                                                 
    ;   true                                                 
    ).

func(gcd, A, B) :- _ is gcd(A,B).
func(add, A, B) :- _ is A + B.
func(sub, A, B) :- _ is A - B.
func(mul, A, B) :- _ is A * B.
func(div, A, B) :- _ is A / B.
func(nop, _, _).

This evaluation is on a Mac M1 using Apple Clang. As the basic operations are not too far of it seems it should be possible to get close. Note that you are not allowed to copy from GMP for copyright reasons. AFAIK it should be fine to get your ideas from Wikipedia and many other places on the internet though.

On the M1 the performance gap seems quite low :slight_smile: LibGMP claims it can be twice as fast on modern X86_64 with a good choice of gcc flags as with the simple -O2 we use now.

Merged your version into the current branch using #ifdef as we probably want to get a merged version (or something else).

Good suggestion, thank you. This will replace a test_bit(0) with a scan for the LS 1 bit which I think should be comparable in performance at the level I’m interested.

I seem to be getting confusing results on my Mac_intel. Using the Euclidean algorithm, 5 successive executions:

?- bench(gcd, 1000, 200, 300).
91.995 sec (overhead was 0.017 sec)
true.

?- bench(gcd, 1000, 200, 300).
58.871 sec (overhead was 0.017 sec)
true.

?- bench(gcd, 1000, 200, 300).
47.732 sec (overhead was 0.019 sec)
true.

?- bench(gcd, 1000, 200, 300).
49.749 sec (overhead was 0.017 sec)
true.

?- bench(gcd, 1000, 200, 300).
122.296 sec (overhead was 0.017 sec)
true.

I’ve temporarily replaced function mpz_lcm with Stein’s algorithm for gcd so I can test both with the same build. Running Stein’s 5 times:

?- bench(lcm, 1000, 200, 300).
3.394 sec (overhead was 0.017 sec)
true.

?- bench(lcm, 1000, 200, 300).
3.397 sec (overhead was 0.017 sec)
true.

?- bench(lcm, 1000, 200, 300).
3.393 sec (overhead was 0.017 sec)
true.

?- bench(lcm, 1000, 200, 300).
3.415 sec (overhead was 0.017 sec)
true.

?- bench(lcm, 1000, 200, 300).
3.410 sec (overhead was 0.017 sec)
true.

Stein’s is over an order of magnitude faster and much more consistent (on this benchmark). Not sure what compiler I’m using but:

?- current_prolog_flag(c_cc,C), current_prolog_flag(c_cflags,CF).
C = cc,
CF = '-fPIC -pthread'.

Standard build instructions ( cmake, ninja, ninja install ).

Surely something is not right.

I guess I was expecting some variability due to the different random number sets, but I wasn’t expecting:

  • so much variability using Euclidean vs. Stein’s
  • over an order of magnitude difference between the two based on:

Jan’s only seeing a difference of a factor of 3.

Bottom line: I hesitate to make any design decisions based on benchmark results in my environment.

We also have a problem with the mpz_scan1 function (lsb). Using GMP:

?- between(0,16,N), X is lsb(2^64+N), format("~d ",[X]),fail.
64 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0 4 
false.

Using libBF emulation:

?- between(0,16,N), X is lsb(2^64+N), format("~d ",[X]),fail.
64 0 1 -63 2 -62 1 -63 3 -61 1 -63 2 -62 1 -63 4 
false.

The numbers you show repeat on Linux on AMD 3950X, gcc 11, be it that I get about 3 times faster results. Variation is more or less the same though. We probably need a bit more sophisticated code to evaluate :slight_smile: My measurements were on M1, with a poorly configured LibBF (using 32 bit limbs) I guess we can also get some speedup by switching to native machine integers when our numbers fit.

Pushed a fix for that. Also added mpq_set_d() and fixed mpz_com().

So here’s a version that uses mpz_scan1 and does multiple bit shifts as per @j4n_bur53 's suggestion. Seems to run about 10% faster than previous version on current benchmark.

mpz_gcd
void
mpz_gcd(mpz_t r, const mpz_t n1, const mpz_t n2)
{ bf_t a, b;
  mp_bitcnt_t als1, bls1, k;

  if ( bf_is_zero(n1) )
  { mpz_abs(r, n2);
    return;
  }
  if ( bf_is_zero(n2) )
  { mpz_abs(r, n1);
    return;
  }

  mpz_init(&a);
  mpz_init(&b);
  // gcd is always positive
  mpz_abs(&a, n1);
  mpz_abs(&b, n2);
  int d = 0;
  while (mpz_cmp(&a, &b) != 0)
  { als1 = mpz_scan1(&a, 0);
    bls1 = mpz_scan1(&b, 0);
   switch ((als1 == 0)*2 +(bls1 == 0))
    { case 0: // both even
        k = -bf_min(als1,bls1);
        mul_2exp(&a, k);
        mul_2exp(&b, k);
        d = d-k;  // add offset (k is negative)
        break;
      case 1: // a even, b odd
        mul_2exp(&a, -als1);
        break;
      case 2: // a odd, b even
        mul_2exp(&b, -bls1);
        break;
      case 3: // both odd, use r as temporary for swapping
        mpz_sub(r, &a, &b);
        if (r->sign)
        { mpz_set(&b, &a);  // a < b --> a0    -> b1
          bf_neg(r);
          mpz_set(&a, r);   //           b0-a0 -> a1
        } else {
          mpz_set(&a, &b);  // a > b --> b0    -> a1
          mpz_set(&b, r);   //           a0-b0 -> b1
        }
        break;      
    }
  } 
  
  mul_2exp(&a, d);   // a==b, so we're done, a*2^d -> r
  mpz_set(r, &a);
  mpz_clear(&a);
  mpz_clear(&b);
  return;
}

Substituting this for lcm:

?- bench(lcm,1000,200,300).
3.106 sec (overhead was 0.019 sec)
true.

I’m going to look into the root functions before I try any more tuning.

Great. I’ll try to merge these (did some cleanup, notably use mpz_* types and functions. As all is inlined, this comes at 0 overhead). Could you use PR’s instead? That makes it a little easier to manage …

Agree. Gcd isn’t perfect yet, but I’m now fairly convinced we can get it fast enough. That was also the approach with the rest of the code: it didn’t need to be perfect, not even correct. It just needs to convince us it is not going to be a show-stopper. the root functions seem to be the only serious thing lacking.

edit Merged and pushed to the libbf branch. Indeed is about 10% faster.