Integer benchmarks

For each benchmarked library, the timings in the bar charts are split into three parts:

  • the init bar, which accounts for the time needed for the initialisation of the benchmark data,

  • the operation bar, which represents the time needed to actually perform the operation being benchmarked,

  • the total bar, which represents the total runtime of the benchmark.

The intent is to provide a breakdown which measures both the performance of the operation being benchmarked and the effect of the small value optimisation implemented in mp++ (the latter tends to have a noticeable impact in the initialisation phase of the benchmarks, which usually involves the creation of a large number of integer objects).

Dot product

This benchmark measures the time needed to compute the dot product of two integer vectors of size \(3\times 10^7\) containing randomly-generated values. The multiply-add primitives of each library are used for the accumulation of the dot product. The benchmark is run in a variety of different setups.

1-limb unsigned integers

In this setup, the integer vectors are initialised with small non-negative values, and the final dot product is less than \(2^{64}\). mp++ integers with 1 limb of static size are employed. For this benchmark, we also include timings for 64-bit (std::uint64_t) and 128-bit (__uint128_t) integers.

_images/integer1_dot_product_unsigned.png

It can be immediately seen how the initialisation cost for the mpz_int class is much higher than for the other integer types. This is due to the fact that the GMP API always uses dynamically-allocated memory, even for small values. The other multiprecision integer types all employ a small-value optimisation, and thus avoid the performance cost of heap allocation.

In this particular benchmark, mp++ is about 3 times faster than GMP (as measured via the mpz_int wrapper) in the operation portion of the benchmark. mp++ is also faster than cpp_int and FLINT, albeit by a smaller margin. With respect to the 64/128-bit hardware integer types, there are x4/x2 slowdowns in the operation portion of the benchmark.

1-limb signed integers

This setup is almost identical to the previous one, but this time the vectors are initialised with both positive and negative values. For this benchmark, we also include timings for 64-bit (std::int64_t) and 128-bit (__int128_t) integers.

_images/integer1_dot_product_signed.png

The presence of both positive and negative values has a noticeable performance impact with respect to the previous test for all libraries, both during the initialisation of the vectors and in the operation part of the benchmark. This is due to the fact that sign handling in multiprecision computations is typically implemented with branches, and when positive and negative values are equally likely the effectiveness of the CPU’s branch predictor is much reduced.

The performance advantage of mp++ with respect to the other libraries is retained, albeit by a smaller margin. With respect to the 64/128-bit hardware integer types, there are x10/x4 slowdowns in the operation portion of the benchmark. The worse slowdowns (with respect to the previous unsigned test) are due to the high number of mispredicted branches in this particular test.

2-limb unsigned integers

This setup is the 2-limb version of the 1-limb unsigned benchmark: the vectors are initialised with larger non-negative values, the final result is less than \(2^{128}\), and mp++ integers with 2 limbs of static size are now employed.

_images/integer2_dot_product_unsigned.png

The benchmark shows that mp++’s specialised arithmetic functions pay off in terms of raw performance in this scenario.

2-limb signed integers

This setup is the signed version of the previous benchmark.

_images/integer2_dot_product_signed.png

As explained earlier, arithmetic with mixed positive and negative values is more expensive than arithmetic with only non-negative values.

Vector multiply-add

In this benchmark, we first compute the element-wise multiplication of two random vectors of size \(3\times 10^7\), followed by the addition to another random vector of the same size. This workload allows to measure the efficiency of the multiplication and addition operations (whereas in the dot product benchmark the multiply-add primitives were employed), and it also increases the pressure on the memory subsystem (due to the need to write into large vectors instead of accumulating the result into a single scalar).

1-limb unsigned integers

In this setup, the integer vectors are initialised with small non-negative values. mp++ integers with 1 limb of static size are employed. For this benchmark, we also include timings for 64-bit (std::uint64_t) and 128-bit (__uint128_t) integers.

_images/integer1_vec_mul_unsigned.png

This time mp++ is more than 5 times faster than GMP in the operation portion of the benchmark, while still maintaining a performance advantage over cpp_int and FLINT. The slowdown in the operation part of the benchmark with respect to 64-bit integers is less than x3, while mp++ performs similarly to 128-bit integers for this particular test.

1-limb signed integers

In this setup, the vectors are initialised with both positive and negative values. For this benchmark, we also include timings for 64-bit (std::int64_t) and 128-bit (__int128_t) integers.

_images/integer1_vec_mul_signed.png

We can see again how the introduction of mixed positive and negative values impacts performance negatively with respect to the unsigned setup. With respect to the 64/128-bit hardware integer types, there are x4/x2 slowdowns in the operation portion of the benchmark.

2-limb unsigned integers

This setup is the 2-limb version of the 1-limb unsigned benchmark: the vectors are initialised with larger non-negative values, and mp++ integers with 2 limbs of static size are now employed.

_images/integer2_vec_mul_unsigned.png

The benchmark shows again that mp++’s specialised arithmetic functions deliver strong performance.

2-limb signed integers

This setup is the signed version of the previous benchmark.

_images/integer2_vec_mul_signed.png

Vector division

In this benchmark we compute the element-wise division with remainder of two randomly-generated vectors of size \(3\times 10^7\), followed by the sum of all the values in the quotient vector.

1-limb unsigned integers

In this setup, the integer vectors are initialised with small non-negative values. mp++ integers with 1 limb of static size are employed.

_images/integer1_vec_div_unsigned.png

mp++ and FLINT perform well on this test, and they are about 5 times faster than GMP.

1-limb signed integers

In this setup, the vectors are initialised with both positive and negative values.

_images/integer1_vec_div_signed.png

Here we can see FLINT pulling ahead of mp++. FLINT uses a signed integer representation for small values that allows to avoid branching based on the signs of the operands. Coupled to the fact that there’s no need to do overflow checking during division, FLINT’s implementation has a distinct performance advantage in this specific test.

2-limb unsigned integers

This setup is the 2-limb version of the 1-limb unsigned benchmark: the vectors are initialised with larger non-negative values, and mp++ integers with 2 limbs of static size are now employed.

_images/integer2_vec_div_unsigned.png

2-limb signed integers

This setup is the signed version of the previous benchmark.

_images/integer2_vec_div_signed.png

Sorting

This benchmark consists of the sorting (via std::sort()) of a randomly-generated vector of \(3\times 10^7\) integers.

1-limb unsigned integers

In this setup, the integer vector is initialised with small non-negative values. mp++ integers with 1 limb of static size are employed.

_images/integer1_sort_unsigned.png

Here again it can be seen how the small-value optimisation implemented in mp++, cpp_int and FLINT pays off on large datasets with respect to plain GMP integers. mp++ shows a modest performance increase with respect to cpp_int and FLINT.

1-limb signed integers

In this setup, the vector is initialised with both positive and negative values.

_images/integer1_sort_signed.png

2-limb unsigned integers

In this setup, the integer vector is initialised with non-negative values in the \(\left[2^{64},2^{128}\right)\) range. mp++ integers with 2 limbs of static size are employed.

_images/integer2_sort_unsigned.png

2-limb signed integers

In this setup, the vector is initialised with both positive and negative values in the \(\left(-2^{128},2^{128}\right)\) range. mp++ integers with 2 limbs of static size are employed.

_images/integer2_sort_signed.png

Left bit shift

This benchmark consists of the element-wise left bit shifting of a randomly-generated vector of \(3\times 10^7\) integers.

1-limb unsigned integers

In this setup, the integer vector is initialised with small non-negative values, which are then left bit shifted by a small random amount (so that the result always fits into a machine word). mp++ integers with 1 limb of static size are employed.

_images/integer1_vec_lshift_unsigned.png

1-limb signed integers

In this setup, the integer vector is initialised with small positive and negative values, which are then left bit shifted by a small random amount (so that the result always fits into a machine word). mp++ integers with 1 limb of static size are employed.

_images/integer1_vec_lshift_signed.png

2-limb unsigned integers

In this setup, the integer vector is initialised with non-negative values in the \(\left[2^{64},2^{128}\right)\) range, which are then left bit shifted by a small random amount (so that the result always fits into 2 machine words). mp++ integers with 2 limbs of static size are employed.

_images/integer2_vec_lshift_unsigned.png

2-limb signed integers

In this setup, the integer vector is initialised with positive and negative values in the \(\left(-2^{128},2^{128}\right)\) range, which are then left bit shifted by a small random amount (so that the result always fits into 2 machine words). mp++ integers with 2 limbs of static size are employed.

_images/integer2_vec_lshift_signed.png

GCD

In this benchmark we compute the element-wise GCD of two randomly-generated vectors of size \(3\times 10^7\), followed by the sum of all the values in the GCD vector.

1-limb signed integers

In this setup, the vectors are initialised with small positive and negative values with an average size, in bits, of approximately 49. All the computations fit within a single 64-bit word, and mp++ integers with 1 limb of static size are employed.

Note

Due to the current lack of an optimised assembly implementation of GCD primitives for Ryzen processors in GMP, this test was run on an Intel Xeon E5-2698 v4 CPU instead.

_images/integer1_vec_gcd_signed.png