Parallel mode#

Added in version 0.18.0.

Starting from version 0.18.0, heyoka can automatically parallelise the integration of a single ODE system using multiple threads of execution. This parallelisation mode is fine-grained, i.e., it acts at the level of an individual integration step, and it thus serves a fundamentally different purpose from the coarse-grained parallelisation approach of ensemble propagations.

In order to be effective, parallel mode needs large ODE systems, that is, systems with a large number of variables and/or large expressions at the right-hand side. When used on small ODE systems, parallel mode will likely introduce a noticeable slowdown due to the multithreading overhead.

Note that, because Taylor integrators are memory intensive, performance for large ODE systems is bottlenecked by RAM speed due to the memory wall. This means, in practice, that, at least for double-precision computations, the performance of parallel mode will not scale linearly with the number of cores. On the other hand, for extended-precision computations the speedup will be much more efficient, due to the fact that arithmetic operations on extended-precision operands are computationally heavier than on double-precision operands.

With these caveats out of the way, let us see an example of parallel mode in action.

Parallel planetary embryos#

In order to illustrate the effectiveness of parallel mode, we will setup an N-body system consisting of a large number (\(N=400\)) of protoplanets in orbit around a Sun-like star. The protoplanets interact with the star and with each other according to Newtonian gravity, and they are initially placed on circular orbits.

Let us begin by defining a run_benchmark() function that will setup and integrate the N-body system. The function is parametrised over the floating-point type T that will be used for the integration (so that we can easily run the same test in both double and quadruple precision). The input arguments are the final time and a boolean flag specifying whether or not to use parallel mode. The return value is the total wall clock time taken by the integration:

// Generic function for running the benchmark with floating-point type T.
template <typename T>
double run_benchmark(T final_time, bool parallel_mode)
{
    // The number of protoplanets.
    const unsigned nplanets = 400;

    // G constant, in terms of solar masses, AUs and years.
    const auto G = T(0.01720209895) * T(0.01720209895) * 365 * 365;

    // Init the masses vector with the solar mass.
    std::vector masses{T(1)};

    // Add the planets' masses.
    for (auto i = 0u; i < nplanets; ++i) {
        masses.push_back((T(1) / 333000) / ((i + 1u) * (i + 1u)));
    }

    // Create the nbody system.
    auto sys = model::nbody(nplanets + 1u, kw::masses = masses, kw::Gconst = G);

    // The initial state (zeroed out, we will change it later).
    std::vector<T> init_state((nplanets + 1u) * 6u);

    // Create the integrator.
    // NOTE: compact_mode is *required* when using parallel mode.
    taylor_adaptive<T> ta{std::move(sys), std::move(init_state), kw::compact_mode = true,
                          kw::parallel_mode = parallel_mode};

    // Create xtensor views for ease of indexing.
    auto s_array = xt::adapt(ta.get_state_data(), {nplanets + 1u, 6u});
    auto m_array = xt::adapt(masses.data(), {nplanets + 1u});

    // Set the initial positions at regular intervals on the x axis
    // on circular orbits. The Sun is already in the origin with zero
    // velocity.
    for (auto i = 0u; i < nplanets; ++i) {
        using std::sqrt;

        s_array(i + 1u, 0) = i + 1u;
        s_array(i + 1u, 4) = sqrt(G / (i + 1u));
    }

    // Take the current time.
    auto start = std::chrono::high_resolution_clock::now();

    // Integrate.
    ta.propagate_for(final_time);

    // Return the elapsed time.
    return static_cast<double>(
        std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start)
            .count());
}

Parallel mode can be enabled when constructing the integrator object:

    // Create the integrator.
    // NOTE: compact_mode is *required* when using parallel mode.
    taylor_adaptive<T> ta{std::move(sys), std::move(init_state), kw::compact_mode = true,
                          kw::parallel_mode = parallel_mode};

Note that parallel mode requires compact mode: if you try to construct a parallel mode integrator without enabling compact mode, an exception will be thrown.

Before running the benchmarks, we will limit the number of threads available for use by heyoka to 8. Since heyoka uses internally the TBB library for multithreading, we will use the TBB API to achieve this goal:

int main()
{
    // Limit to 8 threads of execution.
    oneapi::tbb::global_control gc(oneapi::tbb::global_control::parameter::max_allowed_parallelism, 8);

Let us now run the benchmark in double precision, and let us compare the timings with and without parallel mode:

    // Run the serial vs parallel comparison in double precision.
    auto serial_time_dbl = run_benchmark<double>(10, false);
    std::cout << "Serial time (double): " << serial_time_dbl << "ms\n";

    auto parallel_time_dbl = run_benchmark<double>(10, true);
    std::cout << "Parallel time (double): " << parallel_time_dbl << "ms\n";
Serial time (double): 21107ms
Parallel time (double): 5887ms

We can see how parallel mode resulted in a \(\times 3.6\) speedup with respect to the serial integration. While this speedup is suboptimal with respect to the maximum theoretically achievable speedup of \(\times 8\), these timings show that parallel mode can still provide an easy way of boosting the integrator’s performance for large ODE systems.

Let us now repeat the same test in quadruple precision:

    // Run the serial vs parallel comparison in quadruple precision.
    auto serial_time_f128 = run_benchmark<mppp::real128>(1, false);
    std::cout << "Serial time (real128): " << serial_time_f128 << "ms\n";

    auto parallel_time_f128 = run_benchmark<mppp::real128>(1, true);
    std::cout << "Parallel time (real128): " << parallel_time_f128 << "ms\n";
Serial time (real128): 210398ms
Parallel time (real128): 29392ms

In quadruple precision, the speedup is now \(\times 7.2\), which is quite close to optimal.

These results show that parallel mode can provide an easy way of boosting the performance of heyoka’s integrators for large ODE systems. For double-precision computations the speedup is suboptimal due to the high memory usage of Taylor integrators. For quadruple-precision computations, the speedup is close to optimal.

Full code listing#

#include <chrono>
#include <cmath>
#include <iostream>
#include <utility>
#include <vector>

#include <oneapi/tbb/global_control.h>

#include <xtensor/xadapt.hpp>
#include <xtensor/xio.hpp>
#include <xtensor/xmath.hpp>
#include <xtensor/xview.hpp>

#include <heyoka/heyoka.hpp>

#if defined(HEYOKA_HAVE_REAL128)

#include <mp++/real128.hpp>

#endif

using namespace heyoka;

// Generic function for running the benchmark with floating-point type T.
template <typename T>
double run_benchmark(T final_time, bool parallel_mode)
{
    // The number of protoplanets.
    const unsigned nplanets = 400;

    // G constant, in terms of solar masses, AUs and years.
    const auto G = T(0.01720209895) * T(0.01720209895) * 365 * 365;

    // Init the masses vector with the solar mass.
    std::vector masses{T(1)};

    // Add the planets' masses.
    for (auto i = 0u; i < nplanets; ++i) {
        masses.push_back((T(1) / 333000) / ((i + 1u) * (i + 1u)));
    }

    // Create the nbody system.
    auto sys = model::nbody(nplanets + 1u, kw::masses = masses, kw::Gconst = G);

    // The initial state (zeroed out, we will change it later).
    std::vector<T> init_state((nplanets + 1u) * 6u);

    // Create the integrator.
    // NOTE: compact_mode is *required* when using parallel mode.
    taylor_adaptive<T> ta{std::move(sys), std::move(init_state), kw::compact_mode = true,
                          kw::parallel_mode = parallel_mode};

    // Create xtensor views for ease of indexing.
    auto s_array = xt::adapt(ta.get_state_data(), {nplanets + 1u, 6u});
    auto m_array = xt::adapt(masses.data(), {nplanets + 1u});

    // Set the initial positions at regular intervals on the x axis
    // on circular orbits. The Sun is already in the origin with zero
    // velocity.
    for (auto i = 0u; i < nplanets; ++i) {
        using std::sqrt;

        s_array(i + 1u, 0) = i + 1u;
        s_array(i + 1u, 4) = sqrt(G / (i + 1u));
    }

    // Take the current time.
    auto start = std::chrono::high_resolution_clock::now();

    // Integrate.
    ta.propagate_for(final_time);

    // Return the elapsed time.
    return static_cast<double>(
        std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start)
            .count());
}

int main()
{
    // Limit to 8 threads of execution.
    oneapi::tbb::global_control gc(oneapi::tbb::global_control::parameter::max_allowed_parallelism, 8);

    // Run the serial vs parallel comparison in double precision.
    auto serial_time_dbl = run_benchmark<double>(10, false);
    std::cout << "Serial time (double): " << serial_time_dbl << "ms\n";

    auto parallel_time_dbl = run_benchmark<double>(10, true);
    std::cout << "Parallel time (double): " << parallel_time_dbl << "ms\n";

#if defined(HEYOKA_HAVE_REAL128)

    // Run the serial vs parallel comparison in quadruple precision.
    auto serial_time_f128 = run_benchmark<mppp::real128>(1, false);
    std::cout << "Serial time (real128): " << serial_time_f128 << "ms\n";

    auto parallel_time_f128 = run_benchmark<mppp::real128>(1, true);
    std::cout << "Parallel time (real128): " << parallel_time_f128 << "ms\n";

#endif
}