Parallel mode

Parallel mode#

Starting from version 0.18.0, heyoka.py can automatically parallelise the integration of a single ODE system using multiple threads of execution. This parallelisation mode is fine-grained, i.e., it acts at the level of an individual integration step, and it thus serves a fundamentally different purpose from the coarse-grained parallelisation approach of ensemble propagations.

In order to be effective, parallel mode needs large ODE systems, that is, systems with a large number of variables and/or large expressions at the right-hand side. When used on small ODE systems, parallel mode will likely introduce a noticeable slowdown due to the multithreading overhead.

Note that, because Taylor integrators are memory intensive, performance for large ODE systems is bottlenecked by RAM speed due to the memory wall. This means, in practice, that, at least for double-precision computations, the performance of parallel mode will not scale linearly with the number of cores. On the other hand, for extended-precision computations the speedup will be more efficient, due to the fact that arithmetic operations on extended-precision operands are computationally heavier than on double-precision operands.

With these caveats out of the way, let us see an example of parallel mode in action.

Parallel planetary embryos#

In order to illustrate the effectiveness of parallel mode, we will setup an N-body system consisting of a large number (\(N=400\)) of protoplanets in orbit around a Sun-like star. The protoplanets interact with the star and with each other according to Newtonian gravity, and they are initially placed on circular orbits.

Let us begin by defining a run_benchmark() function that will setup and integrate the N-body system. The function is parametrised over the floating-point type fp_type that will be used for the integration (so that we can easily run the same test in both double and extended precision). The input arguments are the final time and a boolean flag specifying whether or not to use parallel mode. The return value is the total wall clock time taken by the integration:

import heyoka as hy
import numpy as np

def run_benchmark(fp_t, final_time, parallel_mode):
    import time    

    # The number of protoplanets.
    nplanets = 400

    # G constant, in terms of solar masses, AUs and years.
    G = fp_t(0.01720209895 * 0.01720209895 * 365 * 365)

    # Init the mass vector.
    masses = [fp_t(1)] + list(fp_t(1) / 333000 / ((i + 1) * (i + 1)) for i in range(nplanets))

    # Create the nbody system.
    sys = hy.model.nbody(nplanets + 1, masses = masses, Gconst = G)
    
    # The initial state (zeroed out, we will change it later).
    init_state = np.zeros(((nplanets + 1) * 6,), dtype=fp_t)
    
    # Create the integrator.
    # NOTE: compact_mode is *required* when using parallel mode.
    ta = hy.taylor_adaptive(sys, init_state, compact_mode = True,
                            parallel_mode = parallel_mode, fp_type=fp_t)

    # Reshape the state vector for ease of indexing.
    st = ta.state.reshape((nplanets + 1, 6))
    
    # Set the initial positions at regular intervals on the x axis
    # on circular orbits. The Sun is already in the origin with zero
    # velocity.
    for i in range(nplanets):
        st[i + 1, 0] = i + 1
        st[i + 1, 4] = np.sqrt(G / (i + 1))
    
    # Take the current time.
    t = time.monotonic_ns()
    
    # Integrate.
    ta.propagate_for(fp_t(final_time))

    # Return the elapsed time in seconds.
    return (time.monotonic_ns() - t) / 1e9

Note that parallel mode requires compact mode: if you try to construct a parallel mode integrator without enabling compact mode, an exception will be thrown.

Before running the benchmarks, we will limit the number of threads available for use by heyoka to 8:

# Limit to 8 threads of execution.
hy.set_nthreads(8)

Let us now run the benchmark in double precision, and let us compare the timings with and without parallel mode:

print("Serial time (double): {}".format(run_benchmark(float, 1, False)))
print("Parallel time (double): {}".format(run_benchmark(float, 1, True)))
Serial time (double): 0.975656957
Parallel time (double): 0.299708726

We can see how parallel mode resulted in a \(\times 3.3\) speedup with respect to the serial integration. While this speedup is suboptimal with respect to the maximum theoretically achievable speedup of \(\times 8\), these timings show that parallel mode can still provide an easy way of boosting the integrator’s performance for large ODE systems.

Let us now repeat the same test in quadruple precision:

print("Serial time (real128): {}".format(run_benchmark(hy.real128, 1, False)))
print("Parallel time (real128): {}".format(run_benchmark(hy.real128, 1, True)))
Serial time (real128): 176.803970754
Parallel time (real128): 24.239678635

In quadruple precision, the speedup is now \(\times 7.3\).

These results show that parallel mode can provide an easy way of boosting the performance of heyoka’s integrators for large ODE systems, and that the speedup of parallel mode is most efficient when operating in extended precision.