Evaluating the performance of ensemble & batch mode#

In this short tutorial, we will run a couple of benchmarks with the goal of evaluating how features such as batch mode and ensemble propagation can lead to substantial speedups in the solution of multiple independent ODE systems.

As an illustrative example, we will be using a toy model of the outer Solar System consisting of the Sun, Jupiter, Saturn, Uranus, Neptune and Pluto, with all bodies represented as point masses attracting each other according to Newtonian gravity. This dynamical system is described in detail in another tutorial.

All timings were taken on a Ryzen 5950x CPU.

The scalar, serial baseline#

To begin with, we set up a single, scalar numerical integration of the system for \(10^6\) years. Let us start by introducing a few physical constants:

# Masses, from Sun to Pluto.
import numpy as np

masses = np.array(
    [1.00000597682, 1 / 1047.355, 1 / 3501.6, 1 / 22869.0, 1 / 19314.0, 7.4074074e-09]
)

# The gravitational constant.
G = 0.01720209895 * 0.01720209895 * 365 * 365

Next, we introduce a set of initial conditions taken from this paper:

ic = np.array(
    [  # Sun.
        -4.06428567034226e-3,
        -6.08813756435987e-3,
        -1.66162304225834e-6,
        +6.69048890636161e-6 * 365,
        -6.33922479583593e-6 * 365,
        -3.13202145590767e-9 * 365,
        # Jupiter.
        +3.40546614227466e0,
        +3.62978190075864e0,
        +3.42386261766577e-2,
        -5.59797969310664e-3 * 365,
        +5.51815399480116e-3 * 365,
        -2.66711392865591e-6 * 365,
        # Saturn.
        +6.60801554403466e0,
        +6.38084674585064e0,
        -1.36145963724542e-1,
        -4.17354020307064e-3 * 365,
        +3.99723751748116e-3 * 365,
        +1.67206320571441e-5 * 365,
        # Uranus.
        +1.11636331405597e1,
        +1.60373479057256e1,
        +3.61783279369958e-1,
        -3.25884806151064e-3 * 365,
        +2.06438412905916e-3 * 365,
        -2.17699042180559e-5 * 365,
        # Neptune.
        -3.01777243405203e1,
        +1.91155314998064e0,
        -1.53887595621042e-1,
        -2.17471785045538e-4 * 365,
        -3.11361111025884e-3 * 365,
        +3.58344705491441e-5 * 365,
        # Pluto.
        -2.13858977531573e1,
        +3.20719104739886e1,
        +2.49245689556096e0,
        -1.76936577252484e-3 * 365,
        -2.06720938381724e-3 * 365,
        +6.58091931493844e-4 * 365,
    ]
)

We can now proceed to set up the ODEs and create a scalar integrator object:

import heyoka as hy

# The ODEs.
sys = hy.model.nbody(6, masses=masses, Gconst=G)

# The integrator.
ta = hy.taylor_adaptive(sys, ic, high_accuracy=True, tol=1e-18)

We are now ready to run and time the numerical integration:

# Integrate for 1 million years.
%time ret = ta.propagate_until(1e6)

CPU times: user 14.4 s, sys: 1.34 ms, total: 14.4 s
Wall time: 14.4 s

The scalar serial integration took about 14 seconds.

Parallelisation with ensemble propagation#

We are now going to use ensemble propagation to integrate several instances of our dynamical system in parallel, using multiple threads of execution. In each instance of the ensemble, we will slightly and randomly alter the original initial conditions:

# The generator for ensemble propagation.
def gen(ta_copy, _):
    ta_copy.time = 0.0
    # Randomly alter the initial conditions.
    ta_copy.state[:] += np.random.uniform(-1e-12, 1e-12, ta_copy.state.shape)

    return ta_copy

Let us now launch an ensemble propagation consisting of 8 instances running in parallel:

%time ret = hy.ensemble_propagate_until(ta, 1e6, 8, gen)

CPU times: user 2min, sys: 6.3 ms, total: 2min
Wall time: 15.5 s

We can see how, thanks to ensemble parallelisation, we were able to integrate 8 instances of the ODE system in roughly the same time it took to integrate a single instance in serial mode.

Note that, on modern desktop CPUs, parallel speedup is rarely 100% efficient because of CPU frequency boosting when using a single core.

Vectorisation with batch mode#

As the last step, we are now going to activate batch mode in order to take full advantage of SIMD instructions in modern CPUs. In this example, we will be using a batch size of 4 (which is the SIMD vector width for double precision on most contemporary x86 CPUs). This means that each integrator in the ensemble will be propagating 4 different trajectories at once.

We begin with the definition of a template batch integrator:

ta = hy.taylor_adaptive_batch(
    sys, ic.repeat(4).reshape(-1, 4), high_accuracy=True, tol=1e-18
)

Note how the original (scalar) initial conditions were splatted out in a 2D array with 4 columns. Next, we define a new ensemble generator accounting for batch mode:

# The generator for ensemble propagation in batch mode.
def gen(ta_copy, _):
    ta_copy.set_time(0.0)
    ta_copy.state[:] += np.random.uniform(-1e-12, 1e-12, ta_copy.state.shape)

    return ta_copy

We can now run the ensemble batch propagation, using again 8 instances:

%time ret = hy.ensemble_propagate_until(ta, 1e6, 8, gen)

CPU times: user 2min 32s, sys: 9.01 ms, total: 2min 32s
Wall time: 19.8 s

We can see how, with respect to the scalar ensemble propagation, we increased the number of integrated trajectories by a factor of 4 with only a slight runtime increase.

Conclusions#

Thanks to the use of batch mode and ensemble propagation, we were able to increase the computational throughput of our simulations with respect to the serial scalar baseline by a factor of \(\sim 24\) using 8 CPU cores. These results show how batch mode and ensemble propagation can be very effective in accelerating Monte Carlo simulations and parameter searches.

Evaluating the performance of ensemble & batch mode

Contents

Evaluating the performance of ensemble & batch mode#

The scalar, serial baseline#

Parallelisation with ensemble propagation#

Vectorisation with batch mode#

Conclusions#