Implementation details#

In this section, we will delve into specific implementation details of heyoka.py’s expression system. Our main goal is to explain certain pitfalls that can occur in innocent-looking code and which can have deleterious consequence on performance, and to teach you, dear user, how to avoid these pitfalls.

Reference semantics and shared (sub)expressions#

The first and most important thing to understand about heyoka.py’s expressions is that they implement so-called reference semantics. That is, an expression is essentially a handle to an underlying object, and copying the expression will not perform an actual copy, rather it will return a new reference to the same underlying object.

Before you ask “isn’t this how all Python objects work?”, let me immediately point out that heyoka.py’s expressions are exposed from C++ and that reference semantics is implemented all the way down into the C++ layer. As a concrete example of what this means, consider the following simple expression:

import heyoka as hy

# Create a couple of variables.
x, y = hy.make_vars("x", "y")

# Create a simple expression.
ex = x + y

If we attempt to copy ex via the standard copy() function, we will get nominally a new Python object, as we can see by querying the id():

from copy import copy

# Make a "copy" of ex.
ex_copy = copy(ex)

# Print the ids.
print(f'Original id: {id(ex)}')
print(f'Copy id    : {id(ex_copy)}')
Original id: 140032089312448
Copy id    : 140032618181584

However, both ex and ex_copy are in reality pointing to the same underlying C++ object which is shared among the two Python objects.

We can use ex as a building block to create more complicated expressions, e.g.:

a = hy.sin(ex) + hy.cos(ex)
a
(sin((x + y)) + cos((x + y)))

Because of the use of reference semantics, this expression will not contain two separate copies of \(x + y\). Rather, it will contain two references to the original expression ex.

If, on the other hand, we do not re-use ex and write instead

b = hy.sin(x + y) + hy.cos(x + y)
b
(sin((x + y)) + cos((x + y)))

we get an expression b which is mathematically equivalent to a but which contains two separate copies of \(x + y\), rather than two references to ex. This leads to a couple of very important consequences.

First of all, the memory footprint of b will be larger than a’s because it is (wastefully) storing two copies of the same subexpression \(x + y\) (rather than storing two references to the same underlying expression).

Secondly, heyoka.py’s symbolic manipulation routines are coded to keep track of shared subexpressions with the goal of avoiding redundant computations. For instance, let us say we want to replace \(x\) with \(x^2 - 1\) in the expression a via the subs() function:

hy.subs(a, {x: x**2 - 1.})
(sin(((x**2.0000000000000000 - 1.0000000000000000) + y)) + cos(((x**2.0000000000000000 - 1.0000000000000000) + y)))

In order to perform the substitution, the subs() function needs to traverse the expression tree of a. When it encounters for the first time the ex subexpression, it will:

  1. perform the substitution, producing as a result \(x^2-1+y\),

  2. record in an internal bookkeeping structure that performing the substitution on the subexpression ex produced the result \(x^2-1+y\).

Crucially, the second time ex is encountered during the traversal of the expression tree, the subs() function will query the bookkeeping structure and detect that the result of the substitution on ex has already been computed, and it will fetch the cached result of the substitution instead of (wastefully) perform again the same computation. Thus, not only we avoided a redundant calculation, but also the two \(x^2-1+y\) subexpressions appearing in the final result are pointing to the same underlying object (rather than being two separate copies of identical subexpressions).

On the other hand, when we apply the same substitution on b we get:

hy.subs(b, {x: x**2 - 1.})
(sin(((x**2.0000000000000000 - 1.0000000000000000) + y)) + cos(((x**2.0000000000000000 - 1.0000000000000000) + y)))

That is, the result is mathematically identical (obviously), but, because there is no internal subexpression sharing, the substitution \(x \rightarrow x^2 - 1\) had to be performed twice (rather than once) and the two \(x^2-1+y\) subexpressions appearing in the final result are two separate copies of identical subexpressions.

As a final piece of information, it is important to emphasise how subexpression sharing is not limited to single expressions, but it also happens across multiple expressions. For instance, consider the following vector expression consisting of the two components \(\sin\left( x + y \right) + \cos\left( x + y \right)\) and \(1 + \mathrm{e}^{x+y}\):

vec_ex = [hy.sin(ex) + hy.cos(ex), 1. + hy.exp(ex)]
vec_ex
[(sin((x + y)) + cos((x + y))), (1.0000000000000000 + exp((x + y)))]

Here the subexpression ex is shared among the two components of vec_ex, which both contain references to ex (rather than storing their own separate copies of ex). When we invoke the subs() function on vec_ex, the internal bookkeeping structure will be initialised during the traversal of the first component of vec_ex, but it will also persist for the traversal of the second component of vec_ex:

hy.subs(vec_ex, {x: x**2 - 1.})
[(sin(((x**2.0000000000000000 - 1.0000000000000000) + y)) + cos(((x**2.0000000000000000 - 1.0000000000000000) + y))),
 (1.0000000000000000 + exp(((x**2.0000000000000000 - 1.0000000000000000) + y)))]

Thus, the second component of the result of the substitution contains a reference to the expression \(x^2-1+y\) which is shared with the first component of the result.

This is a point which is very important and worth repeating: the fact that heyoka.py’s symbolic manipulation functions (such as subs()) can accept in input both scalar and vector expressions is not only a convenience that reduces typing, it is also crucial in order to ensure maximal subexpression sharing in the result. If you can, you should always use a symbolic manipulation function directly on a vector expression, rather than operating on one component at a time.

# Every time you do this, a kitten dies.
# Please **don't** do this.
>>> res = [hy.subs(ex, {x: x**2 - 1.}) for ex in vec_ex]

# Do this instead.
>>> res = hy.subs(vec_ex, {x: x**2 - 1.})

# Also, **don't** do this.
>>> r = s[:3]
>>> v = s[3:]
>>> new_r = subs(r, {x: x**2 - 1.})
>>> new_v = subs(v, {x: x**2 - 1.})

# Do this instead.
>>> new_s = subs(s, {x: x**2 - 1.})
>>> new_r = new_s[:3]
>>> new_v = new_s[3:]

Consequences for large computational graphs#

These details are, most of the time, of little consequence and they may just result in small, hardly-detectable inefficiencies. The situation however changes when creating large computational graphs, especially if they are constructed by iteratively feeding expressions as arguments to other expressions.

As a motivating example, we can consider the computational graphs of feed-forward neural networks (FFNN). In an FFNN, the result of the inference in each layer is fed to the next layer, giving rise to a computational graph that explodes in exponentail fashion as one increases the number of layers. We can begin with a tiny network consisting of two inputs, no hidden layers and two outputs:

hy.model.ffnn(inputs = [x, y], nn_hidden = [], n_out = 2, activations = [hy.tanh])
[tanh(((p0 * x) + (p1 * y) + p4)), tanh(((p2 * x) + (p3 * y) + p5))]

As we increase the number of layers, we can see that the computational graphs quickly grow more complicated:

hy.model.ffnn(inputs = [x, y], nn_hidden = [2], n_out = 2,
              activations = [hy.tanh, hy.tanh])
[tanh(((p4 * tanh(((p0 * x) + (p1 * y) + p8))) + (p5 * tanh(((p2 * x) + (p3 * y) + p9))) + p10)),
 tanh(((p6 * tanh(((p0 * x) + (p1 * y) + p8))) + (p7 * tanh(((p2 * x) + (p3 * y) + p9))) + p11))]
hy.model.ffnn(inputs = [x, y], nn_hidden = [2, 2], n_out = 2,
              activations = [hy.tanh, hy.tanh, hy.tanh])
[tanh(((p8 * tanh(((p4 * tanh(((p0 * x) + (p1 * y) + p12))) + (p5 * tanh(((p2 * x) + (p3 * y) + p13))) + p14))) + (p9 * tanh(((p6 * tanh(((p0 * x) + (p1 * y) + p12))) + (p7 * tanh(((p2 * x) + (p3 * y) + p13))) + p15))) + p16)),
 tanh(((p10 * tanh(((p4 * tanh(((p0 * x) + (p1 * y) + p12))) + (p5 * tanh(((p2 * x) + (p3 * y) + p13))) + p14))) + (p11 * tanh(((p6 * tanh(((p0 * x) + (p1 * y) + p12))) + (p7 * tanh(((p2 * x) + (p3 * y) + p13))) + p15))) + p17))]

In the last example we can recognise subexpressions that occurr multiple time in the computational graphs, such as (p2 * x) + (p3 * y) + p13. These are the outputs of one layer being fed to the next one.

We can quantify the growth in complexity by looking at how the size (i.e., the number of nodes) of the computational graphs evolves with the number of layers:

for n_hlayers in range(0, 6):
    ffnn = hy.model.ffnn(inputs = [x, y], nn_hidden = [2] * n_hlayers, n_out = 2,
                         activations = [hy.tanh] * (n_hlayers + 1))
    print(f'Size of the outputs for {n_hlayers} hidden layers: {[len(_) for _ in ffnn]}')
Size of the outputs for 0 hidden layers: [9, 9]
Size of the outputs for 1 hidden layers: [25, 25]
Size of the outputs for 2 hidden layers: [57, 57]
Size of the outputs for 3 hidden layers: [121, 121]
Size of the outputs for 4 hidden layers: [249, 249]
Size of the outputs for 5 hidden layers: [505, 505]

The size of the outputs roughly doubles every time we add a new layer. Let us see what happens with 32 hidden layers:

n_hlayers = 32
ffnn = hy.model.ffnn(inputs = [x, y], nn_hidden = [2] * n_hlayers, n_out = 2,
                     activations = [hy.tanh] * (n_hlayers + 1))
print(f'Size of the outputs for {n_hlayers} hidden layers: {[len(_) for _ in ffnn]}')
Size of the outputs for 32 hidden layers: [68719476729, 68719476729]

The graphs have now exploded to a size of \(\sim 10^{10}\) – yet they were created almost instantly and, if you check, you will see that their memory footprint isn’t anywhere near 10GB.

This efficiency is due to the fact that the FFNN constructor is careful to compute the output for each neuron only once and then re-use it in the computation of the output of the next layer, so that subexpression sharing is maximised. Because of this high degree of subexpression sharing, operations on FFNNs are efficient despite the large size of the computational graphs.

For instance, if we want to perform inference, we can compile an FFNN. Subexpression sharing allows for an efficient process of common subexpression elimination (CSE), and we can show that, as expected, the computational complexity of inference end up scaling linearly with the number of layers:

for n_hlayers in range(0, 10):
    ffnn = hy.model.ffnn(inputs = [x, y], nn_hidden = [2] * n_hlayers, n_out = 2,
                         activations = [hy.tanh] * (n_hlayers + 1))
    # Create a compiled function from the FFNN.
    cf = hy.cfunc(ffnn, [x, y])
    print(f'Size of the decomposition for {n_hlayers} hidden layers: {len(cf.dc) - 4}')
Size of the decomposition for 0 hidden layers: 8
Size of the decomposition for 1 hidden layers: 16
Size of the decomposition for 2 hidden layers: 24
Size of the decomposition for 3 hidden layers: 32
Size of the decomposition for 4 hidden layers: 40
Size of the decomposition for 5 hidden layers: 48
Size of the decomposition for 6 hidden layers: 56
Size of the decomposition for 7 hidden layers: 64
Size of the decomposition for 8 hidden layers: 72
Size of the decomposition for 9 hidden layers: 80

As a reminder, the decomposition of a compiled function is the sequence of elementary operations the original function has been broken into.