Ensembles of conditional probability learners

Ensembles of neural networks helps us understand learner variability.

From a perspective of Bayesian updating, the training dataset does not uniquely identify a single trained neural network, rather it constrains the high-dimensional parameter space. Many approaches to Bayesian neural networks adapt the neural architecture in an attempt to approximate a distribution over trained neural networks, but among the most straightforward approaches – thus appealing, albeit costly – is the ensemble.

As is common in simulation-based inference, we will consider a network that consists of

a summary network – an embedding learner that produces fixed length vectors by compressing a more complex data modality, and
an inference network – a conditional generative network which takes the embeddings as input condition and produces samples from a target distribution.

Specifically, we will learn the posterior over model parameters given the observations as conditions, also called amortized neural posterior estimation (NPE).

The idea I want to explore in this post is to have a shared summary network for an ensemble of inference networks. Sharing the summary network is more efficient than learning multiple in the same way that learning just a single inference network would be cheaper than multiple. But the reasonable hope is, that many of the benefits of ensembles are already unlocked by duplicating just the final part of the network. The argument for this kind of “last-part-of-network-ensemble” parallels that of last-layer-Bayesian neural networks.

So in this blog post I am sharing an implementation of an ensemble of conditional generative models (just below) and we will try it out on a classic Bayesian parameter estimation task: Lotka-Volterra predator prey dynamics.

Show general imports

%load_ext autoreload
%autoreload 2

import os

os.environ["KERAS_BACKEND"] = "jax"

import numpy as np
import bayesflow as bf
import matplotlib.pyplot as plt
import seaborn as sns

import scipy
from scipy.integrate import odeint

INFO:bayesflow:Using backend 'jax'

import keras
from bayesflow.types import Shape, Tensor
from bayesflow.utils import layer_kwargs
from bayesflow.utils.decorators import allow_batch_size
from bayesflow.networks import InferenceNetwork

class InferenceNetworkEnsemble(keras.Layer):
    def __init__(self, networks: dict[str, InferenceNetwork], **kwargs):
        super().__init__(**layer_kwargs(kwargs))
        self.networks = networks

    def build(self, xz_shape: Shape, conditions_shape: Shape = None) -> None:
        for network in self.networks.values():
            network.build(xz_shape, conditions_shape)

    def call(
        self,
        xz: Tensor,
        conditions: Tensor = None,
        inverse: bool = False,
        density: bool = False,
        training: bool = False,
        **kwargs,
    ) -> Tensor | tuple[Tensor, Tensor]:
        output = {
            net_key: network(xz, conditions=conditions, density=density, training=training, **kwargs)
            for net_key, network in self.networks.items()
        }
        return output

    @allow_batch_size
    def sample(self, batch_shape: Shape, conditions: Tensor = None, **kwargs) -> Tensor:
        samples = {}
        for net_key, network in self.networks.items():
            samples[net_key] = network.sample(batch_shape, conditions=conditions, **kwargs)
        return samples

    def log_prob(self, samples: Tensor, conditions: Tensor = None, **kwargs) -> Tensor:
        log_probs = {}
        for net_key, network in self.networks.items():
            log_probs[net_key] = network.log_prob(samples, conditions=conditions, **kwargs)
        return log_probs

    def compute_metrics(
        self, x: Tensor, conditions: Tensor = None, sample_weight: Tensor = None, stage: str = "training"
    ) -> dict[str, Tensor]:
        metrics = {}

        for net_key, network in self.networks.items():
            metrics[net_key] = network.compute_metrics(
                x,
                conditions=conditions,
                sample_weight=sample_weight,
                stage=stage,
            )

        joint_metrics = {}
        for net_key in metrics.keys():
            for metric_key, value in metrics[net_key].items():
                joint_metrics[f"{net_key}/{metric_key}"] = value

        metrics = joint_metrics

        losses = [v for k, v in metrics.items() if "loss" in k]
        metrics["loss"] = keras.ops.sum(losses)

        return metrics

This class is designed to be a drop in replacement for inference networks in BayesFlow. To construct an instance, we pass a dictionary of BayesFlow inference networks, which will be the ensemble members. The class channels input and output to and from each of the essential inference network methods to allow parallel training and estimation: * compute_metrics : includes loss computation for training with stochastic gradient decent * sample & log_prob : sample from and evaluate the learnt conditional distribution

As you can see in the code, it works by looping over ensemble members and delegating to their respective methods.

Let’s try one now!

inference_network_ensemble = InferenceNetworkEnsemble(dict(
    cf1=bf.networks.CouplingFlow(),
    cf2=bf.networks.CouplingFlow(),
    fm1=bf.networks.FlowMatching(),
))

The ensemble above wraps inference networks, but in fact is itself a drop in replacement to conventional BayesFlow inference networks. Instead of using just one coupling flow, we get two which are identical except for their initialization and a flow matching model on top of that.

It will be interesting to see how the three models behave differently and we will give them a spin on Lotka-Volterra.

Example: Lotka-Volterra parameter estimation from time series

Kingfisher preys on Lizard. Rathika Ramasamy from Chennai, India,
Predator&Prey, CC BY 2.0

Suppose we measured population counts from two species over time. One of them preys on the other, so we might assume that the dynamics are governed by the classic Lotka-Volterra system. In dimensionless form, with prey population \(x\) and predator population \(y\), the nonlinear differential equation is

\[ \begin{aligned}{\frac {dx}{dt}}&=\alpha x-\beta xy,\\{\frac {dy}{dt}}&=-\gamma y+\delta xy.\end{aligned} \]

As always, this model entails a number of assumptions that can only be approximate reality. In brief: On their own, prey count increases exponentially with rate \(\alpha\), while predator count decays with rate \(\gamma\). Interesting dynamics are possible when both predators and prey are present: The number of predators increases the more prey it can hunt, reducing prey counts proportionally at a rate \(\beta\) and increasing predator count proportionally at a rate \(\delta\).

We can measure population timeseries, but never the parameters directly, so this is a scientifically relevant inverse problem.

Our simulator will consist of three parts:

First, we choose a prior distribution over parameters that reflects our beliefs about parameters before observing data.
Building on parameters sampled from the prior, we solve the parameterized Lotka-Volterra equation starting from some initial conditions.
Finally, we hypothesize that we will make some counting errors when observing the populations, introducing a Gaussian error on the true populations.

Show implementation of Lotka-Volterra observation model

rng = np.random.default_rng(seed=1234)

def prior():
    x = rng.normal(size=4)
    theta = 1/(1+np.exp(-x)) * 3.9 + 0.1 # logit normal distribution scaled to range from 0.1 and 4
    return dict(
        alpha=theta[0],
        beta=theta[1],
        gamma=theta[2],
        delta=theta[3],
    )


def lotka_volterra_equations(state, t, alpha, beta, gamma, delta):
    x, y = state
    dxdt = alpha * x - beta * x * y
    dydt = - gamma * y + delta * x * y
    return [dxdt, dydt]


def ecology_model(alpha, beta, gamma, delta, t_span=[0, 5], t_steps=100, initial_state=[1, 1]):
    t = np.linspace(t_span[0], t_span[1], t_steps)
    state = odeint(lotka_volterra_equations, initial_state, t, args=(alpha, beta, gamma, delta))
    x, y = state.T  # Transpose to get x and y arrays
    
    return dict(
        x=x,  # Prey time series
        y=y,  # Predator time series
        t=t,  # time
    )


def observation_model(x, y, t, subsample=10, obs_prob=1.0, noise_scale=0.1):
    t_steps = x.shape[0]
    
    # Add Gaussian noise to observations
    noisy_x = rng.normal(x, noise_scale)
    noisy_y = rng.normal(y, noise_scale)
    
    # Determine which time steps are observed
    step_indices = np.arange(0, t_steps, subsample)
    num_observed = int(obs_prob * len(step_indices))
    observed_indices = np.sort(rng.choice(step_indices, num_observed, replace=False))

    return {
        "observed_x": noisy_x[observed_indices],
        "observed_y": noisy_y[observed_indices],
        "observed_t": t[observed_indices]
    }

simulator = bf.make_simulator([prior, ecology_model, observation_model])

Adapter for preprocessing simulated data for neural processing

learned_sumstat_adapter = (
    bf.adapters.Adapter()
    
    # convert any non-arrays to numpy arrays
    .to_array()
    
    # convert from numpy's default float64 to deep learning friendly float32
    .convert_dtype("float64", "float32")

    # drop unobserved full trajectories
    .drop(["x", "y", "t"])

    # add a trailing dimension of 1
    .as_time_series(["observed_x", "observed_y", "observed_t"])
    
    # rename the variables to match the required approximator inputs
    .concatenate(["alpha", "beta", "gamma", "delta"], into="inference_variables")
    .concatenate(["observed_x", "observed_y", "observed_t"], into="summary_variables")
)
learned_sumstat_adapter

Adapter([0: ToArray -> 1: ConvertDType -> 2: Drop(['x', 'y', 't']) -> 3: AsTimeSeries -> 4: Concatenate(['alpha', 'beta', 'gamma', 'delta'] -> 'inference_variables') -> 5: Concatenate(['observed_x', 'observed_y', 'observed_t'] -> 'summary_variables')])

We can use this simulator to produce training data and validation data of the joint distribution over parameters and observations…

epochs = 32
batch_size = 32
num_batches = 256
num_train_sims = num_batches * batch_size

train_sims = simulator.sample(num_train_sims)
val_sims = simulator.sample(256)

dataset = bf.OfflineDataset(
    data=train_sims,
    batch_size=batch_size,
    adapter=learned_sumstat_adapter,
)

… and define the full pipeline of the neural approximation to the conditional distribution parameters given observation.

In this pipeline the heart will be the inference_network_ensemble object that we have defined earlier. Even though it is not a single network but an ensemble of them, the user can conveniently plug it in as an inference network.

approximator = bf.PointApproximator(
    adapter=learned_sumstat_adapter,
    summary_network=bf.networks.TimeSeriesNetwork(),
    inference_network=inference_network_ensemble,  # THIS IS AN ENSEMBLE OF INFERENCE NETWORKS
)
total_steps = int(epochs * num_batches)
warmup_steps = int(0.05 * epochs * num_batches)
decay_steps = total_steps - warmup_steps

learning_rate = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1 * 5e-4,
    warmup_target=5e-4,
    warmup_steps=warmup_steps,
    decay_steps=decay_steps,
    alpha=0,
)
optimizer = keras.optimizers.Adam(learning_rate, clipnorm=1.5)
approximator.compile(optimizer=optimizer, metrics=None)

history = approximator.fit(dataset=dataset, epochs=epochs, verbose=0, validation_data=learned_sumstat_adapter(val_sims))

INFO:bayesflow:Fitting on dataset instance of OfflineDataset.
INFO:bayesflow:Building on a test batch.

title_args = dict(y=1.02, size=15)
for k,v in inference_network_ensemble.networks.items():
    f = bf.diagnostics.loss(history, train_key=f"{k}/loss/inference_{k}/loss", val_key=f"val_{k}/loss/inference_{k}/loss")
    f.gca().set_title(f"Loss Trajectory - ensemble member {k.upper()}", **title_args)

The loss curves are how we like them. Monotonically falling and without significant differerence of performance on training (solid line) and validation dataset (dashed line).

The first interesting thing we observe is, that the two Coupling Flows (CF1 & CF2) have quite similar loss fluctuations. Remember, they only differ by their initialization, but always recieve the same training batches across training. So we notice that equal training batches are performing similarly regardless with differing initializations not manifesting much differences in loss. Both networks see the same loss landscape and likely stay close for that reason. Still, they start at different positions and stay at different postitions.

From a dynamical systems perspective, we can think of training a neural network for an epoch as an iterative map and study whether that map is chaotic.¹ So what we are seeing is that if we consider the loss on the same data as an indicator for similarity/discrepancy and infer some metric from that, the training is not chaotic here.

NOTE: I spoke to the the discrepancy of the two coupling flows but not to the different architecture, Flow Matching. This is for good reason: Their losses measure something different and cannot be straightforwardly compared.

We will now assess the discrepancy of the learnt distributions. First, we can sample from the trained approximator object just like any other BayesFlow object.

%%time
# Set the number of posterior draws you want to get
num_samples = 500

# Obtain posterior draws with the sample method
post_draws = approximator.sample(conditions=val_sims, num_samples=num_samples)

# post_draws is a dictionary of draws with one element per named parameters
post_draws.keys()

CPU times: user 8min 54s, sys: 14.5 s, total: 9min 8s
Wall time: 1min 15s

dict_keys(['cf1', 'cf2', 'fm1'])

The only difference is that posterior draws will be a nested dictionary where first level key-values pairs correspond to sample for the different ensemble members (inference networks).

Let’s move on and look at the recovery, calibration and an examplary posterior distribution for each ensemble member.

par_names = [r"$\alpha$", r"$\beta$", r"$\gamma$", r"$\delta$"]
dataset_id = 0
for k,v in post_draws.items():
    f = bf.diagnostics.recovery(v, val_sims, variable_names=par_names)
    f.suptitle(f"Recovery - ensemble member {k.upper()}", **title_args)
for k,v in post_draws.items():
    f = bf.diagnostics.calibration_ecdf(v, val_sims, variable_names=par_names, difference=True)
    f.suptitle(f"Calibration - ensemble member {k.upper()}", **title_args)
for k,v in post_draws.items():
    g = bf.diagnostics.plots.pairs_posterior(
        estimates=post_draws[k], 
        targets=val_sims,
        dataset_id=dataset_id,
        variable_names=par_names,
    )
    g.fig.suptitle(f"Posterior draws - ensemble member {k.upper()}", **title_args)

Calibration and recovery look really similar, although not identical, across the board. What stands out most is that Flow Matching posteriors are significantly less contracted.

Finally, any good Bayesian inference experiment should check the posterior predictive distributions that the (here: learnt) posteriors result in. We will plot the observation that was used as condition for an examplary posterior and the forecasts of the estimated model – again for each ensemble member.

def trajectory_aggregation(traj, confidence=0.95):
    alpha = 1 - confidence
    quantiles = np.quantile(traj, [alpha/2, 0.5, 1-alpha/2], axis=0).T
    central = quantiles[:,1]
    L = quantiles[:,0]
    U = quantiles[:,2]
    return central, L, U

def plot_trajectores(samples, variable_keys, variable_names, fill_colors=["blue", "darkred"], confidence=0.95, alpha=0.8, observations=None, ax=None):
    t_span = samples["t"][0]
    
    if ax is None:
        fig, ax = plt.subplots(1, figsize=(12,3))
        sns.despine()
    
    for i, key in enumerate(variable_keys):

        if observations is not None:     
            ax.scatter(observations["observed_t"], observations["observed_"+key], color=fill_colors[i], marker="x", label="Observed " + variable_names[i].lower())

        central, L, U = trajectory_aggregation(samples[key], confidence=confidence)
        ax.plot(t_span, central, color=fill_colors[i], label="Median " + variable_names[i].lower())
        ax.fill_between(t_span, L, U, color=fill_colors[i], alpha=0.2, label=rf"{int((confidence) * 100)}$\%$ Confidence Bands")

        # plot 20 trajectory samples
        for j in range(20):
            if j == 0:
                label = f"{variable_names[i]} trajectories"
            else:
                label = None
            ax.plot(t_span, samples[key][j], color=fill_colors[i], alpha=0.2, label=label)
        

    ax.legend(loc="right")
    ax.set_xlabel("t")
    ax.set_ylabel("population")
    
def offline_posterior_sampler(post_draws, dataset_id, sample_id):
    posterior_sample_for_id = {var_key: post_draws[var_key][dataset_id, sample_id,...].squeeze() for var_key in post_draws.keys()}
    return posterior_sample_for_id

def take_dataset(sims, dataset_id):
    return {var_key: sims[var_key][dataset_id] for var_key in sims.keys()}

for k,v in post_draws.items():
    list_of_resimulations = []
    for sample_id in range(num_samples):
        one_post_sample = offline_posterior_sampler(v, dataset_id, sample_id)
        list_of_resimulations.append(ecology_model(t_span=[0,10], **one_post_sample))
    resimulation_samples = bf.utils.tree_stack(list_of_resimulations, axis=0)
    
    observations = take_dataset(val_sims, dataset_id)
    
    plot_trajectores(resimulation_samples, ["x", "y"], ["Prey", "Predator"], observations=observations)
    plt.title(f"Posterior predictive forecast - ensemble member {k.upper()}")
    plt.ylim(0,3)
    plt.axvline(5, color="grey", linestyle=":");

Footnotes

Even the tiniest difference in initial conditions would blow up eventually if the map was chaotic. Check out the Lyapunov exponent for a precise conceptionalization of chaos.↩︎