# Probabilistic programming in Python: Pyro versus PyMC3

Thu, Jun 28, 2018*This post was sparked by a question in the lab
where I did my master’s thesis. I had sent a link introducing
Pyro to the lab chat, and the PI wondered about
differences and limitations compared to
PyMC3, the ‘classic’ tool for statistical
modelling in Python. When should you use Pyro, PyMC3, or something else still?*

## Part 1: Goals

First, let’s make sure we’re on the same page on what we want to do. PyMC3,
Pyro, and other probabilistic programming packages such as Stan, Edward, and
BUGS, perform so called **approximate inference**.

*Inference* means calculating probabilities. It means working with the joint
probability distribution $p(\boldsymbol{x})$ underlying a data set
{$\boldsymbol{x}$}.

For example, $\boldsymbol{x}$ might consist of two variables: “wind speed”, and “cloudiness”. You have gathered a great many data points { (3 km/h, 82%), (23 km/h, 15%,), … }. The joint probability distribution $p(\boldsymbol{x})$ then gives you a feel for the density in this windiness-cloudiness space. I.e. which values are common? And which combinations occur together often?

You might want to:

- Do a ‘lookup’ in the probabilty distribution, i.e. calculate how likely a given datapoint is;
- Marginalise (= summate) the joint probability distribution over the variables you’re not interested in, so you can make a nice 1D or 2D plot of the resulting marginal distribution. (Symbolically: $p(b) = \sum_a p(a,b)$);
- Combine marginalisation and lookup to answer conditional questions: given the value for this variable, how likely is the value of some other variable? (Symbolically: $p(a|b) = \frac{p(a,b)}{p(b)}$)
- Find the most likely set of data for this distribution, i.e. calculate the mode, $\text{arg max}\ p(a,b)$. (This can be used in Bayesian learning of a parametric model. The distribution in question is then a joint probability distribution over model parameters and data variables. You can then answer: given the data, what are the most likely parameters of the model?)

We have to resort to *approximate* inference when we do not have closed,
analytical formulas for the above calculations.

There are generally two approaches to approximate inference:

- Sampling
- Variational inference

In **sampling**, you use an algorithm (called a Monte Carlo method) that draws
samples from the probability distribution that you are performing inference on
– or at least from a good approximation to it. You then perform your desired
inference calculation on the samples. For example: mode of the probability
distribution? –> Just find the most common sample. One class of sampling
methods are the Markov Chain Monte Carlo (MCMC) methods, of which
Hamiltonian/Hybrid Monte Carlo (HMC) and No-U-Turn Sampling (NUTS) are
refinements.

**Variational inference** (VI) is an approach to approximate inference that does
not need samples. It transforms the inference problem into an optimisation
problem, where we need to maximise some target function.

The optimisation procedure in VI (which is gradient descent, or a second order
derivative method) requires derivatives of this target function. This is where
**automatic differentiation** (AD) comes in. AD can calculate accurate values
for the derivatives of a function that is specified by a computer program. You
can thus use VI even when you don’t have explicit formulas for your derivatives.

Both AD and VI, and their combination, ADVI, have recently become popular in machine learning.

Further reading:

VI: Wainwright and Jordan (2008).

*Graphical Models, Exponential Families, and Variational Inference*;AD: Blogpost by Justin Domke (2009) – Short, recommended read.

*Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox?*;ADVI: Kucukelbir et al. (2017).

*Automatic Differentiation Variational Inference*;

## Part 2: Backends

Now over from theory to practice. There seem to be three main, pure-Python
libraries for performing approximate inference: PyMC3,
Pyro, and Edward. They all
use a ‘backend’ library that does the heavy lifting of their computations. PyMC3
uses *Theano*, Pyro uses *PyTorch*, and Edward uses *TensorFlow*.

Theano, PyTorch, and TensorFlow are all very similar. They all expose a Python API to underlying C / C++ / Cuda code that performs efficient numeric computations on N-dimensional arrays (scalars, vectors, matrices, or in general: “tensors”). The computations can optionally be performed on a GPU instead of the CPU, for even more efficiency. In this respect, these three frameworks do the same thing as NumPy.

Additionally however, they also offer automatic differentiation (which they often call “autograd”):

They expose a whole library of functions on tensors, that you can compose with
`+`

, `-`

, `*`

, `/`

, tensor concatenation, etc. The result is called a
*computational graph*. This computational graph is your ‘function’, or your
model. For example:

```
def model(x, y):
s = framework.sin(x)
m = 3 * framework.max(x, 6*y)
return s + m
```

Such computational graphs can be used to build (generalised) linear models,
logistic models, neural network models, … almost any model really. In plain
Theano, PyTorch, and TensorFlow, the parameters are just tensors of actual
numbers. For example, `x = framework.tensor([5.4, 8.1, 7.7])`

. In the extensions
PyMC3, Pyro, and Edward, the parameters can also be stochastic variables, that
you have to give a unique name, and that represent probability distributions.
That is why, for these libraries, the computational graph is a probabilistic
model.

The automatic differentiation part of the Theano, PyTorch, or TensorFlow
frameworks can now compute **exact derivatives of the output of your function
with respect to its parameters** (i.e. $\frac{\partial \ \text{model}}{\partial
x}$ and $\frac{\partial \ \text{model}}{\partial y}$ in the example).

As an aside, this is why these three frameworks are (foremost) used for
specifying and fitting neural network models (“deep learning”): the main
innovation that made fitting large neural networks feasible, *backpropagation*,
is nothing more or less than automatic differentiation (specifically: first
order, reverse mode automatic differentiation).

The three “NumPy + AD” frameworks are thus very similar, but they also have individual characteristics:

Theano: the original framework. Sadly, the creators announced that they will stop development.

PyTorch: using this one feels most like ‘normal’
Python development, according to their marketing and to their design goals. In
my experience, this is true. In Theano and TensorFlow, you build a (static)
computational graph as above, and then ‘compile’ it. In PyTorch, there is no
separate compilation step. Commands are executed immediately. (If you execute ```
a
= sqrt(16)
```

, then `a`

will contain `4`

^{[1]}. Not so in Theano or
TensorFlow). This means that debugging is easier: you can for example insert
`print`

statements in the `def model`

example above. This is not possible in the
other two frameworks. It also means that models can be more expressive: PyTorch
can auto-differentiate functions that contain plain Python loops, `if`

s, and
function calls (including recursion and closures). Also, like Theano but unlike
TensorFlow, PyTorch tries to make its tensor API as similar to NumPy’s as
possible.

TensorFlow: the most famous one. Apparently has a clunky API. In October 2017, the developers added an option (termed ‘eager execution’) to use immediate execution / dynamic computational graphs in the style of PyTorch.

^{[1]} This is pseudocode. Real PyTorch code:

```
import torch
a = torch.sqrt(torch.Tensor([16]))
print(a)
```

Output:

```
tensor([ 4.])
```

## Part 3: Showdown

With this backround, we can finally discuss the differences between PyMC3, Pyro and other probabilistic programming packages.

**PyMC3** has an extended history. Therefore there is a lot of good documentation
and content on it. It started out with just approximation by sampling, hence the
‘MC’ in its name. By now, it also supports variational inference, with automatic
differentiation (ADVI). For MCMC sampling, it offers the NUTS algorithm. NUTS is
easy for the end user: no manual tuning of sampling parameters is needed.

**Pyro** came out November 2017. Not much documentation yet. It was built with
large scale ADVI problems in mind. Beginning of this year, support for
approximate inference was added, with both the NUTS and the HMC algorithms.

**Edward** is also relatively new (February 2016). It offers both approximate
inference by sampling and variational inference. I don’t know much about it,
other than that its documentation has style. For MCMC, it has the HMC algorithm
(in which sampling parameters are not automatically updated, but should rather
be carefully set by the user), but not the NUTS algorithm.

Also a mention for probably the most used probabilistic programming language of
all (written in C++): **Stan**. It also offers both
sampling (HMC and NUTS) and variatonal inference. It has bindings for different
languages, including Python. Models are not specified in Python, but in some
specific Stan syntax.

*Discussion:*

The depreciation of its dependency Theano might be a disadvantage for PyMC3 in the long term. It doesn’t really matter right now. Here the PyMC3 devs discuss a possible new backend. The relatively large amount of learning resources on PyMC3 and the maturity of the framework are obvious advantages.

The advantage of Pyro is the expressiveness and debuggability of the underlying PyTorch framework. So if I want to build a complex model, I would use Pyro. I find this comment by ‘joh4n’, who implemented NUTS in PyTorch without much effort telling. The immaturity of Pyro is a rather big disadvantage at the moment.

As to when you should use sampling and when variational inference: I don’t have enough experience with approximate inference to make claims; from this StackExchange question however:

Thus, variational inference is suited to large data sets and scenarios where we want to quickly explore many models; MCMC is suited to smaller data sets and scenarios where we happily pay a heavier computational cost for more precise samples. For example, we might use MCMC in a setting where we spent 20 years collecting a small but expensive data set, where we are confident that our model is appropriate, and where we require precise inferences. We might use variational inference when fitting a probabilistic model of text to one billion text documents and where the inferences will be used to serve search results to a large population of users. In this scenario, we can use distributed computation and stochastic optimization to scale and speed up inference, and we can easily explore many different models of the data. – Sean Easter

I think VI can also be useful for ‘small data’, when you want to fit a model with many parameters / hidden variables. (Of course making sure good regularisation is applied). That is, you are not sure what a good model would be; The final model that you find can then be described in simpler terms.

So the conclusion seems to be: the classics PyMC3 and Stan still come out as the winners at the moment – unless you want to experiment with fancy probabilistic models.