The final state of the files we used in this demo is available on Github here.

Introduction

Python, together with C/C++ and Fortran, is one of the most popular languages that is used by scientists. The easy-to-use interface and a large number of packages make it an attractive choice for prototyping. However, Python suffers from performance issues (such as GIL). For this reason, (pure) Python applications are rarely used in a performance-critical setting. While Python maybe not be the fastest language, it is possible to extend and offload performance-critical functionalities to extensions. These extensions are written in C/C++/Fortran and can be imported as packages. One of the most commonly used packages that use this technique is Numpy. Another example, TensorFlow, one of the most widely used Deep-Learning frameworks exposes a Python interface while having a high-performance runtime that is written in C++.

The capability of running C code or using libraries developed in C/C++ is not unique to Python. Fortran also allows the execution of C functions in subroutines. This can be done through a binding interface, that translates the C typings into Fortran typings.

To work with this tutorial, we recommend using the Windows Subsystem for Linux 2 (WSL 2) and Anaconda as a Python distribution.

To install Anaconda, refer to https://www.anaconda.com/products/individual.

To prepare for the Python environment after installing Anaconda, create a new conda environment and activate it so that we can isolate our activities.

$ conda create -n dd2358 python==3.8
$ conda activate dd2358
$ conda install numpy
$ conda install cython
$ conda install pybind11

Objective

The goal of this tutorial is to give a broad and non-technical overview of the most common approaches to cross-language development with Fortran, Python, and C/C++.

For Fortran, we will go through the steps of developing a subroutine that calls a C function. We illustrate the use of a binding interface, and how to wrap it as a subroutine, using a small example.

For Python, we will go through a number of methods for developing Python extensions using C/C++ and Fortran, followed by a practical step-by-step example of building a simple math kernel extension that computes AXPY on Numpy arrays.

By the end of this tutorial, you should have a basic understanding of cross-language development. You will also be able to develop a minimal extension that operates on Numpy arrays in C++ or Fortran, as well as running C code in a Fortran program.

Table of content

Running C code in Fortran using ISO binding
Calling shared library directly in Python using ctype
Building Python modules from Fortran modules using f2py
Calling shared libraries in Python through Cython
Creating Python modules from C++ code using Pybind11
A step-by-step example on building a module using Pybind11
Real world examples
References

Running C code in Fortran

In this section, we learn how to call C functions in Fortran. While Fortran is a popular language among the scientific community, a large number of third-party libraries are written in C. Fortunately, it is possible to call C functions from Fortran code through an ISO binding, or vice versa. One challenge of using C code in Fortran is the translation of data types (besides memory layout ordering of arrays). Furthermore, the function symbols are handled differently (e.g., Fortran functions end with a _ suffix implicitly). In order to call C functions from Fortran, we need to define an interface, that translates the C types to Fortran types. Additionally, we can further wrap it as a subroutine to expose as little C details as possible.

Example

Consider a C function that computes a product between two double-precision floating-point numbers called product.c.

double product(double, double);

double product(double a, double b)
{
  return a * b;
}

Consider a Fortran program main.f90 that does two things: 1) to add two numbers; 2) to get the product of two numbers, where the product function is written in C.

module math
  interface
    real(c_double) function c_product(a, b) bind(c, name="product")
      use iso_c_binding
      implicit none
      real(kind=c_double), value :: a
      real(kind=c_double), value :: b
    end function
  end interface
end module

subroutine times(c, a, b)
  use, intrinsic :: iso_c_binding, only: c_double 
  use math
  use iso_fortran_env, only : real64
  real(kind=real64), intent(in)  :: a
  real(kind=real64), intent(in)  :: b
  real(kind=real64), intent(out) :: c
  c = c_product(a, b)
end subroutine

subroutine add(c, a, b)
  use iso_fortran_env, only : real64
  real(kind=real64), intent(in)  :: a
  real(kind=real64), intent(in)  :: b
  real(kind=real64), intent(out) :: c
  c = a + b
end subroutine

program main
  use iso_fortran_env, only : real64
  real(kind=real64) :: a
  real(kind=real64) :: b
  real(kind=real64) :: c
  real(kind=real64) :: d
  
  a = 0.5
  b = 0.5
  
  call add(c, a, b)
  call times(d, a, b)
  
  print *, a, " + ",  b, " = ", c
  print *, a, " * ",  b, " = ", d
end program

In this code, we define a module called math, with an interface containing a function called c_product, that accepts two arguments and give a return value in double precision. We define that we want to bind this function to a C function called product(). Next, we define that we want to use the ISO standard C binding. We specify that we have two variables in the signature a, and b, that are passed by value.

Now that we have an interface, we build a subroutine that calls this C function. Inside the subroutine, we define that we use the ISO C binding with the double data type in C. We also specify that we want to use the MATH module that we defined. Additionally, we list the arguments of the subroutine and whether they are input or output variables. Finally, we call the C function and update the output variable.

We compile the C function into object code using a C compiler.

$ gcc -c product.c -o product.o

After that, we compile the Fortran program.

$ gfortran -c main.f90 -o main.o

Finally, we use the Fortran linker (that understands Fortran symbols) to link the objects into an executable.

$ gfortran product.o main.o -o main.out
$ ./main.out
  0.50000000000000000       +   0.50000000000000000       =    1.0000000000000000     
  0.50000000000000000       *   0.50000000000000000       =   0.25000000000000000

We leave calling Fortran subroutine from a C code as an exercise.

Developing Python Extension

In this section, we learn the four ways to write Python Extensions in C/C++ and Fortran:

ctype
f2py
Cython
Pybind11

Finally, we will create an AXPY kernel for Numpy arrays using Pybind11.

ctype

ctype is the simplest way to run a foreign function in a shared library through Python. A third party library of functions can be called directly. Consider the following function in a sum_src.c.

#include <stdio.h>
#include "sum.h"

double sum(double a, double b)
{
  printf("hello world from sum()\n");
  return a + b;
}

With a header file sum.h:

#ifndef __SUM_H__
#define __SUM_H__
double sum(double, double);
#endif

Compiled into a shared library libsum.so:

$ gcc -fPIC -shared -o libsum.so sum_src.c

We can run this function through Python directly:

Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ctypes
>>> _libsum = ctypes.CDLL('libsum.so')
>>> _libsum.sum.argtypes = [ctypes.c_double, ctypes.c_double]
>>> _libsum.sum.restype = ctypes.c_double
>>> _libsum.sum(ctypes.c_double(3.1), ctypes.c_double(3.3))
hello world from sum()
6.4
>>>

The first step when using a shared library through ctypes is to load the shared library. After that, the function signature must be defined such that ctypes can translate the types. Finally, the return type must also be specified. With all information set, the function can be called directly, as if the shared library is a package. One important issue is that all the arguments must be translated to the respective primitive data types. A feature of ctypes is that callback functions can be defined to allow Python functions to be called by a C program. For more information, refer to the documentation.

f2py to run Fortran code in Python

While ctype enables the running of shared libraries in C, f2py enables us to compile a list of subroutines into a Python module. Consider the following program, which contains a module math_kernels with a subroutine called axpy.

module kernels
implicit none
contains
  subroutine axpy(a, x, y)
    real(kind=8),               intent(in)    :: a
    real(kind=8), dimension(:), intent(in)    :: x
    real(kind=8), dimension(:), intent(inout) :: y
  
    y = a * x + y
  end subroutine
end module kernels

program main
  use, intrinsic :: iso_fortran_env, only : int32, real64
  use kernels
  implicit none
  real(kind=real64)                            :: a
  real(kind=real64), allocatable, dimension(:) :: x
  real(kind=real64), allocatable, dimension(:) :: y

  integer(kind=int32) :: length = 10
  integer(kind=int32) :: i = 10
  allocate(x(length))
  allocate(y(length))

  a = 0.8

  do i = 1, length
    x(i) = 0.5 * i
    y(i) = 0.1 * i
  end do

  do i = 1, length
    print *, a * x(i) + y(i)
  end do
  call axpy(a, x, y)
  print *, y(:)

  deallocate(x)
  deallocate(y)
end program

We can compile and run this code and run it. What if we want to create an axpy() as a Python module? We can create a file called math_kernels.f90, and put only module that we want in our module, in this file.

module kernels
implicit none
contains
  subroutine axpy(a, x, y)
    real(kind=8),               intent(in)    :: a
    real(kind=8), dimension(:), intent(in)    :: x
    real(kind=8), dimension(:), intent(inout) :: y
  
    y = a * x + y
  end subroutine
end module kernels

We use f2py to compile this into a module.

$ f2py -c math_kernels.f90 -m math_kernels

After that, we will have a module, called math_kernels.cpython-38-x86_64-linux-gnu.so. We can use this module in Python directly by importing the module inside. For example:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from math_kernels import kernels
>>> import numpy as np
>>> x = np.random.random(5)
>>> y = np.random.random(5)
>>> a = 0.2
>>> z = a * x + y
>>> kernels.axpy(a, x, y)
>>> z
array([0.98274619, 0.25840269, 0.87331128, 0.84778364, 0.32389878])
>>> y
array([0.98274619, 0.25840269, 0.87331128, 0.84778364, 0.32389878])
>>> z - y
array([0., 0., 0., 0., 0.])
>>>

One thing to notice is that we used real(kind=8) instead of real64. The reason is that f2py only accepts constant integers as an input.

Cython

Cython is a source-to-source compiler that Cythonize "Python-like" source code into C code which in turn can be compiled by a C compiler. The major addition of Cython is the addition of C-like static typing. Consider the previous summation example, we can write this program in Cython and call it cython_sum.pyx:

cdef double cython_sum(double a, double b): # define a callable function within Cython with cdef
  return a + b

cpdef double sum(double a, double b):       # define a wrapper function callable by Python with cpdef
  return cython_sum(a, b)

We can directly Cythonize this program using cython to generate a C program. However, one convenient way when developing a module as part of a larger Python package is to use setuptools. We can create a build system setup.py using:

from setuptools import setup
from setuptools.extension import Extension
from Cython.Build import cythonize

ext = Extension(name="sum", sources=["cython_sum.pyx"])
setup(ext_modules=cythonize(ext))

We compile this module using:

$ python setup.py build_ext --inplace
Compiling cython_sum.pyx because it changed.
[1/1] Cythonizing cython_sum.pyx
/home/steven/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/cython_sum.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'sum' extension
creating build
creating build/temp.linux-x86_64-3.7
gcc -pthread -B /home/steven/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/steven/anaconda3/include/python3.7m -c cython_sum.c -o build/temp.linux-x86_64-3.7/cython_sum.o
creating build/lib.linux-x86_64-3.7
gcc -pthread -shared -B /home/steven/anaconda3/compiler_compat -L/home/steven/anaconda3/lib -Wl,-rpath=/home/steven/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/cython_sum.o -o build/lib.linux-x86_64-3.7/sum.cpython-37m-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-3.7/sum.cpython-37m-x86_64-linux-gnu.so ->

After that, a module will be generated, called sum.cpython-37m-x86_64-linux-gnu.so. We can use this module directly:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sum
>>> sum.sum(10.1, 5.2)
15.3
>>>

However, what is the practical benefit of this apart from generating fast code in C and to perform type checking? One major benefit is that it is possible to generate code that calls other shared libraries, effectively making it a wrapper. Consider a more complicated case where we reuse the libsum.so library that we created for the ctypes example. We can call that library in Cython, effectively creating a wrapper.

However, we first need to setup a header file sum.h, and a respective .pxd file, libsum.pxd such that function signatures can be resolved. It is noteworthy that Cython comes with a large number of predefined .pxd files for commonly used libraries, such as libc, so that one does not need to create one such definition for every library used. Consider libsum.pxd that uses sum.h:

cdef extern from "sum.h":
  double sum(double, double)

Now that we have the function definitions well defined, we can use the cimport in the Cython code to import this library, where cython_sum.pyx will become:

cimport libsum                        # use cimport to import libsum.so

cpdef double sum(double a, double b): # Now that we depend on an external library, we only need the Python wrapper
  return libsum.sum(a, b)             # call the sum function

Similarly, we rebuild the Cython module using setup.py. However, since now we depend on an external library, it has to be linked. This can be specified as the Library argument. setup.py becomes:

from setuptools import setup
from setuptools.extension import Extension
from Cython.Build import cythonize

ext = Extension(name="sum", sources=["cython_sum.pyx"], libraries=["sum"])
setup(ext_modules=cythonize(ext))

We rebuild the module with the new configuration:

$ python setup.py build_ext --inplace
Compiling cython_sum.pyx because it changed.
[1/1] Cythonizing cython_sum.pyx
/home/steven/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/cython_sum.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'sum' extension
gcc -pthread -B /home/steven/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/home/steven/anaconda3/include/python3.7m -c cython_sum.c -o build/temp.linux-x86_64-3.7/cython_sum.o
gcc -pthread -shared -B /home/steven/anaconda3/compiler_compat -L/home/steven/anaconda3/lib -Wl,-rpath=/home/steven/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/cython_sum.o -lsum -o build/lib.linux-x86_64-3.7/sum.cpython-37m-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-3.7/sum.cpython-37m-x86_64-linux-gnu.so ->

We will get the same Python module as before, and we can run it the same way as before:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sum
>>> sum.sum(10.1, 5.2)
hello world from sum()
15.3
>>>

The printout "hello world from sum()" is a clear indication that our shared library is being used by the Cython module.

PyBind11

In all the previous examples, we use independent shared libraries written in C and provide Python bindings to turn them into modules. PyBind11 takes the opposite approach and provides a C++ API for binding C++ Classes and functions and for generating Python modules. Pybind11 is C++ only and uses the latest C++11 features. In fact, TensorFlow also uses Pybind11 to expose functionalities in its Python interface. One added advantage of using Pybind11 is that underlying data and types coming from Python (such as Numpy arrays) can be easily handled. Consider our previous example sum_src.c, we create a new version called sum_cpp.cc with exact same content. To generate a Python module using Pybind11, we modify the code to the following:

#include "sum.h"
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>     // Include the Pybind11 header

double sum(double a, double b)
{
  printf("hello world from sum()\n");
  return a + b;
}

/* name of module-   ---a variable with type py::module_ to create the binding */
/*               |   |                                                                                                */
/*               v   v                                                                                               */
PYBIND11_MODULE(sum, m) {                            // Create a module using the PYBIND11_MODULE macro
    m.doc() = "pybind11 sum module";
    m.def("sum", &sum, "sum two numbers in double"); // calls module::def() to create generate binding code that exposes sum()
}

Compile the code with:

$ g++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` sum_cpp.cc -o sum`python3-config --extension-suffix`

We will have a module called sum.cpython-37m-x86_64-linux-gnu.so. This module can be directly used just like before:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sum
>>> sum.sum(10.6, 5.4)
hello world from sum()
16.0
>>>

That's it! No extra binding is needed.

A Simple Math Kernel Extension for Python

In this section, we will learn how to write a simple AXPY Python extension using PyBind11. AXPY is a widely used math operation in scientific applications where a vector LaTeX: x is scaled by a scalar $LaTeX: \alpha$ and added to a vector LaTeX: y . The operation takes the form:

$LaTeX: y \leftarrow \alpha x + y$

There are two variants of AXPY, SAXPY, and DAXPY, where S and D refer to Single Precision (32-bit float) and Double Precision (64-bit double) respectively.

Basic Kernel

We are going to implement a simple SAXPY that is able to take in the variables from Python in Numpy arrays. We implement the basic kernel in math_kernels.cc:

void axpy(size_t n, float a, float *x, float *y)
{
  for (size_t i = 0; i < n; i++)
    y[i] = a * x[i] + y[i];
}

Nothing fancy here. Now that we have our kernel implemented, we need to create a module. However, instead of directly using PYBIND11_MODULE to create a module, we need to first create a wrapper to handle the Numpy arrays.

Handling Numpy arrays

Since Numpy arrays are managed objects by Numpy, we cannot access them as if they are C/C++ arrays. Fortunately, Pybind11 already provides the necessary bindings for us to inspect and unpack Numpy arrays. For this, we can include the header #include <pybind11/numpy.h> and using namespace py = pybind11. The main class to handle Numpy array is py::array_t<T> where T is the datatype. Python supports a very general and convenient way for exchanging data between libraries through a buffer protocol. In this sense, the py::array_t<T> can be considered a buffer object that is restricted to Numpy arrays. The information of the buffer, such as the dimension and size, are stored in an information object called py::buffer_info. To create a function that accepts Numpy arrays in C++, we can do the following:

void axpy_wrapper(float a, py::array_t<float> py_x, py::array_t<float> py_y)
{
  // Request for buffer information
  py::buffer_info x_buffer = py_x.request();
  py::buffer_info y_buffer = py_y.request();

  // check dim
  if (x_buffer.ndim != 1 || y_buffer.ndim != 1) {
    throw std::runtime_error("Error: dimension of vector should be 1");
  }

  // check shape match
  if (x_buffer.shape[0] != y_buffer.shape[0]) {
    throw std::runtime_error("Error: size of X and Y do not match");
  }

  // extract raw pointer
  float *x = (float*)x_buffer.ptr;
  float *y = (float*)y_buffer.ptr;
  
  return axpy(x_buffer.shape[0], a, x, y);
}

In this example, we see how we extract a raw pointer from a Numpy array and get information such as the length of dimension zero. Unless we are certain of how this function will be called, we should perform simple sanity checking, for example on dimensionality and sizes. Furthermore, we should ensure that the array we are handling is a dense array (not irregular). For more information on format checking and performance optimization, refer to here and here.

Putting them all together

Now that we have a kernel, a wrapper to handle incoming Numpy arrays, we can put them together:

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>     // Include header that supports Numpy arrays

namespace py = pybind11;

void axpy(size_t n, float a, float *x, float *y)
{
  for (size_t i = 0; i < n; i++)
    y[i] = a * x[i] + y[i];
}

void axpy_wrapper(float a, py::array_t<float> py_x, py::array_t<float> py_y)
{
  // Request for buffer information
  py::buffer_info x_buffer = py_x.request();
  py::buffer_info y_buffer = py_y.request();

  // check dim
  if (x_buffer.ndim != 1 || y_buffer.ndim != 1) {
    throw std::runtime_error("Error: dimension of vector should be 1");
  }

  // check shape match
  if (x_buffer.shape[0] != y_buffer.shape[0]) {
    throw std::runtime_error("Error: size of X and Y do not match");
  }

  // extract raw pointer
  float *x = (float*)x_buffer.ptr;
  float *y = (float*)y_buffer.ptr;

  return axpy(x_buffer.shape[0], a, x, y);
}

PYBIND11_MODULE(math_kernels, m)
{
  m.def("axpy", &axpy_wrapper);
}

We can compile with the following:

$ g++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` math_kernels.cc -o math_kernels`python3-config --extension-suffix`

After that we can use the module in Python:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import math_kernels
>>> x = np.random.random(5)
>>> y = np.random.random(5)
>>> a = 0.5
>>> z = a * x + y
>>> math_kernels.axpy(a, x, y)
>>> z
array([0.72109651, 0.80642238, 1.3581158 , 0.68677673, 0.47000476])
>>> y
array([0.67480229, 0.67799412, 0.97670182, 0.20975148, 0.35173144])
>>> z - y
array([0.04629421, 0.12842827, 0.38141398, 0.47702525, 0.11827332])
>>>

This is completely wrong, what has happened?

Type Check

Recall that we are implementing SAXPY. However, notice that the default, Numpy arrays have type np.float64, or double precision. The error is due to the implicit conversion of Pybind11 between datatypes. The easiest way to enforce typing is to specify the list of arguments, and whether the arguments should be converted automatically. In this case, we can change your module creation code to:

PYBIND11_MODULE(math_kernels, m)
{
  m.def("axpy", &axpy_wrapper, py::arg("a"), py::arg("x").noconvert(true), py::arg("y").noconvert(true));
}

Where we specify that the type of the two vectors must match exactly. Compile again and we now try the following example, where one vector is in single-precision while the other is in double-precision. We compile the code and run again:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import math_kernels
>>> x = np.random.random(5).astype(np.float32)
>>> y = np.random.random(5)
>>> math_kernels.axpy(0.5, x, y)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: axpy(): incompatible function arguments. The following argument types are supported:
    1. (a: float, x: numpy.ndarray[float32], y: numpy.ndarray[float32]) -> None

Invoked with: 0.5, array([0.583261  , 0.4329178 , 0.1882612 , 0.94220173, 0.56544894],
      dtype=float32), array([0.65019744, 0.12314781, 0.04364232, 0.54704202, 0.18679054])
>>>

We can see that Python now throws an error.

Overloading

In the case we want to support both double-precision and single-precision, what can we do? In C++ we can easily support this using templates. In fact, Pybind11 supports this binding. We rewrite our code to the following:

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>    // Include header that supports Numpy arrays

namespace py = pybind11;

template <typename T>
void axpy(size_t n, T a, T *x, T *y)
{
  for (size_t i = 0; i < n; i++)
    y[i] = a * x[i] + y[i];
}

template <typename T>
void axpy_wrapper(T a, py::array_t<T> py_x, py::array_t<T> py_y)
{
  // Request for buffer information
  py::buffer_info x_buffer = py_x.request();
  py::buffer_info y_buffer = py_y.request();

  // check dim
  if (x_buffer.ndim != 1 || y_buffer.ndim != 1) {
    throw std::runtime_error("Error: dimension of vector should be 1");
  }

  // check shape match
  if (x_buffer.shape[0] != y_buffer.shape[0]) {
    throw std::runtime_error("Error: size of X and Y not match");
  }

  // extract raw pointer
  T *x = (T*)x_buffer.ptr;
  T *y = (T*)y_buffer.ptr;
  
  return axpy<T>(x_buffer.shape[0], a, x, y);
}

PYBIND11_MODULE(math_kernels, m)
{
  m.def("axpy", &axpy_wrapper<float>,  py::arg("a"), py::arg("x").noconvert(true), py::arg("y").noconvert(true));
  m.def("axpy", &axpy_wrapper<double>, py::arg("a"), py::arg("x").noconvert(true), py::arg("y").noconvert(true));
}

We now have a module that supports both, and only, single-precision and double-precision. We compile the code and run the module in Python:

$ python
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import math_kernels
>>> x = np.random.random(5).astype(np.float32)
>>> y = np.random.random(5).astype(np.float32)
>>> a = 0.5
>>> z = a * x + y
>>> math_kernels.axpy(a, x, y)
>>> y - z
array([0., 0., 0., 0., 0.], dtype=float32)
>>> x = np.random.random(5).astype(np.float64)
>>> y = np.random.random(5).astype(np.float64)
>>> z = a * x + y
>>> math_kernels.axpy(a, x, y)
>>> y - z
array([0., 0., 0., 0., 0.])
>>>

It now works with both precisions!

Build system

While we can depend on running the compilation directly, it is common practice to put the wrapper and module creation code in separate locations. Since Pybind11 is a header-only library, it is convenient enough to embed it into the project you are developing. The supported build system of Pybind11 is CMake. Refer to https://pybind11.readthedocs.io/en/stable/compiling.html#building-with-cmake for more information.

We have created a simple CMakeLists.txt here. Since Pybind11 is a header only library, it is convenient to embed it to a project to avoid extra dependencies. One can obtain PyBind11 using the following. Optionally, one can checkout a specific version.

$ git clone https://github.com/pybind/pybind11.git
$ git checkout tags/v2.6.2

(Optional) More advanced examples

In this section, we take a look at two more advanced real-world examples.

StreamBrain

In this example, we look at one of our research projects StreamBrain, a framework for exploring how emerging neuromorphic learning algorithms can make use of HPC hardware in a parallel environment. StreamBrain is a Python framework that implements the BCPNN algorithm, where important steps in the algorithm (called the kernel) are offloaded to different accelerators, such as GPU, FPGA, Vectorized OpenMP, and MPI for distributed computing.

Take a look at how these kernels are bind together with a Pybind11 binding, wrapper, and concrete implementation: https://github.com/KTH-HPC/StreamBrain/blob/mpi%2Bomp/BCPNN/backend/_kernels/kernels_openmp_internal.cc

Take a look at how Pybind11's source code is integrated into the build system: https://github.com/KTH-HPC/StreamBrain/blob/mpi%2Bomp/CMakeLists.txt

TensorFlow

TensorFlow is perhaps one of the most famous projects that makes use of Pybind11 to expose functionalities of its runtime. Take a look at the TensorFlow Python profiler interface and see how Pybind11 is used to expose a TraceMe wrapper (you do not need to understand the source code, but try to understand the binding). There are Pybind11 functions that are not covered in this tutorial, try to look for them in the documentation and over the internet:

If you are interested in the TensorFlow profiler (which out of the scope of this course), read this paper.

References

This tutorial is inspired by: