nanobind — Seamless operability between C++17 and Python

nanobind is a small binding library that exposes C++ types in Python and vice versa. It is reminiscent of Boost.Python and pybind11 and uses near-identical syntax. In contrast to these existing tools, nanobind is more efficient: bindings compile in a shorter amount of time, producing smaller binaries with better runtime performance.

Why yet another binding library?

I started the pybind11 project back in 2015 to generate better and more efficient C++/Python bindings. Thanks to many amazing contributions by others, pybind11 has become a core dependency of software across the world including flagship projects like PyTorch and Tensorflow. Every day, the repository is cloned more than 100.000 times. Hundreds of contributed extensions and generalizations address use cases of this diverse audience. However, all of this success also came with costs: the complexity of the library grew tremendously, which had a negative impact on efficiency.

Ironically, the situation today feels like 2015 all over again: binding generation with existing tools (Boost.Python, pybind11) is slow and produces enormous binaries with overheads on runtime performance. At the same time, key improvements in C++17 and Python 3.8 provide opportunities for drastic simplifications. Therefore, I am starting another binding project.. This time, the scope is intentionally limited so that this doesn't turn into an endless cycle.

Performance

TLDR: nanobind bindings compile ~2-3× faster, producing ~3× smaller binaries, with up to ~8× lower overheads on runtime performance (when comparing to pybind11 with -Os size optimizations).

The following experiments analyze the performance of a very large function-heavy (func) and class-heavy (class) binding microbenchmark compiled using Boost.Python, pybind11, and nanobind in both debug and size-optimized (opt) modes. A comparison with cppyy (which uses dynamic compilation) is also shown later. Details on the experimental setup can be found here.

The first plot contrasts the compilation time, where "number ×" annotations denote the amount of time spent relative to nanobind. As shown below, nanobind achieves a consistent ~2-3× improvement compared to pybind11.

nanobind also greatly reduces the binary size of the compiled bindings. There is a roughly 3× improvement compared to pybind11 and a 8-9× improvement compared to Boost.Python (both with size optimizations).

The last experiment compares the runtime performance overheads by calling one of the bound functions many times in a loop. Here, it is also interesting to compare against cppyy (gray bar) and a pure Python implementation that runs bytecode without binding overheads (hatched red bar).

This data shows that the overhead of calling a nanobind function is lower than that of an equivalent function call done within CPython. The functions benchmarked here don't perform CPU-intensive work, so this this mainly measures the overheads of performing a function call, boxing/unboxing arguments and return values, etc.

The difference to pybind11 is significant: a ~2× improvement for simple functions, and an ~8× improvement when classes are being passed around. Complexities in pybind11 related to overload resolution, multiple inheritance, and holders are the main reasons for this difference. Those features were either simplified or completely removed in nanobind.

Finally, there is a ~1.4× improvement in both experiments compared to cppyy (please ignore the two [debug] columns—I did not feel comfortable adjusting the JIT compilation flags; all cppyy bindings are therefore optimized.)

What are technical differences between nanobind and cppyy?

cppyy is based on dynamic parsing of C++ code and just-in-time (JIT) compilation of bindings via the LLVM compiler infrastructure. The authors of cppyy report that their tool produces bindings with much lower overheads compared to pybind11, and the above plots show that this is indeed true. However, nanobind retakes the performance lead in these experiments.

With speed gone as the main differentiating factor, other qualitative differences make these two tools appropriate to different audiences: cppyy has its origin in CERN's ROOT mega-project and must be highly dynamic to work with that codebase: it can parse header files to generate bindings as needed. cppyy works particularly well together with PyPy and can avoid boxing/unboxing overheads with this combination. The main downside of cppyy is that it depends on big and complex machinery (Cling/Clang/LLVM) that must be deployed on the user's side and then run there. There isn't a way of pre-generating bindings and then shipping just the output of this process.

nanobind is relatively static in comparison: you must tell it which functions to expose via binding declarations. These declarations offer a high degree of flexibility that users will typically use to create bindings that feel pythonic. At compile-time, those declarations turn into a sequence of CPython API calls, which produces self-contained bindings that are easy to redistribute via PyPI or elsewhere. Tools like cibuildwheel and scikit-build can fully automate the process of generating Python wheels for each target platform. A minimal example project shows how to do this automatically via GitHub Actions.

What are technical differences between nanobind and pybind11?

nanobind and pybind11 are the most similar of all of the binding tools compared above.

The main difference is between them is largely philosophical: pybind11 must deal with all of C++ to bind complex legacy codebases, while nanobind targets a smaller C++ subset. The codebase has to adapt to the binding tool and not the other way around! This change of perspective allows nanobind to be simpler and faster. Pull requests with extensions and generalizations were welcomed in pybind11, but they will likely be rejected in this project.

Removed features

A number of pybind11 features are unavailable in nanobind. The list below uses the following symbols:

○: This removal is a design choice. Use pybind11 if you need this feature.
●: This feature will likely be added at some point; your help is welcomed.
◑: Unclear, to be discussed.

Removed features include:

○ Multiple inheritance was a persistent source of complexity in pybind11, and it is one of the main casualties in creating nanobind. C++ classes involving multiple inheritance cannot be mapped onto an equivalent Python class hierarchy.
○ nanobind instances co-locate instance data with a Python object instead of accessing it via a holder type. This is a major difference compared to pybind11, which has implications on object ownership. Shared/unique pointers are still supported with some restrictions, see below for details.
○ Binding does not support C++ classes with overloaded or deleted operator new / operator delete.
○ The ability to run several independent Python interpreters in the same process is unsupported. (This would require TLS lookups for nanobind data structures, which is undesirable.)
○ kw_only / pos_only argument annotations were removed.
○ The options class for customizing docstring generation was removed.
○ Workarounds for old/buggy/non-standard-compliant compilers are gone and will not be reintroduced.
○ Module-local types and exceptions are unsupported.
○ Custom metaclasses are unsupported.
● Many STL type caster have not yet been ported.
● PyPy support is gone. (PyPy requires many workaround in pybind11 that complicate the its internals. Making PyPy interoperate with nanobind will likely require changes to the PyPy CPython emulation layer.)
◑ NumPy integration was replaced by a more general nb::tensor<> integration that supports CPU/GPU tensors produced by various frameworks (NumPy, PyTorch, TensorFlow, JAX, ..).
◑ Eigen integration was removed.
◑ Buffer protocol functionality was removed.
◑ Nested exceptions are not supported.
◑ Features to facilitate pickling and unpickling were removed.
◑ Support for embedding the interpreter and evaluating Python code strings was removed.

Removed features

Bullet points marked with ● or ◑ may be reintroduced eventually, but this will need to be done in a careful opt-in manner that does not affect code complexity, binary size, and compilation/runtime performance of basic bindings that don't depend on these features.

Optimizations

Besides removing features, the rewrite was an opportunity to address long-standing performance issues in pybind11:

C++ objects are now co-located with the Python object whenever possible (less pointer chasing compared to pybind11). The per-instance overhead for wrapping a C++ type into a Python object shrinks by 2.3x. (pybind11: 56 bytes, nanobind: 24 bytes.)
C++ function binding information is now co-located with the Python function object (less pointer chasing).
C++ type binding information is now co-located with the Python type object (less pointer chasing, fewer hashtable lookups).
nanobind internally replaces std::unordered_map with a more efficient hash table (tsl::robin_map, which is included as a git submodule).
function calls from/to Python are realized using PEP 590 vector calls, which gives a nice speed boost. The main function dispatch loop no longer allocates heap memory.
pybind11 was designed as a header-only library, which is generally a good thing because it simplifies the compilation workflow. However, one major downside of this is that a large amount of redundant code has to be compiled in each binding file (e.g., the function dispatch loop and all of the related internal data structures). nanobind compiles a separate shared or static support library (libnanobind) and links it against the binding code to avoid redundant compilation. When using the CMake nanobind_add_module() function, this all happens transparently.
#include <pybind11/pybind11.h> pulls in a large portion of the STL (about 2.1 MiB of headers with Clang and libc++). nanobind minimizes STL usage to avoid this problem. Type casters even for for basic types like std::string require an explicit opt-in by including an extra header file (e.g. #include <nanobind/stl/string.h>).
pybind11 is dependent on link time optimization (LTO) to produce reasonably-sized bindings, which makes linking a build time bottleneck. With nanobind's split into a precompiled core library and minimal metatemplating, LTO is no longer important.
nanobind maintains efficient internal data structures for lifetime management (needed for nb::keep_alive, nb::rv_policy::reference_internal, the std::shared_ptr interface, etc.). With these changes, it is no longer necessary that bound types are weak-referenceable, which saves a pointer per instance.

Other improvements

Besides performance improvements, nanobind includes a quality-of-live improvements for developers:

When the python interpreter shuts down, nanobind reports instance, type, and function leaks related to bindings, which is useful for tracking down reference counting issues.
nanobind deletes its internal data structures when the Python interpreter terminates, which avoids memory leak reports in tools like valgrind.
In pybind11, function docstrings are pre-rendered while the binding code runs (.def(...)). This can create confusing signatures containing C++ types when the binding code of those C++ types hasn't yet run. nanobind does not pre-render function docstrings, they are created on the fly when queried.
nanobind has greatly improved support for exchanging tensor data structures with modern array programming frameworks.

Dependencies

nanobind depends on recent versions of everything:

C++17: The if constexpr feature was crucial to simplify the internal meta-templating of this library.
Python 3.8+: nanobind heavily relies on PEP 590 vector calls that were introduced in version 3.8.
CMake 3.17+: Recent CMake versions include important improvements to FindPython that this project depends on.
Supported compilers: Clang 7, GCC 8, MSVC2019 (or newer) are officially supported.

Other compilers like MinGW, Intel (icpc, oneAPI), NVIDIA (PGI, nvcc) may or may not work but aren't officially supported. Pull requests to work around bugs in these compilers will not be accepted, as similar changes introduced significant complexity in pybind11. Instead, please file bugs with the vendors so that they will fix their compilers.

CMake interface

Note: for your convenience, a minimal example of a project with C++ bindings compiled using nanobind and scikit-build is available in the nanobind_example repository. To set up a build system manually, read on:

nanobind provides a CMake convenience function that automates the process of building a python extension module. This works analogously to pybind11. Example:

add_subdirectory(.. path to nanobind directory ..)
nanobind_add_module(my_ext common.h source_1.cpp source_2.cpp)

The defaults chosen by this function are somewhat opinionated. In particular, it performs the following steps to produce efficient bindings.

In non-debug modes, it compiles with size optimizations (i.e., -Os). This is generally the mode that you will want to use for C++/Python bindings. Switching to -O3 would enable further optimizations like vectorization, loop unrolling, etc., but these all increase compilation time and binary size with no real benefit for bindings.

If your project contains portions that benefit from -O3-level optimizations, then it's better to run two separate compilation steps. An example is shown below:
```
# Compile project code with current optimization mode configured in CMake
add_library(example_lib STATIC source_1.cpp source_2.cpp)
# Need position independent code (-fPIC) to link into 'example_ext' below
set_target_properties(example_lib PROPERTIES POSITION_INDEPENDENT_CODE ON)

# Compile extension module with size optimization and add 'example_lib'
nanobind_add_module(example_ext common.h source_1.cpp source_2.cpp)
target_link_libraries(example_ext PRIVATE example_lib)
```
Size optimizations can be disabled by specifying the optional NOMINSIZE argument, though doing so is not recommended.
nanobind_add_module() also disables stack-smashing protections (i.e., it specifies -fno-stack-protector to Clang/GCC). Protecting against such vulnerabilities in a Python VM seems futile, and it adds non-negligible extra cost (+8% binary size in my benchmarks). This behavior can be disabled by specifying the optional PROTECT_STACK flag. Either way, is not recommended that you use nanobind in a setting where it presents an attack surface.
In non-debug compilation modes, it strips internal symbol names from the resulting binary, which leads to a substantial size reduction. This behavior can be disabled using the optional NOSTRIP argument.
Link-time optimization (LTO) is not active by default; benefits compared to pybind11 are relatively low, and this tends to make linking a build bottleneck. That said, the optional LTO argument can be specified to enable LTO in non-debug modes.
The function also sets the target to C++17 mode (it's fine to manually increase this later on, e.g., to C++20)
It appends the library suffix (e.g., .cpython-39-darwin.so) based on information provided by CMake's FindPython module.
It statically or dynamically links against libnanobind depending on the value of the NB_SHARED parameter of the CMake project. Note that NB_SHARED is not an input of the nanobind_add_module() function. Rather, it should be specified before including the nanobind CMake project:
```
set(NB_SHARED OFF CACHE INTERNAL "") # Request static compilation of libnanobind
add_subdirectory(.. path to nanobind directory ..)
nanobind_add_module(...)
```

API differences

nanobind mostly follows the pybind11 API, hence the pybind11 documentation is the main source of documentation for this project. A number of simplifications and noteworthy changes are detailed below.

Namespace. nanobind types and functions are located in the nanobind namespace. The namespace nb = nanobind; shorthand alias is recommended.
Macros. The PYBIND11_* macros (e.g., PYBIND11_OVERRIDE(..)) were renamed to NB_* (e.g., NB_OVERRIDE(..)).
Shared pointers and holders. nanobind removes the concept of a holder type, which caused inefficiencies and introduced complexity in pybind11. This has implications on object ownership, shared ownership, and interactions with C++ shared/unique pointers. Please see the following separate document for the nitty-gritty details.

The gist is that use of shared/unique pointers requires one or both of the following optional header files:
- nanobind/stl/unique_ptr.h
- nanobind/stl/shared_ptr.h
Binding functions that take std::unique_ptr<T> arguments involves some limitations that can be avoided by changing their signatures to use std::unique_ptr<T, nb::deleter<T>> instead. Usage of std::enable_shared_from_this<T> is prohibited and will raise a compile-time assertion. This is consistent with the philosophy of this library: the codebase has to adapt to the binding tool and not the other way around.

It is no longer necessary to specify holder types in the type declaration:

pybind11:
```
py::class_<MyType, std::shared_ptr<MyType>>(m, "MyType")
  ...
```
nanobind:
```
nb::class_<MyType>(m, "MyType")
  ...
```

Implicit type conversions. In pybind11, implicit conversions were specified using a follow-up function call. In nanobind, they are specified within the constructor declarations:

pybind11:

py::class_<MyType>(m, "MyType")
    .def(py::init<MyOtherType>());

py::implicitly_convertible<MyOtherType, MyType>();

nanobind:

nb::class_<MyType>(m, "MyType")
    .def(nb::init_implicit<MyOtherType>());

Trampoline classes. Trampolines, i.e., polymorphic class implementations that forward virtual function calls to Python, now require an extra NB_TRAMPOLINE(parent, size) declaration, where parent refers to the parent class and size is at least as big as the number of NB_OVERRIDE_*() calls. nanobind caches information to enable efficient function dispatch, for which it must know the number of trampoline "slots". Example:
```
struct PyAnimal : Animal {
    NB_TRAMPOLINE(Animal, 1);

    std::string name() const override {
        NB_OVERRIDE(std::string, Animal, name);
    }
};
```
Trampoline declarations with an insufficient size may eventually trigger a Python RuntimeError exception with a descriptive label, e.g. nanobind::detail::get_trampoline('PyAnimal::what()'): the trampoline ran out of slots (you will need to increase the value provided to the NB_TRAMPOLINE() macro)!.
Type casters. The API of custom type casters has changed significantly. In a nutshell, the following changes are needed:
- load() was renamed to from_python(). The function now takes an extra uint8_t flags (instead bool convert, which is now represented by the flag nanobind::detail::cast_flags::convert). A cleanup_list * pointer keeps track of Python temporaries that are created by the conversion, and which need to be deallocated after a function call has taken place. flags and cleanup should be passed to any recursive usage of type_caster::from_python().
- cast() was renamed to from_cpp(). The function takes a return value policy (as before) and a cleanup_list * pointer.
Both functions must be marked as noexcept. In contrast to pybind11, errors during type casting are only propagated using status codes. If a severe error condition arises that should be reported, use Python warning API calls for this, e.g. PyErr_WarnFormat().

Note that the cleanup list is only available when from_python() or from_cpp() are called as part of function dispatch, while usage by nanobind::cast() sets cleanup to nullptr. This case should be handled gracefully by refusing the conversion if the cleanup list is absolutely required.

The std::pair type caster may be useful as a reference for these changes.
The following types and functions were renamed:

pybind11 nanobind

error_already_set python_error

type::of<T> type<T>

type type_object

reinterpret_borrow borrow

reinterpret_steal steal

custom_type_setup type_callback
New features.
- Unified DLPack/Buffer protocol integration: nanobind can retrieve and return tensors using two standard protocols: DLPack, and the the buffer protocol. This enables zero-copy data exchange of CPU and GPU tensors with array programming frameworks including NumPy, PyTorch, TensorFlow, JAX, etc.
  
  Details on using this feature can be found here.
- Supplemental type data: nanobind can store supplemental data along with registered types. This information is co-located with the Python type object. An example use of this fairly advanced feature are libraries that register large numbers of different types (e.g. flavors of tensors). A single generically implemented function can then query this supplemental information to handle each type slightly differently.
```
struct Supplement {
    ... // should be a POD (plain old data) type
};

// Register a new type Test, and reserve space for sizeof(Supplement)
nb::class_<Test> cls(m, "Test", nb::supplement<Supplement>())

/// Mutable reference to 'Supplement' portion in Python type object
Supplement &supplement = nb::type_supplement<Supplement>(cls);
```
- Low-level interface: nanobind exposes a low-level interface to provide fine-grained control over the sequence of steps that instantiates a Python object wrapping a C++ instance. Like the above point, this is useful when writing generic binding code that manipulates nanobind-based objects of various types.
  
  Details on using this feature can be found here.
- Python type wrappers: The nb::handle_of<T> type behaves just like the nb::handle class and wraps a PyObject * pointer. However, when binding a function that takes such an argument, nanobind will only call the associated function overload when the underlying Python object wraps a C++ instance of type T.
- Raw docstrings: In cases where absolute control over docstrings is required (for example, so that complex cases can be parsed by a tool like Sphinx), the nb::raw_doc attribute can be specified to functions. In this case, nanobind will skip generation of a combined docstring that enumerates overloads along with type information.
  
  Example:
```
m.def("identity", [](float arg) { return arg; });
m.def("identity", [](int arg) { return arg; },
      nb::raw_doc(
          "identity(arg)\n"
          "An identity function for integers and floats\n"
          "\n"
          "Args:\n"
          "    arg (float | int): Input value\n"
          "\n"
          "Returns:\n"
          "    float | int: Result of the identity operation"));
```
  Writing detailed docstrings in this way is rather tedious. In practice, they would usually be extracted from C++ heades using a tool like pybind11_mkdoc.

pybind11	nanobind
`error_already_set`	`python_error`
`type::of<T>`	`type<T>`
`type`	`type_object`
`reinterpret_borrow`	`borrow`
`reinterpret_steal`	`steal`
`custom_type_setup`	`type_callback`

How to cite this project?

Please use the following BibTeX template to cite nanobind in scientific discourse:

@misc{nanobind,
   author = {Wenzel Jakob},
   year = {2022},
   note = {https://github.com/wjakob/nanobind},
   title = {nanobind -- Seamless operability between C++17 and Python}
}

inducer / nanobind Goto Github PK

nanobind's Introduction

nanobind — Seamless operability between C++17 and Python

Why yet another binding library?

Performance

What are technical differences between nanobind and cppyy?

What are technical differences between nanobind and pybind11?

Removed features

Removed features

Optimizations

Other improvements

Dependencies

CMake interface

API differences

How to cite this project?

nanobind's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org