elixir-nx / nx Goto Github PK

Multi-dimensional arrays (tensors) and numerical definitions for Elixir

Elixir 89.33% Makefile 0.11% C++ 10.26% CMake 0.18% C 0.12%

gpu numerical elixir tensor xla pytorch jit

nx's Introduction

This repository currently holds the following projects:

Nx - Multi-dimensional arrays (tensors) and numerical definitions for Elixir
EXLA - Google's XLA (Accelerated Linear Algebra) compiler/backend for Nx
Torchx - LibTorch backend for Nx

Each has their own README, which you can access above to learn more. They will be extracted to their own repository in the future. Examples and benchmarks are available in the EXLA project.

Check our organization page for a general introduction to Machine Learning in Elixir.

nx's People

Contributors

Stargazers

Watchers

Forkers

sabiwara thefirstavenger treble37 wojtekmach shrikanthkr onixus74 neuroradiology quinnwilton alexjuca christhekeele m13m samsplunks sthagen siddhant3030 lesliesibbs qrede elbow-jason billylanchantin tschmidleithner thbar pcorey zeta1999 kentaro jangromko thomaschanghbw bcardarella liboshen milmazz kw7oe njwest 0urobor0s takasehideki pnezis enzg jordanhubbard chrismcg tehprofessor kipcole9 emadurandal hauleth nyo16 hanamizuno erlsci jamescheuk91 zakimedina stjordanis grzegorz-wcislo kshalot braman09 e8t-fork thanos eigr seancribbs strogo t-rutten hellcoderz pavelzx adamt igorcguimaraes forkkit jetwork-io rykawamu agga1 artart788 petermm sriram-srinivasan cigrainger chengjingfeng aetherus xire28 kenken-neko torepettersen dognotdog dannote shawnfsl benjamin-philip zeionara jeantux lambdacalrissian floatn rhbvkleef stefkohub kikuyuta cristineguadelupe goodhamgupta kianmeng vinpatel tiagodavi howard0su dmorn al2o3cr tlietz mlliarm john-sonz vans163 venomsolo herman-rogers ckhoff ckhrysze flmath

nx's Issues

Should we allow Nx functions to be called with lists?

For example, should this be allowed?

Nx.add([1, 2, 3], [4, 5, 6])

In any case, it will still return tensors.

Add grad helpers

At least custom_grad and zero_grad. Anything else @seanmor5?

Fix gradient for interior padding

With #134 we now support interior padding, but unfortunately the gradient is broken for Nx.pad. In order to fix it, we will need to introduce Nx.slice

Revisit pad to use weighted traverse

Convolution causes segmentation fault

I just built without docker, all tests passing except the Convolution causes a segmentation fault. I have a feeling it's a version issue, because it had a warning along these lines:

Failed to determine best cudnn convolution algorithm. Falling back to default. Performance may be sub optimal.

For reference:

OS: Ubuntu 20.04, kernel 5.8.0-38-generic
NVidia Driver Version: 460.32.03
CUDA 11.0
libcudnn-8.0.4

Introduce defn =

I would like to collect your opinion on introducing a defn variant that uses the equal sign, so instead of this:

defn tanh(_shape, y, _x), do: 1.0 - y * y

we can write this:

defn tanh(_shape, y, _x) = 1.0 - y * y

The pro is that it looks closer to mathematical definitions and it is less verbose. The cons is that it looks less like regular Elixir code but it can be a plus to help find defn functions in a module with mixed definitions.

If we decide to go down this route, we have two options:

Require the = syntax everywhere, multiline looks like this:

 defn gradient({w1, b1, w2, b2}, batch_images, batch_labels) =
        (
          grad_b1 = grad_b1(batch_labels)
          grad_b2 = grad_b2(w2, batch_labels)
          grad_w1 = grad_w1(w2, batch_images, batch_labels)
          grad_w2 = grad_w2(w1, b1, batch_images, batch_labels)
          {grad_w1, grad_b1, grad_w2, grad_b2}
        )

2, Allow both do/end and =

What are your thoughts?

Handling of NaN and infinity

Today Nx operations fail if they find a NaN and/or Infinity (although defn behaviour will be compiler independent). Do we need to implement handling of NaN and infinity within Nx? What are the use cases?

Introduce Nx.Device

Proposal: we will change the :data field of Nx.Tensor to be a
{device_module, term}. The default is {Nx.BinaryDevice, binary}.

All functions in Nx will expect the device to be the binary device.
If the device is elsewhere (think GPU), then it needs to be either
read or transferred.

# Transfers data to the given device.
#
#     Nx.device_transfer(tensor)
#     Nx.device_transfer(tensor, Exla.NxDevice, device: {:cuda, 1})
#
# If a device is not given, Nx.BinaryDevice is used, which means
# the data is read into an Elixir binary. If the device is already
# Nx.BinaryDevice, it returns the tensor as is.
#
# Once transfer is done, the data is deallocated from the given
# tensor.
#
# If the device has already been deallocated, it raises.
Nx.device_transfer(tensor, device \\ Nx.BinaryDevice, device_opts \\ [])

# Read data that is on the device.
#
# It returns a tensor where the device is Nx.BinaryDevice.
# The data is not deallocated from the current device. If the
# device is already Nx.BinaryDevice, it returns the tensor as is.
#
# If the device has already been deallocated, it raises.
Nx.device_read(tensor)

# Deallocates data from device. Returns either
# :ok or :already_deallocated.
Nx.device_deallocate(tensor)

The Nx.Device behaviour

defmodule Nx.Device do
  @type state :: term

  @callback transfer({mod, term}, type, dims, opts) :: state
  @callback read(term) :: bitstring
  @callback deallocate(term) :: :ok | :already_deallocated
end

Fix memory leak on cpu platform with non-zero copied binaries

Today exla_client.cc will only zero copy binaries under certain conditions but run treats it as if zero-copy always happens.

My suggestion is to break this function in three: one for zero copy allocation and one for device allocation. Then in run, we will try to zero copy and if not, we call the device allocation function, storing the results of zero-copying or not in a separate vector.

Then, after running, we traverse the inputs again, choosing to either release or zero copy them based on the results.

defn roadmap

Constructs:

Passes:

Autograd
vmap
pmap

Reading wrong binary from device

If you wrap the contents of this test in a for:

https://github.com/elixir-nx/exla/blob/main/test/exla/nx_device_test.exs#L4

Like this:

  test "transfers data from nx<->device" do
    for _ <- 1..10000 do
    t = Nx.tensor([1, 2, 3, 4])

Then it fails with a 100% certainty for me. I think we are doing either one of:

When sending data to the device, we are using zero-copy (which we shouldn't, we should only do zero-copy on run, since we know the binary can't be GCed meanwhile)
When reading the data for a CPU device, we are pointing to a place in memory instead of allocating an Erlang binary (unlikely?)

Make Nx.dot/4 aware of batch dimensions

This is related to #174 so perhaps it's more appropriate to move this into the discussion in that issue.

Right now, the dot product is only aware of contracting dimensions. One example of something we can't support right now is a bilinear transformation like in: https://pytorch.org/docs/stable/nn.functional.html#bilinear

Given input shapes: {a, b}, {z, b, c}, {a, c}, for a bilinear transformation the output shape should be: {a, z} given by:

input1
|> Nx.dot(weight)
|> Nx.dot(input2)

However, even if we configure the contracting dimensions correctly, we'll end up with shape {a, z, a}. Instead what we need is to treat the first dimension as a batch dimension so the final dot product between shapes {a, z, c} and {a, c} is really a batched dot product between shapes {z, c} and {c} resulting in an output of {a, z}

Introduce while loops

I would like to add a higher-level looping construct to defn. In Elixir, we can use for+:reduce but I am afraid it will be too verbose and foreign for new users. Let's imagine we want to translate this python code:

    def _smooth(x):
      out = np.empty_like(x)
      for i in range(1, x.shape[0] - 1):
          for j in range(1, x.shape[1] - 1):
              out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                          x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                          x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9
      return out

With for+:reduce, we would write it as:

    def smooth(x) do
      for i <- 1..elem(x.shape, 0)-1, j <- 1..elem(x.shape, 1)-1, reduce: x do
        x ->
          put_in x[i, j], x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                           x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                           x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) / 9
      end
    end

I propose we introduce a loop construct, inspired by futhark that looks like this:

loop tuple_or_var [= expr], [pattern <- expr]+ do

end

Rewriting the above to this loop construct, we have:

    def smooth(x) do
      loop x,
           i <- 1..elem(x.shape, 0)-1,
           j <- 1..elem(x.shape, 1)-1 do
        put_in x[i, j], x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                          x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                          x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) / 9
      end
    end

There is one downside with this approach: the only form of loops we have in XLA are while loops which are sequential. Other languages, such as taichi, can optimize them to run in parallel. For this reason, we may want to introduce higher level constructs for manipulating tensors. In particular, I believe we should introduce functions such as map, map_with_index, reduce and reduce_with_index. I have some thoughts in how we can implement said functions so they also work with batching out of the box, but I am waiting for some feedback on this issue before moving forward.

Figure out a way to not discard the builds when swapping targets

One option is to remove EXLA_TARGET. if that doesn't work, we should namespace the directories with the target.

Implement Infeed/Outfeed Ops

I have been reading a bit about XLAs Infeed/Outfeed Ops and how they can be used to send data to/from the device to host during a computation, which presents an opportunity for what I think would be significant performance increases.

JAX currently holds the MLPerf records training on a large TPU cluster. Looking at their ResNet implementation you can see how their training loop makes use of XLA Ops (loops and infeeds) to speed up training. You can use Infeeds to perform multiple steps per batch without having to rerun the computation. Infeeds accept input shapes and tokens which are used for enforcing an order between operations across replicas/partitions. Adding this feature would allow us to do something similar for whatever NN library we decide to implement, and would also give users the flexibility to speed up their own custom training loops.

In the same sense that you can pass data to a device during a computation, you can receive data from a still-running computation using outfeeds. An outfeed accepts a shape as well, and then an outfeed receiver handles data received from the Outfeed. The Python XLA client implements an outfeed receiver in C++, but reading the implementation notes, it seems like Elixir is a perfect fit for handling everything the Python Outfeed Receiver is trying to do in C++.

Reading about infeeds/outfeeds in the context of TPUs, it seems that they are almost ALWAYS infeed/outfeed bound, so taking advantage TPUs in "coreless" mode is really important for performance. A TPU running in coreless mode is basically just using a TPUs CPU. The TPUs CPU has 300GB of memory which can be used for preprocessing/transformations in the data pipeline. It seems the most efficient way to train with TPU would be to:

Write a training loop that makes use of Infeeds/Outfeeds for actually processing the neural network
Implement an input pipeline that takes advantage of the TPUs CPU to do transformations. These transformations can also be defn compiled functions when needed or just plain Elixir for IO stuff. An additional advantage we have is that it should be very straightforward to do these transformations in parallel to pass to multiple TPU cores. A single TPU Pod has 2048 TPU cores, so they are massively parallel, but Elixir is the perfect language for handling this.
In the same respect as above, we need an outfeed receiver that handles a massively parallel training pipeline.

One of the big questions is how we implement something like this so it's backend agnostic. Having an Nx.infeed or an Nx.outfeed wouldn't really make sense. I think these probably tie in best with Nx.device.

Allow on device references

keep_on_device should default to false
when keep_on_device is true, it returns a ref (both for CPU and GPU)
the ref will have two functions: read and deallocate - deallocating sets the buffer it holds to null
trying to read a deallocated ref will raise a proper error. deallocating a deallocated ref just works
whenever a ref is given to run, it will be deallocated at the end of the run (if we give a deallocated ref, it raises)
add binary_to_device which returns a ref (it is a way to allocate without having to run something) (edited)

Maybe we can call those binary_to_device_mem, read_device_mem and deallocate_device_mem?

Named tensors

This is a discussion that showed up internally and apparently there is a similar movement in PyTorch: http://nlp.seas.harvard.edu/NamedTensor

Some questions are:

Should we provide some default names?
Should we support constant names? i.e. have :i, :j, :k, etc work regardless of the names in the tensor. This can allow people to create generic algorithms without hardcoding dimensions.
Should we allow specifying which dimensions are batch dimensions (so we get vmap but without making it a transform)? Similar to the privacy section in the document above but already generalized for batching.

Do not allow closures in computations

Validate that the function given to reduce is not really a closure (i.e. it does not accept parent's expressions
Make sure functions become expressions when handling Expr for grads (instead of invoked on the compiler)

Support vmap

See the tensorflow feature and linked paper: https://www.tensorflow.org/api_docs/python/tf/vectorized_map

Add types to Nx.Defn.Expr

Then we make Exla work with a superset of the given types, keeping its internal representation as much as possible.

This means we can implement quantization, type replacement, and similar with Expr instead of implementing it for each compiler.

Add to_heatmap

We should introduce a to_heatmap conversion, similar to how it is done in Matrex.

The Nx.Util module seems to be the best candidate for having this function. Currently Nx.Util is handful of conversion functions from tensors to flat lists and scalars. However, we could also argue that heatmap belongs to Nx because in practice it is two operations:

Normalizing the data to heatmap values based on its min and max
Converting the values above to an ansi value we can print

The first part could be implemented in Nx, which means it could be called inside defn and performed more efficiently. At the same time, step 1 is done with two traversals of a binary, which is relatively fast anyway.

For achieve step 2, my suggestion is for to_heatmap to return a Nx.Heatmap, which wraps the tensor and implements the Inspect protocol so it prints the heatmap. The advantage of making it a struct is that we can also implement this protocol for LiveBook and so on.

The heatmap will apply to the 2 lowest dimensions. If there are higher dimensions, they will be wrapped in lists. For example, a 14x14 tensor would look like this:

#Nx.Heatmap<
  s64[14][14]
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
>

However, if the dimensions are {2, 14, 14}, we would get:

#Nx.Heatmap<
  s64[2][14][14]
  [
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............,

    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
  ]
>

Default arguments do not work with only 1 argument

As an example, this does not compile:

  defn uniform(opts \\ []) do
    shape = transform(opts, &Keyword.fetch!(&1, :shape))
    opts = keyword!(opts, type: {:f, 32}, shape: {}, scale: 1.0e-2)
    Nx.random_uniform(shape, type: opts[:type]) * opts[:scale]
  end

But this does:

  defn uniform(_t, opts \\ []) do
    shape = transform(opts, &Keyword.fetch!(&1, :shape))
    opts = keyword!(opts, type: {:f, 32}, shape: {}, scale: 1.0e-2)
    Nx.random_uniform(shape, type: opts[:type]) * opts[:scale]
  end

Even though t is unused.

Support groups in convolution

Grouped Convolutions are common enough that we should support them. Here's some discussion and papers on grouped convolutions: https://paperswithcode.com/method/grouped-convolution

XLA supports grouped convolutions with it's feature_group_count option (right now we always set it to 1). We will just need to add support for groups in the binary implementation of Nx.conv. This involves an additional iteration in the binary implementation as well as some additional shape checks.

XLA also supports batch_group_count for grouping parts of the batch. I see less value in adding this option as I haven't been able to find any applications that are all that common.

Add missing gradients

Currently missing gradients for these functions:

Make Nx a meta module

I have been reading more into sparse matrices and one of the issues in scikit learn is that, when you have a sparse matrix, it is extremely discouraged that you use np functions because they would lose their sparse properties. Instead, you should use the sparse specific versions.

While this is a fine recommendation (and generally suitable to Elixir), it introduces an issue in that defn functions only allow function calls to the Nx module. Given we most likely want to support sparse specific compilers in defn in the future, this puts us in a rough spot:

If we want sparse matrices in defn, we need to call Nx with sparse matrices
Outside of defn, you should not call Nx with sparse matrices

I believe we can address this with (surprise, surprise!) one extra level of indirection. We should make the Nx module a meta-module that doesn't really implement the operations but just define their API. For example, the exp implementation would rather look like this:

def exp(tensor) do
  tensor = tensor(tensor)
  tensor.data.__struct__.exp(tensor)
end

This has a couple benefits:

Now people can bring sparse matrices and use Nx module as is
We can unify Nx and Nx.Defn.Expr because it can be implemented using the same API as above

While I would be generally worried about this approach since the indirection can affect performance, the cost of an extra function call is irrelevant when working with tensors, so we are fine.

GPU transfer from NIF on run causes OOM

Currently, we explicitly transfer buffers to the device before calling the run NIF from within Exla.Executable.run when the platform is a GPU like this:

    outside_cpu = client.platform == :cuda || client.platform == :rocm
    keep_on_device_int = if keep_on_device || outside_cpu, do: 1, else: 0

    device_id = device_assignment_to_device_id(executable, {replica, partition})

    inputs =
      Enum.map(arguments, fn
        %Buffer{ref: {ref, _}, data: nil} ->
          ref

        buffer = %Buffer{data: data, shape: shape, ref: nil} ->
          if outside_cpu do
            %Buffer{ref: {ref, _}} = Buffer.place_on_device(buffer, client, device_id)
            ref
          else
            {data, shape.ref}
          end
      end)

Originally, this was done to decouple what could have made the run NIF IO bound or CPU bound. After determining the GPU case is always closer to IO bound, we no longer need this logic; however, removing it leads to OOM errors on the MNIST example.

Build Failure: No module named 'numpy'

Checking out the project for the first time and encountered an error during the first build attempt running mix test:

➜  exla git:(main) mix test

... <successes omitted> ...

Repository rule local_python_configure defined at:
  /home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl:275:26: in <toplevel>
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (13 packages loaded, 14 targets configured)
ERROR: An error occurred during the fetch of repository 'local_execution_config_python':
   Traceback (most recent call last):
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 213
		_get_numpy_include(<2 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 187, in _get_numpy_include
		execute(repository_ctx, <3 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/remote_config/common.bzl", line 217, in execute
		fail(<1 more arguments>)
Problem getting numpy include path.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'
Is numpy installed?
INFO: Repository go_sdk instantiated at:
  no stack (--record_rule_instantiation_callstack not enabled)
Repository rule _go_download_sdk defined at:
  /home/elbow-jason/.cache/bazel/_bazel_elbow-jason/bbfe9bcff2dc48e9f808ab24728cb493/external/io_bazel_rules_go/go/private/sdk.bzl:52:20: in <toplevel>
ERROR: Analysis of target '//tensorflow/compiler/xla/exla:libexla.so' failed; build aborted: Traceback (most recent call last):
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 213
		_get_numpy_include(<2 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 187, in _get_numpy_include
		execute(repository_ctx, <3 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/remote_config/common.bzl", line 217, in execute
		fail(<1 more arguments>)
Problem getting numpy include path.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'
Is numpy installed?
INFO: Elapsed time: 21.495s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (13 packages loaded, 14 targets configured)
FAILED: Build did NOT complete successfully (13 packages loaded, 14 targets configured)
make: *** [Makefile:25: all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

Not sure what system info is needed for diagnosis. Some basic info:

➜  exla git:(main) uname -a
Linux elbow-at-home 5.8.0-38-generic #43~20.04.1-Ubuntu SMP Tue Jan 12 16:39:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

➜  exla git:(main) gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

➜  exla git:(main) make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

➜  exla git:(main) asdf current erlang
erlang          23.1.5          /home/elbow-jason/.tool-versions
➜  exla git:(main) asdf current elixir
elixir          1.11.2-otp-23   /home/elbow-jason/.tool-versions
➜  exla git:(main) asdf current python
python          3.6.12          /home/elbow-jason/.tool-versions

Study if we can reduce the size of .so file

Pin EXLA to a TF Stable Release

Right now, we're working off a recent TF master revision. We'll want to pin to a more stable release instead, and then work out a process for periodically updating the TF version.

Add streaming device allocation

For example, if we need to load 2GB of data. Now we need to first load it into the memory and then load it into the GPU. We would like to that without having to load it all onto the CPU first.

The biggest question is what would be the API on the Elixir side. We could have a Stream based one (immutable) or a process-based one (mutable). Perhaps both?

Add reduce_window

Support common LinAlg primitives

For now, at least:

Support slicing (known as Access in Elixir)

Proposal still has to be drafted.

Support bf16 on Exla

We are not currently matching on the type on the c++ side of things.

Support common mathematical functions

Leaving this here for tracking/discussion on what should and shouldn't be included. Looking at XLA/JAX/Numpy, these are some common ones we are missing:

There are a few others as well that we should look into, but these seemed to be the most common.

Always call tensor on inputs in Nx / Nx.Util

Tag dirty NIFs

We will want to make tag long running NIFs as dirty. Probably compile (dirty CPU NIF) and run (should have two versions, the CPU one is a dirty CPU NIF, and the GPU one is a dirty IO NIF).

elixir_make retired and doesn't have permissions to mix.exs

Following the README:

➜  exla git:(main) mix deps.get
Resolving Hex dependencies...
Dependency resolution completed:
Unchanged:
  benchee 1.0.1
  deep_merge 1.0.0
  elixir_make 0.6.1 RETIRED!
    (invalid) Wrong permissions on the mix.exs file
All dependencies are up to date

device roadmap

Streamline and document device ordinal handling
Add an option to return supported platforms
Allocate data on multiple devices by a given dimension
Stream data into device
Stream data into multiple devices (for pmap)
Read data from multiple devices into a single binary (for pmap)

Add basic operator fusion to Nx.Defn

While the EXLA compiler already performs operator fusion, we should also be able to do some operator fusion for the built-in compiler written in Elixir. We can fuse most unary operators and binary operators with constants to reduce the amount of traversals. It still won't hold a candle against compiled modes but it should be overall positive to performance.

Broadcasting a constant does not produce an Expr

This test will not compile:

defn zeros(opts \\ []) do
  shape = transform(opts, &Keyword.fetch!(&1, :shape))
  opts = keyword!(opts, type: {:f, 32}, shape: {})
  Nx.as_type(Nx.broadcast(0, shape), opts[:type])
end

test "zeros with default args" do
  assert zeros(shape: {}) == Nx.tensor(0.0, type: {:f, 32})
end

This test will compile but raise (at least on Nx.Defn backend):

defn zeros(_t, opts \\ []) do
  shape = transform(opts, &Keyword.fetch!(&1, :shape))
  opts = keyword!(opts, type: {:f, 32}, shape: {})
  Nx.as_type(Nx.broadcast(0, shape), opts[:type])
end

test "zeros with default args" do
  assert zeros(Nx.tensor([1, 2, 3]), shape: {}) == Nx.tensor(0.0, type: {:f, 32})
end

** (ArgumentError) defn must return an expression tensor or a tuple, got: #Nx.Tensor<
       f32
       0.0
     >

This workaround will also raise:

    defmodule Zeros do
      def zeros(shape, opts \\ []) do
        type = opts[:type] || {:f, 32}
        constant(0, [shape: shape, type: type])
      end

      defnp constant(t, opts) do
        opts = keyword!(opts, type: {:f, 32}, shape: {})
        Nx.as_type(Nx.broadcast(t, opts[:shape]), opts[:type])
      end
    end

** (ArgumentError) defn functions expects either numbers or tensors as arguments. Got: [shape: [shape: {}], type: {:f, 32}]

Support :max_float_type

max_float_type will rewrite all floats with equal or higher size to the given one.

Compiler option

@defn_compiler {EXLA, max_float_type: {:bf, 16}}

Pros: most efficient
Cons: each compiler has to implement it which can be error prone

Transformer (like grad)

max_float_type(expr, {:bf, 16})

Pros: one implementation for all compilers
Pros: easy to implement thanks to the new Expr.traverse_args
Cons: has to traverse the expression after the expression is built

If we go the transformer route, we could make the transformer module public, which means we can support both APIs (either via a transform or via an option). Implementing the traversal is easy and requires rewriting the types of the appropriate nodes. Only a handful of notes require special attention:

Parameters will have to be cast down as an explicit conversion step
Functions will have to be recomputed
Tensors will have to be recast

Support f16

Requires erlang/otp#2890

On both Nx and Exla.

Implement mean and transpose for EXLA's defn

Add support for labelling coordinates

xarray can provide a lot of guidance here: http://xarray.pydata.org/en/stable/

It would also be interesting to see tensor backends that target libraries/formats such as pandas/apache arrow and see the changes that we would need to make to model to support those.

Support complex types

Add common aggregate ops

So far we have mean and sum, but we are missing:

reduce_max
reduce_min
reduce_prod

As well as variants for reduce_window:

We could also probably tackle the cumulative ops. We can also consider different names for these functions. Maybe rather than reduce_max and reduce_min, just a max/1 and min/1.

Add `keep_dims` option to aggregate functions

Most NumPy aggregate functions have a keepdims option that will keep the reduced dimensions as size 1 so the rank of the tensor stays the same. This is very easily implemented with some shape changes from Nx and an additional reshape in EXLA.

It's very common, especially in neural network ops. Numerous examples in here: https://github.com/google/flax/blob/master/flax/core/nn/normalization.py

Implementing something similar in defn I believe would involve a transform to calculate the reshape, which would be slightly annoying to have to do every time.