Giter Site home page Giter Site logo

nx's Introduction

This repository currently holds the following projects:

  • Nx - Multi-dimensional arrays (tensors) and numerical definitions for Elixir

  • EXLA - Google's XLA (Accelerated Linear Algebra) compiler/backend for Nx

  • Torchx - LibTorch backend for Nx

Each has their own README, which you can access above to learn more. They will be extracted to their own repository in the future. Examples and benchmarks are available in the EXLA project.

Check our organization page for a general introduction to Machine Learning in Elixir.

nx's People

Contributors

adamt avatar alexjuca avatar benjamin-philip avatar cocoa-xu avatar dannote avatar elbow-jason avatar floatn avatar grzuy avatar howard0su avatar ian-gl avatar jan-swiatek avatar john-sonz avatar jonatanklosko avatar josevalim avatar kenken-neko avatar krstopro avatar kshalot avatar laursisask avatar linjunpop avatar msluszniak avatar polvalente avatar seanmor5 avatar t-rutten avatar thefirstavenger avatar thiagopromano avatar tiagodavi avatar versilov avatar wfvining avatar wojtekmach avatar zacky1972 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nx's Issues

Fix gradient for interior padding

With #134 we now support interior padding, but unfortunately the gradient is broken for Nx.pad. In order to fix it, we will need to introduce Nx.slice

Convolution causes segmentation fault

I just built without docker, all tests passing except the Convolution causes a segmentation fault. I have a feeling it's a version issue, because it had a warning along these lines:

Failed to determine best cudnn convolution algorithm. Falling back to default. Performance may be sub optimal.

For reference:

  • OS: Ubuntu 20.04, kernel 5.8.0-38-generic
  • NVidia Driver Version: 460.32.03
  • CUDA 11.0
  • libcudnn-8.0.4

Nx roadmap

  • Random generators
  • Unary ops
  • Element-wise binary ops
  • Element-wise comparison ops
  • Other operations provided by XLA as needed
  • FP16 types
  • Complex types
  • Handling of NaN and Infinity (?)

Introduce defn =

I would like to collect your opinion on introducing a defn variant that uses the equal sign, so instead of this:

defn tanh(_shape, y, _x), do: 1.0 - y * y

we can write this:

defn tanh(_shape, y, _x) = 1.0 - y * y

The pro is that it looks closer to mathematical definitions and it is less verbose. The cons is that it looks less like regular Elixir code but it can be a plus to help find defn functions in a module with mixed definitions.

If we decide to go down this route, we have two options:

  1. Require the = syntax everywhere, multiline looks like this:
 defn gradient({w1, b1, w2, b2}, batch_images, batch_labels) =
        (
          grad_b1 = grad_b1(batch_labels)
          grad_b2 = grad_b2(w2, batch_labels)
          grad_w1 = grad_w1(w2, batch_images, batch_labels)
          grad_w2 = grad_w2(w1, b1, batch_images, batch_labels)
          {grad_w1, grad_b1, grad_w2, grad_b2}
        )

2, Allow both do/end and =

What are your thoughts?

Handling of NaN and infinity

Today Nx operations fail if they find a NaN and/or Infinity (although defn behaviour will be compiler independent). Do we need to implement handling of NaN and infinity within Nx? What are the use cases?

Introduce Nx.Device

Proposal: we will change the :data field of Nx.Tensor to be a
{device_module, term}. The default is {Nx.BinaryDevice, binary}.

All functions in Nx will expect the device to be the binary device.
If the device is elsewhere (think GPU), then it needs to be either
read or transferred.

# Transfers data to the given device.
#
#     Nx.device_transfer(tensor)
#     Nx.device_transfer(tensor, Exla.NxDevice, device: {:cuda, 1})
#
# If a device is not given, Nx.BinaryDevice is used, which means
# the data is read into an Elixir binary. If the device is already
# Nx.BinaryDevice, it returns the tensor as is.
#
# Once transfer is done, the data is deallocated from the given
# tensor.
#
# If the device has already been deallocated, it raises.
Nx.device_transfer(tensor, device \\ Nx.BinaryDevice, device_opts \\ [])

# Read data that is on the device.
#
# It returns a tensor where the device is Nx.BinaryDevice.
# The data is not deallocated from the current device. If the
# device is already Nx.BinaryDevice, it returns the tensor as is.
#
# If the device has already been deallocated, it raises.
Nx.device_read(tensor)

# Deallocates data from device. Returns either
# :ok or :already_deallocated.
Nx.device_deallocate(tensor)

The Nx.Device behaviour

defmodule Nx.Device do
  @type state :: term

  @callback transfer({mod, term}, type, dims, opts) :: state
  @callback read(term) :: bitstring
  @callback deallocate(term) :: :ok | :already_deallocated
end

Fix memory leak on cpu platform with non-zero copied binaries

Today exla_client.cc will only zero copy binaries under certain conditions but run treats it as if zero-copy always happens.

My suggestion is to break this function in three: one for zero copy allocation and one for device allocation. Then in run, we will try to zero copy and if not, we call the device allocation function, storing the results of zero-copying or not in a separate vector.

Then, after running, we traverse the inputs again, choosing to either release or zero copy them based on the results.

defn roadmap

Constructs:

  • Math operators
  • Bit operators
  • Logical operators
  • Slices (access + put_in/update_in)
  • Conditionals (if/cond)
  • Tuples (with pattern matching)
  • Loops (for? while?)
  • Support random functions
  • Default arguments

Passes:

  • Autograd
  • vmap
  • pmap

Reading wrong binary from device

If you wrap the contents of this test in a for:

https://github.com/elixir-nx/exla/blob/main/test/exla/nx_device_test.exs#L4

Like this:

  test "transfers data from nx<->device" do
    for _ <- 1..10000 do
    t = Nx.tensor([1, 2, 3, 4])

Then it fails with a 100% certainty for me. I think we are doing either one of:

  1. When sending data to the device, we are using zero-copy (which we shouldn't, we should only do zero-copy on run, since we know the binary can't be GCed meanwhile)

  2. When reading the data for a CPU device, we are pointing to a place in memory instead of allocating an Erlang binary (unlikely?)

Make Nx.dot/4 aware of batch dimensions

This is related to #174 so perhaps it's more appropriate to move this into the discussion in that issue.

Right now, the dot product is only aware of contracting dimensions. One example of something we can't support right now is a bilinear transformation like in: https://pytorch.org/docs/stable/nn.functional.html#bilinear

Given input shapes: {a, b}, {z, b, c}, {a, c}, for a bilinear transformation the output shape should be: {a, z} given by:

input1
|> Nx.dot(weight)
|> Nx.dot(input2)

However, even if we configure the contracting dimensions correctly, we'll end up with shape {a, z, a}. Instead what we need is to treat the first dimension as a batch dimension so the final dot product between shapes {a, z, c} and {a, c} is really a batched dot product between shapes {z, c} and {c} resulting in an output of {a, z}

Introduce while loops

I would like to add a higher-level looping construct to defn. In Elixir, we can use for+:reduce but I am afraid it will be too verbose and foreign for new users. Let's imagine we want to translate this python code:

    def _smooth(x):
      out = np.empty_like(x)
      for i in range(1, x.shape[0] - 1):
          for j in range(1, x.shape[1] - 1):
              out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                          x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                          x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9
      return out

With for+:reduce, we would write it as:

    def smooth(x) do
      for i <- 1..elem(x.shape, 0)-1, j <- 1..elem(x.shape, 1)-1, reduce: x do
        x ->
          put_in x[i, j], x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                           x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                           x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) / 9
      end
    end

I propose we introduce a loop construct, inspired by futhark that looks like this:

loop tuple_or_var [= expr], [pattern <- expr]+ do

end

Rewriting the above to this loop construct, we have:

    def smooth(x) do
      loop x,
           i <- 1..elem(x.shape, 0)-1,
           j <- 1..elem(x.shape, 1)-1 do
        put_in x[i, j], x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                          x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                          x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) / 9
      end
    end

There is one downside with this approach: the only form of loops we have in XLA are while loops which are sequential. Other languages, such as taichi, can optimize them to run in parallel. For this reason, we may want to introduce higher level constructs for manipulating tensors. In particular, I believe we should introduce functions such as map, map_with_index, reduce and reduce_with_index. I have some thoughts in how we can implement said functions so they also work with batching out of the box, but I am waiting for some feedback on this issue before moving forward.

Implement Infeed/Outfeed Ops

I have been reading a bit about XLAs Infeed/Outfeed Ops and how they can be used to send data to/from the device to host during a computation, which presents an opportunity for what I think would be significant performance increases.

JAX currently holds the MLPerf records training on a large TPU cluster. Looking at their ResNet implementation you can see how their training loop makes use of XLA Ops (loops and infeeds) to speed up training. You can use Infeeds to perform multiple steps per batch without having to rerun the computation. Infeeds accept input shapes and tokens which are used for enforcing an order between operations across replicas/partitions. Adding this feature would allow us to do something similar for whatever NN library we decide to implement, and would also give users the flexibility to speed up their own custom training loops.

In the same sense that you can pass data to a device during a computation, you can receive data from a still-running computation using outfeeds. An outfeed accepts a shape as well, and then an outfeed receiver handles data received from the Outfeed. The Python XLA client implements an outfeed receiver in C++, but reading the implementation notes, it seems like Elixir is a perfect fit for handling everything the Python Outfeed Receiver is trying to do in C++.

Reading about infeeds/outfeeds in the context of TPUs, it seems that they are almost ALWAYS infeed/outfeed bound, so taking advantage TPUs in "coreless" mode is really important for performance. A TPU running in coreless mode is basically just using a TPUs CPU. The TPUs CPU has 300GB of memory which can be used for preprocessing/transformations in the data pipeline. It seems the most efficient way to train with TPU would be to:

  1. Write a training loop that makes use of Infeeds/Outfeeds for actually processing the neural network
  2. Implement an input pipeline that takes advantage of the TPUs CPU to do transformations. These transformations can also be defn compiled functions when needed or just plain Elixir for IO stuff. An additional advantage we have is that it should be very straightforward to do these transformations in parallel to pass to multiple TPU cores. A single TPU Pod has 2048 TPU cores, so they are massively parallel, but Elixir is the perfect language for handling this.
  3. In the same respect as above, we need an outfeed receiver that handles a massively parallel training pipeline.

One of the big questions is how we implement something like this so it's backend agnostic. Having an Nx.infeed or an Nx.outfeed wouldn't really make sense. I think these probably tie in best with Nx.device.

Allow on device references

  • keep_on_device should default to false
  • when keep_on_device is true, it returns a ref (both for CPU and GPU)
  • the ref will have two functions: read and deallocate - deallocating sets the buffer it holds to null
  • trying to read a deallocated ref will raise a proper error. deallocating a deallocated ref just works
  • whenever a ref is given to run, it will be deallocated at the end of the run (if we give a deallocated ref, it raises)
  • add binary_to_device which returns a ref (it is a way to allocate without having to run something) (edited)

Maybe we can call those binary_to_device_mem, read_device_mem and deallocate_device_mem?

Named tensors

This is a discussion that showed up internally and apparently there is a similar movement in PyTorch: http://nlp.seas.harvard.edu/NamedTensor

Some questions are:

  1. Should we provide some default names?

  2. Should we support constant names? i.e. have :i, :j, :k, etc work regardless of the names in the tensor. This can allow people to create generic algorithms without hardcoding dimensions.

  3. Should we allow specifying which dimensions are batch dimensions (so we get vmap but without making it a transform)? Similar to the privacy section in the document above but already generalized for batching.

Do not allow closures in computations

  • Validate that the function given to reduce is not really a closure (i.e. it does not accept parent's expressions

  • Make sure functions become expressions when handling Expr for grads (instead of invoked on the compiler)

Add types to Nx.Defn.Expr

Then we make Exla work with a superset of the given types, keeping its internal representation as much as possible.

This means we can implement quantization, type replacement, and similar with Expr instead of implementing it for each compiler.

Add to_heatmap

We should introduce a to_heatmap conversion, similar to how it is done in Matrex.

The Nx.Util module seems to be the best candidate for having this function. Currently Nx.Util is handful of conversion functions from tensors to flat lists and scalars. However, we could also argue that heatmap belongs to Nx because in practice it is two operations:

  1. Normalizing the data to heatmap values based on its min and max

  2. Converting the values above to an ansi value we can print

The first part could be implemented in Nx, which means it could be called inside defn and performed more efficiently. At the same time, step 1 is done with two traversals of a binary, which is relatively fast anyway.

For achieve step 2, my suggestion is for to_heatmap to return a Nx.Heatmap, which wraps the tensor and implements the Inspect protocol so it prints the heatmap. The advantage of making it a struct is that we can also implement this protocol for LiveBook and so on.

The heatmap will apply to the 2 lowest dimensions. If there are higher dimensions, they will be wrapped in lists. For example, a 14x14 tensor would look like this:

#Nx.Heatmap<
  s64[14][14]
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
  ..............
>

However, if the dimensions are {2, 14, 14}, we would get:

#Nx.Heatmap<
  s64[2][14][14]
  [
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............,

    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
    ..............
  ]
>

Default arguments do not work with only 1 argument

As an example, this does not compile:

  defn uniform(opts \\ []) do
    shape = transform(opts, &Keyword.fetch!(&1, :shape))
    opts = keyword!(opts, type: {:f, 32}, shape: {}, scale: 1.0e-2)
    Nx.random_uniform(shape, type: opts[:type]) * opts[:scale]
  end

But this does:

  defn uniform(_t, opts \\ []) do
    shape = transform(opts, &Keyword.fetch!(&1, :shape))
    opts = keyword!(opts, type: {:f, 32}, shape: {}, scale: 1.0e-2)
    Nx.random_uniform(shape, type: opts[:type]) * opts[:scale]
  end

Even though t is unused.

Support groups in convolution

Grouped Convolutions are common enough that we should support them. Here's some discussion and papers on grouped convolutions: https://paperswithcode.com/method/grouped-convolution

XLA supports grouped convolutions with it's feature_group_count option (right now we always set it to 1). We will just need to add support for groups in the binary implementation of Nx.conv. This involves an additional iteration in the binary implementation as well as some additional shape checks.

XLA also supports batch_group_count for grouping parts of the batch. I see less value in adding this option as I haven't been able to find any applications that are all that common.

Add missing gradients

Currently missing gradients for these functions:

  • Product
  • Scatter Window Max
  • Scatter Window Min
  • Scatter Add
  • LU Decomposition
  • Triangular Solve
  • SVD

Make Nx a meta module

I have been reading more into sparse matrices and one of the issues in scikit learn is that, when you have a sparse matrix, it is extremely discouraged that you use np functions because they would lose their sparse properties. Instead, you should use the sparse specific versions.

While this is a fine recommendation (and generally suitable to Elixir), it introduces an issue in that defn functions only allow function calls to the Nx module. Given we most likely want to support sparse specific compilers in defn in the future, this puts us in a rough spot:

  1. If we want sparse matrices in defn, we need to call Nx with sparse matrices
  2. Outside of defn, you should not call Nx with sparse matrices

I believe we can address this with (surprise, surprise!) one extra level of indirection. We should make the Nx module a meta-module that doesn't really implement the operations but just define their API. For example, the exp implementation would rather look like this:

def exp(tensor) do
  tensor = tensor(tensor)
  tensor.data.__struct__.exp(tensor)
end

This has a couple benefits:

  1. Now people can bring sparse matrices and use Nx module as is
  2. We can unify Nx and Nx.Defn.Expr because it can be implemented using the same API as above

While I would be generally worried about this approach since the indirection can affect performance, the cost of an extra function call is irrelevant when working with tensors, so we are fine.

GPU transfer from NIF on run causes OOM

Currently, we explicitly transfer buffers to the device before calling the run NIF from within Exla.Executable.run when the platform is a GPU like this:

    outside_cpu = client.platform == :cuda || client.platform == :rocm
    keep_on_device_int = if keep_on_device || outside_cpu, do: 1, else: 0

    device_id = device_assignment_to_device_id(executable, {replica, partition})

    inputs =
      Enum.map(arguments, fn
        %Buffer{ref: {ref, _}, data: nil} ->
          ref

        buffer = %Buffer{data: data, shape: shape, ref: nil} ->
          if outside_cpu do
            %Buffer{ref: {ref, _}} = Buffer.place_on_device(buffer, client, device_id)
            ref
          else
            {data, shape.ref}
          end
      end)

Originally, this was done to decouple what could have made the run NIF IO bound or CPU bound. After determining the GPU case is always closer to IO bound, we no longer need this logic; however, removing it leads to OOM errors on the MNIST example.

Build Failure: No module named 'numpy'

Checking out the project for the first time and encountered an error during the first build attempt running mix test:

➜  exla git:(main) mix test

... <successes omitted> ...

Repository rule local_python_configure defined at:
  /home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl:275:26: in <toplevel>
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (13 packages loaded, 14 targets configured)
ERROR: An error occurred during the fetch of repository 'local_execution_config_python':
   Traceback (most recent call last):
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 213
		_get_numpy_include(<2 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 187, in _get_numpy_include
		execute(repository_ctx, <3 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/remote_config/common.bzl", line 217, in execute
		fail(<1 more arguments>)
Problem getting numpy include path.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'
Is numpy installed?
INFO: Repository go_sdk instantiated at:
  no stack (--record_rule_instantiation_callstack not enabled)
Repository rule _go_download_sdk defined at:
  /home/elbow-jason/.cache/bazel/_bazel_elbow-jason/bbfe9bcff2dc48e9f808ab24728cb493/external/io_bazel_rules_go/go/private/sdk.bzl:52:20: in <toplevel>
ERROR: Analysis of target '//tensorflow/compiler/xla/exla:libexla.so' failed; build aborted: Traceback (most recent call last):
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 213
		_get_numpy_include(<2 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 187, in _get_numpy_include
		execute(repository_ctx, <3 more arguments>)
	File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/remote_config/common.bzl", line 217, in execute
		fail(<1 more arguments>)
Problem getting numpy include path.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'
Is numpy installed?
INFO: Elapsed time: 21.495s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (13 packages loaded, 14 targets configured)
FAILED: Build did NOT complete successfully (13 packages loaded, 14 targets configured)
make: *** [Makefile:25: all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

Not sure what system info is needed for diagnosis. Some basic info:

➜  exla git:(main) uname -a
Linux elbow-at-home 5.8.0-38-generic #43~20.04.1-Ubuntu SMP Tue Jan 12 16:39:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

➜  exla git:(main) gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

➜  exla git:(main) make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

➜  exla git:(main) asdf current erlang
erlang          23.1.5          /home/elbow-jason/.tool-versions
➜  exla git:(main) asdf current elixir
elixir          1.11.2-otp-23   /home/elbow-jason/.tool-versions
➜  exla git:(main) asdf current python
python          3.6.12          /home/elbow-jason/.tool-versions

Pin EXLA to a TF Stable Release

Right now, we're working off a recent TF master revision. We'll want to pin to a more stable release instead, and then work out a process for periodically updating the TF version.

Add streaming device allocation

For example, if we need to load 2GB of data. Now we need to first load it into the memory and then load it into the GPU. We would like to that without having to load it all onto the CPU first.

The biggest question is what would be the API on the Elixir side. We could have a Stream based one (immutable) or a process-based one (mutable). Perhaps both?

Support common LinAlg primitives

For now, at least:

  • QR decomposition
    • (EXLA implementation missing)
  • LU decomposition
  • Cholesky decomposition
  • SVD
  • Eigen decomposition (hermitian)
  • Eigen decomposition (general)
  • Triangular Solve
  • Norms (L0, L1, L2)
    • ord: -2 for 2D missing
    • nuclear missing

Support common mathematical functions

Leaving this here for tracking/discussion on what should and shouldn't be included. Looking at XLA/JAX/Numpy, these are some common ones we are missing:

  • acos
  • asin
  • atan
  • tan
  • acosh
  • asinh
  • atanh
  • cosh
  • sinh
  • erf
  • erfc
  • erf_inv

There are a few others as well that we should look into, but these seemed to be the most common.

Tag dirty NIFs

We will want to make tag long running NIFs as dirty. Probably compile (dirty CPU NIF) and run (should have two versions, the CPU one is a dirty CPU NIF, and the GPU one is a dirty IO NIF).

elixir_make retired and doesn't have permissions to mix.exs

Following the README:

➜  exla git:(main) mix deps.get
Resolving Hex dependencies...
Dependency resolution completed:
Unchanged:
  benchee 1.0.1
  deep_merge 1.0.0
  elixir_make 0.6.1 RETIRED!
    (invalid) Wrong permissions on the mix.exs file
All dependencies are up to date

device roadmap

  • Streamline and document device ordinal handling
  • Add an option to return supported platforms
  • Allocate data on multiple devices by a given dimension
  • Stream data into device
  • Stream data into multiple devices (for pmap)
  • Read data from multiple devices into a single binary (for pmap)

Add basic operator fusion to Nx.Defn

While the EXLA compiler already performs operator fusion, we should also be able to do some operator fusion for the built-in compiler written in Elixir. We can fuse most unary operators and binary operators with constants to reduce the amount of traversals. It still won't hold a candle against compiled modes but it should be overall positive to performance.

Broadcasting a constant does not produce an Expr

This test will not compile:

defn zeros(opts \\ []) do
  shape = transform(opts, &Keyword.fetch!(&1, :shape))
  opts = keyword!(opts, type: {:f, 32}, shape: {})
  Nx.as_type(Nx.broadcast(0, shape), opts[:type])
end

test "zeros with default args" do
  assert zeros(shape: {}) == Nx.tensor(0.0, type: {:f, 32})
end

This test will compile but raise (at least on Nx.Defn backend):

defn zeros(_t, opts \\ []) do
  shape = transform(opts, &Keyword.fetch!(&1, :shape))
  opts = keyword!(opts, type: {:f, 32}, shape: {})
  Nx.as_type(Nx.broadcast(0, shape), opts[:type])
end

test "zeros with default args" do
  assert zeros(Nx.tensor([1, 2, 3]), shape: {}) == Nx.tensor(0.0, type: {:f, 32})
end
** (ArgumentError) defn must return an expression tensor or a tuple, got: #Nx.Tensor<
       f32
       0.0
     >

This workaround will also raise:

    defmodule Zeros do
      def zeros(shape, opts \\ []) do
        type = opts[:type] || {:f, 32}
        constant(0, [shape: shape, type: type])
      end

      defnp constant(t, opts) do
        opts = keyword!(opts, type: {:f, 32}, shape: {})
        Nx.as_type(Nx.broadcast(t, opts[:shape]), opts[:type])
      end
    end
** (ArgumentError) defn functions expects either numbers or tensors as arguments. Got: [shape: [shape: {}], type: {:f, 32}]

Support :max_float_type

max_float_type will rewrite all floats with equal or higher size to the given one.

Compiler option

@defn_compiler {EXLA, max_float_type: {:bf, 16}}
  • Pros: most efficient
  • Cons: each compiler has to implement it which can be error prone

Transformer (like grad)

max_float_type(expr, {:bf, 16})
  • Pros: one implementation for all compilers
  • Pros: easy to implement thanks to the new Expr.traverse_args
  • Cons: has to traverse the expression after the expression is built

If we go the transformer route, we could make the transformer module public, which means we can support both APIs (either via a transform or via an option). Implementing the traversal is easy and requires rewriting the types of the appropriate nodes. Only a handful of notes require special attention:

  1. Parameters will have to be cast down as an explicit conversion step
  2. Functions will have to be recomputed
  3. Tensors will have to be recast

Add common aggregate ops

So far we have mean and sum, but we are missing:

  • reduce_max
  • reduce_min
  • reduce_prod

As well as variants for reduce_window:

  • reduce_window_sum
  • reduce_window_mean
  • reduce_window_max
  • reduce_window_min
  • reduce_window_prod

We could also probably tackle the cumulative ops. We can also consider different names for these functions. Maybe rather than reduce_max and reduce_min, just a max/1 and min/1.

Add `keep_dims` option to aggregate functions

Most NumPy aggregate functions have a keepdims option that will keep the reduced dimensions as size 1 so the rank of the tensor stays the same. This is very easily implemented with some shape changes from Nx and an additional reshape in EXLA.

It's very common, especially in neural network ops. Numerous examples in here: https://github.com/google/flax/blob/master/flax/core/nn/normalization.py

Implementing something similar in defn I believe would involve a transform to calculate the reshape, which would be slightly annoying to have to do every time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.