elixir-nx / bumblebee Goto Github PK

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)

License: Apache License 2.0

Elixir 100.00%

axon elixir nx hugging-face pre-trained machine-learning transformer

bumblebee's Introduction

Bumblebee

Bumblebee provides pre-trained Neural Network models on top of Axon. It includes integration with 🤗 Models, allowing anyone to download and perform Machine Learning tasks with few lines of code.

Getting started

The best way to get started with Bumblebee is with Livebook. Our announcement video shows how to use Livebook's Smart Cells to perform different Neural Network tasks with few clicks. You can then tweak the code and deploy it.

We also provide single-file examples of running Neural Networks inside your Phoenix (+ LiveView) apps inside the examples/phoenix folder.

You may also check our official docs, which includes notebooks and our API reference. The "Tasks" section in the sidebar covers high-level APIs for using Bumblebee. The remaining modules in the sidebar lists all supported architectures.

Installation

First add Bumblebee and EXLA as dependencies in your mix.exs. EXLA is an optional dependency but an important one as it allows you to compile models just-in-time and run them on CPU/GPU:

def deps do
  [
    {:bumblebee, "~> 0.5.3"},
    {:exla, ">= 0.0.0"}
  ]
end

Then configure Nx to use EXLA backend by default in your config/config.exs file:

import Config

config :nx, default_backend: EXLA.Backend

To use GPUs, you must set the XLA_TARGET environment variable accordingly.

In notebooks and scripts, use the following Mix.install/2 call to both install and configure dependencies:

Mix.install(
  [
    {:bumblebee, "~> 0.5.3"},
    {:exla, ">= 0.0.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

Usage

To get a sense of what Bumblebee does, look at this example:

{:ok, model_info} = Bumblebee.load_model({:hf, "google-bert/bert-base-uncased"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "google-bert/bert-base-uncased"})

serving = Bumblebee.Text.fill_mask(model_info, tokenizer)
Nx.Serving.run(serving, "The capital of [MASK] is Paris.")
#=> %{
#=>   predictions: [
#=>     %{score: 0.9279842972755432, token: "france"},
#=>     %{score: 0.008412551134824753, token: "brittany"},
#=>     %{score: 0.007433671969920397, token: "algeria"},
#=>     %{score: 0.004957548808306456, token: "department"},
#=>     %{score: 0.004369721747934818, token: "reunion"}
#=>   ]
#=> }

We load the BERT model from Hugging Face Hub, then plug it into an end-to-end pipeline in the form of "serving", finally we use the serving to get our task done. For more details check out the documentation.

HuggingFace Hub

HuggingFace Hub is a platform hosting models, datasets and demo apps (Spaces), all using Git repositories (with Git LFS for large files). For further information check out the Hub documentation and explore the model repositories.

Models

Model repositories are regular Git repositories, therefore they can store arbitrary files. However, most repositories store models saved using the Python Transformers library. Bumblebee is an Elixir counterpart of Transformers and allows for importing those models, as long as they are implemented in Bumblebee.

A repository in the Transformers format does not store an actual model, only the trained parameters and a configuration file. The configuration file specifies the model type (e.g. BERT) and high-level properties, such as the number layers and their size. The model implementation lives in the library code (both Transformers and Bumblebee). When loading a model, the library fetches the configuration and builds a matching model, then it fetches the trained parameters to pair them with the model. The key takeaway is that in order to use any given model, it needs to have an implementation in Bumblebee.

Model repository

Here is a list of files commonly found in a repository following the Transformers format.

config.json - model configuration, specifies the model type and model-specific options. You can think of this as a blueprint for how the model should be constructed
pytorch_model.bin - raw model parameters (tensors) serialized from a PyTorch model using PyTorch format (supported by Bumblebee)
model.safetensors - raw model parameters (tensors) serialized from a PyTorch model using Safetensors (supported by Bumblebee)
flax_model.msgpack, tf_model.h5 - raw model parameters (tensors) serialized from Flax and Tensorflow models respectively (not supported by Bumblebee)
tokenizer.json, tokenizer_config.json - tokenizer configuration, describes how to convert text input to model inputs (tensors). See Tokenizer support
preprocessor_config.json - featurizer configuration, describes how to convert real-world input (image, audio) to model inputs (tensors)
generation_config.json - a set of configuration options specific to text generation, such as token sampling strategy and various constraints

Model support

As pointed out above, in order to load a model, the given model type must be implemented in Bumblebee. To find out whether the model is supported you can call Bumblebee.load_model({:hf, "model-repo"}) or use this tool to run a number of checks against the repository.

If you prefer to poke around the code, open the config.json file in the model repository and copy the class name under "architectures". Next, search Bumblebee codebase for that keyword. If you find a match, this indicates the model is supported.

Also note that certain repositories include multiple models in separate repositories, for example stabilityai/stable-diffusion-2. In such case use Bumblebee.load_model({:hf, "model-repo", subdir: "..."}).

Tokenizer support

The Transformers library distinguishes two types of tokenizer implementations:

"slow tokenizer" - a tokenizer implemented in Python and stored as tokenizer_config.json and a couple extra files
"fast tokenizer" - a tokenizer implemented in Rust and stored in a single file - tokenizer.json

Bumblebee relies on the Rust implementations (through bindings to Tokenizers) and therefore always requires the tokenizer.json file. Many repositories only include files for a "slow tokenizer". When you stumble upon such repository, there are two options you can try.

First, if the repository is clearly a fine-tuned version of another model, you can look for tokenizer.json in the original model repository. For example, textattack/bert-base-uncased-yelp-polarity only includes tokenizer_config.json, but it is a fine-tuned version of bert-base-uncased, which does include tokenizer.json. Consequently, you can safely load the model from textattack/bert-base-uncased-yelp-polarity and tokenizer from bert-base-uncased.

Otherwise, the Transformers library includes conversion rules to load a "slow tokenizer" and convert it to a corresponding "fast tokenizer", which is possible in most cases. You can generate the tokenizer.json file using this tool. Once successful, you can follow the steps to submit a PR adding tokenizer.json to the model repository. Note that you do not have to wait for the PR to be merged, instead you can copy commit SHA from the PR and load the tokenizer with Bumblebee.load_tokenizer({:hf, "model-repo", revision: "..."}).

License

Copyright (c) 2022 Dashbit

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

bumblebee's People

Contributors

Stargazers

Watchers

Forkers

muharremokutan dbernheisel sorentwo fromeroj kamidev kianmeng j3t4r0 zolrath tmr08c byronsalty ryman connorrigby stefanberndtsson kentaro doytsujin sitch qkhosro strogo zgcarvalho lorransr omginbd kfabryczny mingyar lorenzosinisi blackeuler edennis nschechter fhunleth waranlogesh capitalist42 thiagoesteves jeregrine benbot grossvogel falkorlabs afatsini 3promintempne meanderingstream bglusman masegraye steffende woshikan sifinzlincfu universalwow mimiquate stephenxxxx moogle19 rajrajhans jaman leapsight kabie snewcomer linusdm ityonemo briankariuki imahmedismail zahangir1234 jlojosnegros robinmonjo wtedw thiagopromano factsfinder joelpaulkoch gerardramos roger120981 himanshu21git preciz myotp sonic182 sh joshua-shepherd rugwedak nickkaltner nyo16 meta-introspector cmeon stuartjohnpage toranb benjamin-philip christianalexander hbcbh1999 jkbbwr haavars bowyern dbii

bumblebee's Issues

Support more aggregation strategies in token classification

We are missing word-aware strategies, such as word first/max/mean.

Got OOM message with GTX3060

I've been trying to Stable Diffusion with GPU.

But it failed and I got the OOM message

Is this error message due to insufficient GPU memory?
Is it possible to make it work by adjusting some parameters?
Stable Diffusion 1.4 is running on this GPU in the tensorflow environment. It would be nice if it works with bumblebee too.

it's working fine with :host . It's amazing how easy it is to use neural networks with livebooks!!!

OS Ubunt 22.04 on WSL2
GPU GTX3060(12GB)
Livebook v0.8.0
Elixir v1.14.2
XLA_TARGET=cuda111
CUDA Version: 11.7

05:32:56.019 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

05:32:56.023 [info] XLA service 0x7fb39437dac0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

05:32:56.023 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

05:32:56.023 [info] Using BFC allocator.

05:32:56.023 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.

05:32:58.662 [info] Start cannot spawn child process: No such file or directory

05:34:00.234 [info] total_region_allocated_bytes_: 10641368576 memory_limit_: 10641368678 available bytes: 102 curr_region_allocation_bytes_: 21282737664

05:34:00.234 [info] Stats: 
Limit:                     10641368678
InUse:                      5530766592
MaxInUse:                   7566778624
NumAllocs:                        3199
MaxAllocSize:                399769600
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

05:34:00.234 [warn] **********___***********************************************************____________________________

05:34:00.234 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 3546709984 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    3.84GiB
              constant allocation:       144B
        maybe_live_out allocation:   768.0KiB
     preallocated temp allocation:    3.30GiB
  preallocated temp fragmentation:       304B (0.00%)
                 total allocation:    7.15GiB
              total fragmentation:   821.0KiB (0.01%)

whole log is
oommessage.log

Stable Diffusion, load previously downloaded model

Is it possible to load an already downloaded SD model using a file path, rather than downloading it from Huggingface using the repo name?

I think it would be a nice addition especially that some people might have already some working setup of Stable Diffusion and want to skip downloading the models again.

Support object detection

Note: I'm posting this issue at Sean Moriarity's (emailed) suggestion. However, I'm not at all sure what needs to be done here, let alone how. So, I'll just summarize the use case.

As discussed here, I'd like there to be a way to scan videos of technical presentations, extract the text and layout of slides, and generate corresponding Markdown files. Aside from making it possible to search the slides, this could help to make the presentations more accessible to blind and visually impaired users.

Let's assume that a video has been downloaded from a web site (e.g., via VLC media player) and that we can extract individual images from the resulting file (e.g., via Membrane).

On edited videos, these images will often contain regions showing the presenter, a slide, and assorted fill. Before we can process the slide (e.g., via Tesseract OCR), we need to extract it from the surrounding image. And, before we can do that, we need to determine its boundaries. According to Sean:

This is an object segmentation task. It's a task available in pre-trained models on HuggingFace like DETR -- which means we can certainly build the same functionality into Bumblebee. Object segmentation outlines the boundary region of an image as you describe, and then you can use that boundary region to do whatever you want.

As a side note, various related tasks will need to be addressed. For example, a production system should identify and handle duplicate images, dynamic content, embedded graphics, etc. It would also be nice to generate and incorporate transcriptions from the audio track (and a pony...).

Issue with .load_model/1 matching on :zip.unzip (bad_central_directory)

On arch linux calling:

Bumblebee.load_model({:hf, "stanford-crfm/pubmedgpt"})

Throws

** (MatchError) no match of right hand side value: {:error, :bad_central_directory}                                              
    (bumblebee 0.1.2) lib/bumblebee/conversion/pytorch/loader.ex:29: Bumblebee.Conversion.PyTorch.Loader.load_zip!/1
    (bumblebee 0.1.2) lib/bumblebee/conversion/pytorch.ex:24: Bumblebee.Conversion.PyTorch.load_params!/4
    (bumblebee 0.1.2) lib/bumblebee.ex:399: Bumblebee.load_params/4
    (bumblebee 0.1.2) lib/bumblebee.ex:378: Bumblebee.load_model/2

The problem is with https://github.com/elixir-nx/bumblebee/blob/main/lib/bumblebee/conversion/pytorch/loader.ex#L29

Seems to be some sort of erlang decoding issue in: https://github.com/erlang/otp/blob/master/lib/stdlib/src/zip.erl

The file isn't corrupt, as I am able to unzip using the linux unzip binary. Also, the file size is 10GB, and I currently have >50GB free ram

Unify model inputs

Currently there are some inputs applicable to most most models (input embeds, head mask, position ids), but not all models accept them. We should add the missing inputs to make the models aligned as much as possible.

CUDA toolkit 12 just released, EXLA doesn't support

I can't really think of a better place for this to be but CUDA Toolkit just released their version 12 and it's hard to find the older versions (latest EXLA supports is 11.8), but you can get it here: https://developer.nvidia.com/cuda-11-8-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local

[Feat request] Language identification

Implementing a model for language identification like this example would be most helpful.

CLIP models

We currently have ClipText, we should also add ClipVision. There we can add a Clip model that combines both and make sure that loading parameters works fine across those.

Implement BERT/RoBERTa CLM decoder

Currently this model can't be used as a decoder. The missing elements are related to #6.

Nx.LazyContainer not implemented for %Bumblebee.Diffusion.PndmScheduler

Hi!
I'm trying to run Stable Diffusion in Livebook without using the pre-built Nx.Serving setup (aka Bumblebee.Diffusion.StableDiffusion.text_to_image).

I'm getting stuck though because this call keeps failing:

{_state, new_latents} = Bumblebee.Diffusion.PndmScheduler.step(scheduler, scheduler_state, latents, noise_pred)

where scheduler comes from {:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"})

The error I'm getting is:

** (Protocol.UndefinedError) protocol Nx.LazyContainer not implemented for %Bumblebee.Diffusion.PndmScheduler{num_train_steps: 1000, beta_schedule: :quadratic, beta_start: 8.5e-4, beta_end: 0.012, alpha_clip_strategy: :alpha_zero, timesteps_offset: 1, reduce_warmup: true} of type Bumblebee.Diffusion.PndmScheduler (a struct), data-structures given to defn/Nx must implement either Nx.LazyContainer or Nx.Container. This protocol is implemented for the following type(s): Any, Atom, Complex, Float, Integer, List, Map, Nx.Batch, Nx.Tensor, Tuple

I'm not sure why this doesn't work since it seems to be the same call that the text_to_image function uses internally. Is it because it's within Livebook? The failing notebook is here (last cell)

Support BigBird

I'm happy to give this a shot but if it's not far off something that exists and isn't too much effort... ;)

Validate model configuration

Since #55, we now have model configuration/docs as a data structure, so we can expand on that and use NimbleOptions for validation. One thing to keep in mind is that in our case the configuration is incremental, we want to re-configure rather than passing all the options, but we just need to merge things accordingly.

Perhaps we could generate most of the hf/transformers converters based on option types.

Support additional Stable Diffusion modes

As discussed in #111, Stable Diffusion supports a number of different modes.

Currently, only text-to-image is supported but the other modes are considered in-scope, verified by seanmor5.

Currently Supported

text-to-image
ControlNet (#359)

Currently Unsupported

As #111 was closed when the general support for Stable Diffusion 2 was added, it seemed appropriate to track these separately.

Adjust input shapes in transformer models

Currently the sequence lengths are hardcoded, most of the inputs should be {nil, nil}. Depends on elixir-nx/axon#272.

Implement AlbertTokenizer

See #21. This is also an opportunity to generalize, so we share as much as possible with BertTokenizer.

Huggingface model keeps downloading after stopping cell evaluation

I found that stopping the evaluation of a cell doesn't stop the model download from Huggingface, for a statement like:

Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )

The network activity stops only when I completely shutdown the Livebook server.

Cannot set temperature, top-k, etc for GPT-2 models

Text generation tasks are very susceptible to repetition for anything longer than the shortest outputs. How can we set temperature, top-k or other parameters that are normally used to avoid this in GPT-2 output?

Scrub usages of anonymous functions in implementations

Corollary to Axon issue

Unable to configure XLA_TARGET=cuda118 to use GPU

I've been trying to use XLA_TARGET=cuda118 within a livebook app.

I'm running using the livebook github repository (https://github.com/livebook-dev/livebook.git, my HEAD is 361455cd4eb1b527e6fd04d5c51f2901cbb4ed90).

I am running using XLA_TARGET=cuda118 MIX_ENV=prod mix phx.server.

I also tried setting XLA_TARGET=cuda118 inside the environment variables of the livebook settings.

No matter what I do, when I run an example Neural Network, it always prints out:

14:30:19.336 [info] TfrtCpuClient created.

If I'm targeting the GPU, I assume it would print out a different client, right?

Notebook dependencies and setup are:

Mix.install(
  [
    {:kino_bumblebee, "~> 0.1.0"},
    {:exla, "~> 0.4.1"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

Bumblee seems to be the correct version:

Application.spec(:kino_bumblebee, :vsn)
'0.1.0'

Also:

Livebook v0.8.0
Elixir v1.14.2

I believe I have a supported version of nvidia on the host:

$ nvidia-smi 
Fri Dec  9 14:37:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8     6W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+

Add optional safety checker to Stable Diffusion

We need to implement the safety checker model and then pass it as an option to Stable Diffusion generation. Then we will use the model to add :nsfw field in the output entries.

Support Stable Diffusion 2

It looks like there are some minor things broken for stable diffusion 2-1:

** (RuntimeError) conversion failed, expected "attention_head_dim" to be a number, got: [5, 10, 20, 20]
    (bumblebee 0.1.0) lib/bumblebee/shared/converters.ex:20: anonymous fn/3 in Bumblebee.Shared.Converters.convert!/2
    (elixir 1.14.2) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    (bumblebee 0.1.0) lib/bumblebee/shared/converters.ex:14: Bumblebee.Shared.Converters.convert!/2
    (bumblebee 0.1.0) lib/bumblebee/diffusion/unet_2d_conditional.ex:341: Bumblebee.HuggingFace.Transformers.Config.Bumblebee.Diffusion.UNet2DConditional.load/2
    (bumblebee 0.1.0) lib/bumblebee.ex:279: Bumblebee.load_spec/2
    (bumblebee 0.1.0) lib/bumblebee.ex:372: Bumblebee.load_model/2
    (stdlib 4.0.1) erl_eval.erl:744: :erl_eval.do_apply/7
    (stdlib 4.0.1) erl_eval.erl:492: :erl_eval.expr/6

Support :backend option when loading

Support negative prompts in stable diffusion

This was added awhile ago to the diffusers library and should be pretty easy to add to the current pipeline

Integrate tokenizers

We need a wrapper API for tokenizers to automatically handle pairs of sentences, batching, generating attention mask. Also, we should add an API for loading tokenizers, similarly to futurizers.

Simplify tokenizer modules

Currently each tokenizer module is the same, other than the default special tokens (see #141). Maybe we should kill the behaviour altogether, but we still need a place for the default special tokens.

This picture may change if we have a tokenizer that doesn't follow the current scheme exactly, such as whisper (#107).

Implement ViT Masked Image Model

Requires PixelShuffle (e.g. depth_to_space) layer on upstream Axon

Error compile XLA

==> xla
Compiling 2 files (.ex)
Generated xla app
rm -f /home/livebook/.cache/xla_extension/tf-d5b57ca93e506df258271ea00fc29cf98383a374/tensorflow/compiler/xla/extension && \
	ln -s "/home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/1afae0bfefe756b720b6a2ccf0818979/deps/xla/extension" /home/livebook/.cache/xla_extension/tf-d5b57ca93e506df258271ea00fc29cf98383a374/tensorflow/compiler/xla/extension && \
	cd /home/livebook/.cache/xla_extension/tf-d5b57ca93e506df258271ea00fc29cf98383a374 && \
	bazel build --define "framework_shared_object=false" -c opt   --config=cuda //tensorflow/compiler/xla/extension:xla_extension && \
	mkdir -p /home/livebook/.cache/xla/0.4.2/cache/build/ && \
	cp -f /home/livebook/.cache/xla_extension/tf-d5b57ca93e506df258271ea00fc29cf98383a374/bazel-bin/tensorflow/compiler/xla/extension/xla_extension.tar.gz /home/livebook/.cache/xla/0.4.2/cache/build/xla_extension-x86_64-linux-gnu-cuda.tar.gz
ln: failed to create symbolic link '/home/livebook/.cache/xla_extension/tf-d5b57ca93e506df258271ea00fc29cf98383a374/tensorflow/compiler/xla/extension': No such file or directory
make: *** [Makefile:27: /home/livebook/.cache/xla/0.4.2/cache/build/xla_extension-x86_64-linux-gnu-cuda.tar.gz] Error 1
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"

Using the official docker livebook image with no modifications

Tips for creating a `sentence-transformer` model?

First of all, awesome project! I'm looking forward using many models from here.

One use-case I'm excited about to try out is semantic search based on text and images based on SentenceTransformers. My first try was to export a model to ONNX and use it in Axon, but I ran into this issue mortont/axon_onnx#48. While I'm still trying to fix that I was wondering...

How do you actually create these models? Did you really implement them with the paper as resource or is there a tip on how to implement them? I'm a machine learning beginner. :-)

Support more text generation strategies

Currently Bumblebee.Text.Generation.generate/5 supports only a basic greedy strategy for selecting the next token. We should add options for more sophisticated strategies, in particular beam search and sampling.

Here is a good reference explaining various options. And contrastive search, a more recent development.

Add integration test for Stable Diffusion

Encapsulate generation config

See huggingface/transformers#18655.

Currently the model configuration may include properties that are then used as defaults for generation. This makes it less clear where the defaults come from and unnecessarily makes model configurations larger.

Add support for HuggingFace Datasets

Datasets could be downloaded directly instead of via Bumblebee. However, the key need is support for Parquet. It is my understanding that Parquet is going to be the "standard" for HF datasets. I couldn't find an Elixir library that can explode Parquet.

Implement ViTFeaturizer

Remap layer names

Ideally we should use any layer names we want and then have an explicit name/pattern mapping from hf/transformers names. This way we can keep the models consistent, and also share more parts of the transformer models (currently they often use different layer naming).

Support for stable diffusion >2.0

Stable Diffusion 1.5 works perfectly (might be worth updating the smart cell to point to 1.5 rather than 1.4), but 2.0 and 2.1 need some attention. (Looks like older models only supported one attention head dim and now the models have multiple?):

** (RuntimeError) conversion failed, expected "attention_head_dim" to be a number, got: [5, 10, 20, 20]
    (bumblebee 0.1.2) lib/bumblebee/shared/converters.ex:20: anonymous fn/3 in Bumblebee.Shared.Converters.convert!/2
    (elixir 1.14.2) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    (bumblebee 0.1.2) lib/bumblebee/shared/converters.ex:14: Bumblebee.Shared.Converters.convert!/2
    (bumblebee 0.1.2) lib/bumblebee/diffusion/unet_2d_conditional.ex:341: Bumblebee.HuggingFace.Transformers.Config.Bumblebee.Diffusion.UNet2DConditional.load/2
    (bumblebee 0.1.2) lib/bumblebee.ex:279: Bumblebee.load_spec/2
    (bumblebee 0.1.2) lib/bumblebee.ex:372: Bumblebee.load_model/2
    #cell:25h4u7t3mfavivpqrylftjgs5u6ptd3t:10: (file)

to reproduce, just use the SD 1.4 smart-cell and replace repository_id with repository_id = "stabilityai/stable-diffusion-2-1"

`Bumblebee.Diffusion.VaeKl` could use a public `sample` method

When using Stable Diffusion to create an image to image model, the process is:
Image -> VAE encoder -> Posterior -> Sample from posterior to get latent -> Add noise to latent etc.

Right now there's no public method to sample from the posterior that the VAE encoder outputs. It's not hard to write but would be nice to have given it's probably a common thing to do.

Side note - I'm probably wrong about this but is this correct? https://github.com/elixir-nx/bumblebee/blob/main/lib/bumblebee/diffusion/vae_kl.ex#L414

Shouldn't this be:

    posterior.std
    |> Axon.multiply(z)
    |> Axon.add(posterior.mean)
  end

so that the mean isn't also being multiplied by z

Test on Torchx with Metal and address gaps

Unpickler load op bug

I have a bert-base model I fine-tuned on a token classification problem. I trained on the GPU, so I wonder if this is related to the MPS issue we had earlier. Anyway, loading the model I get:

** (FunctionClauseError) no function clause matching in Unpickler.load_op/2    
    
    The following arguments were given to Unpickler.load_op/2:
    
        # 1
        nil
    
        # 2
        %{
          memo: %{},
          metastack: [],
          object_resolver: #Function<2.26982889/1 in Bumblebee.Conversion.PyTorch.Loader.object_resolver>,
          persistent_id_resolver: #Function<5.26982889/1 in Bumblebee.Conversion.PyTorch.Loader.load_zip!/1>,
          refs: %{},
          stack: []
        }
    
    Attempted function clauses (showing 1 out of 1):
    
        defp load_op(<<opcode, rest::binary()>>, state)
    
    (unpickler 0.1.0) lib/unpickler.ex:236: Unpickler.load_op/2
    (bumblebee 0.1.0) lib/bumblebee/conversion/pytorch/loader.ex:37: Bumblebee.Conversion.PyTorch.Loader.load_zip!/1
    (bumblebee 0.1.0) lib/bumblebee/conversion/pytorch.ex:25: Bumblebee.Conversion.PyTorch.load_params!/4
    (bumblebee 0.1.0) lib/bumblebee.ex:318: Bumblebee.load_params/4
    (bumblebee 0.1.0) lib/bumblebee.ex:295: Bumblebee.load_model/2

Let me know if you want me to send you the models

Load tokenizer special tokens

Currently our tokenizer implementations assume certain special tokens, like [PAD] for BERT, however there are repos on the Hub that override it. Example:

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"})
Bumblebee.apply_tokenizer(tokenizer, "foo")

This fails, because the padding token is actually <pad>.

Support .safetensors deserialization/serialization

We may want to move this all the way up to Nx/Axon.

EXLA compile issue in Livebook

I'm messing around with the Stable Diffusion example, and I'm getting the following error in Livebook when trying to add the exla dependency:

could not compile dependency :exla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile exla", update it with "mix deps.update exla" or clean it with "mix deps.clean exla"

** (MatchError) no match of right hand side value: {"x86_64", "windows"}
...

Not sure what I could be doing wrong... any suggestions?

API for composing encoder-decoder from pretrained models

A wrapper model that uses arbitrary encoder and decoder models. See this example.

Reduce StableDiffusion memory usage

A list of ideas to explore:

Lazy transfers (so we don't load data into the GPU at once)
FP16 on load
FP16 policies on Axon
~~Attention slicing~~ (no longer applicable huggingface/diffusers#4487)
~~Flash attention (JAX version)~~ (see notes in #300)
DPM-Solver++ (more schedulers here, here, and in the comments below) (another PyTorch implementation)
TokenMerging
LCM+LoRA
~~DeepCache~~ (not applicable #147 (comment))

Document image format expectations for Bumblebee.Vision.ImageClassification

I would like to contribute some documentation that clarifies the expected image format to Bumblebee.Vision.image_classification. The type t:Bumblebee.Vision.image says:

@type image() :: Nx.Container.t()
A term representing an image.
Either Nx.Tensor in HWC order or a struct implementing Nx.Container and
resolving to such tensor.

However it does not clarify:

If the image should be resized first to the same size as that used to train the model (224 x 224 for the resnet models?)
Whether the image data should be {:u, 8} or some other type (some models suggest data should be in the range [0.0..1.0]
Whether the image can have an alpha layer (reading the code suggests yes, but perhaps that is model dependent)
Whether the image should be preprocessed? This stack overflow article suggests they should be?

If I can get some guidance I'll write a doc PR.

Add support for converting python Pillow image file format into something that Elixir can use

I considered looking at a dataset in Huggingface, but the images are stored a Pillow format. I don't believe the images can be read with an Elixir tool.

Unable to force build of Tokenizers

I'm attempting to get a Google Colab going that runs LiveBook w/ BumbleeBee to give developers easy access to a GPU. The crux is getting everything to work on Ubuntu 18.04 when all the precompiled binaries require newer GLIBC.

Generally that hasn't been a problem for EXLA, but the Tokenizers library (compiled nif via Rustler) isn't behaving correctly.

While Rustler's force_build config option doesn't appear to work at all, setting the env variable TOKENIZERS_BUILD=true works perfectly to compile and launch Livebook, but when running within the livebook, it seems to revert to using the prebuilt binaries.

So, a call to Bumblebee.load_tokenizer results in:

** (UndefinedFunctionError) function Tokenizers.Native.from_file/1 is undefined (module Tokenizers.Native is not available)
    (tokenizers 0.2.0) Tokenizers.Native.from_file("/root/.cache/bumblebee/huggingface/hnu6qkd3fooybwwjvnddfafwua.ei2geojyhbrggy3dhfsggnlbmrqwgzbugazwgmbqmi2dombuhe3tmmjzgy2tiodghara")
    (bumblebee 0.1.2) lib/bumblebee/utils/tokenizers.ex:120: Bumblebee.Utils.Tokenizers.load!/1
    (bumblebee 0.1.2) lib/bumblebee/text/gpt2_tokenizer.ex:37: Bumblebee.HuggingFace.Transformers.Config.Bumblebee.Text.Gpt2Tokenizer.load/2
    (bumblebee 0.1.2) lib/bumblebee.ex:577: Bumblebee.load_tokenizer/2
    #cell:f6mspfr4bdyx57zlfg26jfl3c4fqc2gq:2: (file)

22:11:53.535 [warn] The on_load function for module Elixir.Tokenizers.Native returned:
{:error,
 {:load_failed,
  'Failed to load NIF library: \'/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28\' not found

Add Google Colab link?

Hey! I saw @josevalim 's youtube video announcing BumbleBee and noted that he didn't have easy access to a GPU to demo.

I put together a google colab notebook that runs LiveBook w/ BumbleBee and CUDA support here ->
https://github.com/lukegalea/LiveBook_GoogleColab/blob/main/Google_Colab_hosted_Elixir_LiveBook_%2B_BumbleeBee_on_GPU_(Stable_Diffusion_%2B_GPT_2)_v1_0.ipynb

If you've got Colab Pro+, you can assign a high memory, high performance GPU instance and have 52GB of RAM and 40GB of VRAM, enough to run something like GPT-J, etc.

Think it's worth linking to?

Error on linux when attemptting to load models from huggingface

When trying out image classification in Livebook, the models seem to fail to load. I am running an up to date install of arch linux with the latest versions of erlang and elixir installed through asdf.

hansihe:~/ $ asdf current
elixir          1.14.2-otp-25
erlang          25.1.2

I'm running the following full code snppet from livebook:

Mix.install([
  {:bumblebee, "~> 0.1.0"},
  {:nx, "~> 0.4.1"},
  {:exla, "~> 0.4.1"},
  {:axon, "~> 0.3.1"},
  {:kino, "~> 0.8.0"}
])

Nx.global_default_backend(EXLA.Backend)

{:ok, resnet} = Bumblebee.load_model({:hf, "microsoft/resnet-50"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "microsoft/resnet-50"})

:ok

This errors:

** (File.RenameError) could not rename from "/tmp/bumblebee_yb7uihnimydvsi55mie2rgydcyih322y" to "/home/hansihe/.cache/bumblebee/huggingface/45jmafnchxcbm43dsoretzry4i.eiztamryhfrtsnzzgjstmnrymq3tgyzzheytqmrzmm4dqnbshe3tozjsmi4tanjthera": cross-domain link
    (elixir 1.14.2) lib/file.ex:766: File.rename!/2
    (bumblebee 0.1.0) lib/bumblebee/huggingface/hub.ex:63: Bumblebee.HuggingFace.Hub.cached_download/2
    (bumblebee 0.1.0) lib/bumblebee.ex:250: Bumblebee.load_spec/2
    (bumblebee 0.1.0) lib/bumblebee.ex:372: Bumblebee.load_model/2
    #cell:a3yfrzehbpz4mpgcbs7lpry7b3sia35g:1: (file)

This seems to happen because we are trying to move a file from /tmp to /home, which are mounted on different filesystems:

hansihe:~/ $ mount
/dev/nvme0n1p3 on / type ext4 (rw,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,nr_inodes=1048576,inode64)
[...]