ml-explore / mlx-data Goto Github PK

View Code? Open in Web Editor NEW

350.0 350.0 36.0 10.91 MB

Efficient framework-agnostic data loading

License: MIT License

CMake 8.59% C++ 76.63% Python 14.43% Shell 0.35%

mlx-data's People

Contributors

Stargazers

Watchers

mlx-data's Issues

features.audio.mfsc returns empty arrays

I just tried out the spectrogram feature extraction pipeline as seen here.

However, when running the following minimal example based on the provided sample code:

from mlx.data.datasets import load_librispeech
from mlx.data.features import mfsc

dset = (
    load_librispeech()
    .key_transform("audio", mfsc(80, 16000))
    .to_stream()
    .prefetch(16, 8)
    .batch(1)
    .prefetch(2, 1)
)

batch = next(dset)
print(batch['audio'].shape)        # prints (1, <x>, 0, 80)
print(batch['audio'])                   # prints []

I get shape as (1, <x>, 0, 80) and an empty array. I'm looking into it.

`key_transform` and `buffer_from_vector` should support mlx.core arrays instead of throwing `Array is not writeable`

Consider the following minimal example:

b = dx.buffer_from_vector(list(dict(i=i) for i in range(5)));
print(b.key_transform("i", lambda x: mx.ones(3), output_key="o"))

This raises ValueError: array is not writeable. It works with numpy arrays, though: lambda x: np.ones(3).

I am assuming that mx.array is missing some protocol to support this, so I am not sure if I should report this here or in ml-explore/mlx.

The current situation is a bit strange to be forced to init a numpy array only to be converted to mlx later on.

The same problem arise when creating an mx.array for buffer_from_vector: b = dx.buffer_from_vector(list(dict(i=mx.array(i)) for i in range(5)));.

How to load datasets like `torchvision.datasets.ImageFolder`?

How to load datasets like torchvision.datasets.ImageFolder? Following is the directory & files format. This kind of loading is very much needed.

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

I want to train a GAN and potentially submit a PR in ml-explore/mlx-examples repo.

Cannot import load_cifar10: `/usr/local/lib/libsndfile.1.dylib` is `x86_64` but `arm64` needed

Here's the traceback:

Traceback (most recent call last):
  File "/Users/codegod/apple-mlx/mlx-examples/cifar/main.py", line 7, in <module>
    from dataset import get_cifar10
  File "/Users/codegod/apple-mlx/mlx-examples/cifar/dataset.py", line 2, in <module>
    from mlx.data.datasets import load_cifar10
  File "/Users/codegod/apple-mlx/venv/lib/python3.11/site-packages/mlx/data/__init__.py", line 3, in <module>
    from ._c import *
ImportError: dlopen(/Users/codegod/apple-mlx/venv/lib/python3.11/site-packages/mlx/data/_c.cpython-311-darwin.so, 0x0002): Library not loaded: /opt/homebrew/opt/libsndfile/lib/libsndfile.1.dylib
  Referenced from: <9F7752F7-AB7A-3025-A4B7-0EFABE6D194E> /Users/codegod/apple-mlx/venv/lib/python3.11/site-packages/mlx/data/_c.cpython-311-darwin.so
  Reason: tried: '/usr/local/lib/libsndfile.1.dylib' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/usr/local/lib/libsndfile.1.dylib' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/opt/homebrew/opt/libsndfile/lib/libsndfile.1.dylib' (no such file), '/opt/homebrew/opt/libsndfile/lib/libsndfile.1.dylib' (no such file), '/libsndfile.1.dylib' (no such file), '/opt/homebrew/opt/libsndfile/lib/libsndfile.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libsndfile/lib/libsndfile.1.dylib' (no such file), '/opt/homebrew/opt/libsndfile/lib/libsndfile.1.dylib' (no such file)

An efficient implementation of BytePairTokenizer

As suggested by @angeloskath' s code review ml-explore/mlx-examples#315 (comment), an implementation of BytePairTokenizer seems useful for many use cases, but it is currently missing in mlx-data. I did some research on byte pair tokenization in transformers. I think that the implementation in transformers is somewhat slow. More precisely, the implementation iterates over all possible adjacent symbol pairs to determine the optimal symbol pair to merge, every time a merge could be done. This implies quadratic time complexity. However, in the referenced paper, there is an elegant linearithmic time implementation. Since the implementation requires some pointer trickery, it seems that we could (relatively) easily implement this in C++ and expose to Python.

I would appreciate your thoughts on:

Do we want an implementation of BytePairTokenizer in C++?
Do we want the faster implementation of BytePairTokenizer in C++, referenced in the paper?

Paper: https://arxiv.org/pdf/2306.16837.pdf

PNG Image support

Wish I bumped into this repo earlier, created a similar lib and see that you guys have support for most of what I was working on.

I did notice that png support is missing afaict, is there an interest in adding it? I was utilizing libpng and can bring it over if desired.

C++ API Usage Example

I am reaching out to express my interest in the MLX-Data project and its capabilities. I've been exploring the documentation and am highly impressed with the breadth of features.

As a C++ developer, I'm particularly interested in integrating MLX-Data functionalities into my projects. However, I noticed that the documentation primarily focuses on Python examples. While Python examples are incredibly helpful, a C++ API usage example would greatly enhance my understanding and enable me to leverage MLX-Data more effectively in my C++ projects.

Could the team consider providing a basic example of using the MLX-Data API with C++? Such an example would be immensely valuable for developers like myself who prefer or are required to use C++ in their projects. It would not only aid in a smoother integration process but also help in broadening the user base of MLX-Data among the C++ developer community.

auto conversion of string to int8 arrays in `buffer_from_vector` makes debugging quite hard

I noted that buffer_from_vector converts every value to an array. This may be fine for actual data but makes debugging very hard if something goes wrong in a data loading pipeline.

Take the example form the docs:

import mlx.data as dx

def list_files(root: Path):
    files = list(root.rglob("*.jpg"))
    classes = sorted(set(f.parent.name for f in files))
    classes = dict((v, i) for i, v in enumerate(classes))
    return [
      {"file": str(f.relative_to(root)).encode(), "label": classes[f.parent.name]}
      for f in files
    ]

root = Path("/path/to/image/dataset")
dset = (
  dx.buffer_from_vector(list_files(root))
  .load_image("file", prefix=str(root), output_key="image")
)

The resulting dset will have entries where the dset[0]["file"] values look like:

array([ 73,  77,  71,  95,  54,  55,  51,  50,  46, 106, 112, 101, 103],
      dtype=int8)

When running into an error, it is very hard to tell which actual file causes problems.

Is there a good way to at least transform the int8 array back to a readable string? How do you handle string encoding when converting back and forth?

Segfualt with prefetching and MLX arrays in key transform

The following code segfaults on my machine (M1 Max, OS 14.2)

Some observations:

Using NumPy in place of MLX works fine
Only segfaults with prefetching

import mlx.core as mx
from mlx.data.datasets import load_cifar10

def get_cifar10(batch_size, root=None):
    tr = load_cifar10(root=root)

    mean = mx.array([0.485, 0.456, 0.406]).reshape((1, 1, 3))
    std = mx.array([0.229, 0.224, 0.225]).reshape((1, 1, 3))

    def normalize(x):
        x = x.astype("float32") / 255.0
        return (x - mean) / std

    tr_iter = (
        tr.shuffle()
        .to_stream()
        .image_random_h_flip("image", prob=0.5)
        .pad("image", 0, 4, 4, 0.0)
        .pad("image", 1, 4, 4, 0.0)
        .image_random_crop("image", 32, 32)
        .key_transform("image", normalize)
        .batch(batch_size)
        .prefetch(4, 4)
    )

    return tr_iter

if __name__ == "__main__":
    tr_iter = get_cifar10(256)
    for batch_counter, batch in enumerate(tr_iter):
        print(batch)

Feature Request: `stream.checkpoint(path)`

Checkpointing data workloads is critical when training on larger corpora of data.
Checkpointing enables recovery or continuing training at a later date.
Checkpointing is necessary to restore state and prevent sample repeats.

Tokenization is extremely slow--am I doing something wrong?

Under what circumstances is MLX supposed to provide a speedup over sentencepiece? In a naive test with the same SPM .model file, I'm able to tokenize 1000 batches in 13 seconds with sentencepiece, and it takes over 5 minutes with MLX. Hardware is M2 Macbook Pro with 64GB unified memory. Is the CharTrie tokenization only useful when paired with key_transform? Are there plans to add a "tokenize_batch" with better parallelization/concurrency?

Code for reference:

class Tokenizer:
    def __init__(
        self, 
        model_path: str, 
        use_mlx: bool = False,
        mlx_tokenize_shortest: bool = False,
    ):
        assert Path(model_path).exists(), model_path
        self.use_mlx = use_mlx
        self.mlx_tokenize_shortest = mlx_tokenize_shortest
        self.spm_model = SentencePieceProcessor(model_file=model_path)
        assert self.spm_model.vocab_size() == self.spm_model.get_piece_size()
        if self.use_mlx:
            try:
                from mlx.data.core import Tokenizer as MLXTokenizer
                from mlx.data.tokenizer_helpers import read_trie_from_spm
            except ImportError:
                raise ImportError("Please install MLX to use MLX Tokenizer")
            trie, weights = read_trie_from_spm(model_path)
            try:
                self.mlx_model = MLXTokenizer(trie, trie_key_scores=weights)
                print(f"Loaded trie with {trie.num_keys()} keys.")
                assert self.spm_model.vocab_size() == trie.num_keys()
            except Exception as e:
                print(f"Unable to load trie into MLX Tokenizer: {e}. Using SentencePiece instead.")
                self.use_mlx = False

    @property
    def vocab_size(self) -> int:
        return self.spm_model.vocab_size()
    
    @property
    def bos_id(self) -> int:
        return self.spm_model.bos_id()
    
    @property
    def eos_id(self) -> int:
        return self.spm_model.eos_id()

    @property
    def pad_id(self) -> int:
        return self.spm_model.pad_id()
    
    @property
    def mask_id(self) -> int:
        # unknown kind of spiritually makes sense as mask
        return self.spm_model.unk_id()
    
    def __call__(
        self, 
        t: Union[str, list[str]],
        max_length: int,
        pack: bool = False,
        use_bos: bool = False,
        use_eos: bool = True,
        format: Literal["np", "mx"]= "mx",
        tokens_only: bool = False
    ) -> np.ndarray:
        if isinstance(t, str):
            t = [t]
        return self.encode_batch(t, max_length, pack, use_bos, use_eos, format)

    def _pad(self, tokens: ak.Array, max_length: int):
        tokens = ak.pad_none(tokens, target=max_length, axis=-1, clip=True)
        tokens = ak.fill_none(tokens, self.pad_id)
        return tokens
    
    def _unpad(self, tokens: ak.Array):
        tokens = ak.from_regular(tokens)
        is_pad = tokens == self.pad_id
        return tokens[~is_pad]

    def encode_single_mlx(
        self,
        text: str,
    ):
        if self.mlx_tokenize_shortest:
            tokens = self.mlx_model.tokenize_shortest(text)
        else:
            tokens = self.mlx_model.tokenize_rand(text)
        return tokens
    
    def encode_batch(
        self, 
        batch: list[str],
        max_length: int,
        pack: bool = False,
        use_bos: bool = False,
        use_eos: bool = True,
        tokens_only: bool = False
    ) -> np.ndarray:
        if self.use_mlx:
            tokens = ak.Array([self.encode_single_mlx(text) for text in batch])
        else:
            tokens = ak.Array(self.spm_model.encode(batch))
        if use_bos:
            tokens = ak.concatenate([ak.full_like(tokens[:, :1], self.bos_id), tokens], axis=1)
        if use_eos:
            tokens = ak.concatenate([tokens, ak.full_like(tokens[:, :1], self.eos_id)], axis=1)
        if pack:
            # pack into batches of max_length
            sequence_ids = ak.zeros_like(tokens) + range(len(tokens))
            tokens = ak.flatten(tokens, axis=None)
            sequence_ids = ak.flatten(sequence_ids, axis=None)
            n_to_truncate = len(tokens) % max_length
            tokens = np.asarray(tokens)[:-n_to_truncate].reshape(-1, max_length)
            sequence_ids = np.asarray(sequence_ids)[:-n_to_truncate].reshape(-1, max_length)
            if tokens_only:
                return tokens
            return {
                "input_ids": tokens,
                "sequence_ids": sequence_ids, # these are for block diagonal attention
                "attention_mask": np.zeros_like(tokens)
            }

            
        else:
            # keep sequences separate; pad or truncate to max_length
            tokens = np.array(self._pad(tokens, max_length))
            mask = np.where(tokens != self.pad_id, 0, float("-inf"))
            if tokens_only:
                return tokens
            return {
                "input_ids": tokens,
                "sequence_ids": np.zeros_like(tokens),
                "attention_mask": mask
            }
    

def test():
    tokenizer_file = "tokenizer.model"
    spm_tokenizer = Tokenizer(tokenizer_file, use_mlx=False)
    mlx_tokenizer = Tokenizer(tokenizer_file, use_mlx=True)

    random_texts = [
        """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."""
    ] * 500
    import time
    start = time.time()
    for i in tqdm.tqdm(range(1_000)):
        spm_tokenizer(random_texts, 512, pack=True, use_bos=False, use_eos=True)
    print(f"spm: {time.time() - start:.3f}s")

    start = time.time()
    for i in tqdm.tqdm(range(1_000)):
        mlx_tokenizer(random_texts, 512, pack=True, use_bos=False, use_eos=True)
    print(f"mlx: {time.time() - start:.3f}s")


test()

RuntimeError: load_stbi: could not load <.

Here's a simple 7 line script based on the example in the readme, trying to load a single image from the CelebA dataset:

import mlx.data as dx
dset = (
    dx.buffer_from_vector([{"image": "./000001.jpg"}])
    .to_stream()
    .load_image("image")
)
print(next(dset))

Here's the 000001.jpg from the dataset: https://0x0.st/H3TG.jpg

$ brew install libsndfile libsamplerate ffmpeg jpeg-turbo zlib bzip2 xz aws-sdk-cpp
$ python3 -m pip install mlx-data
$ python3 broken.py

Expected result

{ "image": .... }

Actual result

Traceback (most recent call last):
  File "/Users/stjo/broken/broken.py", line 7, in <module>
    print(next(dset))
          ^^^^^^^^^^
RuntimeError: load_stbi: could not load <.

python version: Python 3.11.6
mlx-data version: 0.0.1

How to load only one image?

I see there is Buffer.load_image() but it seems to be tricky to use. I just want to load a single image named painting.jpg. How to do it?

403: Forbidden in MNIST

Getting a 403: Forbidden, when running the MNIST example in the documentation.

The error occurs in this part

# Let's import MNIST loading
from mlx.data.datasets import load_mnist

# Loads a buffer with the MNIST images
mnist_train = load_mnist(train=True)

How to decide prefetch batch and threads?

Here's a code snippet from Buffers, Streams and Samples webpage.

# We can define the rest of the processing pipeline using streams.
# 1. First shuffle the buffer
# 2. Make a stream
# 3. Batch and then prefetch
dset = (
    dset
    .shuffle()
    .to_stream()  # <-- making a stream from the shuffled buffer
    .batch(32)
    .prefetch(8, 4)  # <-- prefetch 8 batches using 4 threads
)

# Now we can iterate over dset
sample = next(dset)

I don't know how to decide the values for prefetch. Could someone suggest a general rule?

Selecting a resampling algorithm for various image operations

Can we support a selection of resampling algorithm (e.g. bilinear, bicubic, nearest ...) for operations such as image_resize and image_resize_smallest_side?

This may be necessary to implement some models, such as CLIP (ml-explore/mlx-examples#315).

`image_random_h_flip` returns uint8 array filled with zeros if input is not uint8

I noticed that image_random_h_flip will always return a uint8 array filled with zeros, if the input is not of dtype uint8. Here is a reproducer:

img_uint8 = np.arange(4, dtype=np.uint8).reshape((2, 2, 1))
img_int64 = np.arange(4).reshape((2, 2, 1))
print(repr(img_uint8.squeeze(-1)))
print(repr(img_int64.squeeze(-1)))

buf = dx.buffer_from_vector([dict(image=img_uint8)]).image_random_h_flip("image", prob=1.0)
flipped = buf[0]["image"].squeeze(-1)
print("flipped uint8:")
print(repr(flipped))

buf = dx.buffer_from_vector([dict(image=img_int64)]).image_random_h_flip("image", prob=1.0)
flipped = buf[0]["image"].squeeze(-1)
print("flipped int64:")
print(repr(flipped))

output

array([[0, 1],
       [2, 3]], dtype=uint8)
array([[0, 1],
       [2, 3]])
flipped uint8:
array([[1, 0],
       [3, 2]], dtype=uint8)
flipped int64:
array([[0, 0],
       [0, 0]], dtype=uint8)

I'd suggest to throw an exception if implementing it for other dtypes is not feasible at the moment.

Would an async api have performance gain?

Fantastic job! But I'd like to ask if there would be some another gains by designing an async implementation and bind them to python with __anext__ and __aiter__, since data loading is highly related to IO.

load_mnist returned buffer

Great job, thank you. I would like to train a Bert classifier, but how to create a buffer with several "types" of samples like input_ids, attention_masks and labels? Something similar to the return of load_mnist for instance. In advance thank you.

Enable Discussions for this repo

Hello,

I would request the team to please enable Discussions for this repository as datasets are of utmost importance in deep learning algorithms and I (probably others too) have a lot of questions that don't mean to be related to issues with the source code but regarding the usage of it.

Thanks.

Feature request: Support negative indexing on buffers

Currently dx.Buffer does not seam to support pythonic negative indexing:

b = dx.buffer_from_vector(list(dict(i=i) for i in range(5)))

b[-1] throws

RuntimeError: FromVector: index out of range

However: b[len(b) - 1] works as expected. Would be nice to support the python convention to use negative indices.

ZeroDivisionError in download speed calculation

I was trying to run the example cvae from mlx-examples and I have incurred in the following error when downloading the dataset:

[...]
  File "mlx-examples/venv/lib/python3.11/site-packages/mlx/data/datasets/common.py", line 35, in message
    speed = self.current_size / (self.current_time - self.start_time)
            ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ZeroDivisionError: float division by zero

I'm already preaparing a pull request for this.

Not installing `mlx-data-0.0.1` on Python 3.12

Running pip install mlx-data installs mlx-data-0.0.0 on Python 3.12.

Collecting mlx-data
  Downloading mlx_data-0.0.0-py3-none-any.whl.metadata (170 bytes)
Downloading mlx_data-0.0.0-py3-none-any.whl (1.0 kB)
Installing collected packages: mlx-data
Successfully installed mlx-data-0.0.0

ml-explore / mlx-data Goto Github PK

mlx-data's People

Contributors

Stargazers

Watchers

Forkers

mlx-data's Issues

Expected result

Actual result

Recommend Projects

Recommend Topics

Recommend Org