Giter Site home page Giter Site logo

activeloopai / deeplake Goto Github PK

View Code? Open in Web Editor NEW
8.0K 8.0K 610.0 67.12 MB

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Home Page: https://activeloop.ai

License: Mozilla Public License 2.0

Dockerfile 0.01% Python 99.98% Shell 0.01%
ai computer-vision cv data-science data-version-control datalake datasets deep-learning image-processing langchain large-language-models llm machine-learning ml mlops python pytorch tensorflow vector-database vector-search

deeplake's People

Contributors

abhinavtuli avatar activesoull avatar adolkhan avatar artgish avatar as-engineer avatar benchislett avatar davidbuniat avatar dependabot-preview[bot] avatar dgaloop avatar dhiganthrao avatar edogrigqv2 avatar farizrahman4u avatar fayazrahman avatar haiyangdeperci avatar imshashank avatar istranic avatar khustup avatar khustup2 avatar kristinagrig06 avatar levongh avatar mikayelh avatar mynameisvinn avatar nvoxland avatar nvoxland-al avatar progerdav avatar sounakr avatar sparkingdark avatar tatevikh avatar thisiseshan avatar verbose-void avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeplake's Issues

Dataset Caching

Describe the feature

Zarr caching is per array not shared. Please come up with shared caching, and once the dataset is uploaded. We need our shared storage to have options to write to the array. Dataset will have .commit() function. Once caching get an alert if the dataset is ready, call .commit().

Additional notes

PR to release/v1.0

Slices views of datasets

Describe Feature

Implement virtual datasets

  1. Get and set a dataset from the subview of the dataset.

Additional Notes

PR to release/v1.0

From Pytorch dataset to Hub format

Describe your feature request

Create a converter that takes any PyTorch dataset and converts it into a hub format.

A simple test would be

from hub import datasets
import torch
import torchvision

imagenet = torchvision.datasets.ImageNet(path, split='train', transforms=...)
ds = datasets.from_pytorch(imagenet)
ds = ds.store("/tmp/imagenet")
ds = ds.to_pytorch()

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Errors in Dataset docs

I came upon following errors while referencing docs :

  1. Under Guidelines sub-heading, there is enumeration error. Proposed solution - Using indentation of 4, this issue could be solved. To make page uniform, former changes are applied for every code blocks.
  2. Under How to Upload a Dataset sub-heading, links are updated for MNIST, CIFAR and COCO examples.

Store and Load Models

Describe Feature

We want to be able to store and load models, similar to datasets. The model has a computational graph and weights. Look into how Pytorch saves and loads the model. Check ONNX or TF later.

form hub import model

resnet = model.load("username/resnet")
model.store("username/resnet2", resnet)

Notes

  1. Start from Pytorch
  2. Implement TensorFlow
  3. Include ONNX and other types.

optional dependency torch failing on import hub

trying to use the package fails:

>>> import hub
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/__init__.py", line 2, in <module>
    from .creds import Base as Creds
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/creds.py", line 4, in <module>
    from .bucket import Bucket
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/bucket.py", line 7, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

This is a bad first experience really. is there a way to use hub without the torch failing me?

Add VIRAT Video Dataset

Describe the dataset

Add VIRAT dataset to Hub. So this would work.

import hub
ds = hub.load("username/VIRAT")

Here's a tutorial for uploading datasets using Hub.

Add CI/CD

Describe your feature request

Please add CI/CD for open source development

  • Will automatically run tests with
    • Embedded AWS S3 credentials connection tests
    • Embedded GCS credentials tests
    • PyTorch/Tensorflow tests
  • Move docs here from dataflow and automatically deploy docs from here
  • On Tag will make a package and deploy to Pipy
  • Ci/CD badge to Readme
  • Test Coverage to Readme

Add Pascal Dataset

Describe the dataset

Add Pascal dataset to Hub. So this would work.

import hub
ds = hub.load("username/pascal")

Here's a tutorial for uploading datasets using Hub.

Same values in Dataset

I have a Dataset of logs which is defined like that:

logs = Dataset(dtype={"train_acc": float, "train_loss": float, "val_acc": float, "val_loss": float},
                         shape=(epochs,), url='./logs', mode='w')

I also have some average metrics stored in a dict, i.e.

metrics =  {'val_loss': AverageValue,   'val_acc': AverageValue, 'train_loss': .......}
metrics['val_loss'].avg   # tensor(1.2748, device='cuda:0')
metrics['val_acc'].avg    # tensor(0.5000, device='cuda:0')

To store those metrics in logs, I run:

for key, value in self.meters.items():
    self.logs[key][value.count - 1] = value.avg  #value.count is an index of a value starting from 1

But when I run

logs['val_acc'][0].numpy()
logs['val_loss'][0].numpy()
logs['train_acc'][0].numpy()
logs['train_acc'][0].numpy()

all these values are equal.

Datatypes to Zarr files

Describe the feature

Zarr needs max_shape, url, and other parameters. We need a component to parse the hierarchy of datatypes and let the dataset to create the zarr arrays. Similar to StorageTensor and DynamicTensor (it's an API). This is implemented in _flatten, we need to make more robust.

Solution

  • Decide on API
  • Add max_shape
  • Decide on how to chunk automatically

Kinetics-700

Describe the dataset

Add Kinetics-700 dataset to Hub. So this would work.

import hub
ds = hub.load("username/kinetics-700")

Here's a tutorial for uploading datasets using Hub.

Available Datasets

I want to thank you first for the great work.
My question is, are you planning to provide pre-loaded datasets like the Imagenet example?
I tried to load Nuscenes dataset, but it is not working.

hub.array assignment not as robust as np.array

hub_array[0, :,:,:] = image
raises a MemoryError: unable to allocate 112GiB with shape ...

while
hub_array[0] = image
works fine as expected.

in np.array both versions work the same way.

took us several hours to debug the issue, and in the end was just a slight incompatibility to np.array.
would be nice to add the first assignment just in case someone else tries it expecting np.array behavior.

class labels access and labels shape

Hi,
If I understand it correctly, the shape of 'labels' which is (frame_count, 11, 400, 7) corresponds to the 7 box values (center coordinates, box dimensions and heading) for a maximum of 400 obstacles in each of the images, lasers_camera_projection and lasers_range_image.
I have two questions here,

  1. Where can I access the class labels? Eg. Pedestrian, Vehicle etc.
  2. Considering the number of images, lasers_camera_projection, lasers_range_image to be 5 each, should it be 15 in place of 11?

Dynamic Shape Handling

Describe Feature details

Implement Dynamic Shapes for tensors with according chunking

  1. Automatic size extension/expansion.
  2. Boundary checking

Additional notes

PR to release/v1.0

Add Barcelona Dataset

Describe the dataset

Add Barcelona dataset to Hub. So this would work.

import hub
ds = hub.load("username/barcelona")

Here's a tutorial for uploading datasets using Hub.

s3 access via IAM-role doesn't seem to work

trying to connect to s3 without explicit credentials by running from ec2 with IAM role that allows access:

datahub = hub.s3('my-bucket').connect()

(boto seems to supports this mode: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#iam-role)

but I get access denied:

Traceback (most recent call last):
  File "hub_load_s3.py", line 13, in <module>
    imagenet = datahub.open('imagenet/test:latest')
  File "/usr/local/lib/python3.6/dist-packages/hub/bucket.py", line 41, in open
    jsontext = self._storage.get_or_none(os.path.join(name, "info.json"))
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/retry_wrapper.py", line 30, in get_or_none
    return self._internal.get_or_none(path)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 55, in get_or_none
    raise err
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 48, in get_or_none
    Key=path,
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

I do manage to access s3 from this machine just fine with aws-cli without configured credentials.

Is this mode of s3 authentication supported?

Concerns related to provided guidelines.

Can not find the Uploading MNIST, Uploading CIFAR, Uploading COCO URLs provided in Guidelines. Seems that they does not exist.

Creation of the dataset seems to be successful:

`from hub import tensor, dataset

images = tensor.from_array(np.zeros((4, 512, 512)))
labels = tensor.from_array(np.zeros((4, 512, 512)))

ds = dataset.from_tensors({"images": images, "labels": labels})
ds.store("username/dataset") # Upload`

but can not see the uploaded dataset in https://app.activeloop.ai/datasets.

The consecutive issues are based on the lack of info mentioned above.

Fixes in to_tensorflow method

Observed a couple of problems while converting stored datasets to TensorFlow format that need some small fixes.

to_tensorflow fails when the meta information for a tensor includes dtype="object" ("object" dtype has been used for images, area, id, bbox in Coco dataset - https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py#L24)
A fix for this is to keep the dtype="uint8" or something similar while uploading. The Coco example needs to be updated to reflect this.

to_tensorflow also fails when it gets shape=(1,) in meta and the actual object has multiple dimensions, for example, an image.
This can be fixed by commenting out this line https://github.com/activeloopai/Hub/blob/master/hub/collections/dataset/core.py#L633, which will set the output_shapes as None by default.

to_pytorch works fine in both the above cases.

Create a Colab demo

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Merging all together

Describe the feature

  • Combine all PRs together
  • Finalize the user API
  • Test backend
  • Add documentation

Notes

PR to release release/v1.0

Create a tutorial on Colab

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Add code test coverage

Describe Task

Add code test coverage for the repository.

Notes

Feel free to use any online resource and connect with circleci.

PermissionException on AWS

Facing issues with ds.store() on AWS while the same code works properly locally.
Error : hub.exceptions.PermissionException: No permision to store the dataset at s3://snark-hub/public/abhinav/ds

For now, got it working using sudo rm -rf /tmp/dask-worker-space/.
A proper fix is needed.

Support TFDS datasets

Describe your feature request

Create a converter that takes any Tensorflow Dataset dataset and convert it into a hub format.

from hub import datasets
import tensorflow_datasets as tfds

tfds = tfds.load('mnist', split='train', shuffle_files=True)
ds = datasets.from_tensorflow(tfds)
ds = ds.store("/tmp/mnist")
hub_tf_ds = ds.to_tensorflow()

# assert hub_tf_ds == tfds <- sample from both datasets and check if they are the same
hub_py_ds = ds.to_pytorch()

# assert hub_tf_ds == hub_py_ds <- sample from both datasets and check if they are the same

More advanced test would require to run the conversion for each dataset in parrarel

for name in tfds.list_builders()
    tfds = tfds.load(name, ...)
    ds = datasets.from_tensorflow(tfds)
    ds = ds.store(f"/tmp/{name}")

Notes

  • General advice: start with simple small datasets (low hanging fruits), commit often maybe mid-PRs to Master, then steadily generalize your converter.
  • TFDS has FeatureDict for understanding data archetypes, we need to rely on them.
  • This task would require to significantly extend how Hub handles different data and data types (dtags)
  • At every step think how uploading dataset could be simplified from a user perspective
  • While converting datasets take into consideration compression

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Serialization of data types

Describe your feature

Implement the Serialization of Dataset Structure. Then implement Tensor derivatives such as Image, ClassLabel, Mask, Segmentation, Polygon, Bounding Box, Tabular, and other TFDS Features.

There are two subtasks

  1. Serialize and Deserialize the metadata into meta.json
  2. Implement Tensor Derivatives

Additional Notes

Have a call with @edogrigqv2 to get started
Please pull request to release/v1.0 ask for a review

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.