activeloopai / deeplake Goto Github PK

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Home Page: https://activeloop.ai

License: Mozilla Public License 2.0

Dockerfile 0.01% Python 99.98% Shell 0.01%

ai computer-vision cv data-science data-version-control datalake datasets deep-learning image-processing langchain large-language-models llm machine-learning ml mlops python pytorch tensorflow vector-database vector-search

deeplake's People

Contributors

Stargazers

Watchers

Forkers

jesusoctavioas ofirbb sophiaar ssahgal mikayelh aren-beglaryan gevorgaghekyan 40a krhero03 sambhavbhurtel adi10hero aryansharmaa s-arpit007 rohankmr414 anshssonkhia ishaan75 souvikgithub sparkingdark tanujcbe samlpu sonam2905 prithviraj-maurya sheetal01761 sumitsrivastav180 musk101 scratcher007lakshya vocalfan kzuri techonerd aksh-02 senguyen1011 ritikaag zomglings harshitsankhla sulochanaviji deepakpatel100297 eshamahendra charlottemach rajat1433 td-bolaji hugovk krabhi977 vaishnavi-cyber-blip rickyja69 verrah aks998 abhilasha06 ad-dt shanonbaker arunima8jan valerianpereira sanchitvj sudiptog81 vegetablejuiceftw vvkpd averraver deeplearning2012 shivam1718 anuj840 pathanzafar702 rheehot deepandy thomascherickal adelzakirov dpkeee rcplay haiyangdeperci gyanachand1 maxcodextc rohitpandey13 4thel00z evolearner georgi-petkov gazzola adbmd atom-101 vitamen espoirgk vasileiosaidonis manyshapes shekhawatsanjay97 yuvalofer raijinspecial mbrukman valeman pravin1237 sohamsshah darkborderman eyh0602 sanggusti nasheedyasin joeykhuang debadityapal vinod1988 gabriellewp x213212 krishnachaitanya1 shineusn pranavdurai10 mohamedramadanhassan

deeplake's Issues

Add Caltech Pedestrian Detection dataset

Add Caltech Pedestrian Detection dataset to app.activeloop.ai with the correct version.

For more information regarding creating and uploading a dataset please see the Hub documentation.

Dataset Caching

Describe the feature

Zarr caching is per array not shared. Please come up with shared caching, and once the dataset is uploaded. We need our shared storage to have options to write to the array. Dataset will have .commit() function. Once caching get an alert if the dataset is ready, call .commit().

Additional notes

PR to release/v1.0

Slices views of datasets

Describe Feature

Implement virtual datasets

Get and set a dataset from the subview of the dataset.

Additional Notes

PR to release/v1.0

From Pytorch dataset to Hub format

Describe your feature request

Create a converter that takes any PyTorch dataset and converts it into a hub format.

A simple test would be

from hub import datasets
import torch
import torchvision

imagenet = torchvision.datasets.ImageNet(path, split='train', transforms=...)
ds = datasets.from_pytorch(imagenet)
ds = ds.store("/tmp/imagenet")
ds = ds.to_pytorch()

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Errors in Dataset docs

I came upon following errors while referencing docs :

Under Guidelines sub-heading, there is enumeration error. Proposed solution - Using indentation of 4, this issue could be solved. To make page uniform, former changes are applied for every code blocks.
Under How to Upload a Dataset sub-heading, links are updated for MNIST, CIFAR and COCO examples.

tensorflow integration not supported

dataset.to_tensorflow() seems to be broken:

https://github.com/snarkai/Hub/blob/9539dc4d9699df454d87dd4faf480c763945cd33/hub/dataset.py#L11-L14

isn't dataset integration with tensorflow supposed to be supported?

GZIP Compression

Add Gzip compression to chunks

Store and Load Models

Describe Feature

We want to be able to store and load models, similar to datasets. The model has a computational graph and weights. Look into how Pytorch saves and loads the model. Check ONNX or TF later.

form hub import model

resnet = model.load("username/resnet")
model.store("username/resnet2", resnet)

Notes

Start from Pytorch
Implement TensorFlow
Include ONNX and other types.

Add top contributors to Readme

Describe Task

Add top contributors to the readme. For reference please take a look at other repositories. https://github.com/huggingface/transformers

optional dependency torch failing on import hub

trying to use the package fails:

>>> import hub
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/__init__.py", line 2, in <module>
    from .creds import Base as Creds
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/creds.py", line 4, in <module>
    from .bucket import Bucket
  File "/Users/ram/.pyenv/versions/3.8.2/lib/python3.8/site-packages/hub/bucket.py", line 7, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

This is a bad first experience really. is there a way to use hub without the torch failing me?

Add IMDB-WIKI dataset

Add IMDB-WIKI dataset to app.activeloop.ai with the correct version.

For more information regarding creating and uploading a dataset please see the Hub documentation.

Add VIRAT Video Dataset

Describe the dataset

Add VIRAT dataset to Hub. So this would work.

import hub
ds = hub.load("username/VIRAT")

Here's a tutorial for uploading datasets using Hub.

Exception: Connect timeout on endpoint URL

Open Images Dataset

Describe the dataset

Add Open Images Dataset dataset to Hub. So this would work.

import hub
ds = hub.load("username/open-images")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

Repository Naming Error

There is an error naming hub repositories

ClientError: An error occurred (ExpiredToken) when calling the PutObject operation: The provided token has expired.

Have tried to store the dataset trice.

The loading was taking place more than 12 hours, however, at some point, I am getting the ExpiredToken error:

Have already uploaded the script and have provided the URLs of data and annotations in issues/36

Add FSDKaggle2019 dataset

I would like to add this dataset to Hub

Add CI/CD

Describe your feature request

Please add CI/CD for open source development

Will automatically run tests with
- Embedded AWS S3 credentials connection tests
- Embedded GCS credentials tests
- PyTorch/Tensorflow tests
Move docs here from dataflow and automatically deploy docs from here
On Tag will make a package and deploy to Pipy
Ci/CD badge to Readme
Test Coverage to Readme

Add Dataset The 20BN-something V2

Describe the dataset

Add The 20BN-something dataset to Hub. So this would work.

import hub
ds = hub.load("username/the20b-something")

Here's a tutorial for uploading datasets using Hub.

Add Pascal Dataset

Describe the dataset

Add Pascal dataset to Hub. So this would work.

import hub
ds = hub.load("username/pascal")

Here's a tutorial for uploading datasets using Hub.

Same values in Dataset

I have a Dataset of logs which is defined like that:

logs = Dataset(dtype={"train_acc": float, "train_loss": float, "val_acc": float, "val_loss": float},
                         shape=(epochs,), url='./logs', mode='w')

I also have some average metrics stored in a dict, i.e.

metrics =  {'val_loss': AverageValue,   'val_acc': AverageValue, 'train_loss': .......}
metrics['val_loss'].avg   # tensor(1.2748, device='cuda:0')
metrics['val_acc'].avg    # tensor(0.5000, device='cuda:0')

To store those metrics in logs, I run:

for key, value in self.meters.items():
    self.logs[key][value.count - 1] = value.avg  #value.count is an index of a value starting from 1

But when I run

logs['val_acc'][0].numpy()
logs['val_loss'][0].numpy()
logs['train_acc'][0].numpy()
logs['train_acc'][0].numpy()

all these values are equal.

Add documentation for using text labels in datasets

Add documentation for the function get_text that retrieves text labels from the stored dataset.
Function definition:- https://github.com/activeloopai/Hub/blob/master/hub/collections/dataset/core.py#L655
Examples of usage:-
Pytorch: https://github.com/activeloopai/Hub/blob/master/examples/fashion-mnist/train_pytorch.py#L95
Tensorflow: https://github.com/activeloopai/Hub/blob/master/examples/fashion-mnist/train_tf_gradient_tape.py#L84

Datatypes to Zarr files

Describe the feature

Zarr needs max_shape, url, and other parameters. We need a component to parse the hierarchy of datatypes and let the dataset to create the zarr arrays. Similar to StorageTensor and DynamicTensor (it's an API). This is implemented in _flatten, we need to make more robust.

Solution

Decide on API
Add max_shape
Decide on how to chunk automatically

Documentation for using ingestors for big datasets

Add documentation describing transformers being used as dataset ingestors while uploading large datasets.

References:-
Example of usage:- https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py
Upload documentation:- https://docs.activeloop.ai/en/latest/concepts/dataset.html#how-to-upload-a-dataset
Documentation for transformers:- https://docs.activeloop.ai/en/latest/concepts/transform.html

Kinetics-700

Describe the dataset

Add Kinetics-700 dataset to Hub. So this would work.

import hub
ds = hub.load("username/kinetics-700")

Here's a tutorial for uploading datasets using Hub.

Available Datasets

I want to thank you first for the great work.
My question is, are you planning to provide pre-loaded datasets like the Imagenet example?
I tried to load Nuscenes dataset, but it is not working.

Add Tiny ImageNet dataset

Describe the dataset

Add Tiny ImageNet dataset to Hub. So this would work.

import hub
ds = hub.load("username/tiny-imagenet")

Here's a tutorial for uploading datasets using Hub.

hub.array assignment not as robust as np.array

hub_array[0, :,:,:] = image
raises a MemoryError: unable to allocate 112GiB with shape ...

while
hub_array[0] = image
works fine as expected.

in np.array both versions work the same way.

took us several hours to debug the issue, and in the end was just a slight incompatibility to np.array.
would be nice to add the first assignment just in case someone else tries it expecting np.array behavior.

class labels access and labels shape

Hi,
If I understand it correctly, the shape of 'labels' which is (frame_count, 11, 400, 7) corresponds to the 7 box values (center coordinates, box dimensions and heading) for a maximum of 400 obstacles in each of the images, lasers_camera_projection and lasers_range_image.
I have two questions here,

Where can I access the class labels? Eg. Pedestrian, Vehicle etc.
Considering the number of images, lasers_camera_projection, lasers_range_image to be 5 each, should it be 15 in place of 11?

Dynamic Shape Handling

Describe Feature details

Implement Dynamic Shapes for tensors with according chunking

Automatic size extension/expansion.
Boundary checking

Additional notes

PR to release/v1.0

AttributeError: 'Dataset' object has no attribute 'compute'

Add Barcelona Dataset

Describe the dataset

Add Barcelona dataset to Hub. So this would work.

import hub
ds = hub.load("username/barcelona")

Here's a tutorial for uploading datasets using Hub.

s3 access via IAM-role doesn't seem to work

trying to connect to s3 without explicit credentials by running from ec2 with IAM role that allows access:

datahub = hub.s3('my-bucket').connect()

(boto seems to supports this mode: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#iam-role)

but I get access denied:

Traceback (most recent call last):
  File "hub_load_s3.py", line 13, in <module>
    imagenet = datahub.open('imagenet/test:latest')
  File "/usr/local/lib/python3.6/dist-packages/hub/bucket.py", line 41, in open
    jsontext = self._storage.get_or_none(os.path.join(name, "info.json"))
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/retry_wrapper.py", line 30, in get_or_none
    return self._internal.get_or_none(path)
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 55, in get_or_none
    raise err
  File "/usr/local/lib/python3.6/dist-packages/hub/storage/s3.py", line 48, in get_or_none
    Key=path,
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

I do manage to access s3 from this machine just fine with aws-cli without configured credentials.

Is this mode of s3 authentication supported?

Concerns related to provided guidelines.

Can not find the Uploading MNIST, Uploading CIFAR, Uploading COCO URLs provided in Guidelines. Seems that they does not exist.

Creation of the dataset seems to be successful:

`from hub import tensor, dataset

images = tensor.from_array(np.zeros((4, 512, 512)))
labels = tensor.from_array(np.zeros((4, 512, 512)))

ds = dataset.from_tensors({"images": images, "labels": labels})
ds.store("username/dataset") # Upload`

but can not see the uploaded dataset in https://app.activeloop.ai/datasets.

The consecutive issues are based on the lack of info mentioned above.

Fixes in to_tensorflow method

Observed a couple of problems while converting stored datasets to TensorFlow format that need some small fixes.

to_tensorflow fails when the meta information for a tensor includes dtype="object" ("object" dtype has been used for images, area, id, bbox in Coco dataset - https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py#L24)
A fix for this is to keep the dtype="uint8" or something similar while uploading. The Coco example needs to be updated to reflect this.

to_tensorflow also fails when it gets shape=(1,) in meta and the actual object has multiple dimensions, for example, an image.
This can be fixed by commenting out this line https://github.com/activeloopai/Hub/blob/master/hub/collections/dataset/core.py#L633, which will set the output_shapes as None by default.

to_pytorch works fine in both the above cases.

Create a Colab demo

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Merging all together

Describe the feature

Combine all PRs together
Finalize the user API
Test backend
Add documentation

Notes

PR to release release/v1.0

Add Stanford Background Dataset

Describe the dataset

Add Stanford background dataset to Hub. So this would work.

import hub
ds = hub.load("username/stanford-background")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

Create a tutorial on Colab

Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

Add Cityscapes dataset

Describe the dataset

Add Cityscapes dataset to Hub. So this would work.

import hub
ds = hub.load("username/cityscapes")

Here's a tutorial for uploading datasets using Hub.

tensor.from_array() fails running.

Add code test coverage

Describe Task

Add code test coverage for the repository.

Notes

Feel free to use any online resource and connect with circleci.

Add SIFT Flow Dataset

Describe the dataset

Add SIFT Flow dataset to Hub. So this would work.

import hub
ds = hub.load("username/SIFTFlow")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

PermissionException on AWS

Facing issues with ds.store() on AWS while the same code works properly locally.
Error : hub.exceptions.PermissionException: No permision to store the dataset at s3://snark-hub/public/abhinav/ds

For now, got it working using sudo rm -rf /tmp/dask-worker-space/.
A proper fix is needed.

Support TFDS datasets

Describe your feature request

Create a converter that takes any Tensorflow Dataset dataset and convert it into a hub format.

from hub import datasets
import tensorflow_datasets as tfds

tfds = tfds.load('mnist', split='train', shuffle_files=True)
ds = datasets.from_tensorflow(tfds)
ds = ds.store("/tmp/mnist")
hub_tf_ds = ds.to_tensorflow()

# assert hub_tf_ds == tfds <- sample from both datasets and check if they are the same
hub_py_ds = ds.to_pytorch()

# assert hub_tf_ds == hub_py_ds <- sample from both datasets and check if they are the same

More advanced test would require to run the conversion for each dataset in parrarel

for name in tfds.list_builders()
    tfds = tfds.load(name, ...)
    ds = datasets.from_tensorflow(tfds)
    ds = ds.store(f"/tmp/{name}")

Notes

General advice: start with simple small datasets (low hanging fruits), commit often maybe mid-PRs to Master, then steadily generalize your converter.
TFDS has FeatureDict for understanding data archetypes, we need to rely on them.
This task would require to significantly extend how Hub handles different data and data types (dtags)
At every step think how uploading dataset could be simplified from a user perspective
While converting datasets take into consideration compression

Please follow Contribution guidelines and look into uploading dataset guidelines for more details about meta information.

Dynamic Shapes

Discuss Dynamic Shapes and implement

MPII Human Pose Dataset

Describe the dataset

Add MPII Human Pose Dataset dataset to Hub. So this would work.

import hub
ds = hub.load("username/mpii-human-pose-dataset")

Steps

Please take a look at the docs on uploading datasets.
Uploading script should be added to examples folder

Example

You can find an example of large dataset loading and upload here:

https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py

Add CamVid dataset

Add CamVid dataset to app.activeloop.ai with the correct version.

For more information regarding creating and uploading a dataset please see the Hub documentation.

Serialization of data types

Describe your feature

Implement the Serialization of Dataset Structure. Then implement Tensor derivatives such as Image, ClassLabel, Mask, Segmentation, Polygon, Bounding Box, Tabular, and other TFDS Features.

There are two subtasks

Serialize and Deserialize the metadata into meta.json
Implement Tensor Derivatives

Additional Notes

Have a call with @edogrigqv2 to get started
Please pull request to release/v1.0 ask for a review

Auto-chunking of tensor based user given dtype information with minimal extra user input

We have to find a way to understand how we do tensor chunking without explicitly asking user how to do it. The less extra information we ask from user to figure this out the better. Maybe it is better to go with suboptimal solution for sake of not asking user extra information.

activeloopai / deeplake Goto Github PK

deeplake's People

Contributors

Stargazers

Watchers

Forkers

deeplake's Issues

Describe the feature

Additional notes

Describe Feature

Additional Notes

Describe your feature request

Describe Feature

Notes

Describe Task

Describe the dataset

Describe the dataset

Steps

Example

Describe your feature request

Describe the dataset

Describe the dataset

Describe the feature

Solution

Describe the dataset

Describe the dataset

Describe Feature details

Additional notes

Describe the dataset

Create a tutorial on Colab

Describe the feature

Notes

Describe the dataset

Steps

Example

Create a tutorial on Colab

Describe the dataset

Describe Task

Notes

Describe the dataset

Steps

Example

Describe your feature request

Notes

Describe the dataset

Steps

Example

Describe your feature

Additional Notes

Recommend Projects

Recommend Topics

Recommend Org