Giter Site home page Giter Site logo

nvidia / aistore Goto Github PK

View Code? Open in Web Editor NEW
1.1K 43.0 140.0 74.82 MB

AIStore: scalable storage for AI applications

Home Page: https://aiatscale.org

License: MIT License

Makefile 0.40% Go 82.27% Shell 4.89% Python 8.86% Dockerfile 0.15% HCL 0.05% Jupyter Notebook 3.37% Jinja 0.01%
object-storage distributed-shuffle multiple-backends deploy-anywhere etl-offload linear-scalability network-of-clusters small-file-datasets

aistore's Introduction

AIStore is a lightweight object storage system with the capability to linearly scale out with each added storage node and a special focus on petascale deep learning.

License Go Report Card

AIStore (AIS for short) is a built from scratch, lightweight storage stack tailored for AI apps. It's an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size.

AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered nodes. The ability to scale linearly with each added disk was, and remains, one of the main incentives. Much of the initial design was also driven by the ideas to offload custom dataset transformations (often referred to as ETL). And finally, since AIS is a software system that aggregates Linux machines to provide storage for user data, there's the requirement number one: reliability and data protection.

Features

  • Deploys anywhere. AIS clusters are immediately deployable on any commodity hardware, on any Linux machine(s).
  • Highly available control and data planes, end-to-end data protection, self-healing, n-way mirroring, erasure coding, and arbitrary number of extremely lightweight access points.
  • REST API. Comprehensive native HTTP-based API, as well as compliant Amazon S3 API to run unmodified S3 clients and apps.
  • Unified namespace across multiple remote backends including Amazon S3, Google Cloud, and Microsoft Azure.
  • Network of clusters. Any AIS cluster can attach any other AIS cluster, thus gaining immediate visibility and fast access to the respective hosted datasets.
  • Turn-key cache. Can be used as a standalone highly-available protected storage and/or LRU-based fast cache. Eviction watermarks, as well as numerous other management policies, are per-bucket configurable.
  • ETL offload. The capability to run I/O intensive custom data transformations close to data - offline (dataset to dataset) and inline (on-the-fly).
  • File datasets. AIS can be immediately populated from any file-based data source (local or remote, ad-hoc/on-demand or via asynchronus batch).
  • Read-after-write consistency. Reading and writing (as well as all other control and data plane operations) can be performed via any (random, selected, or load-balanced) AIS gateway (a.k.a. "proxy"). Once the first replica of an object is written and finalized subsequent reads are guaranteed to view the same content. Additional copies and/or EC slices, if configured, are added asynchronously via put-copies and ec-put jobs, respectively.
  • Write-through. In presence of any remote backend, AIS executes remote write (e.g., using vendor's SDK) as part of the transaction that places and finalizes the first replica.
  • Small file datasets. To serialize small files and facilitate batch processing, AIS supports TAR, TAR.GZ (or TGZ), ZIP, and TAR.LZ4 formatted objects (often called shards). Resharding (for optimal sorting and sizing), listing contained files (samples), appending to existing shards, and generating new ones from existing objects and/or client-side files - is also fully supported.
  • Kubernetes. Provides for easy Kubernetes deployment via a separate GitHub repo and AIS/K8s Operator.
  • Access control. For security and fine-grained access control, AIS includes OAuth 2.0 compliant Authentication Server (AuthN). A single AuthN instance executes CLI requests over HTTPS and can serve multiple clusters.
  • Distributed shuffle extension for massively parallel resharding of very large datasets.
  • Batch jobs. APIs and CLI to start, stop, and monitor documented batch operations, such as prefetch, download, copy or transform datasets, and many more.

For easy usage, management, and monitoring, there's also:

  • Integrated and powerful CLI. As of early 2024, top-level CLI commands include:
$ ais

bucket        etl         help           log              create        dsort        stop         blob-download
object        job         advanced       performance      download      evict        cp           rmo
cluster       auth        storage        remote-cluster   prefetch      get          rmb          wait
config        show        archive        alias            put           ls           start        search

AIS runs natively on Kubernetes and features open format - thus, the freedom to copy or move your data from AIS at any time using the familiar Linux tar(1), scp(1), rsync(1) and similar.

For developers and data scientists, there's also:

For the original AIStore white paper and design philosophy, for introduction to large-scale deep learning and the most recently added features, please see AIStore Overview (where you can also find six alternative ways to work with existing datasets). Videos and animated presentations can be found at videos.

Finally, getting started with AIS takes only a few minutes.


Deployment options

AIS deployment options, as well as intended (development vs. production vs. first-time) usages, are all summarized here.

Since prerequisites boil down to, essentially, having Linux with a disk the deployment options range from all-in-one container to a petascale bare-metal cluster of any size, and from a single VM to multiple racks of high-end servers. But practical use cases require, of course, further consideration and may include:

Option Objective
Local playground AIS developers and development, Linux or Mac OS
Minimal production-ready deployment This option utilizes preinstalled docker image and is targeting first-time users or researchers (who could immediately start training their models on smaller datasets)
Easy automated GCP/GKE deployment Developers, first-time users, AI researchers
Large-scale production deployment Requires Kubernetes and is provided via a separate repository: ais-k8s

Further, there's the capability referred to as global namespace: given HTTP(S) connectivity, AIS clusters can be easily interconnected to "see" each other's datasets. Hence, the idea to start "small" to gradually and incrementally build high-performance shared capacity.

For detailed discussion on supported deployments, please refer to Getting Started.

For performance tuning and preparing AIS nodes for bare-metal deployment, see performance.

Existing datasets

AIStore supports multiple ways to populate itself with existing datasets, including (but not limited to):

  • on demand, often during the first epoch;
  • copy entire bucket or its selected virtual subdirectories;
  • copy multiple matching objects;
  • archive multiple objects
  • prefetch remote bucket or parts of thereof;
  • download raw http(s) addressible directories, including (but not limited to) Cloud storages;
  • promote NFS or SMB shares accessible by one or multiple (or all) AIS target nodes;

The on-demand "way" is maybe the most popular, whereby users just start running their workloads against a remote bucket with AIS cluster positioned as an intermediate fast tier.

But there's more. In v3.22, we introduce blob downloader, a special facility to download very large remote objects (BLOBs).

Installing from release binaries

Generally, AIStore (cluster) requires at least some sort of deployment procedure. There are standalone binaries, though, that can be built from source or, alternatively, installed directly from GitHub:

$ ./scripts/install_from_binaries.sh --help

The script installs aisloader and CLI from the most recent, or the previous, GitHub release. For CLI, it'll also enable auto-completions (which is strongly recommended).

PyTorch integration

AIS is one of the PyTorch Iterable Datapipes.

Specifically, TorchData library provides:

to list and, respectively, load data from AIStore.

Further references and usage examples - in our technical blog at https://aiatscale.org/blog:

Since AIS natively supports a number of remote backends, you can also use (PyTorch + AIS) to iterate over Amazon S3 and Google Cloud buckets, and more.

Reuse

This repo includes SGL and Slab allocator intended to optimize memory usage, Streams and Stream Bundles to multiplex messages over long-lived HTTP connections, and a few other sub-packages providing rather generic functionality.

With a little effort, they all could be extracted and used outside.

Guides and References

License

MIT

Author

Alex Aizman (NVIDIA)

aistore's People

Contributors

aaronnw avatar alex-aizman avatar ambarsarkar avatar bforbesnvidia avatar cicovic-andrija avatar dhruvaalam avatar gaikwadabhishek avatar grmaltby avatar haochengg avatar hondhan avatar jhc1210 avatar knopt avatar liangdrew avatar mjnovice avatar prytu avatar rkoo19 avatar ruyangl avatar ryan-beisner avatar saiprashanth173 avatar sasanap avatar satyatumati avatar shrirama avatar smkuls avatar soumyabk avatar soumyendra98 avatar straill-nvidia avatar timoha avatar usmanong avatar virrages avatar vladimirmarkelov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aistore's Issues

Setting `backend_bck` with REST API

Is it possible to set the backend_bck of a bucket using the REST API?

Was hoping to reproduce this CLI command with the REST API.

$ ais bucket props set ais://bucket_name backend_bck=gcp://cloud_bucket

It seems it's only possible to set the other properties, not including `backend_bck?

| Set [bucket properties](bucket.md#bucket-properties) (proxy) | PATCH {"action": "set-bprops"} /v1/buckets/bucket-name | `curl -i -X PATCH -H 'Content-Type: application/json' -d '{"action":"set-bprops", "value": {"checksum": {"type": "sha256"}, "mirror": {"enable": true}, "force": false}' 'http://G/v1/buckets/abc'` <sup id="a9">[9](#ft9)</sup> | `api.SetBucketProps` |

Can object be cached in chunks by multiple modes!

Trying to understand how this works compared to MongoDB’s GridFS. I understand it’s a cache for objects stored in cloud buckets like S3 et al so my question is if an object is larger in size than the disk space allocated for this on each node will the object be stored using multiple nodes?

Where in the docs can I find a simple explanation of how this works?

When deploy ais to k8s cluster with network policy ais cluster never be ready.

I use calico network policy and it drops traffic to service while no endpoints, so in bootstrap phase curl tries to connect to service and wait until reach default timeout (5m). Readiness probe couldn't get success because pod restart. Also it applicable to ais_docker_start script - primary proxy restarts because targets aren't connected to proxy(curl in target pod wait). Please, set curl timeout in ais_readiness.sh and ais_docker_start.sh scripts like that --connect-timeout 3

Question on AIStore cluster metrics

Hi,

How do I collect the metrics of ais-node application that is deployed using ais-operator style? Is it similar to the helm chart way? If not, can we have a doc on that please.

Ability to pass a JSON object instead of SortedFile

The capacity to start a dsort shuffle/sort with a custom orderfile becomes difficult when working within a VPC that doesn't support HTTPS. Currently, custom training jobs cannot easily request an ordering without having to manage the uploading process of a OrderFile. Ideally, we wouldn't want to push to object storage and use HTTPS requests to pass to the dsort api without creating different configurations for the aistore permissions.
It may be useful to allow this to be requested from the API/SDK with a JSON object that encodes this data! This also greatly simplifies the ability to do custom batch balancing for resampling methods that are not easily configurable from content or file name sorting (or the outlined OrderFile process).

Right now, this behavior can be done with individual xactions, but there can be a considerable performance benefit to marshall the request directly in JSON, and allowing dsort to manage the memory allocations to process the action faster than a series of xactions (especially for a large dataset).

Is there an easy way do do the above within the current setup?

Query on aistore.pytorch.Dataset

Hi,

I'm following this doc to train Imagenet Dataset: https://github.com/NVIDIA/aistore/blob/cc6e029721ef159f3df516ec9f8e3065ef6ac54d/docs/_posts/2021-10-22-ais-etl-2.md

I've a query specifically related to this part.
train_loader = torch.utils.data.DataLoader( aistore.pytorch.Dataset( "http://aistore-sample-proxy:51080", # AIS IP address or hostname Bck("imagenet"), prefix="train/", transform_id="my-first-etl", transform_filter=lambda object_name: object_name.endswith('.jpg'), ), batch_size=args.batch_size, shuffle=True, num_workers=args.workers, pin_memory=True)

I see a type error, when I try to use it as is in the training code. pydantic.main.BaseModel.__init__ TypeError: __init__() takes exactly 1 positional argument (2 given) Do you any insights on how to mitigate this error?

Also I found that the implementation of aistore.pytorch.Dataset is present in one of the development branch(post-3).

Limiting cache size per node

Hi,

Let’s say I wish to limit the amount of space I used on local disk per machine. Where do i configure that?

DNS name in DirectURL in smap

Hi everyone. What about change electable proxy manifest to StatefulSet and use dns name of pod instead ip in directurl in smap? IP is ephemeral resource in k8s.

Rebalance too slow

Hi! I'm trying restore missing data using rebalance in ec bucket (about 3500000 objects). This process takes too long time, after 24 hours ais restored only about 100 000 object. ec.batch_size set to max (128). How do i decrease rebalance time?

ais version - 3.5

Damaged metafile after update from 3.5 to 3.6

After update to 3.6 version i see error in targets log:
W 22:24:01.281820 ec.go:440 failed to load metadata from "/ais/sda/@ais/nvais/%mt/0025f/bJgGysbPLgSoXmVEsZVjMzcefFbWKebg": damaged metafile "/ais/sda/@ais/nvais/%mt/0025f/bJgGysbPLgSoXmVEsZVjMzcefFbWKebg": unsupported metadata format version 2065855337. Only 1 supported
How can i resolve this problem and covert to new metadata format. Maybe exist update documentation?

aisfs problems

I can't get aisfs to work.

  • it does not seem to respect configuration files in $HOME/.config/ais/bucket.aisfs.mount.json
  • there is now way of telling what configuration file it is actually loading
  • it ignores AIS_ENDPOINT

I would suggest the following:

  • aisfs should use the same defaults that are configured for ais (including ~/.config/ais and environment variables)
  • with those defaults, aisfs bucket dir should just work
  • there should be a -v option that logs the operations (reading config file, contacting server, mounting, etc.) to stderr
  • there should be a --configfile option that lets users provide a config file on the command line

I'd also suggest adding actual subcommands:

  • aisfs mount bucket dir
  • aisfs umount dir
  • aisfs stats dir
  • ais config -- displays current config, including where loaded

v1.0.5 AISFileLister ModuleNotFoundError

After installing v1.0.5, I get the following error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[43], line 7
      4 image_prefix = ["ais://cifar10/"]
      6 # Listing all files starting with these prefixes on AIStore
----> 7 dp_urls = AISFileLister(url="http://localhost:51080/", source_datapipe=image_prefix)
      9 # list obj urls
     10 print(list(dp_urls))

File [~/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torchdata/datapipes/iter/load/aisio.py:82](https://vscode-remote+ssh-002dremote-002bvscode.vscode-resource.vscode-cdn.net/home/ec2-user/SageMaker/torchdata/~/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torchdata/datapipes/iter/load/aisio.py:82), in AISFileListerIterDataPipe.__init__(self, source_datapipe, url, length)
     81 def __init__(self, source_datapipe: IterDataPipe[str], url: str, length: int = -1) -> None:
---> 82     _assert_aistore()
     83     _assert_aistore_version()
     84     self.source_datapipe: IterDataPipe[str] = source_datapipe

File [~/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torchdata/datapipes/iter/load/aisio.py:34](https://vscode-remote+ssh-002dremote-002bvscode.vscode-resource.vscode-cdn.net/home/ec2-user/SageMaker/torchdata/~/anaconda3/envs/pytorch2/lib/python3.8/site-packages/torchdata/datapipes/iter/load/aisio.py:34), in _assert_aistore()
     32 def _assert_aistore() -> None:
     33     if not HAS_AIS:
---> 34         raise ModuleNotFoundError(
     35             "Package `aistore` (>=1.0.2) is required to be installed to use this datapipe."
     36             "Please run `pip install --upgrade aistore` or `conda install aistore` to install the package"
     37             "For more info visit: https://github.com/NVIDIA/aistore/blob/master/sdk/python/"
     38         )

ModuleNotFoundError: Package `aistore` (>=1.0.2) is required to be installed to use this datapipe.Please run `pip install --upgrade aistore` or `conda install aistore` to install the packageFor more info visit: https://github.com/NVIDIA/aistore/blob/master/sdk/python/

Encountered an issue while upgrading the aisnode version on k8s cluster(GCP)

Hi,

I was following https://github.com/NVIDIA/ais-k8s terraform style of deploying the ais application on to k8s cluster. I initially deployed the cluster with 3.4 version, which was successful & then when I tried to upgrade it to 3.8 version(using the same script), the pods went to crashloopbackoff status. I had to manually clean the state & config files to make it work for 3.8 version. Any inputs on how to resolve this issue without having to clean the files manually?

Restore objects on crashed disk

Hi! I experienced with problem. One disk crashed on target and was changed to empty. Now part of objects don't available through ais object get, although i use ec for backet. Object quite small and it replicated (i found file on disk on another target). How can i restore such files?

version: 3.5

Ability to get last version from backend storage

I want use ais as cache tier with another ais cluster as provider. How can i validate that object in cache is not obsolete? What about extend get object api and add object version as query param. If that version not found in cache then initiate cold get from backend?

Retrieval Cost for AIS with GCP / AWS

I am interested in using PT+AIS to iterate over Google Cloud buckets using the Iterable DataPipes API in PT 1.12, and was wondering if there were retrieval costs in using AIS. Aside from paying for the storage, how cost-friendly would it be to use AIS+Google Cloud (or other services like Webdataset) for training over long durations?

ETL offline question

I'm trying to get a better idea of the lifecycle of a bucket contents with respect to a offline ETL. For example, if an ETL has been executed on the entire bucket and has completed, then a new object is put into the bucket, will a subsequent execution of the ETL automatically skip over results that have already been computed or will it materialize an entirely new set of outputs? What about if an object is deleted from the source bucket? Will the corresponding output from the ETL be deleted? If it does reuse results, is there a way to invalidate them, if necessary?

S3 compatibility not working with boto3

Hi! I'm trying to use boto3 and i've got problem working with ais as s3 backend. More detail in this issue boto/boto3#2708. Could you resolve this problem in aistore side or any idea how to use it with ais? Boto3 is popular library and used in many projects (like tensorboard).

Minimum cluster size with failover

Hi team,

Looking to create a small cluster, say, 3-4 nodes and with ability to grow.

Seems like you only replicate small objects with the same replication as parity-number. And a usual/popular EC scheme is at least 6+4, thus requiring 10 nodes. And EC can't be changed.

Is it possible to have only replication, until the cluster grows enough to handle, say, 10+4 EC ?

Or doing EC 10+4 from the start, with 4 nodes, and setting ec.objsize_limit to very big to cover all objects, and when the cluster is bigger, lowering it so big objects start being erasure-coded.

Does that make sense ?

Regards

ETL documentation issue

Hi team,

I think ETL documentation needs to be updated. Specifically, I would like to learn more about how to write transform function. There is the following sentence in the doc, but I don't find the example.

You can write your own custom transform function (see example below) that takes input object bytes as a parameter and returns output bytes (the transformed object’s content).

Bucket Copy REST API Docs

Noticed that the bucket API docs might be a bit outdated. Managed to work out command by looking at Python SDK source.

| Copy [bucket](bucket.md) | POST {"action": "copy-bck"} /v1/buckets/from-name | `curl -i -X POST -H 'Content-Type: application/json' -d '{"action": "copy-bck", }}}' 'http://G/v1/buckets/from-name?bck=<bck>&bckto=<to-bck>'` | `api.CopyBucket` |

I think it should look something like

POST {"action": "copy-bck"} /v1/buckets/<bck> | `curl -i -X POST -H 'Content-Type: application/json' -d '{"action": "copy-bck"}}}' 'http://G/v1/buckets/from-name?provider=<from_provider>&bck_to=<to-provider>//<to-bck>/'`

Happy to put in PR later if preferred.

Also, is there a way to pass in --list parameter to REST API like in CLI? An &list= param didn't seem to do the trick.

$ ais bucket cp ais://bck1 ais://bck2 --list obj1.tar,obj1.info --wait

E 06:05:56.448375 target.go:258 FATAL ERROR: operation not supported

Hi,

When trying to spin up a container by the following command:

docker run -d \
           -p 51080:51080 \
           -v /nvme/disk0:/ais/disk0 \
           aistore/cluster-minimal:latest

the container automatically shuts down immediately. The error upon docker logs is:

E 06:05:56.448375 target.go:258 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported

However, if use -v /disk0:/ais/disk0, the spin up is successful.

The FSTYPE difference between /nvme/disk0 and /disk0 are:
/nvme/disk0 nfs
/disk0 ext4

Is it because something is not supported in nfs type?

Thanks !

dutils_linux.go uses lsblk to get host device information

Using lsblk to get host device information is not supported in all environments because it leaks host information into the container. It happens to work under docker right now, but it doesn't work in podman (an improved drop-in replacement for Docker) and other container runtimes.

$ podman run -p 51080:51080 -v /shared/aistore:/ais/disk0 aistore/cluster-minimal:latest
E 06:28:40.198737 dutils_linux.go:80 No disks for /dev/sda1("sda1"):
[]
E 06:28:40.198996 vinit.go:42 FATAL ERROR: t[xbWPpOjy]: mp[/ais/disk0, fs=/dev/sda1] has no disks
FATAL ERROR: t[xbWPpOjy]: mp[/ais/disk0, fs=/dev/sda1] has no disks
$

(A workaround for podman is to add --security-opt unmask=/sys/dev/block, but not all runtimes or environments may allow that.)

Python API recursive PUT

Hi, what is the Python API analogous command to the following?

# put the downloaded dataset in the created AIS bucket
! ais object put -r -y <path_to_dataset> ais://caltech256/

Support custom s3 endpoint (minio, wasabi)

Hi!

Going through the codebase (with the s3 regex for getting the endpoint and a tightly coupled integration with the aws sdk), I assume that there is no way to setup a custom s3 backend at the moment (such as minio or using wasabi).

If the above assumption is correct, hereby a proposal to add the ability to set a custom s3 backend.

Sometimes failing to restart AIS in GKE: getting "lost or missing mountpath" fatal error

Hi,

We're using AIStore in a GKE cluster using the AIS K8S operator.

We have multiple mounts specified in the current spec we're using

    mounts:
      - path: "/ais1"
        size: 1000Gi
      - path: "/ais2"
        size: 1000Gi

However over time we notice that target pods can crash and then fail to start up. The startup message indicates the following error:

FATAL ERROR: t[rlNlzeeu]: [storage integrity error sie#50, for troubleshooting see https://github.com/NVIDIA/aistore/blob/master/docs/troubleshooting.md]: lost or missing mountpath "/ais1" ({Fs:/dev/sdb FsType:ext4 FsID:543639085,1533152499} vs {Path:/ais1 Fs:/dev/sdc FsType:ext4 FsID:543639085,1533152499 Ext:<nil> Enabled:true})

It seems that the 2 disks are getting mounted to different devices (/dev/sdc and /dev/sdb) after restarting which then fails the integrity check on the start up. I can resolve this by removing .ais.vmd files (unclear if this is causing data issues yet.)

Do you have any suggestions around this issue? Is there a way to enforce which mount gets mapped to which device?

Question on `aistore.pytorch.Dataset`

Hi,

Thank you for the excellent work!

A question

  • where can I see the implementation of the dataloader that is used in this?

Thank you,

    # Data loading code
    train_loader = torch.utils.data.DataLoader(
        aistore.pytorch.Dataset(
            "http://aistore-sample-proxy:51080", Bck("imagenet"),  # AIS IP address or hostname
            prefix="train/", transform_id="imagenet-train",
            transform_filter=lambda object_name: object_name.endswith('.jpg'),
        ),
        batch_size=args.batch_size, shuffle=True,
        num_workers=args.workers, pin_memory=True,
    )

https://aiatscale.org/examples/etl-imagenet-dataset/train_aistore.py

Private Azure bucket as backend

Hi,

I have my datasets stored in a private Azure blob storage, and I was wondering if I could use that as my backend bucket for the AIS bucket?

Thanks

OOM killed error of Operator Controller

Problem:
Hit OMM killed of operator controller on my k8s env

Cause:

  1. Controller Manager resource limit in mem:30Mi
    operator/config/manager/manager.yaml

        resources:
          limits:
            cpu: 100m
            memory: 30Mi
          requests:
            cpu: 100m
            memory: 20Mi
  1. the memory usage average on 55Mi in the cluster.

Solution:
Modify the default memory limit to 100Mi

S3 compatibility with AWS bucket backend

Hi,

Even if for now I'll pause my exploration of aistore, wanted to give some feedback on the issues I encountered when testing (using 9ec804e).

After activating the aws cloud backend, I started first with the direct access.
I was able to list the bucket content (after getting trolled a bit since I was too restrictive on my bucket policy and needed to add more than just List* for aistore to read bucket metadata I believe).

ais ls s3://aistore-test --props all
NAME		 SIZE		 CHECKSUM				 ATIME			 VERSION	 CACHED	 TARGET URL		 STATUS	 COPIES
Chart-88.png	 44.50KiB	 a8c44bbfa1d649872a05a3ef1cea7bc6	 			 		 no	 http://10.10.1.21:5354	 ok	 0
Chart-89.png	 44.50KiB	 ee901cdd0bea09c2			 08 Dec 20 09:53 UTC	 		 yes	 http://10.10.1.21:5354	 ok	 1

Two things that were worrisome are the missing ATIME when the file is not cached and the different checksum depending if the file is cached or not (it's the same file with a different name).
But what I did not realized was that the bucket is not accessible through the s3 endpoint, when listing with s3cmd the bucket is not visible and if I try to list it it will fail with a 404 on ais://aistore-test.

So I switched to creating an ais bucket with an aws backend :

ais create bucket test
ais set props ais://test backend_bck=aws://aistore-test
ais set props ais://test checksum.type=md5

First issue I had was with default max page size, the target was outputting errors such as :
page size exceeds the maximum value (got: 10000, max expected: 1000). I believed it's due to the default for ais being higher than the one for aws. There is probably a clean way to do it but I just changed it here https://github.com/NVIDIA/aistore/blob/master/cmn/api_const.go#L280 :).

After this I encountered the same missing ATIME when the file is not cached which prevented s3cmd from listing the bucket. After loading each file in cache the ls returned but all files were listed twice.

When trying to push a file using with s3cmd it failed after multiple retry due to invalid checksum (file is still uploaded), when setting the checksum.type props it appears to have no effect on the file checksum (tried setting it before and after setting backend_bck, did not check with tcpdump if the ETag was present).

When fetching a file with s3cmd if it's not cached then it will fail first due to missing ATIME, then after it's loaded in cache it will download the file but output and error due to an invalid checksum.

Edit: Additionally I did two upload test on a standard ais bucket.

Thank you and have a nice day.

ais object get a specific version?

Hi folks, I'm trying to get how object versioning works. As I run put object a number of times, ais object show --all shows an incremental version number. The question is, how do I retrieve a specific object version? I don't find a viable option with ais object get, am I missing something?

Aistore HTTP failed to json-unmarshal GET request

Hi,

I am running aistore as single docker deployment + ais cli on host machine to monitor and play around.
Currently, I am trying to list all buckets from my deployment using an example from http_api, but the command fails with HTTP 400 error code

luab@datastore:~$ curl -L -X GET 'http://localhost:8080/v1/buckets'
{"status":400,"message":"failed to json-unmarshal GET request, err: EOF [*cmn.ActionMsg](proxy.go, #384)","method":"GET","url_path":"/v1/buckets","remote_addr":"172.17.0.1:60062","caller":""}

Same for

luab@datastore:~$ curl -L -X GET 'http://localhost:8080/ais/medical'
{"status":400,"message":"invalid protocol scheme ''","method":"GET","url_path":"/ais/medical","remote_addr":"172.17.0.1:60070","caller":""}

At the same time ais cli commands seem to work just fine

luab@datastore:~$ ais bucket ls
AIS Buckets (2)
  ais://arxiv
  ais://medical

I am also able to retrive node config using this command.

luab@datastore:~$ curl -X GET http://localhost:8080/v1/daemon?what=config

Am I doing requests wrong or is something broken is aistore/my setup? Thanks!

HTTP API - dsort

The documentation here suggests that dsort can be accessed via v1/dsort.

However, it seems to be the case that it is instead accessed via v1/sort.

single docker deploy: has no disks

Following docs: deploy/prod/docker/single/README.md

$ docker run \            
    -p 51080:51080 \
    -v $(mktemp -d):/ais/disk0 \
    aistore/cluster-minimal:latest
E 01:52:32.813230 vinit.go:42 FATAL ERROR: t[QIfDzhxP]: mp[/ais/disk0, fs=/dev/mapper/ubuntuvg-root] has no disks
FATAL ERROR: t[QIfDzhxP]: mp[/ais/disk0, fs=/dev/mapper/ubuntuvg-root] has no disks

S3 compatibility with AWS Cli and AIS bucket

Hi,

I started to test aistore recently, I had some issue with the quickstart so I just reused the docker image from dev k8s templates with minor modification (updated to ubuntu 20.04 and cloned the 3.2.1 project version instead, cf gist).

I then started a proxy and target node on two different node using host network, cluster initialization appear to be ok and ais show cluster return both nodes as healthy.

First test I wanted to do then is the S3 compatibility, so I created a bucket using ais create bucket test and pushed a file in it. I then tried to interact with using aws --endpoint-url http://10.10.1.21:5353/s3 s3 ls s3://., it returned the bucket list. But when I started to try to interact with the bucket I had multiple issues :

  • on aws --end... s3 ls s3://test/ : Invalid timestamp "": String does not contain a date: using tcpdump I can see that the LastModified is empty
  • on aws --end... s3 cp s3://test/Chart-88.png .: fatal error: Could not connect to the endpoint URL: http://s3.ais.amazonaws.com/s3/test/Chart-88.png (Not sure where the s3.ais.amazonaws.com come from it's not in my config and using tcpdump I can see that the proxy return a 307 to the target with the correct url).
  • using s3cmd I was able to push a file even if the command line returned errors MD5 Sums don't match! (I downloaded it afterward and it has the same sha256 hash).

I tested using awscli 1.18.186 and 2.1.1 (I use some random access key/secret but disabled the auth).

Is this expected or do I have issues with my setup ?
Thank you and have a nice day :).

What is the housekeeper ?

Hi AIstore team,

I am trying to run AIstore in a HPC system. I am using GO 1.17.3. After executing make deploy, at some point there is an output message saying the housekeeper is not running, then it fails.

What is the housekeeper ?

make deploy
Enter number of storage targets:
5
Enter number of proxies (gateways):
1
Number of local mountpaths (enter 0 for preconfigured filesystems):
2
Select backend providers:
Amazon S3: (y/n) ?
n
Google Cloud Storage: (y/n) ?
n
Azure: (y/n) ?
n
HDFS: (y/n) ?
n
Would you like to create loopback mount points: (y/n) ?
n
Building aisnode: version=1bea20d85 providers= tags= mono
done.
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais0/ais.json -local_config=/ccs/home/benjha/.ais0/ais_local.json -role=proxy -ntargets=5
housekeeper not running, cannot reg ".dflt.mm.gc"housekeeper not running, cannot reg ".dflt.mm.small.gc"+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais1/ais.json -local_config=/ccs/home/benjha/.ais1/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais2/ais.json -local_config=/ccs/home/benjha/.ais2/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais3/ais.json -local_config=/ccs/home/benjha/.ais3/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais4/ais.json -local_config=/ccs/home/benjha/.ais4/ais_local.json -role=target
+ /sw/summit/ums/gen119/aistore/src/bin/aisnode -config=/ccs/home/benjha/.ais5/ais.json -local_config=/ccs/home/benjha/.ais5/ais_local.json -role=target
E 14:55:57.012409 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.012480 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.012924 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.013381 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
E 14:55:57.013471 err.go:118 FATAL ERROR: operation not supported
FATAL ERROR: operation not supported
Done.

Thanks

About `fmt.Sprintf` or `+` as string concat optimization proposal

hey folks:

I am currently working on HPC high performance computing

and aistore is used in the storage deployment of our project.

I found that the aistore source code uses a large number of fmt.Sprintf or + as string concat performance is slower performance, because it needs to open up a new space every time

A Builder is used to efficiently build a string using Write methods. It minimizes memory copying.

I would like to submit a PR to improve this question, would you please accept it?

ModuleNotFoundError: No module named 'aistore.botocore_patch'

Hi, I was attempting to test the example here:
https://github.com/NVIDIA/aistore/blob/master/docs/s3compat.md#boto3-compatibility

I installed the library with %pip install aistore[botocore], however the import fails:

from aistore.botocore_patch import botocore

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import boto3
----> 2 from aistore.botocore_patch import botocore

ModuleNotFoundError: No module named 'aistore.botocore_patch'

Proxy failed to launch after Helm install

Proxy failed to launch after Helm install #127

Env:
kubenetes 1.23.1 on ubuntu 20.04, x86_64, helm3

Problem:

Here is the lauch error: daemon.go:142 FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64:

Logs:


kubectl logs ais-proxy-2
aisnode proxy container startup at Tue May  9 07:31:52 UTC 2023
'/var/ais_config/ais.json' -> '/etc/ais/ais.json'
'/var/ais_config/ais_local.json' -> '/etc/ais/ais_local.json'
'/var/statsd_config/statsd.json' -> '/opt/statsd/statsd.conf'
No cached .ais.smap
9 May 07:31:52 - [11] reading config file: /opt/statsd/statsd.conf
9 May 07:31:52 - server is up INFO
aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -alsologtostderr=true -stderrthreshold=1  -allow_shared_no_disks=false
E 07:31:54.234343 daemon.go:142 FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: �, error found in #10 byte of ...|rol":   "",
      "p|..., bigger context ...|         "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""
  }
}|...
FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: �, error found in #10 byte of ...|rol":   "",
      "p|..., bigger context ...|         "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""
  }
}|...

The aislocal.json file:

{
  "confdir": "/etc/ais",
  "log_dir": "/var/log/ais",
  "host_net": {
      "hostname":                 "${AIS_PUB_HOSTNAME}",
      "hostname_intra_control":   "${AIS_INTRA_HOSTNAME}",
      "hostname_intra_data":      "${AIS_DATA_HOSTNAME}",
      "port":                 "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""
  }
}

And log after hardcode the hostname:


'/var/ais_config/ais.json' -> '/etc/ais/ais.json'
'/var/ais_config/ais_local.json' -> '/etc/ais/ais_local.json'
'/var/statsd_config/statsd.json' -> '/opt/statsd/statsd.conf'
No cached .ais.smap
9 May 09:06:19 - [11] reading config file: /opt/statsd/statsd.conf
9 May 09:06:19 - server is up INFO
aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -alsologtostderr=true -stderrthreshold=1  -allow_shared_no_disks=false -ntargets=3
W 09:06:21.127947 config.go:1716 load initial global config "/etc/ais/ais.json"
E 09:06:21.135042 daemon.go:142 FATAL ERROR: failed to load initial global config "/etc/ais/ais.json": cmn.ClusterConfig.ReadObject: found unknown field: compression, error found in #10 byte of ...|mpression": {
      |..., bigger context ...| },
  "backend": {

  },
  "compression": {
          "block_size":   262144,
          "c|...
FATAL ERROR: failed to load initial global config "/etc/ais/ais.json": cmn.ClusterConfig.ReadObject: found unknown field: compression, error found in #10 byte of ...|mpression": {
      |..., bigger context ...| },
  "backend": {

  },
  "compression": {
          "block_size":   262144,
          "c|...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.