Giter Site home page Giter Site logo

ipfsspec's Introduction

ipfsspec

A readonly implementation of fsspec for IPFS.

Installation

You can install ipfsspec directly from git with the following command:

pip install ipfsspec

Usage

This project is currently very rudimentaty. It is not yet optimized for efficiency and is not yet feature complete. However it should be enough to list directory contents and to retrieve files from ipfs:// resources via fsspec. A simple hello worlds would look like:

import fsspec

with fsspec.open("ipfs://QmZ4tDuvesekSs4qM5ZBKpXiZGun7S2CYtEZRB3DYXkjGx", "r") as f:
    print(f.read())

The current implementation uses a HTTP gateway to access the data. It tries to use a local one (which is expected to be found at http://127.0.0.1:8080) and falls back to ipfs.io if the local gateway is not available.

You can modify the list of gateways using the space separated environment variable IPFSSPEC_GATEWAYS.

ipfsspec's People

Contributors

d70-t avatar davidgasquez avatar observingclouds avatar thewtex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipfsspec's Issues

roadmap: what should ipfsspec do?

This issue is meant to discuss the purpose of the ipfsspec fsspec backend and to sharpen the overall design.

background

Due to the availability of IPFS -> HTTP gateways, a specialized IPFS backend for fsspec based read access is not required, as it is possible to open any CID using the http backend by accessing

http(s)://<gateway>/ipfs/<CID>

the downside of this approach is, that this requires to transform from content-based addressing to location-based addressing in user code. Using gateway-aware urls in user code makes it harder

  • to use local gateways
  • to do automatic fallback between multiple gateways
  • to define a preferred gateway based on the local computing environment

To overcome these downsides, it seems to be beneficial to refer to IPFS resources via a gateway-unaware url like

ipfs://<CID>

and do the translation to HTTP or IPFS when accessing the resource and based on the local computing environment and settings. This was the initial idea of ipfsspec.

design questions

Is such a library useful at all?

Or should this translation be implemented on a different layer?

Should this library do automatic load balancing / fallback between multiple gateways?

  • Doing load balancing or fallback properly is not trivial to implement (especially with async).
  • If the library should just work without user configuration, a solution with fallback is likely required, as otherwise it is not possible to use public gateways and still prefer the local gateway if is available.

Should the library provide write support?

... and if yes, how?

IPFS is a content addressable storage, thus one can not choose the filename when adding content. In stead, the "filename" is computed based on the stored content. As a result, the signature of a put function would rather look like

cid = put(content)

in stead of

put(content, filename)

and thus wouldn't directly fit into fsspec.

A way out might be to use the IPFS mutable filesystem, which adds a local mutable overlay on top of the immutable filesystem. Using MFS it would be possible to incrementally construct a local filesystem hierarchy and ask for a root CID after construction has finished. The downside of this approach is, that this only works locally (or at least local to one gateway) and thus is probably not suited for larger datasets. So there's probably not too much benefit as compared to writing data into a local temporary folder and than ipfs add -r -H the entire folder.

A related option might be to pin data blocks one by one and keep the virtual directory in memory. After writing out a larger dataset this way, a root CID for remotely stored datasets could be created. An advantage of this approach might be, that writing could be distributed to multiple remote gateways.

verified reads

Never Versions of kubo / go-ipfs support retrieving data as CAR, which generally allows to verify responses for correctness, thus removing the need for trust in gateway servers. This should be supported by ipfsspec.

improve async scheduling

As noted in #12, the async implementation can be slower than the sync implementation in some cases. This should not be.

The MultiGateway sends requests to multiple gateways, trying to balance between them based on how fast they respond and if they explicitly mention that requests are coming in too quick (HTTP status 429). This happens for two purposes: automatic fall back from broken to live gateways and to be nice to (especially public) gateways.

It seems like the current scheduling strategy is not yet very good, leads to bad performance and should be updated.

Loading of zarr dataset fails due to missing "ETag" in server response.

What happened
While trying to open the dataset zarr dataset bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu with

import xarray as xr
xr.open_dataset("ipfs://bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu", engine="zarr")

a KeyError is sometimes raised:

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    798 
    799         filename_or_obj = _normalize_path(filename_or_obj)
--> 800         store = ZarrStore.open_group(
    801             filename_or_obj,
    802             group=group,

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
    363                     stacklevel=stacklevel,
    364                 )
--> 365                 zarr_group = zarr.open_group(store, **open_kwargs)
    366         elif consolidated:
    367             # TODO: an option to pass the metadata_key keyword

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
   1165 
   1166     # handle polymorphic store arg
-> 1167     store = _normalize_store_arg(
   1168         store, storage_options=storage_options, mode=mode
   1169     )

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/hierarchy.py in _normalize_store_arg(store, storage_options, mode)
   1055     if store is None:
   1056         return MemoryStore()
-> 1057     return normalize_store_arg(store,
   1058                                storage_options=storage_options, mode=mode)
   1059 

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/storage.py in normalize_store_arg(store, storage_options, mode)
    112     if isinstance(store, str):
    113         if "://" in store or "::" in store:
--> 114             return FSStore(store, mode=mode, **(storage_options or {}))
    115         elif storage_options:
    116             raise ValueError("storage_options passed with non-fsspec path")

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/storage.py in __init__(self, url, normalize_keys, key_separator, mode, exceptions, dimension_separator, **storage_options)
   1138         # Pass attributes to array creation
   1139         self._dimension_separator = dimension_separator
-> 1140         if self.fs.exists(self.path) and not self.fs.isdir(self.path):
   1141             raise FSPathExistNotDir(url)
   1142 

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     84     def wrapper(*args, **kwargs):
     85         self = obj or args[0]
---> 86         return sync(self.loop, func, *args, **kwargs)
     87 
     88     return wrapper

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     64         raise FSTimeoutError from return_result
     65     elif isinstance(return_result, BaseException):
---> 66         raise return_result
     67     else:
     68         return return_result

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     24         coro = asyncio.wait_for(coro, timeout=timeout)
     25     try:
---> 26         result[0] = await coro
     27     except Exception as ex:
     28         result[0] = ex

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in _isdir(self, path)
    531     async def _isdir(self, path):
    532         try:
--> 533             return (await self._info(path))["type"] == "directory"
    534         except IOError:
    535             return False

KeyError: 'type'

Expected behaviour
The dataset is returned without any error.

Potential causes
Debugging the above call

by inserting a few print statements into async_ipfs.py
    async def file_info(self, path, session):
        info = {"name": path}
    headers = {"Accept-Encoding": "identity"}  # this ensures correct file size
    res = await self.cid_head(path, session, headers=headers)

    async with res:
        self._raise_not_found_for_status(res, path)
        if res.status != 200:
            # TODO: maybe handle 301 here
            raise FileNotFoundError(path)
        if "Content-Length" in res.headers:
            info["size"] = int(res.headers["Content-Length"])
        elif "Content-Range" in res.headers:
            info["size"] = int(res.headers["Content-Range"].split("/")[1])

        if "ETag" in res.headers:
            etag = res.headers["ETag"].strip("\"")
            info["ETag"] = etag
            if etag.startswith("DirIndex"):
                info["type"] = "directory"
                info["CID"] = etag.split("-")[-1]
            else:
                info["type"] = "file"
                info["CID"] = etag

    print(f"Info: {info}", flush=True)  # debug print
    print(res.status)  # debug print
    print(res.headers)  # debug print
    return info

reveals that the "ETag" is not always returned by the server. While the header looks like

Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 'ETag': 'DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 'type': 'directory', 'CID': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}
200
<CIMultiDictProxy('Server': 'openresty', 'Date': 'Sat, 18 Jun 2022 23:03:06 GMT', 'Content-Type': 'text/html', 
'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Access-Control-Allow-Methods': 'GET', 
'Etag': '"DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu"', 
'X-Ipfs-Gateway-Host': 'ipfs-bank6-fr2', 
'X-Ipfs-Path': '/ipfs/bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'X-Ipfs-Roots': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'X-IPFS-POP': 'ipfs-bank6-fr2', 'Access-Control-Allow-Origin': '*', 
'Access-Control-Allow-Methods': 'GET, POST, OPTIONS',
 'Access-Control-Allow-Headers': 'X-Requested-With, Range, Content-Range, X-Chunked-Output, X-Stream-Output', 
'Access-Control-Expose-Headers': 'Content-Range, X-Chunked-Output, X-Stream-Output', 
'X-IPFS-LB-POP': 'gateway-bank2-fr2',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Proxy-Cache': 'MISS')>
Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'ETag': 'DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'type': 'directory',
'CID': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}

for a successful request, it misses the "ETag" when failing:

Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}
200
<CIMultiDictProxy('Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Sun, 19 Jun 2022 10:10:27 GMT', 'Content-Type': 'text/html', 
'Connection': 'keep-alive', 'Access-Control-Allow-Headers': 'Content-Type', 
'Access-Control-Allow-Headers': 'Range', 'Access-Control-Allow-Headers': 'User-Agent', 
'Access-Control-Allow-Headers': 'X-Requested-With', 
'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Methods': 'HEAD',
 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Range',
 'Access-Control-Expose-Headers': 'X-Chunked-Output', 
'Access-Control-Expose-Headers': 'X-Stream-Output', 
'X-Ipfs-Path': '/ipfs/bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu')>

Without the "ETag" the "type"-Key is not set.

if "ETag" in res.headers:
etag = res.headers["ETag"].strip("\"")
info["ETag"] = etag
if etag.startswith("DirIndex"):
info["type"] = "directory"
info["CID"] = etag.split("-")[-1]
else:
info["type"] = "file"
info["CID"] = etag

Does this mean that the success of the function call seems to depend on which IPFS peer is responding quickest?

fewer gateway requests

Currently the info() method will often issue two calls to the gateway to determine the size of an object and if it is a file or directory. It should be possible to find a solution which does this in one go and also with public gateways (local gateway api supports files/stat, but public gateways usually don't).

Gateway selection using IPIP-280

IPIP-280 specifies how automatic gateway configuration should work. While IPIP-280 is not yet formally merged, it seems to be sufficiently stable to be implemented. (e.g. it's also implemented in curl).

Ls fails when a directory contains symlinks (or anything that is not a file or directory).

I ran into an issue where running ls broke because "Type 4" was not recognized.

Unfortunately the ls rest docs are not helpful here:

http://docs.ipfs.tech.ipns.localhost:8080/reference/kubo/rpc/#api-v0-ls

But I was able to find this indicating what the different type codes meant:

https://ipfs-search.readthedocs.io/en/latest/ipfs_datatypes.html
https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.proto

MWE:

import fsspec

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)


import ipfsspec
fs = fsspec.get_filesystem_class('ipfs')

fs = ipfsspec.core.IPFSFileSystem(timeout=100)

cid = "bafybeief7tmoarwmd26b2petx7crtvdnz6ucccek5wpwxwdvfydanfukna"
res = fs._gw_apipost("ls", arg=cid)
links = res["Objects"][0]["Links"]

fs.ls(cid)

I will post up an MR with a fix shortly.

`/api/v0` is deprecated on the gateway port

Some functionality of ipfsspec (stat and ls) relies on accessing the (restricted) API through the gateway interface, this is no longer possible, because the /api/v0 endpoint has been removed. In the meantime, a lot of missing functionality has been added to the gateway, such that it's now likely possible to implement ipfsspec on top of regular gateway functionality (e.g. using the trustless gateway spec)

async ipfsspec hangs

From #11:

The async ipfsspec implementation hangs when running:

git clone -b ipfsspec https://github.com/thewtex/spatial-image-multiscale
cd spatial-image-multiscale
pip install -e '.[test]'
pytest

CC @thewtex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.