Giter Site home page Giter Site logo

vsifile's Introduction

vsifile


Source Code: https://github.com/vincentsarago/vsifile


Description

Experiment using Rasterio/GDAL Python file opener VSI plugin https://github.com/rasterio/rasterio/pull/2898/files

Future version of rasterio will accept an custom dataset opener:

opener : callable, optional
        A custom dataset opener which can serve GDAL's virtual
        filesystem machinery via Python file-like objects. The
        underlying file-like object is obtained by calling *opener* with
        (*fp*, *mode*) or (*fp*, *mode* + "b") depending on the format
        driver's native mode. *opener* must return a Python file-like
        object that provides read, seek, tell, and close methods.

ref: https://github.com/rasterio/rasterio/blob/d966440c06f3324aca1fa761d490cc780a9f619c/rasterio/__init__.py#L185-L191

Install

You can install vsifile using pip

python -m pip install -U pip
python -m pip install -U vsifile

or install from source:

git clone https://github.com/vincentsarago/vsifile.git
cd vsifile
python -m pip install -U pip
python -m pip install -e .

Usage

from vsifile import VSIFile, FileReader

src_path = "tests/fixture.cog.tif"

with VSIFile(src_path, "rb") as f:
    assert isinstance(f, FileReader)
    assert hash(f)
    assert "FileReader" in str(f)

    assert not f.closed
    assert f._cache
    assert len(f._header) == 32768
    assert f.tell() == 0
    assert f.seekable

    b = f.read(100)
    assert len(b) == 100
    assert f._header[0:100] == b
    assert f.tell() == 100

    _ = f.seek(0)
    assert f.tell() == 0

    _ = f.seek(40000)
    assert f.tell() == 40000

    b = f.read(100)
    assert f.tell() == 40100

    # fetch the same block (should be from LRU cache)
    _ = f.seek(40000)
    b_cache = f.read(100)
    assert f.tell() == 40100
    assert b_cache == b

    b = f.read_multi_range(2, [100, 200], [10, 20])
    assert len(b) == 2
    assert len(b[0]) == 10
    assert len(b[1]) == 20
    assert f.tell() == 220

With Rasterio

import rasterio
from vsifile.rasterio import opener

with rasterio.open("tests/fixtures/cog.tif",  opener=opener) as src:
    ...

Cache Configuration

vsifile uses DiskCache to create a persistent File Header cache. By default the cache will be cleaned up when closing the file handle, you can change this behaviour by setting VSIFILE_CACHE_DIRECTORY="{your temp directory}" environment variable.

Contribution & Development

See CONTRIBUTING.md

Changes

See CHANGES.md.

License

See LICENSE

vsifile's People

Contributors

vincentsarago avatar

Watchers

 avatar Henry Rodman avatar

vsifile's Issues

implement http multiplexing and multirange

Right now our multi-range-read method will just loop over a _read_range method, while in theory some HTTP server support returning multi-range request: https://datatracker.ietf.org/doc/html/rfc9110#name-media-type-multipart-bytera
or multiplexing https://www.python-httpx.org/http2/

class HttpReader(BaseReader):

  • GDAL_HTTP_MULTIPLEX=[YES/NO]: Defaults to YES. Only applies on a HTTP/2 connection. If set to YES, HTTP/2 multiplexing can be used to download multiple ranges in parallel, during ReadMultiRange() requests that can be emitted by the GeoTIFF driver.
  • GDAL_HTTP_MULTIRANGE=[SINGLE_GET/SERIAL/YES]: Defaults to YES. Controls how ReadMultiRange() requests emitted by the GeoTIFF driver are satisfied. SINGLE_GET means that several ranges will be expressed in the Range header of a single GET requests, which is not supported by a majority of servers (including AWS S3 or Google GCS). SERIAL means that each range will be requested sequentially. YES means that each range will be requested in parallel, using HTTP/2 multiplexing or several HTTP connections.

Can VSIFILE or GDAL merge requests?

in https://github.com/vincentsarago/vsifile/blob/main/vsifile_logs.ipynb I'm printing the logs to see the behaviour that GDAL has when opening/reading the file using VSIOpener.

Header Cache

DEBUG:vsifile:Using /tmp/visfile-cache Cache directory
DEBUG:vsifile:Opening: tests/fixtures/cog.tif (mode: rb)
DEBUG:vsifile:Reading 0->32768 from Header cache
DEBUG:vsifile:Reading 0->8 from Header cache
DEBUG:vsifile:Reading 8->10 from Header cache
DEBUG:vsifile:Reading 10->226 from Header cache
DEBUG:vsifile:Reading 226->230 from Header cache
DEBUG:vsifile:Reading 1280->1304 from Header cache
DEBUG:vsifile:Reading 1304->1352 from Header cache
DEBUG:vsifile:Reading 1352->1416 from Header cache
DEBUG:vsifile:Reading 1416->1446 from Header cache
DEBUG:vsifile:Reading 1198->1279 from Header cache
DEBUG:vsifile:Reading 1446->1448 from Header cache
DEBUG:vsifile:Reading 1448->1616 from Header cache
DEBUG:vsifile:Reading 1616->1620 from Header cache

as we can see in โ˜๏ธ, GDAL has 0 caching method and will request bytes without reusing the one it already fetched ๐Ÿ˜ญ

DEBUG:vsifile:Reading 0->32768 from Header cache
DEBUG:vsifile:Reading 0->8 from Header cache

This seems really weird, once GDAL ask for open, the first thing it does is fetch the first 32768 bytes (GDAL_INGESTED_BYTES_AT_OPEN=32768) but doesn't make use of the bytes afterward. In vsifile we store this header in diskcache and in the python Object after opening, so no direct file request is necessary but the <> between GDAL and Python is just not necessary.

Block cache

DEBUG:vsifile:Fetching 86991->87046
DEBUG:vsifile:Fetching 87046->87101
DEBUG:vsifile:Fetching 87101->87156
DEBUG:vsifile:Fetching 87156->88480
DEBUG:vsifile:Fetching 88480->90780
DEBUG:vsifile:Fetching 90780->91309
DEBUG:vsifile:Fetching 91309->91364
DEBUG:vsifile:Fetching 91364->91419
DEBUG:vsifile:Fetching 91419->91474
DEBUG:vsifile:Fetching 91474->91529
DEBUG:vsifile:Fetching 91529->91688
DEBUG:vsifile:Fetching 91688->91743
DEBUG:vsifile:Fetching 91743->91798
DEBUG:vsifile:Fetching 91798->92160
DEBUG:vsifile:Fetching 92160->93792
DEBUG:vsifile:Fetching 93792->96827
DEBUG:vsifile:Fetching 96827->101030

GDAL does not merge request with will then results in a LOT of requests, making vsiopener (or this implementation) useless ๐Ÿ˜ญ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.