geospatial-jeff / aiocogeo Goto Github PK

View Code? Open in Web Editor NEW

70.0 70.0 10.0 234 KB

Asynchronous cogeotiff reader

License: MIT License

Python 100.00%

aiocogeo's People

Contributors

Stargazers

Watchers

Forkers

vincentsarago dmahr1 scottyhq kylebarron daguirreg jferencik nishadhka greenosoil ddohler rdenham

aiocogeo's Issues

Implement tile cache

It would be nice to have a cache for tiles which have already been fetched. This could be as simple as decorating get_tile with an alru_cache.

See GDAL's block cache

try out libdeflate

it might be faster than imagecodecs for deflate compression

https://github.com/dcwatson/deflate

Fix parsing of RESOLUTION tags

Parsing of these tags doesn't seem to work. Will fix this in a future PR, as these tags aren't that important for COG (esp. server side).

Originally posted by @geospatial-jeff in #78 (comment)

investigate chunk transfer encoding

GDAL and geotiff.js both use chunk transfer encoding, aiohttp supports it (https://docs.aiohttp.org/en/stable/streams.html) so we should do a comparison to see which is better.

Support requester pays buckets

Read COG metadata (IFD/tags)

I'm thinking something like:

class Tag:
    ...

class IFD:
    tags: Dict[str, Tag]


class COGTiff:
    fpath: str
    ifds: List[IFD]    

    async def __aenter__(self):
        # Request first 16kb, parse ifd and their tags.
        ifds = <ifd with tags>
        return self(ifds)

Usage would be:

async with COGTiff('https://coolsat.com/cog.tif') as cog:
    await cog.read_tile()

We can reuse most of the code from COGDumper but I'd like to make it more object orientated to make the interface a little easier to use.

I like the AbstractReader used by COGDumper, for now lets focus on making it work for http files and then introduce pluggable readers.

Boundless reads

Confirm that boundless reads work (reading a map tile which isn't fully covered by image tiles).

Should just have to add exception handling here to catch TileNotFoundError, and create a mask for the missing portion of the map tile.

Also let the user define what value is used to fill empty pixels.

cc @vincentsarago

Read nodata values

Nodata is defined as a GDAL private tiff tag. We need to:

Read the no data value from the header.
Apply no data value as mask on tile reads.

Define explicit IFD attributes for supported tags

When the interface is finished we should define explicit IFD attributes for the supported tiff tags, a few reasons:

Having a huge LUT containing a bunch of tags indicates that the library supports all of those tags when we really only want to support a small subset of tags defined in the TIFF spec which are necessary for partial reads.
As currently written, its not explicitly defined which tiff tags are attached to an IFD. This makes the code much harder to understand and maintain. A user/developer should be able to look at the IFD class definition and know exactly which tiff tags it supports and how they are accessed.

Merge consecutive range requests

We have the offsets of each image tile and can calculate which tiles we need to read for a given partial read so this shouldn't be too difficult.

Ref: https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_MERGE_CONSECUTIVE_RANGES

Improve IFD/tag composition

Tags are really just metadata about the IFD and its annoying to access them like:

ifd.tag['TagName'].value

Would be easier to do:

ifd.TagName.value

Add rasterio profile

cog.profile property which mirrors rasterio's profile

https://github.com/geospatial-jeff/async-cog-reader/blob/master/tests/test_cog_reader.py#L22-L31

Add STAC filesystem

The STAC filesystem would search the item's assets for COGs and return potentially several http or s3 readers

Add filesystem exception handling

Reduce memory usage

Aiocogeo uses ~4x more memory than rio tiler when reading a single tile:

Line #    Mem usage    Increment   Line Contents
================================================
    44    115.3 MiB    115.3 MiB   @profile
    45                             def main():
    46    125.7 MiB     10.4 MiB       asyncio.run(_aiocogeo())
    47    128.5 MiB      2.8 MiB       rio_tile()

The culprit is the call to skimage.resize when resampling the image:

Line #    Mem usage    Increment   Line Contents
================================================
   292    118.9 MiB    118.9 MiB       @profile
   293                                 def _postprocess(
   294                                     self, arr: NpArrayType, img_tiles: TileMetadata, out_shape: Tuple[int, int]
   295                                 ) -> NpArrayType:
   296                                     """Wrapper around ``_clip_array`` and ``_resample`` to postprocess the partial read"""
   297    118.9 MiB      0.0 MiB           return self._resample(
   298    126.5 MiB      7.6 MiB               self._clip_array(arr, img_tiles), img_tiles=img_tiles, out_shape=out_shape
   299                                     )

Tag values are currently typed as Union[Any, Tuple[Any]]. This causes lots of downstream issues because the type is unclear. It would make the code much cleaner if we removed the Union and only used a single type for tag values. This would also let us add mypy to pre-commit.

Support http2

To match GDAL_HTTP_VERSION config option. Typically this is used in conjunction with GDAL_HTTP_MULTIPLEX (https://github.com/developmentseed/cogeo-tiler/blob/master/serverless.yml#L54-L55) but we are already multiplexing with asyncio.

Ref https://docs.aiohttp.org/en/stable/client_reference.html

Add CLI (Info and Grid)

I think it will be good to have a CLI with this tool.
Right now I'm think about

Info: dump general info about the COG (https://github.com/blacha/cogeotiff#cogeotiff-info)
- blocks
- size
- min/max/mean block size per overview
- mask
- compression
grid: return a TMS model of the internal structure of the COG (using morecantile tms model)

rio-tiler integration

Work began in #68 to support tiling with aiocogeo. The next step is to extend rio-tiler's BaseReader instead of defining our own class so aiocogeo can be (kind of) compatible with applications that already use rio-tiler.

Merge consecutive Requests

linked to #21, GDAL merge consecutive requests (horizontal tiles, when band interleave I think) up to 2Mb (configurable).

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).

Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems

Add config option to enable/disable block cache

Add config option to enable/disable block cache (GDAL has this as well). Also the cache is causing tests to fail, so it would be nice to disable caching during tests. That specific test case works by itself but fails during build because the same tiles are requested and cached in a different test case which throws off the number of requests.

Aiocache supports this through cache_read and cache_write kwargs injected to the cache key generator (https://aiocache.readthedocs.io/en/latest/decorators.html#cached)

Run cpu bound code in background

There is a lot of cpu bound code which blocks the main thread at higher concurrencies like decompression, resampling, and numpy operations. We should look to use something like starlette.concurrency.run_in_threadpool or aiofiles.os.wrap which both use asyncio.loop.run_in_executor to run code in the background without blocking the main thread.

It would be worth benchmarking the difference between a ProcessPoolExecutor and ThreadPoolExecutor (process would definitely be faster but by how much?).

Add cog validator

aiocogeo supports a much smaller subset of COG types than gdal, so it would be good to have a way to validate if an image can be read.

support AVIF compression

imagecodecs (https://github.com/cgohlke/imagecodecs) is adding support for AVIF

Read COG tile

Once #2 is ready to go, we need a method to use IFD/tag metadata to read a given tile. COGDumper uses cogdumper.cog_tiles.COGTiff.read_tile which looks to just a single XYZ tile based on the tile's coordinate with respect to the (top left?) of the image from the appropriate overview.

As @vincentsarago pointed out, we could use pyproj to:

get geospatial info from the COG

fetch only the internal tile (and overview tiles) for a specific .read request.

I think it would be nice to implement something similar to rasterio.windows where we can use pyproj to map a particular bounding box to the corresponding XYZ tiles in the COG, but I'm definitely open to other ideas. This brings some questions.

How this will work with rio-tiler-v2 -- if at all. If the COGTiff class can implement a similar interface to rasterio.io.DatasetReader it could be passed in as the src_dst to rio_tiler.reader._read but I'm not sure if that is feasible.

Inconsistent indexing between one/multiband images

Read GeoKeyDirectoryTag into something pyproj compatible

Everything we need should be in https://github.com/OSGeo/libgeotiff/blob/master/libgeotiff

aiocogeo info fails when nodata is not an integer

aiocogeo/aiocogeo/ifd.py

Line 164 in 5a1d32c

return int(self.NoData.value[0]) if self.NoData else None

Make header size configurable with environment variable

https://github.com/geospatial-jeff/async-cog-reader/blob/e3b613717291be7d247359480bd8e2f2cd2fe60a/async_cog_reader/constants.py#L3

GDAL docs:

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).

Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems

Large tag value offsets

The first 16KB of the header should contain all IFDs, but large tag values which don't fit in the 12 bytes provided by each IFD for the tag's value may be stored anywhere in the file (even after image data) in which case we'll need to do another range request into the file fetch the tag value.

Refactor partial read for internal masks

Doing a partial read when an internal mask is present is different enough from no mask to warrant refactoring the partial read into two methods. This should also make it easier to support internal masks when merging range requests (#29)

Read internal masks

remove `run_in_background`

Decompression and postprocessing will never really block the main thread, so its causing more harm than good.

use moto for tests

Yeah should really use moto so we don't need to talk to S3 at all for tests

Originally posted by @geospatial-jeff in #97 (comment)

parallelize tag reads

With a slightly smarter reader we could easily read all tags for an ifd in parallel.

https://github.com/geospatial-jeff/aiocogeo/blob/master/aiocogeo/ifd.py#L42-L45

move tiling code to aiocogeo-tiler

It makes a lot of sense on many levels to move the gdal dependency into another repo to keep aiocogeo low level and lightweight. I think https://github.com/geospatial-jeff/aiocogeo-tiler would make a good home.

@vincentsarago

Cache only COG header

https://cogeotiff.slack.com/archives/C01DE57GLHE/p1603130953009500

Summary

Consider a case where N unique tile requests are made to a single COG. Despite the ENABLE_CACHE environment variable being enabled, all requests would be cache misses. Thus at least 2 * N range requests would need to be made to the COG. But if the COG header were cached separately, then only 1 + N range requests would need to be made.

Details

I plan to incorporate aiocogeo within a traditional tile server middleware that handles regular z/x/y.png requests. These currently read PNG tiles that are stored as a tile pyramid in bucket storage. This dated architecture is space inefficient but very performant. I'm hoping to achieve the space savings of COGs (via YCbCr JPEG compression + GDAL mask bands) without a meaningful increase in latency. One way to eliminate that latency is by caching the header in redis or another fast cache available to many servers. For example:

Client requests Mercator tile (z, x, y) for cog.tif in cloud bucket storage. Networking layer routes it to COG server 1 (one of many COG servers).
- COG server 1 checks redis (or another cache) for cog.tif header, but it's not found. CACHE MISS.
- COG server 1 makes range request to cog.tif header.
- COG server 1 caches cog.tif header in redis.
- COG server 1 makes range request for (z, x, y) tile data.
- COG server 1 performs postprocessing and returns (z, x, y) tile data to client.
Client requests Mercator tile (z, x + 1, y) for cog.tif in cloud bucket storage. Networking layer routes it to COG server 2.
- COG server 2 checks redis for cog.tif header and it's found. CACHE HIT.
- COG server 2 makes range request for (z, x + 1, y) tile data.
- COG server 2 performs postprocessing and returns (z, x + 1, y) tile data to client.

The two italicized operations for the first request are not necessary for the second request.

Checkout COGDumper

👋 @geospatial-jeff
The subject looks interesting 😄

Not sure what's your idea but if we want to go full async maybe we can use some of the code from https://github.com/mapbox/COGDumper to go GDAL Free ...

COGDumper is not smart and doesn't do any spatial stuff but if we add pyproj we might be able to do;

get geospatial info from the COG
fetch only the internal tile (and overview tiles) for a specific .read request.

Add rasterio/tiler extra

Add an extra which includes code to do dynamic tiling with aiocogeo (ex. pip install aiocogeo[tiler])

This would be an extra because rasterio is required for coordinate system logic, and I don't want to include it as a core dependency.
We should aim to implement a similar interface to rio_tiler.io.base.BaseReader.

Fix reading of byte formatted tags with text

Example:
Tag(code=305, name='Software', tag_type=TagType(format='c', size=1), count=21, length=21, value=(b'T', b'r', b'i', b'm', b'b', b'l', b'e', b' ', b'G', b'e', b'r', b'm', b'a', b'n', b'y', b' ', b'G', b'm', b'b', b'H', b'\x00'))

add logging

Would be really useful for debugging purposes to have more verbosity on reads.

I often use CPL_DEBUG and CPL_CURL_VERBOSE withing GDAL to see how much data and how many GET/LIST/HEAD request gdal is doing.

Side note: myabe having and internal variable to host this could be cool:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/webp_cog.tif") as cog:
    x = y = z = 0
    tile = await cog.get_tile(x, y, z)

print(cog.requests)
{
    count: 3,   
    size: TotalSizeOfRequest
    get: [
       'offset1-offset2', sizeOfRequest1,
       'offset3-offset4',  sizeOfRequest2,
       'offset5-offset6',  sizeOfRequest3
   ]
}

Support BIGTIFF

Pretty self exaplantory, aiocogeo doesn't currently support BIGTIFF.

aiocogeo/aiocogeo/cog.py

Line 88 in 950ea55

raise NotImplementedError("BigTiff is not yet supported")

Define tags as semi-private

Made some improvements with #4 and #9 but I'm still not a huge fan.

I think it would be best to switch tags to semi-private attributes and expose important metadata through properties like here. At minimum we should have properties for the rasterio profile. A few reasons:

I don't think most users care about the metadata contained on each Tag object (or even care about all of the defined tags)
Keeping Tag defined on the IFD still resolves #9.
Properties of course will be more user friendly to use (ifd.width instead of ifd.ImageWidth.value).
Making Tag semi-private prevents confusion (ex. ifd.Compression vs ifd.compression is confusing)

Implement GDAL 3.1 optimizations

Ref https://gdal.org/drivers/raster/cog.html#header-ghost-area

use numpy.testing for better array comparisons

add flake8, black, isort, pydocstring .... or use yapf

lets the war begin @kylebarron! the first one to submit a PR wins!

Caching merged range requests

Ref #23

I think there are a few options which could work:

Cache individual tiles after the ranged request. This has the benefit of caching the tile regardless of how it was requested (merged vs. unmerged), but adds complexity because we need to check if all of the tiles encapsulated by a specific merged request are cached before doing the request, skipping the merged request and pulling tiles directly from the cache if this is the case.
Cache the range request itself, using start/end as the cache key. This is easier to implement but wont cache the same tile across merged and unmerged requests. Another downside is we will only get a cache hit if the exact same range request is performed (ex. if you have two ranges A->D and B->D there will not be a cache hit even though 75% of the imagery is the same between the two requests).
Another solution is to cache with some sort of range key so we never request the same byte from the image more than once. This would of course be useful for every range request we perform and would be implemented on the lower-level Filesystem which is a nice design pattern, but I don't think aiocache has support for this.

There is also an argument to be made that choosing a caching stragegy which works across both merged/unmerged requests since (I think?) most users would be exclusively using either merged or un-merged range requests.

Support more compressions

It would be great to add support for other compressions. Cross referencing the compressions supported by imagecodecs to rio-cogeo profiles, we should support:

lzma
packbits
lerc

We should also support no compression, although I don't think this is very common.