geospatial-jeff / aiocogeo Goto Github PK
View Code? Open in Web Editor NEWAsynchronous cogeotiff reader
License: MIT License
Asynchronous cogeotiff reader
License: MIT License
It would be nice to have a cache for tiles which have already been fetched. This could be as simple as decorating get_tile
with an alru_cache
.
it might be faster than imagecodecs
for deflate compression
Parsing of these tags doesn't seem to work. Will fix this in a future PR, as these tags aren't that important for COG (esp. server side).
Originally posted by @geospatial-jeff in #78 (comment)
GDAL and geotiff.js both use chunk transfer encoding, aiohttp supports it (https://docs.aiohttp.org/en/stable/streams.html) so we should do a comparison to see which is better.
I'm thinking something like:
class Tag:
...
class IFD:
tags: Dict[str, Tag]
class COGTiff:
fpath: str
ifds: List[IFD]
async def __aenter__(self):
# Request first 16kb, parse ifd and their tags.
ifds = <ifd with tags>
return self(ifds)
Usage would be:
async with COGTiff('https://coolsat.com/cog.tif') as cog:
await cog.read_tile()
We can reuse most of the code from COGDumper but I'd like to make it more object orientated to make the interface a little easier to use.
I like the AbstractReader
used by COGDumper, for now lets focus on making it work for http files and then introduce pluggable readers.
Confirm that boundless reads work (reading a map tile which isn't fully covered by image tiles).
Should just have to add exception handling here to catch TileNotFoundError
, and create a mask for the missing portion of the map tile.
Also let the user define what value is used to fill empty pixels.
Nodata is defined as a GDAL private tiff tag. We need to:
When the interface is finished we should define explicit IFD attributes for the supported tiff tags, a few reasons:
We have the offsets of each image tile and can calculate which tiles we need to read for a given partial read so this shouldn't be too difficult.
Ref: https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_MERGE_CONSECUTIVE_RANGES
Tags are really just metadata about the IFD and its annoying to access them like:
ifd.tag['TagName'].value
Would be easier to do:
ifd.TagName.value
cog.profile
property which mirrors rasterio's profile
https://github.com/geospatial-jeff/async-cog-reader/blob/master/tests/test_cog_reader.py#L22-L31
The STAC filesystem would search the item's assets for COGs and return potentially several http or s3 readers
Aiocogeo uses ~4x more memory than rio tiler when reading a single tile:
Line # Mem usage Increment Line Contents
================================================
44 115.3 MiB 115.3 MiB @profile
45 def main():
46 125.7 MiB 10.4 MiB asyncio.run(_aiocogeo())
47 128.5 MiB 2.8 MiB rio_tile()
The culprit is the call to skimage.resize
when resampling the image:
Line # Mem usage Increment Line Contents
================================================
292 118.9 MiB 118.9 MiB @profile
293 def _postprocess(
294 self, arr: NpArrayType, img_tiles: TileMetadata, out_shape: Tuple[int, int]
295 ) -> NpArrayType:
296 """Wrapper around ``_clip_array`` and ``_resample`` to postprocess the partial read"""
297 118.9 MiB 0.0 MiB return self._resample(
298 126.5 MiB 7.6 MiB self._clip_array(arr, img_tiles), img_tiles=img_tiles, out_shape=out_shape
299 )
Tag values are currently typed as Union[Any, Tuple[Any]]
. This causes lots of downstream issues because the type is unclear. It would make the code much cleaner if we removed the Union
and only used a single type for tag values. This would also let us add mypy to pre-commit.
To match GDAL_HTTP_VERSION
config option. Typically this is used in conjunction with GDAL_HTTP_MULTIPLEX
(https://github.com/developmentseed/cogeo-tiler/blob/master/serverless.yml#L54-L55) but we are already multiplexing with asyncio
.
Ref https://docs.aiohttp.org/en/stable/client_reference.html
I think it will be good to have a CLI with this tool.
Right now I'm think about
Info: dump general info about the COG (https://github.com/blacha/cogeotiff#cogeotiff-info)
grid: return a TMS model of the internal structure of the COG (using morecantile tms model)
Work began in #68 to support tiling with aiocogeo. The next step is to extend rio-tiler's BaseReader
instead of defining our own class so aiocogeo can be (kind of) compatible with applications that already use rio-tiler.
linked to #21, GDAL merge consecutive requests (horizontal tiles, when band interleave I think) up to 2Mb (configurable).
Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).
Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems
Add config option to enable/disable block cache (GDAL has this as well). Also the cache is causing tests to fail, so it would be nice to disable caching during tests. That specific test case works by itself but fails during build because the same tiles are requested and cached in a different test case which throws off the number of requests.
Aiocache supports this through cache_read
and cache_write
kwargs injected to the cache key generator (https://aiocache.readthedocs.io/en/latest/decorators.html#cached)
There is a lot of cpu bound code which blocks the main thread at higher concurrencies like decompression, resampling, and numpy operations. We should look to use something like starlette.concurrency.run_in_threadpool
or aiofiles.os.wrap
which both use asyncio.loop.run_in_executor
to run code in the background without blocking the main thread.
It would be worth benchmarking the difference between a ProcessPoolExecutor
and ThreadPoolExecutor
(process would definitely be faster but by how much?).
aiocogeo supports a much smaller subset of COG types than gdal, so it would be good to have a way to validate if an image can be read.
imagecodecs (https://github.com/cgohlke/imagecodecs) is adding support for AVIF
Once #2 is ready to go, we need a method to use IFD/tag metadata to read a given tile. COGDumper uses cogdumper.cog_tiles.COGTiff.read_tile
which looks to just a single XYZ tile based on the tile's coordinate with respect to the (top left?) of the image from the appropriate overview.
As @vincentsarago pointed out, we could use pyproj to:
- get geospatial info from the COG
- fetch only the internal tile (and overview tiles) for a specific
.read
request.
I think it would be nice to implement something similar to rasterio.windows
where we can use pyproj to map a particular bounding box to the corresponding XYZ tiles in the COG, but I'm definitely open to other ideas. This brings some questions.
rio-tiler-v2
-- if at all. If the COGTiff
class can implement a similar interface to rasterio.io.DatasetReader
it could be passed in as the src_dst
to rio_tiler.reader._read
but I'm not sure if that is feasible.Everything we need should be in https://github.com/OSGeo/libgeotiff/blob/master/libgeotiff
Line 164 in 5a1d32c
GDAL docs:
Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).
Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems
The first 16KB of the header should contain all IFDs, but large tag values which don't fit in the 12 bytes provided by each IFD for the tag's value may be stored anywhere in the file (even after image data) in which case we'll need to do another range request into the file fetch the tag value.
Doing a partial read when an internal mask is present is different enough from no mask to warrant refactoring the partial read into two methods. This should also make it easier to support internal masks when merging range requests (#29)
Decompression and postprocessing will never really block the main thread, so its causing more harm than good.
Yeah should really use moto so we don't need to talk to S3 at all for tests
Originally posted by @geospatial-jeff in #97 (comment)
With a slightly smarter reader we could easily read all tags for an ifd in parallel.
https://github.com/geospatial-jeff/aiocogeo/blob/master/aiocogeo/ifd.py#L42-L45
It makes a lot of sense on many levels to move the gdal dependency into another repo to keep aiocogeo low level and lightweight. I think https://github.com/geospatial-jeff/aiocogeo-tiler would make a good home.
https://cogeotiff.slack.com/archives/C01DE57GLHE/p1603130953009500
Consider a case where N
unique tile requests are made to a single COG. Despite the ENABLE_CACHE
environment variable being enabled, all requests would be cache misses. Thus at least 2 * N
range requests would need to be made to the COG. But if the COG header were cached separately, then only 1 + N
range requests would need to be made.
I plan to incorporate aiocogeo
within a traditional tile server middleware that handles regular z/x/y.png
requests. These currently read PNG tiles that are stored as a tile pyramid in bucket storage. This dated architecture is space inefficient but very performant. I'm hoping to achieve the space savings of COGs (via YCbCr JPEG compression + GDAL mask bands) without a meaningful increase in latency. One way to eliminate that latency is by caching the header in redis or another fast cache available to many servers. For example:
(z, x, y)
for cog.tif
in cloud bucket storage. Networking layer routes it to COG server 1 (one of many COG servers).
cog.tif
header, but it's not found. CACHE MISS.cog.tif
header.cog.tif
header in redis.(z, x, y)
tile data.(z, x, y)
tile data to client.(z, x + 1, y)
for cog.tif
in cloud bucket storage. Networking layer routes it to COG server 2.
cog.tif
header and it's found. CACHE HIT.(z, x + 1, y)
tile data.(z, x + 1, y)
tile data to client.The two italicized operations for the first request are not necessary for the second request.
๐ @geospatial-jeff
The subject looks interesting ๐
Not sure what's your idea but if we want to go full async maybe we can use some of the code from https://github.com/mapbox/COGDumper to go GDAL Free ...
COGDumper is not smart and doesn't do any spatial stuff but if we add pyproj we might be able to do;
.read
request.Add an extra which includes code to do dynamic tiling with aiocogeo (ex. pip install aiocogeo[tiler]
)
rio_tiler.io.base.BaseReader
.Example:
Tag(code=305, name='Software', tag_type=TagType(format='c', size=1), count=21, length=21, value=(b'T', b'r', b'i', b'm', b'b', b'l', b'e', b' ', b'G', b'e', b'r', b'm', b'a', b'n', b'y', b' ', b'G', b'm', b'b', b'H', b'\x00'))
Would be really useful for debugging purposes to have more verbosity on reads.
I often use CPL_DEBUG
and CPL_CURL_VERBOSE
withing GDAL to see how much data and how many GET/LIST/HEAD request gdal is doing.
Side note: myabe having and internal variable to host this could be cool:
async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/webp_cog.tif") as cog:
x = y = z = 0
tile = await cog.get_tile(x, y, z)
print(cog.requests)
{
count: 3,
size: TotalSizeOfRequest
get: [
'offset1-offset2', sizeOfRequest1,
'offset3-offset4', sizeOfRequest2,
'offset5-offset6', sizeOfRequest3
]
}
Pretty self exaplantory, aiocogeo doesn't currently support BIGTIFF.
Line 88 in 950ea55
Made some improvements with #4 and #9 but I'm still not a huge fan.
I think it would be best to switch tags to semi-private attributes and expose important metadata through properties like here. At minimum we should have properties for the rasterio profile. A few reasons:
Tag
object (or even care about all of the defined tags)Tag
defined on the IFD
still resolves #9.ifd.width
instead of ifd.ImageWidth.value
).Tag
semi-private prevents confusion (ex. ifd.Compression
vs ifd.compression
is confusing)lets the war begin @kylebarron! the first one to submit a PR wins!
Ref #23
I think there are a few options which could work:
Filesystem
which is a nice design pattern, but I don't think aiocache
has support for this.There is also an argument to be made that choosing a caching stragegy which works across both merged/unmerged requests since (I think?) most users would be exclusively using either merged or un-merged range requests.
It would be great to add support for other compressions. Cross referencing the compressions supported by imagecodecs
to rio-cogeo
profiles, we should support:
We should also support no compression, although I don't think this is very common.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.