geofileops / geofileops Goto Github PK

View Code? Open in Web Editor NEW

78.0 4.0 3.0 22.54 MB

Python toolbox to process large vector files faster.

Home Page: https://geofileops.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 97.95% QML 1.89% Scheme 0.16%

geospatial-data geospatial-processing large-files geopackage geoprocessing python vector gis gdal geopandas

geofileops's Introduction

geofileops

Geofileops is a python toolbox to process large vector files faster.

Most typical GIS operations are available: e.g. buffer, dissolve, erase/difference, intersection, union,...

The spatial operations are tested on geopackage and shapefile input files, but geopackage is recommended as it will give better performance. General layer and file operations can be used on the file formats supported by GDAL.

The full documentation is available on readthedocs.

Different techniques are used under the hood to be able to process large files as fast as possible:

process data in batches
subdivide/merge complex geometries on the fly
process data in different passes
use all available CPUs

The following chart gives an impression of the speed improvement that can be expected when processing larger files. The benchmarks typically use input file(s) with 500K polygons, ran on a Windows PC with 12 cores and include I/O.

geofileops's People

Contributors

Stargazers

Watchers

Forkers

theroggy richardscottoz kriway

geofileops's Issues

MacOS CI builds give fiona circular import errors

This error is thrown: AttributeError: partially initialized module 'fiona' has no attribute '_loading' (most likely due to a circular import)

Remove some older deprecated code

Add support to reproject files in gfo.convert

Add a detection method of the geometry types actually in a file for a "Geometry" geometrycolumn

Some files have a column defined as "Geometry", but actually have only one primitive type in them (e.g. multi + single polygons) in them and so could still get proper automatic treatment in geo operations.

For this type of files an automatic detection would be useful...

make a function to determine all geometry types in a layer (geofileops.get_layer_geometrytypes)
use function to automatically deal with this if possible in all geo operations
A special case of this is geofiles that don't contain any rows and have a geometry column of type "Geometry". For this situation the geometrytype is not important at all for any spatial operations.

Remove append code in geofile.to_file() needed for compatibility with fiona < 1.8.14

Not to be done immediately, but once some time has passed, would make the code (a lot) cleaner.

Remove verbose parameter everywhere

The verbose parameter isn't really useful because the amount of output can be controlled using the standard python logging settings.

Use pyogrio for GeoDataFrame I/O

Gives a significant increase in performance for GeoDataFrame I/O, especially for writing: 5x faster.

Especially for larger files this can give significant gains: eg. for a 400 MB gpkg this is a difference of more than a minute.

Make all direct dependencies of geofileops explicit

Why?
Not all packages that are explicitly imported/used in geofileops are listed in the dependencies, because they are installed anyway as dependencies of other packages.
However, if one of those packages gets rid of such a dependency, geofileops breaks... so better to add them explicitly as dependency.

Where?
In setup.py, conda feedstock, CI envs...

Create tests on WGS84 data, now only projected data

ENH: reduce memory usage when applying geo operations on large input files

Evade having out-of-memories, regardless of the size of the input files, eg. by:

limit batch size for geopandas based operations
give the user an option to specify batch size explicitly for cases where the heuristics don't work out
avoid using group_by clauses in queries, as this seems to use a lot of memory
- erase/union/identity
- export_by_location
- clip
have more intelligent determination of the size of batches in general
check memory usage not only when starting an operation, but also eg. before the start of the processing of a new batch.
the sqlite cache size is now quite large, test if it can be reduced without impacting throughput (too much)

In some cases columns get lost when using two layer operations

Eg. if a more exotic way to store timestamps is used in a timestamp column.

Improve speed of creating index on geopackage

The time taken to create an index on a geopackage can be halved by letting sqlite use more memory.

So:

Increase cache size: the default cache size for sqlite is 1000 pages, or 4 MB, which is very conservative, so this default can be increased
Add a parameter gfo.to create_spatial_index to specify the cache size

Add operation apply

Add the generic function apply, to be able to run custom python code on (typically) geometries...

Run isvalid on an empty input file gives an error

Option to backup original primary key column (fid) in output when applying operations

Now the PK "fid" column that is present in eg. shapefile and geopackage files is ignored and thus lost in geofileops operations. However, in some input files there is no other unique ID field available, so for these cases it can be practical (necessary?) that the original fid values can be kept/added (as columns). This way it is possible to eg. join with the original files later on when doing GIS analysis.

There are a few options:

retain original fid as fid in the output file: this could be an options for one-layer operations, or operations that output only columns of the input file. But nonetheless this can give issues:
- if explodecollections=True this can/will lead to non-unique fids
- shapefiles don't support fid's that aren't sequential
add column(s) with the original fid(s) of the input file(s)

Conclusion: it seems that (having the option to) add a column with the original fid is the solution that works in all cases without significant disadvantages, so probably the best one.

The name of the fid column(s) to be added needs some extra thought though:

for most operations with multiple input files the column names can follow the existing prefix system (eg. l1_fid, l2_fid).
for single-layer and some 2 layer operations there is no prefix, so another solution must be used

Another question: add the fid column(s) by default or not?

if automatically added, the change is not backwards compatible, however, adding a column will typically be not too bad in real-life use cases, so probably not to problematic.
typically though, you don't need the original fid, because most of the time there will be a logical unique ID to use when eg. joining with the original file (if needed).

Conclusion: don't add them by default, but make it possible for all spatial operations. If it becomes possible to specify "fid" via the existing columns parameter, the feature can be added without creating additional parameters, which sounds nice.

Technical note: a complication is that the implementation is different for sql based operations vs. geopandas based ones.

sql-based operations this should be quite straightforward to implement
in geopandas based operations the issue is that geopandas at the moment doesn't support reading the index. pyogrio 0.4 introduces support for it, but isn't released yet. So it might be best to wait till pyogrio 0.4 is released, add support for pyogrio reading, then implement this for these operations.

Improve performance by using st_relate where possible

If multiple spatial filters are applied (eg. intersects and touches), it is more efficient to combine them by calculating the relate + testing 2 "matches" on that calculated relate.

Fix support to specify sql_dialect in select()

The sql_dialect parameter is ignored now.

Improve placeholder used for the layer name in sql statements

Historically these placeholder names were choosen a bit weird, so better to change them:

{input1_tmp_layer} -> {input1_layer}
{input2_tmp_layer} -> {input2_layer}

Add a function to drop a column

(try to) improve performance of gfo.convert

check if increasing transaction_size is useful
check if creating the index afterwards is faster

Add support to write an attribute table (=no geometry column) to geopackage

Improve geofileops benchmarks

simplify code
add chart reports to make interpretation of results easier
add benchmarks for simplify, convexhull
make code more flexible to be able to also benchmark against other libraries

For simplify operation, expose option to provide a geometry for points_to_keep

The underlying function simplify_ext already supports providing a geometry to the simplify algorythm to block points intersecting with this geometry to be removed during simplification.

This functionality can also be exposed to the geofileops.simplify() function.

Have option to create output path automatically + return it

When writing a GIS analysis geofileops, an uninteresting part of the code is preparing output file names, while most of the time they are really "standard".

Having an automatic way of creating output file names + returning it would be more practical.

I like readable filenames, so ideally there is a logic in how the filenames are created, the only/main disadvantage is that the filenames can become (too!) long...
However, the user needs to still be able to specify a filename, so he can always overrule when the name becomes too long?

Return return value in gfo.execute_sql

Now it isn't returned but it would be useful to do so for cases where the data doesn't need to be returned as structured data (= Dataframe), ex.

return value of update statements
queries on metadata tables
query to check if spatial index exists

Improve performance of buffer, simplify,... operations by avoiding temp files

During buffer, simplify,... operations the results are first written to a temp file per processing batch, and then they are alle merged again to a single file.

Since fiona version ? appending to geopackage/shapefile is possible, so the temp files could be evaded by writing directly to a common output file (using locking because concurrent writing is not supported).

In dissolve, support using aggregations on a groupby column + None data in aggregation columns

At the moment, aggregations on a groupby columns always result in 0.

It was assumed there was no use in supporting aggregations on those columns, so to optimize processing they weren't retained. However, if all columns are in the groupby clause, there is a use to do eg. a count() aggregation.

Make reporting consistent for all geo operations

Now operations using ogr versus gpd have a slightly different way of reporting progress...

New operation: join_nearest

Make a new operation to join the x nearest features in a layer to another layer.

The VirtualKNN indexing of spatialite seems like a good basis for this feature.

Some other projects for fast / parallel operations

Dag Pieter,

Sorry for opening an issue a bit "out of nowhere" here, but I noticed your repo (through the geopandas issue you commented on), and thought to share a few things.
I don't want to seem like a jerk "knowing better", just genuinely thought you might interested in those links.

First, really cool you are using GeoPandas to build upon! ;) (or at least for parts of the repo)

Since you are focusing on "fast" operations and doing things in parallel, those projects and developments might be of interest to you:

PyGEOS: this is a new wrapper of GEOS, and is going to become "Shapely 2.0" (long story can be read here: https://github.com/shapely/shapely-rfc/pull/1/files). This blogpost gives a bit of background: https://caspervdw.github.io/Introducing-Pygeos/, but basically it provides all functionality of Shapely, but through faster, vectorized functions.
And in the upcoming GeoPandas 0.8 release (for now you need to use master), this can already be used under the hood, if you have pygeos installed (see https://geopandas.readthedocs.io/en/latest/install.html#using-the-optional-pygeos-dependency), and should automatically give a speed-up in spatial operations.
For file reading, there is also https://github.com/brendan-ward/pyogrio, which is experimenting with a faster replacement of fiona (but this might be more experimental).
For running things in parallel, there is the experimental https://github.com/jsignell/dask-geopandas package to connect GeoPandas and Dask (general parallel computing / task scheduling package in Python specifically targetting the data science use cases). The idea is that it divides the GeoDataFrame in chunks (partitions) and then operations are run in parallel on those partitions. But, this is mostly done under the hood by dask, so for the user it gives a very similar interface as GeoPandas. For example, for a parallel buffer operation, code could look like:
```
import geopandas
import dask_geopandas

df = geopandas.read_file('...')
ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf.geometry = ddf.buffer(...)
```
and the buffer operation would be run in parallel (using multithreading, by default, but could also choose to use multiprocessing).
I saw that you were parallelizing some operations like buffer, and for those the dask-geopandas project might be interesting (it won't be able to help with making ogr interactions parallel, though). It's a very young project, but contributions are always welcome ;)

Keep the computer responsive while running heavy processing

When running a heavy process, sometimes the computer becomes very slow for other uses.

Update all tests to use @pytest.mark.parametrize if applicable

In many tests the following construct is still used:

    for suffix in test_helper.get_test_suffix_list():
        for crs_epsg in test_helper.get_test_crs_epsg_list():

Using @pytest.mark.parametrize is quite a bit cleaner, and makes it possible to specify which combinations to test to make tests faster...

Add support to specify (any) gdal options in relevant fileops

Now only a very limited specific list of options can be specified, eg. to create a spatial index.

There should be a generic way to specify any option.

The specific options available now probably should be removed, so there are is only one way to specify options and it is impossible to have contradictions.

An error is thrown when te result of the union operation is empty

Optimize performance of operations when nb_parallel = 1

If the output file is created using a single batch, the result can be written directly to the output file instead of using intermediate files + copy it to the output file.

for single-file sql operations
for two-file sql operations
for geopandas operations?

Add missing "typical" spatial operation symmetric difference

Operations on an input layer involving a second layer. So basically the rows from the input layer are retained and filtered/cut/... by the second layer.

clip

"Pairwise" operations, where (conceptually) +- all rows in layer1 are cross joined with all rows in layer2, and then an spatial operation is executed so the result of one of the core spatial operations is retained. Hence, the columns of both layers are/can be retained.

These already exist:

erase = difference (only columns of input layer retained, as there are no (pieces of) features that overlap with layer 2 retained
intersection (columns of both layers retained)
split = identity (columns of both layers retained)
union (columns of both layers retained)

Missing:

symmetric_difference (columns of both layers retained)

Expand on join_by_location so it supports also other operations than intersect

Now join_by_location only supports an (outer) join for features having an intersection...

Ideally should support all classic ways to spatially join layers...

Explodecollections=True doesn't work on erase() operation

Still multi(polygons) in output

Dissolve issues

2 issues:

when agg_columns=json is used and the dissolve needs multiple passes, the json output is not correct.
when combining tiled output with explodecollections=False, the output is still ~exploded.

Add support in dissolve for other aggregations than 'first'

Because of the batched processing this is slightly less easy than you'd think at first sight...

Improve performance of tests on Windows, now terribly slow

On windows the tests are very slow (also for CI tests: 18 minutes)... would be great if that could be improved.

Options:

For large files Multiprocessing is faster than MultiThreading. Maybe this is different for small files/batches?
- on windows seems faster when using threads, at least for files <= 100 rows (in 2 batches). Would be better to do extra tests to choose barrier more correctly + impact of bigger an smaller batchsizes.
Make sure all tests use (min.) 2 batches so the multi-batch approach is really tested + run in 2 parallel threads/processes!
Check that tests don't run on useless combinations of eg. file type + epsg + other parameters
Try to use miniforge + mamba to speed up conda installation in CI tests

By default, don't list the attribute tables anymore in e.g. listlayers

If layer styles are added to a geopackage, the table created for that is treated as a real layer.

This isn't really convenient, as the file is not treated as single-layer anymore, and so it becomes mandatory to always specify the layer name even though there is only one "spatial layer" in the file.

Solution:

by default don't list the attribute tables anymore in e.g. listlayers and don't count them in get_only_layer if there are > 1 layer/table
add a parameter only_spatial_layers to control the behaviour, with default value True

As spatialite 5 is common now, remove hack to evade issue of libspatialite missing mod_spatialite

Remove all code for this hack as it is not needed anymore.

Improve performance of geofileops_gpd geooperations with "sparse" files

Now the batch row limits are determined based on the number of rows in a file. If rows have been removed from the file though, this can result in significantly more rows needing to be processed in the last batch.

Determining the batch row limits using the min and max of the rowid of the input files, like already implemented in geofileops_ogr, gives a better result.

Where possible, use sql version instead of gdp for single layer operations

They have better performance based on most recent benchmarks:

buffer: only for default parameters, as the bufferoptions don't work in sql version
simplify: only for ramer-douglas-peucker, because other algorithms only supported in gpd version
convexhull: can always be changed to sql version

(Try to) centralize batch scheduling logic for sql vs geopandas based operations

Now the scheduling of both has it's own logic, but probably this is not needed + possibly this can lead to exposing more scheduling parameters to power users.

Remarks:

It should also made testable!
It should also determine if threads or processes should be used -> eg. if nb_parallel = 1 always threads?

Rename master branch to main

https://github.com/github/renaming#rename-existing

Flatten API

Using geofileops now needs two imports:

from geofileops import geofileops
from geofileops import geofile

Additionally, because of the geofileops module that needs to be imported, it is not possible to do a simple import geofileops, eg. to get the version (geofileops.version

Finally, some of the functions in geofile actually would be better to be placed in geofileops (eg. execute_sql,...).

All in all, it would be a lot cleaner to just eliminate the hierarchy that mainly has historical reasons.

For gfo.intersection() some attribute data is NULL if output format is .shp

If the column names in the input file need "laundering" (eg. longer than 10 characters), for most rows the output values become NULL.