Giter Site home page Giter Site logo

geofileops / geofileops Goto Github PK

View Code? Open in Web Editor NEW
78.0 4.0 3.0 22.54 MB

Python toolbox to process large vector files faster.

Home Page: https://geofileops.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 97.95% QML 1.89% Scheme 0.16%
geospatial-data geospatial-processing large-files geopackage geoprocessing python vector gis gdal geopandas

geofileops's Introduction

geofileops

Actions Status Coverage Status PyPI version Conda version DOI

Geofileops is a python toolbox to process large vector files faster.

Most typical GIS operations are available: e.g. buffer, dissolve, erase/difference, intersection, union,...

The spatial operations are tested on geopackage and shapefile input files, but geopackage is recommended as it will give better performance. General layer and file operations can be used on the file formats supported by GDAL.

The full documentation is available on readthedocs.

Different techniques are used under the hood to be able to process large files as fast as possible:

  • process data in batches
  • subdivide/merge complex geometries on the fly
  • process data in different passes
  • use all available CPUs

The following chart gives an impression of the speed improvement that can be expected when processing larger files. The benchmarks typically use input file(s) with 500K polygons, ran on a Windows PC with 12 cores and include I/O.

Geo benchmark

geofileops's People

Contributors

dependabot[bot] avatar joeribroeckx avatar kriway avatar richardscottoz avatar theroggy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

geofileops's Issues

Add a detection method of the geometry types actually in a file for a "Geometry" geometrycolumn

Some files have a column defined as "Geometry", but actually have only one primitive type in them (e.g. multi + single polygons) in them and so could still get proper automatic treatment in geo operations.

For this type of files an automatic detection would be useful...

  • make a function to determine all geometry types in a layer (geofileops.get_layer_geometrytypes)
  • use function to automatically deal with this if possible in all geo operations
  • A special case of this is geofiles that don't contain any rows and have a geometry column of type "Geometry". For this situation the geometrytype is not important at all for any spatial operations.

Use pyogrio for GeoDataFrame I/O

Gives a significant increase in performance for GeoDataFrame I/O, especially for writing: 5x faster.

Especially for larger files this can give significant gains: eg. for a 400 MB gpkg this is a difference of more than a minute.

Make all direct dependencies of geofileops explicit

Why?
Not all packages that are explicitly imported/used in geofileops are listed in the dependencies, because they are installed anyway as dependencies of other packages.
However, if one of those packages gets rid of such a dependency, geofileops breaks... so better to add them explicitly as dependency.

Where?
In setup.py, conda feedstock, CI envs...

ENH: reduce memory usage when applying geo operations on large input files

Evade having out-of-memories, regardless of the size of the input files, eg. by:

  • limit batch size for geopandas based operations
  • give the user an option to specify batch size explicitly for cases where the heuristics don't work out
  • avoid using group_by clauses in queries, as this seems to use a lot of memory
    • erase/union/identity
    • export_by_location
    • clip
  • have more intelligent determination of the size of batches in general
  • check memory usage not only when starting an operation, but also eg. before the start of the processing of a new batch.
  • the sqlite cache size is now quite large, test if it can be reduced without impacting throughput (too much)

Improve speed of creating index on geopackage

The time taken to create an index on a geopackage can be halved by letting sqlite use more memory.

So:

  1. Increase cache size: the default cache size for sqlite is 1000 pages, or 4 MB, which is very conservative, so this default can be increased
  2. Add a parameter gfo.to create_spatial_index to specify the cache size

Add operation apply

Add the generic function apply, to be able to run custom python code on (typically) geometries...

Option to backup original primary key column (fid) in output when applying operations

Now the PK "fid" column that is present in eg. shapefile and geopackage files is ignored and thus lost in geofileops operations. However, in some input files there is no other unique ID field available, so for these cases it can be practical (necessary?) that the original fid values can be kept/added (as columns). This way it is possible to eg. join with the original files later on when doing GIS analysis.

There are a few options:

  • retain original fid as fid in the output file: this could be an options for one-layer operations, or operations that output only columns of the input file. But nonetheless this can give issues:
    • if explodecollections=True this can/will lead to non-unique fids
    • shapefiles don't support fid's that aren't sequential
  • add column(s) with the original fid(s) of the input file(s)

Conclusion: it seems that (having the option to) add a column with the original fid is the solution that works in all cases without significant disadvantages, so probably the best one.

The name of the fid column(s) to be added needs some extra thought though:

  • for most operations with multiple input files the column names can follow the existing prefix system (eg. l1_fid, l2_fid).
  • for single-layer and some 2 layer operations there is no prefix, so another solution must be used

Another question: add the fid column(s) by default or not?

  • if automatically added, the change is not backwards compatible, however, adding a column will typically be not too bad in real-life use cases, so probably not to problematic.
  • typically though, you don't need the original fid, because most of the time there will be a logical unique ID to use when eg. joining with the original file (if needed).

Conclusion: don't add them by default, but make it possible for all spatial operations. If it becomes possible to specify "fid" via the existing columns parameter, the feature can be added without creating additional parameters, which sounds nice.

Technical note: a complication is that the implementation is different for sql based operations vs. geopandas based ones.

  • sql-based operations this should be quite straightforward to implement
  • in geopandas based operations the issue is that geopandas at the moment doesn't support reading the index. pyogrio 0.4 introduces support for it, but isn't released yet. So it might be best to wait till pyogrio 0.4 is released, add support for pyogrio reading, then implement this for these operations.

Improve geofileops benchmarks

  • simplify code
  • add chart reports to make interpretation of results easier
  • add benchmarks for simplify, convexhull
  • make code more flexible to be able to also benchmark against other libraries

Have option to create output path automatically + return it

When writing a GIS analysis geofileops, an uninteresting part of the code is preparing output file names, while most of the time they are really "standard".

Having an automatic way of creating output file names + returning it would be more practical.

I like readable filenames, so ideally there is a logic in how the filenames are created, the only/main disadvantage is that the filenames can become (too!) long...
However, the user needs to still be able to specify a filename, so he can always overrule when the name becomes too long?

Return return value in gfo.execute_sql

Now it isn't returned but it would be useful to do so for cases where the data doesn't need to be returned as structured data (= Dataframe), ex.

  • return value of update statements
  • queries on metadata tables
  • query to check if spatial index exists

Improve performance of buffer, simplify,... operations by avoiding temp files

During buffer, simplify,... operations the results are first written to a temp file per processing batch, and then they are alle merged again to a single file.

Since fiona version ? appending to geopackage/shapefile is possible, so the temp files could be evaded by writing directly to a common output file (using locking because concurrent writing is not supported).

New operation: join_nearest

Make a new operation to join the x nearest features in a layer to another layer.

The VirtualKNN indexing of spatialite seems like a good basis for this feature.

Some other projects for fast / parallel operations

Dag Pieter,

Sorry for opening an issue a bit "out of nowhere" here, but I noticed your repo (through the geopandas issue you commented on), and thought to share a few things.
I don't want to seem like a jerk "knowing better", just genuinely thought you might interested in those links.

First, really cool you are using GeoPandas to build upon! ;) (or at least for parts of the repo)

Since you are focusing on "fast" operations and doing things in parallel, those projects and developments might be of interest to you:

  • PyGEOS: this is a new wrapper of GEOS, and is going to become "Shapely 2.0" (long story can be read here: https://github.com/shapely/shapely-rfc/pull/1/files). This blogpost gives a bit of background: https://caspervdw.github.io/Introducing-Pygeos/, but basically it provides all functionality of Shapely, but through faster, vectorized functions.
    And in the upcoming GeoPandas 0.8 release (for now you need to use master), this can already be used under the hood, if you have pygeos installed (see https://geopandas.readthedocs.io/en/latest/install.html#using-the-optional-pygeos-dependency), and should automatically give a speed-up in spatial operations.

  • For file reading, there is also https://github.com/brendan-ward/pyogrio, which is experimenting with a faster replacement of fiona (but this might be more experimental).

  • For running things in parallel, there is the experimental https://github.com/jsignell/dask-geopandas package to connect GeoPandas and Dask (general parallel computing / task scheduling package in Python specifically targetting the data science use cases). The idea is that it divides the GeoDataFrame in chunks (partitions) and then operations are run in parallel on those partitions. But, this is mostly done under the hood by dask, so for the user it gives a very similar interface as GeoPandas. For example, for a parallel buffer operation, code could look like:

    import geopandas
    import dask_geopandas
    
    df = geopandas.read_file('...')
    ddf = dask_geopandas.from_geopandas(df, npartitions=4)
    ddf.geometry = ddf.buffer(...)
    

    and the buffer operation would be run in parallel (using multithreading, by default, but could also choose to use multiprocessing).
    I saw that you were parallelizing some operations like buffer, and for those the dask-geopandas project might be interesting (it won't be able to help with making ogr interactions parallel, though). It's a very young project, but contributions are always welcome ;)

Update all tests to use @pytest.mark.parametrize if applicable

In many tests the following construct is still used:

    for suffix in test_helper.get_test_suffix_list():
        for crs_epsg in test_helper.get_test_crs_epsg_list():

Using @pytest.mark.parametrize is quite a bit cleaner, and makes it possible to specify which combinations to test to make tests faster...

Add support to specify (any) gdal options in relevant fileops

Now only a very limited specific list of options can be specified, eg. to create a spatial index.

There should be a generic way to specify any option.

The specific options available now probably should be removed, so there are is only one way to specify options and it is impossible to have contradictions.

Optimize performance of operations when nb_parallel = 1

If the output file is created using a single batch, the result can be written directly to the output file instead of using intermediate files + copy it to the output file.

  • for single-file sql operations
  • for two-file sql operations
  • for geopandas operations?

Add missing "typical" spatial operation symmetric difference

Operations on an input layer involving a second layer. So basically the rows from the input layer are retained and filtered/cut/... by the second layer.

  • clip

"Pairwise" operations, where (conceptually) +- all rows in layer1 are cross joined with all rows in layer2, and then an spatial operation is executed so the result of one of the core spatial operations is retained. Hence, the columns of both layers are/can be retained.

These already exist:

  • erase = difference (only columns of input layer retained, as there are no (pieces of) features that overlap with layer 2 retained
  • intersection (columns of both layers retained)
  • split = identity (columns of both layers retained)
  • union (columns of both layers retained)

Missing:

  • symmetric_difference (columns of both layers retained)

Dissolve issues

2 issues:

  • when agg_columns=json is used and the dissolve needs multiple passes, the json output is not correct.
  • when combining tiled output with explodecollections=False, the output is still ~exploded.

Improve performance of tests on Windows, now terribly slow

On windows the tests are very slow (also for CI tests: 18 minutes)... would be great if that could be improved.

Options:

  • For large files Multiprocessing is faster than MultiThreading. Maybe this is different for small files/batches?
    • on windows seems faster when using threads, at least for files <= 100 rows (in 2 batches). Would be better to do extra tests to choose barrier more correctly + impact of bigger an smaller batchsizes.
  • Make sure all tests use (min.) 2 batches so the multi-batch approach is really tested + run in 2 parallel threads/processes!
  • Check that tests don't run on useless combinations of eg. file type + epsg + other parameters
  • Try to use miniforge + mamba to speed up conda installation in CI tests

By default, don't list the attribute tables anymore in e.g. listlayers

If layer styles are added to a geopackage, the table created for that is treated as a real layer.

This isn't really convenient, as the file is not treated as single-layer anymore, and so it becomes mandatory to always specify the layer name even though there is only one "spatial layer" in the file.

Solution:

  • by default don't list the attribute tables anymore in e.g. listlayers and don't count them in get_only_layer if there are > 1 layer/table
  • add a parameter only_spatial_layers to control the behaviour, with default value True

Improve performance of geofileops_gpd geooperations with "sparse" files

Now the batch row limits are determined based on the number of rows in a file. If rows have been removed from the file though, this can result in significantly more rows needing to be processed in the last batch.

Determining the batch row limits using the min and max of the rowid of the input files, like already implemented in geofileops_ogr, gives a better result.

Flatten API

Using geofileops now needs two imports:

from geofileops import geofileops
from geofileops import geofile

Additionally, because of the geofileops module that needs to be imported, it is not possible to do a simple import geofileops, eg. to get the version (geofileops.version

Finally, some of the functions in geofile actually would be better to be placed in geofileops (eg. execute_sql,...).

All in all, it would be a lot cleaner to just eliminate the hierarchy that mainly has historical reasons.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.