Giter Site home page Giter Site logo

opengeos / open-buildings Goto Github PK

View Code? Open in Web Editor NEW
114.0 12.0 16.0 859 KB

Tools for working with open building datasets

Home Page: https://opengeos.github.io/open-buildings

License: Other

Python 98.70% Shell 1.30%
buildings geopython geospatial open-data open-buildings geoparquet

open-buildings's Introduction

open-buildings

image

Tools for working with open building datasets

Introduction

This repo is intended to be a set of useful scripts for getting and converting Open Building Datasets using Cloud Native Geospatial formats. Initially the focus is on Google's Open Buildings dataset and Overture's building dataset.

The main tool that most people will be interested in is the get_buildings command, that lets you supply a GeoJSON file to a command-line interface and it'll download all buildings in the area supplied, output in common GIS formats (GeoPackage, FlatGeobuf, Shapefile, GeoJSON and GeoParquet).

The tool works by leveraging partitioned GeoParquet files, using DuckDB to just query exactly what is needed. This is done without any server - DuckDB on your computer queries, filter and downloads just the rows that you want. Right now you can query two datasets, that live on Source Cooperative, see here for Google and here for Overture. The rest of the CLI's and scripts were used to create those datasets, with some additions for benchmarking performance.

This is basically my first Python project, and certainly my first open source one. It is only possible due to ChatGPT, as I'm not a python programmer, and not a great programmer in general (coded professionally for about 2 years, then shifted to doing lots of other stuff). So it's likely not great code, but it's been fun to iterate on it and seems like it might be useful to others. And contributions are welcome! I'm working on making the issue tracker accessible, so anyone who wants to try out some open source coding can jump in.

Installation

Install with pip:

pip install open-buildings

This should add a CLI that you can then use. If it's working then:

ob

Will print out a help message. You then will be able run the CLI (download 1.json:

ob tools get_buildings 1.json --dst my-buildings.geojson --country_iso RW

You can also stream the json in directly in one line:

curl https://data.source.coop/cholmes/aois/1.json | ob get_buildings - --dst my-buildings.geojson --country_iso RW

Functionality

get_buildings

The main tool for most people is get_buildings. It queries complete global building datasets for the GeoJSON provided, outputting results in common geospatial formats. The full options and explanation can be found in the --help command:

% ob get_buildings --help
Usage: ob get_buildings [OPTIONS] [GEOJSON_INPUT] [DST]

  Tool to extract buildings in common geospatial formats from large archives
  of GeoParquet data online. GeoJSON input can be provided as a file or piped
  in from stdin. If no GeoJSON input is provided, the tool will read from
  stdin.

  Right now the tool supports two sources of data: Google and Overture. The
  data comes from Cloud-Native Geospatial distributions on
  https://source.coop, that are partitioned by admin boundaries and use a
  quadkey for the spatial index. In time this tool will generalize to support
  any admin boundary partitioned GeoParquet data, but for now it is limited to
  the Google and Overture datasets.

  The default output is GeoJSON, in a file called buildings.json. Changing the
  suffix will change the output format - .shp for shapefile .gpkg for
  GeoPackage, .fgb for FlatGeobuf and .parquet for GeoParquet, and .json or
  .geojson for GeoJSON. If your query is all within one country it is strongly
  recommended to use country_iso to hint to the query engine which country to
  query, as this  will speed up the query significantly (5-10x). Expect query
  times of 5-10 seconds for a queries with country_iso and 30-60 seconds
  without country_iso.

  You can look up the country_iso for a country here:
  https://github.com/lukes/ISO-3166-Countries-with-Regional-
  Codes/blob/master/all/all.csv If you get the country wrong you will get zero
  results. Currently you can only query one country, so if your query crosses
  country boundaries you should not use country_iso. In future versions of
  this tool we hope to eliminate the need to hint with the country_iso.

Options:
  --dst TEXT                  The path to write the output to. Can be a
                              directory or file.
  --location TEXT             Use city or region name instead of providing an
                              AOI as file.
  --source [google|overture]  Dataset to query, defaults to Overture
  --country_iso TEXT          A 2 character country ISO code to filter the
                              data by.
  -s, --silent                Suppress all print outputs.
  --overwrite                 Overwrite the destination file if it already
                              exists.
  -v, --verbose               Print detailed logs with timestamps.
  --help                      Show this message and exit.

Note that the get_buildings operation is not very robust, there are likely a number of ways to break it. #13 is used to track it, but if you have any problems please report them in the issue tracker to help guide how we improve it.

We do hope to eliminate the need to supply an iso_country for fast querying, see #29 for that tracking issue. We also hope to add more building datasets, starting with the Google-Microsoft Open Buildings by VIDA, see #26 for more info.

Google Building processings

In the google portion of the CLI there are two functions:

  • convert takes as input either a single CSV file or a directory of CSV files, downloaded locally from the Google Buildings dataset. It can write out as GeoParquet, FlatGeobuf, GeoPackage and Shapefile, and can process the data using DuckDB, GeoPandas or OGR.
  • benchmark runs the convert command against one or more different formats, and one or more different processes, and reports out how long each took.

A sample output for benchmark, run on 219_buildings.csv, a 101 mb CSV file is:

Table for file: 219_buildings.csv
╒═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕
│ process   │ fgb       │ gpkg      │ parquet   │ shp       │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ duckdb    │ 00:02.330 │ 00:00.000 │ 00:01.866 │ 00:03.119 │
├───────────┼───────────┼───────────┼───────────┼───────────┤
│ ogr       │ 00:02.034 │ 00:07.456 │ 00:01.423 │ 00:02.491 │
├───────────┼───────────┼───────────┼───────────┼───────────┤
│ pandas    │ 00:18.184 │ 00:24.096 │ 00:02.710 │ 00:20.032 │
╘═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛

The full options can be found with --help after each command, and I'll put them here for reference:

Usage: open_buildings convert [OPTIONS] INPUT_PATH OUTPUT_DIRECTORY

  Converts a CSV or a directory of CSV's to an alternate format. Input CSV's
  are assumed to be from Google's Open Buildings

Options:
  --format [fgb|parquet|gpkg|shp]
                                  The output format. The default is FlatGeobuf (fgb)
  --overwrite                     Whether to overwrite any existing output files.
  --process [duckdb|pandas|ogr]   The processing method to use. The default is 
                                  pandas.
  --skip-split-multis             Whether to keep multipolygons as they are
                                  without splitting into their component polygons.
  --verbose                       Whether to print detailed processing
                                  information.
  --help                          Show this message and exit.
Usage: open_buildings benchmark [OPTIONS] INPUT_PATH OUTPUT_DIRECTORY

  Runs the convert function on each of the supplied processes and formats,
  printing the timing of each as a table

Options:
  --processes TEXT      The processing methods to use. One or more of duckdb,
                        pandas or ogr, in a comma-separated list. Default is
                        duckdb,pandas,ogr.
  --formats TEXT        The output formats to benchmark. One or more of fgb,
                        parquet, shp or gpkg, in a comma-separated list.
                        Default is fgb,parquet,shp,gpkg.
  --skip-split-multis   Whether to keep multipolygons as they are without
                        splitting into their component polygons.
  --no-gpq              Disable GPQ conversion. Timing will be faster, but not
                        valid GeoParquet (until DuckDB adds support)
  --verbose             Whether to print detailed processing information.
  --output-format TEXT  The format of the output. Options: ascii, csv, json,
                        chart.
  --help                Show this message and exit.

Warning - note that --no-gpq doesn't actually work right now, see #4 to track. It is just always set to true, so DuckDB times with Parquet will be inflated (you can change it in the Python code in a global variables). Note also that the ogr process does not work with --skip-split-multis, but will just report very minimal times since it skips doing anything, see #5 to track.

Format Notes

I'm mostly focused on GeoParquet and FlatGeobuf, as good cloud-native geo formats. I included GeoPackage and Shapefile mostly for benchmarking purposes. GeoPackage I think is a good option for Esri and other more legacy software that is slow to adopt new formats. Shapefile is total crap for this use case - it fails on files bigger than 4 gigabytes, and lots of the source S2 Google Building CSV's are bigger, so it's not useful for translating. The truncation of field names is also annoying, since the CSV file didn't try to make short names (nor should it, the limit is silly).

GeoPackage is particularly slow with DuckDB, it's likely got a bit of a bug in it. But it works well with Pandas and OGR.

Process Notes

When I was processing V2 of the Google Building's dataset I did most of the initial work with GeoPandas, which was awesome, and has the best GeoParquet implementation. But the size of the data made its all in memory processing untenable. I ended up using PostGIS a decent but, but near the end of that process I discovered DuckDB, and was blown away by it's speed and ability to manage memory well. So for this tool I was mostly focused on those two.

Note also that currently DuckDB fgb, gpkg and shp output don't include projection information, so if you want to use the output then you'd need to run ogr2ogr on the output. It sounds like that may get fixed pretty soon, so I'm not going to add a step that includes the ogr conversion.

OGR was added later, and as of yet does not yet do the key step of splitting multi-polygons, since it's just using ogr2ogr as a sub-process and I've yet to find a way to do that from the CLI (though knowing GDAL/OGR there probably is one - please let me know). To run the benchmark with it you need to do --skip-split-multis or else the times on it will be 0 (except for Shapefile, since it doesn't differentiate between multipolygons and regular polygons). I hope to add that functionality and get it on par, which may mean using Fiona. But it seems like that may affect performance, since Fiona doesn't use the GDAL/OGR column-oriented API.

Code customizations

There are 3 options that you can set as global variables in the Python code, but are not yet CLI options. These are:

  • RUN_GPQ_CONVERSION - whether GeoParquet from DuckDB by default runs gpq on the DuckDB Parquet output, which adds a good chunk of processing time. This makes it so the DuckDB processing output is slower than it would be if DuckDB natively wrote GeoParquet metadata, which I believe is on their roadmap. So that will likely emerge as the fastest benchmark time. In the code you can set RUN_GPQ_CONVERSION in the python code to false if you want to get a sense of it. In the above benchmark running the Parquet with DuckDB without GPQ conversion at the end resulted in a time of .76 seconds.
  • PARQUET_COMPRESSION - which compression to use for Parquet encoding. Note that not all processes support all compression options, and also the OGR converter currently ignores this option.
  • SKIP_DUCK_GPKG - whether to skip the GeoPackage conversion option on DuckDB, since it takes a long time to run.

Contributing

All contributions are welcome, I love running open source projects. I'm clearly just learning to code Python, so there's no judgement about crappy code. And I'm super happy to learn from others about better code. Feel free to sound in on the issues, make new ones, grab one, or make a PR. There's lots of low hanging fruit of things to add. And if you're just starting out programming don't hesitate to ask even basic things in the discussions.

open-buildings's People

Contributors

cholmes avatar darrenwiens avatar felix-schott avatar giswqs avatar ingenieroariel avatar mtravis avatar theroggy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-buildings's Issues

Warn users if their geojson is not in `iso_country`

Adding a wrong iso_country will result in 0 buildings coming down, since it'll just query the wrong country. It'd be good to warn users about that possibility.

I think this can be pretty simple - just check if there are 0 features that are downloaded and if the user provided a country_iso. If both of those are true then print out something like 'WARNING: You supplied country_iso BR and your geojson got 0 buildings. Check to be sure your GeoJSON is actually in the right country'. Note that if we do #29 then there will hopefully be no need for this warning.

Organize packages better

With the 0.8 release we'll have functions to format overture buildings, and to download any buildings.

I don't really understand how python packages work, but I'd like to get it so the cli.py has 3 subcommands:
open_buildings google with convert and benchmark under it - the existing main commands.
open_buildings overture with commands to work with overture data.
open_buildings tools that has the 'get-buildings' command.

Should likely combine the two overture python files into one. Right now they all have click interfaces in them - the intent was to move all the click interfaces to the cli.py. But once I got into it it would be nice to have some of the commands that were more made for debugging / figuring out what's going (like quad2json, wkt, etc) be in a CLI, but not in the 'main' CLI. So maybe it makes sense to have click packages in both?

I'm also more than open to other ideas on how to organize things. I do think it'll make sense to evolve more of the functions to be 'generic', but I'll make a separate ticket for that.

Add a geocoding option

Right now you have to input a geojson - it'd be much nicer for many users to just enter like a city, state or county name and get buildings for it.

Ideally we find a geocoder that returns polygons, and doesn't cost too much. I could likely pay for it for a bit, but we probably evolve to making it a config option for people to put in their geocoder api key.

Better attribute requesting in `get-buildings`

Right now the get-buildings command just has some hard coded attributes for testing, and just is ones the overture buildings has. Should have it default to getting all attributes, and then also have a flag where a user can put the attributes they want. Ideally both include and exclude options.

Finish google buildings v3

The google building data on source.coop only has a complete partitioned one based on google's v2. The v3 version is started, but it hasn't actually been partitioned yet.

The idea was to try a few different tools to partition and write up comparisons, just like I did for convert - https://cholmes.medium.com/performance-explorations-of-geoparquet-and-duckdb-84c0185ed399

But overture came into the mix and bumped the priority. There are lots of learnings from there, and many of the tools built for overture could be made more generic (#14 is the ticket for that). It'd be nice to just 'finish' v3 of google buildings in at least one partition, with row groups, so it can be used in get-buildings (ticket #15), so that one doesn't just work with v2 which is less countries.

Automatically calculate country codes per quadkey & remove country_iso flag

It seems like it should be possible to automatically include a country_iso to substantially speed up the query. The current method of having the user supply it is potentially error prone, and annoying.

The idea would be to calculate the list of country_iso values for every single quadkey. This would have to be a list, since quadkeys can cross countries, and big ones could have a hundred or more countries in them. But most should be one or a handful of countries, which will most always speed up the query.

There are 16 million quadkeys at level 12, but many are likely in the ocean. We likely could use a quadkey at level 10 or even 8, as having a couple more hive partitions to help wouldn't still make it worth it.

So I think the main thing would be to make a script that generates a list of country iso codes for every quadkey. Then store that as a parquet file, and if it's not too big we could likely just include it in the open_buildings package.

If we had this then we could remove the country_iso flag, as we'd be able to always use a hive partition.

Installation requirement files and dependencies

Environment Information

  • open_buildings version: 0.10.0
  • Python version: 3.9.5
  • Operating System: Windows OSGeo4W

Description

Describe what you were trying to get done.

PS: This just an observation. The installation was successful(on a windows pc)

After running installation command pip install open-buildings the list of dependencies that pip was installing on the background was suspiciously longer than what would have been expected from the requirements.txt file. Some packages that would have been expected only when building documentation for example.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sphinx 3.3.0 requires sphinxcontrib-applehelp, which is not installed.
sphinx 3.3.0 requires sphinxcontrib-devhelp, which is not installed.
sphinx 3.3.0 requires sphinxcontrib-htmlhelp, which is not installed.
sphinx 3.3.0 requires sphinxcontrib-jsmath, which is not installed.
sphinx 3.3.0 requires sphinxcontrib-qthelp, which is not installed.
sphinx 3.3.0 requires sphinxcontrib-serializinghtml, which is not installed.
pointpats 2.2.0 requires opencv-contrib-python>=4.2.0, which is not installed.
access 1.1.1 requires Sphinx==2.4.3, but you have sphinx 3.3.0 which is incompatible.
pysal 2.3.0 requires python-dateutil<=2.8.0, but you have python-dateutil 2.8.2 which is incompatible.
pysal 2.3.0 requires urllib3<1.25, but you have urllib3 1.25.11 which is incompatible.
pytest 6.1.2 requires pluggy<1.0,>=0.12, but you have pluggy 1.0.0 which is incompatible.
statsmodels 0.14.0 requires patsy>=0.5.2, but you have patsy 0.5.1 which is incompatible.

Based on this message log, I decided to check on the dependencies tree and what package requires these packages i.e

List of all packages Installed

MarkupSafe-2.1.3
PySocks-1.7.1
attrs-23.1.0
boto3-1.28.80
botocore-1.31.80
bqplot-0.12.42
branca-0.7.0
charset-normalizer-3.3.2
click-8.1.7
cligj-0.7.2
colour-0.1.5
comm-0.2.0
duckdb-0.9.1
folium-0.15.0
gdown-4.7.1
geojson-3.1.0
ipyevents-2.0.2
ipyfilechooser-0.6.0
ipyleaflet-0.17.4
ipytree-0.2.2
ipywidgets-8.1.1
jmespath-1.0.1
jsonschema-4.19.2
jsonschema-specifications-2023.7.1
jupyterlab-widgets-3.0.9
leafmap-0.28.1
mercantile-1.2.1
open-buildings-0.10.0
openlocationcode-1.0.1
pyshp-2.3.1
pystac-1.9.0
pystac-client-0.7.5
python-box-7.1.1
python-dateutil-2.8.2
referencing-0.30.2
requests-2.31.0
rpds-py-0.12.0
s3transfer-0.7.0
scooby-0.9.2
tabulate-0.9.0
traittypes-0.2.1
whitebox-2.3.1
whiteboxgui-2.3.0
widgetsnbextension-4.0.9
xyzservices-2023.10.1

What I Did

Used pipdeptree tool to print out the dependency tree for open-buildings package/tool
Command

pipdeptree.exe --package open-buildings

Output

------------------------------------------------------------------------
open-buildings==0.10.0
├── boto3 [required: Any, installed: 1.28.80]
│   ├── botocore [required: >=1.31.80,<1.32.0, installed: 1.31.80]
│   │   ├── jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
│   │   ├── python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
│   │   │   └── six [required: >=1.5, installed: 1.15.0]
│   │   └── urllib3 [required: >=1.25.4,<1.27, installed: 1.25.11]
│   ├── jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
│   └── s3transfer [required: >=0.7.0,<0.8.0, installed: 0.7.0]
│       └── botocore [required: >=1.12.36,<2.0a.0, installed: 1.31.80]
│           ├── jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
│           ├── python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
│           │   └── six [required: >=1.5, installed: 1.15.0]
│           └── urllib3 [required: >=1.25.4,<1.27, installed: 1.25.11]
├── click [required: Any, installed: 8.1.7]
│   └── colorama [required: Any, installed: 0.4.4]
├── duckdb [required: Any, installed: 0.9.1]
├── geopandas [required: Any, installed: 0.13.2]
│   ├── Fiona [required: >=1.8.19, installed: 1.9.4.post1]
│   │   ├── attrs [required: >=19.2.0, installed: 23.1.0]
│   │   ├── certifi [required: Any, installed: 2020.6.20]
│   │   ├── click [required: ~=8.0, installed: 8.1.7]
│   │   │   └── colorama [required: Any, installed: 0.4.4]
│   │   ├── click-plugins [required: >=1.0, installed: 1.1.1]
│   │   │   └── click [required: >=4.0, installed: 8.1.7]
│   │   │       └── colorama [required: Any, installed: 0.4.4]
│   │   ├── cligj [required: >=0.5, installed: 0.7.2]
│   │   │   └── click [required: >=4.0, installed: 8.1.7]
│   │   │       └── colorama [required: Any, installed: 0.4.4]
│   │   ├── importlib-metadata [required: Any, installed: 2.0.0]
│   │   │   └── zipp [required: >=0.5, installed: 3.4.0]
│   │   └── six [required: Any, installed: 1.15.0]
│   ├── packaging [required: Any, installed: 23.0]
│   ├── pandas [required: >=1.1.0, installed: 2.0.2]
│   │   ├── numpy [required: >=1.20.3, installed: 1.24.1]
│   │   ├── python-dateutil [required: >=2.8.2, installed: 2.8.2]
│   │   │   └── six [required: >=1.5, installed: 1.15.0]
│   │   ├── pytz [required: >=2020.1, installed: 2023.3]
│   │   └── tzdata [required: >=2022.1, installed: 2023.3]
│   ├── pyproj [required: >=3.0.1, installed: 3.6.0]
│   │   └── certifi [required: Any, installed: 2020.6.20]
│   └── shapely [required: >=1.7.1, installed: 2.0.1]
│       └── numpy [required: >=1.14, installed: 1.24.1]
├── leafmap [required: Any, installed: 0.28.1]
│   ├── bqplot [required: Any, installed: 0.12.42]
│   │   ├── ipywidgets [required: >=7.5.0,<9, installed: 8.1.1]
│   │   │   ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │   │   │   └── traitlets [required: >=4, installed: 5.0.5]
│   │   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │   │   │   ├── backcall [required: Any, installed: 0.2.0]
│   │   │   │   ├── colorama [required: Any, installed: 0.4.4]
│   │   │   │   ├── decorator [required: Any, installed: 4.4.2]
│   │   │   │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │   │   │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │   │   │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │   │   │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │   │   │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │   │   │   ├── Pygments [required: Any, installed: 2.7.2]
│   │   │   │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │   │   │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │   │   ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │   │   │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   │   ├── numpy [required: >=1.10.4, installed: 1.24.1]
│   │   ├── pandas [required: >=1.0.0,<3.0.0, installed: 2.0.2]
│   │   │   ├── numpy [required: >=1.20.3, installed: 1.24.1]
│   │   │   ├── python-dateutil [required: >=2.8.2, installed: 2.8.2]
│   │   │   │   └── six [required: >=1.5, installed: 1.15.0]
│   │   │   ├── pytz [required: >=2020.1, installed: 2023.3]
│   │   │   └── tzdata [required: >=2022.1, installed: 2023.3]
│   │   ├── traitlets [required: >=4.3.0, installed: 5.0.5]
│   │   │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   └── traittypes [required: >=0.0.6, installed: 0.2.1]
│   │       └── traitlets [required: >=4.2.2, installed: 5.0.5]
│   │           └── ipython-genutils [required: Any, installed: 0.2.0]
│   ├── colour [required: Any, installed: 0.1.5]
│   ├── folium [required: Any, installed: 0.15.0]
│   │   ├── branca [required: >=0.6.0, installed: 0.7.0]
│   │   │   └── Jinja2 [required: Any, installed: 3.1.2]
│   │   │       └── MarkupSafe [required: >=2.0, installed: 2.1.3]
│   │   ├── Jinja2 [required: >=2.9, installed: 3.1.2]
│   │   │   └── MarkupSafe [required: >=2.0, installed: 2.1.3]
│   │   ├── numpy [required: Any, installed: 1.24.1]
│   │   └── requests [required: Any, installed: 2.31.0]
│   │       ├── certifi [required: >=2017.4.17, installed: 2020.6.20]
│   │       ├── charset-normalizer [required: >=2,<4, installed: 3.3.2]
│   │       ├── idna [required: >=2.5,<4, installed: 2.10]
│   │       └── urllib3 [required: >=1.21.1,<3, installed: 1.25.11]
│   ├── gdown [required: Any, installed: 4.7.1]
│   │   ├── beautifulsoup4 [required: Any, installed: 4.9.3]
│   │   │   └── soupsieve [required: >1.2, installed: 2.0.1]
│   │   ├── filelock [required: Any, installed: 3.0.12]
│   │   ├── requests [required: Any, installed: 2.31.0]
│   │   │   ├── certifi [required: >=2017.4.17, installed: 2020.6.20]
│   │   │   ├── charset-normalizer [required: >=2,<4, installed: 3.3.2]
│   │   │   ├── idna [required: >=2.5,<4, installed: 2.10]
│   │   │   └── urllib3 [required: >=1.21.1,<3, installed: 1.25.11]
│   │   ├── six [required: Any, installed: 1.15.0]
│   │   └── tqdm [required: Any, installed: 4.51.0]
│   ├── geojson [required: Any, installed: 3.1.0]
│   ├── ipyevents [required: Any, installed: 2.0.2]
│   │   └── ipywidgets [required: >=7.6.0, installed: 8.1.1]
│   │       ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │       │   └── traitlets [required: >=4, installed: 5.0.5]
│   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │       ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │       │   ├── backcall [required: Any, installed: 0.2.0]
│   │       │   ├── colorama [required: Any, installed: 0.4.4]
│   │       │   ├── decorator [required: Any, installed: 4.4.2]
│   │       │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │       │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │       │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │       │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │       │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │       │   ├── Pygments [required: Any, installed: 2.7.2]
│   │       │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │       │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │       ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │       ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │       │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │       └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   ├── ipyfilechooser [required: Any, installed: 0.6.0]
│   │   └── ipywidgets [required: Any, installed: 8.1.1]
│   │       ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │       │   └── traitlets [required: >=4, installed: 5.0.5]
│   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │       ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │       │   ├── backcall [required: Any, installed: 0.2.0]
│   │       │   ├── colorama [required: Any, installed: 0.4.4]
│   │       │   ├── decorator [required: Any, installed: 4.4.2]
│   │       │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │       │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │       │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │       │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │       │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │       │   ├── Pygments [required: Any, installed: 2.7.2]
│   │       │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │       │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │       ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │       ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │       │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │       └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   ├── ipyleaflet [required: Any, installed: 0.17.4]
│   │   ├── branca [required: >=0.5.0, installed: 0.7.0]
│   │   │   └── Jinja2 [required: Any, installed: 3.1.2]
│   │   │       └── MarkupSafe [required: >=2.0, installed: 2.1.3]
│   │   ├── ipywidgets [required: >=7.6.0,<9, installed: 8.1.1]
│   │   │   ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │   │   │   └── traitlets [required: >=4, installed: 5.0.5]
│   │   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │   │   │   ├── backcall [required: Any, installed: 0.2.0]
│   │   │   │   ├── colorama [required: Any, installed: 0.4.4]
│   │   │   │   ├── decorator [required: Any, installed: 4.4.2]
│        │   │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │   │   │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │   │   │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │   │   │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │   │   │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │   │   │   ├── Pygments [required: Any, installed: 2.7.2]
│   │   │   │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │   │   │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │   │   ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │   │   │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   │   ├── traittypes [required: >=0.2.1,<3, installed: 0.2.1]
│   │   │   └── traitlets [required: >=4.2.2, installed: 5.0.5]
│   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   └── xyzservices [required: >=2021.8.1, installed: 2023.10.1]
│   ├── ipywidgets [required: Any, installed: 8.1.1]
│   │   ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │   │   └── traitlets [required: >=4, installed: 5.0.5]
│   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │   │   ├── backcall [required: Any, installed: 0.2.0]
│   │   │   ├── colorama [required: Any, installed: 0.4.4]
│   │   │   ├── decorator [required: Any, installed: 4.4.2]
│   │   │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │   │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │   │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │   │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │   │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │   │   ├── Pygments [required: Any, installed: 2.7.2]
│   │   │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │   │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │   ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │   │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   ├── matplotlib [required: Any, installed: 3.5.1]
│   │   ├── cycler [required: >=0.10, installed: 0.10.0]
│   │   │   └── six [required: Any, installed: 1.15.0]
│   │   ├── fonttools [required: >=4.22.0, installed: 4.28.5]
│   │   ├── kiwisolver [required: >=1.0.1, installed: 1.2.0]
│   │   ├── numpy [required: >=1.17, installed: 1.24.1]
│   │   ├── packaging [required: >=20.0, installed: 23.0]
│   │   ├── Pillow [required: >=6.2.0, installed: 9.2.0]
│   │   ├── pyparsing [required: >=2.2.1, installed: 2.4.7]
│   │   └── python-dateutil [required: >=2.7, installed: 2.8.2]
│   │       └── six [required: >=1.5, installed: 1.15.0]
│   ├── numpy [required: Any, installed: 1.24.1]
│   ├── pandas [required: Any, installed: 2.0.2]
│   │   ├── numpy [required: >=1.20.3, installed: 1.24.1]
│   │   ├── python-dateutil [required: >=2.8.2, installed: 2.8.2]
│   │   │   └── six [required: >=1.5, installed: 1.15.0]
│   │   ├── pytz [required: >=2020.1, installed: 2023.3]
│   │   └── tzdata [required: >=2022.1, installed: 2023.3]
│   ├── pyshp [required: Any, installed: 2.3.1]
│   ├── pystac-client [required: Any, installed: 0.7.5]
│   │   ├── pystac [required: >=1.8.2, installed: 1.9.0]
│   │   │   └── python-dateutil [required: >=2.7.0, installed: 2.8.2]
│   │   │       └── six [required: >=1.5, installed: 1.15.0]
│   │   ├── python-dateutil [required: >=2.8.2, installed: 2.8.2]
│   │   │   └── six [required: >=1.5, installed: 1.15.0]
│   │   └── requests [required: >=2.28.2, installed: 2.31.0]
│   │       ├── certifi [required: >=2017.4.17, installed: 2020.6.20]
│   │       ├── charset-normalizer [required: >=2,<4, installed: 3.3.2]
│   │       ├── idna [required: >=2.5,<4, installed: 2.10]
│   │       └── urllib3 [required: >=1.21.1,<3, installed: 1.25.11]
│   ├── python-box [required: Any, installed: 7.1.1]
│   ├── scooby [required: Any, installed: 0.9.2]
│   ├── whiteboxgui [required: Any, installed: 2.3.0]
│   │   ├── ipyfilechooser [required: Any, installed: 0.6.0]
│   │   │   └── ipywidgets [required: Any, installed: 8.1.1]
│   │   │       ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │   │       │   └── traitlets [required: >=4, installed: 5.0.5]
│   │   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │       ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │   │       │   ├── backcall [required: Any, installed: 0.2.0]
│   │   │       │   ├── colorama [required: Any, installed: 0.4.4]
│   │   │       │   ├── decorator [required: Any, installed: 4.4.2]
│   │   │       │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │   │       │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │   │       │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │   │       │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │   │       │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │   │       │   ├── Pygments [required: Any, installed: 2.7.2]
│   │   │       │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │   │       │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │       ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │   │       ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │   │       │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │       └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   │   ├── ipytree [required: Any, installed: 0.2.2]
│   │   │   └── ipywidgets [required: >=7.5.0,<9, installed: 8.1.1]
│   │   │       ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │   │       │   └── traitlets [required: >=4, installed: 5.0.5]
│   │   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │       ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │   │       │   ├── backcall [required: Any, installed: 0.2.0]
│   │   │       │   ├── colorama [required: Any, installed: 0.4.4]
│   │   │       │   ├── decorator [required: Any, installed: 4.4.2]
│   │   │       │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │   │       │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │   │       │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │   │       │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │   │       │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │   │       │   ├── Pygments [required: Any, installed: 2.7.2]
│   │   │       │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │   │       │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │   │       │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │       ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │   │       ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │   │       │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │       └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   │   ├── ipywidgets [required: Any, installed: 8.1.1]
│   │   │   ├── comm [required: >=0.1.3, installed: 0.2.0]
│   │   │   │   └── traitlets [required: >=4, installed: 5.0.5]
│   │   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   ├── ipython [required: >=6.1.0, installed: 7.18.1]
│   │   │   │   ├── backcall [required: Any, installed: 0.2.0]
│   │   │   │   ├── colorama [required: Any, installed: 0.4.4]
│   │   │   │   ├── decorator [required: Any, installed: 4.4.2]
│   │   │   │   ├── jedi [required: >=0.10, installed: 0.17.2]
│   │   │   │   │   └── parso [required: >=0.7.0,<0.8.0, installed: 0.7.1]
│   │   │   │   ├── pickleshare [required: Any, installed: 0.7.5]
│   │   │   │   ├── prompt-toolkit [required: >=2.0.0,<3.1.0,!=3.0.1,!=3.0.0, installed: 3.0.8]
│   │   │   │   │   └── wcwidth [required: Any, installed: 0.2.5]
│   │   │   │   ├── Pygments [required: Any, installed: 2.7.2]
│   │   │   │   ├── setuptools [required: >=18.5, installed: 67.6.0]
│   │   │   │   └── traitlets [required: >=4.2, installed: 5.0.5]
│   │   │   │       └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   ├── jupyterlab-widgets [required: ~=3.0.9, installed: 3.0.9]
│   │   │   ├── traitlets [required: >=4.3.1, installed: 5.0.5]
│   │   │   │   └── ipython-genutils [required: Any, installed: 0.2.0]
│   │   │   └── widgetsnbextension [required: ~=4.0.9, installed: 4.0.9]
│   │   └── whitebox [required: Any, installed: 2.3.1]
│   │       └── click [required: >=6.0, installed: 8.1.7]
│   │           └── colorama [required: Any, installed: 0.4.4]
│   └── xyzservices [required: Any, installed: 2023.10.1]
├── mercantile [required: Any, installed: 1.2.1]
│   └── click [required: >=3.0, installed: 8.1.7]
│       └── colorama [required: Any, installed: 0.4.4]
├── openlocationcode [required: Any, installed: 1.0.1]
├── pandas [required: Any, installed: 2.0.2]
│   ├── numpy [required: >=1.20.3, installed: 1.24.1]
│   ├── python-dateutil [required: >=2.8.2, installed: 2.8.2]
│   │   └── six [required: >=1.5, installed: 1.15.0]
│   ├── pytz [required: >=2020.1, installed: 2023.3]
│   └── tzdata [required: >=2022.1, installed: 2023.3]
├── shapely [required: Any, installed: 2.0.1]
│   └── numpy [required: >=1.14, installed: 1.24.1]
└── tabulate [required: Any, installed: 0.9.0]

From the tree output, leamap had the most deps thus a general inspection through the code base to find instances in which the package is being used by using visual studio code text editor search ctrl + shift + f functionality.
image

It turns out to be used once, in the examples file download_buildings.ipynb and no usage in the main package source code. Hence kept asking the question why it is included in the main requirements.txt file instead of just docs requirements.

side Note: Any instructions on how to build the docs locally would be appreciated thanks for the awesome tool.

Tidy up requirements.txt

Description

The requirements.txt/requirements_dev.txt files need some tidying up, it seems. I see leafmap in there but don't think it's used anywhere in the code. In addition, all packages should ideally be pinned to specific versions - otherwise there is a risk that pip downloads the latest available version at a given point in time which can even break versions of the package that used to work for users.

Refactor overture stuff to be more generic

The main overture commands could likely be done fairly generic for any large geospatial file. It'd be great to evolve them to at least be 'tools', and perhaps even be their own package that 'open_buildings' would call / depend on. The overall flow of how the data is formatted is:

  • Add country_iso and quadkey columns to a directory of parquet files.
  • Create a duckdb database from all the files (this isn't actually a CLI / python script yet, as it's super easy - just create table from reading in the whole directory)
  • Write out individual parquet files based on country and iso, to the maximum size, with the appropriate rowgroup.

A more generic version of this would likely take input from more than parquet files (or at least have a command to convert to parquet files). And it would not be tied to the 'buildings' name.

Don't create empty files if there are 0 features

Right now if you put in a geojson of an area that has no buildings (like the middle of the ocean, or the middle of the desert) then the get_buildings command will write out a geospatial file with 0 rows. This is not ideal - instead it should warn the user and then not actually call the DuckDB command to write it out.

This should be pretty simple to do - somewhere in here just check if the count is 0, and if it is then print out to the user that 0 buildings were found and that no file was written, and then just return / skip the rest.

Better estimates of how long a query might take

Right now the CLI informs a user that a query is at least 5-10 seconds if they have a country_iso and 30-60 if they do not. But that's really just the minimum times, for small areas. It'd be better if we could provide more guidance - like if someone is trying to query a huge area then tell them it can minutes or hours, or even longer on a slow connection. I just did a decent sized area around Sao Paulo and it took 18 minutes to download 5.36 million buildings / 716.9 mb, and my connection is pretty fast.

This would likely need #32 to be sure that the user is actually requesting a large area if it's a large quadkey, since right now if it's a geojson that straddles a quadkey then it can look big but would still go pretty fast.

And ideally we'd do a good bit of testing to be able to give guidance - test really sparse areas in australia and dense areas in like India, and then also on different connections.

This is related to #31 - though this is probably a bit easier, as it's just guidance based on the size of the request, not trying to actually report what's happening. Though if that one is easier then this one may not be needed.

Cannot save as Shapefile

Environment Information

  • open_buildings version: 0.10.0
  • Python version: 3.11
  • Operating System: Debian Bullseye (docker image)

Description

Trying to save the output as a shapefile fails, see command and traceback below.

What I Did

$ echo '{ "type": "Feature", "properties": {}, "geometry": {"coordinates": [[[-0.13085471468215815, 51.50945096318702], [-0.13085471468215815, 51.50612362847875], [-0.12508113856225123, 51.50612362847875], [-0.12508113856225123, 51.50945096318702], [-0.13085471468215815, 51.50945096318702]]], "type": "Polygon"}}' | ob get_buildings - buildings.shp --country_iso GB 
[2023-10-14 10:12:02] Querying and downloading data for quadkey 0313131311 in country GB...
[2023-10-14 10:12:02] Expect query times of at least 5-10 seconds
[2023-10-14 10:12:02] Installing DuckDB spatial extension...
[2023-10-14 10:12:56] Downloaded 65 features into DuckDB.
[2023-10-14 10:12:56] Writing to buildings.shp...
terminate called after throwing an instance of 'duckdb::IOException'
  what():  IO Error: Could not write file "buildings.dbf": Bad file descriptor
Aborted (core dumped)

Make get-buildings more robust

The get-buildings command seems to work decently well, but it does very little in the way of catching all the things a user might input things wrong. It'd be good to test out common ways that it wouldn't work right and give better warnings, etc.

It also appears to hang if you don't supply the file, I think since it's expecting stdin. I definitely want to keep the ability to do stdin, but ideally can warn users. In planet CLI we had '-' mean 'read from stdin' so that could be a good pattern to follow.

Add tests

We have a nice CI system that checks multiple operating systems to be sure that all still works. But there's one big problem - there's no tests that it runs. It's been a one man project, but if the community grows it's essential to be able to automatically check if changes broke something unexpected (also good even when it is a one man project).

ChatGPT can likely help in the creation of tests. Would be good to have unit tests, and also some more integration tests that use the source.coop files to ensure it's all working.

Add vida google-microsoft buildings to `get-buildings`

The VIDA dataset on source combines google and microsoft buildings, and should get the most buildings of the different options. It should be relatively easy to add, but it doesn't use 'quadkey' for spatial partitioning, it's s2 instead. The one to add is https://beta.source.coop/vida/google-microsoft-open-buildings/geoparquet/by_country_s2 - as it's more partitioned and likely will perform much better (though it's worth trying both).

The main task for this is to have a different 'spatial' column - the current set up assumes quadkey, as that's what the first two were done with. Ideally download_buildings function would take an argument that would next be 'quadkey' or 's2', and we could add h3, geohash, etc. The get_building CLI should just have an option to use this dataset, and then it can pass the right arguments into download_buildings.

The quadkey is computed client side, and it's likely similarly easy to compute the s2 key, and then use that in the query.

Control over attributes requested in `get-buildings`

Right now the get-buildings call just requests all attributes. It'd be nice if we had --include and --exclude flags like tippecanoe does, to give users more control over the attributes they want.

To tackle this issue the get_buildings command in cli.py should have two more flags added, and then download_buildings should get them in. Then the select_values variable should be tweaked with the right logic. Can leverage DuckDB's exclude command to form the sql, and the include would just pass the values in. It'd likely make sense to just not allow a user to use both include and exclude, as that'd be funky logic to get right.

Option to just get the 'count' of buildings, but not actually download the results.

It'd be nice to be able to quickly see how many buildings a request would result in, instead of downloading all the buildings. This can be done by just doing a select count (*) (instead of select *) in DuckDB, and then just printing that out and not downloading anything.

To add this just start with --verbose to see what type of queries DuckDB will issue, and then try out a similar query that will just get the count and make sure it works. Then try just changing the core 'download' command to do a count and print that out. Once that is working then you just need to add the flag, to the cli.py and pass in the count flag to the download function in download_buildings.

If you want to take this on and have more questions feel free to comment here and I can explain more.

Performance testing of different partition options

In making the get-buildings command I went through a couple of iterations of trying out different formatting - definitely realizing that more row groups than gpq makes by default is better. And with the latest scripts I have a way to set the 'max number of rows' per file and also the number of row groups. But I have no idea if things could be lots faster if we increased or decreased row group size, and/or increased / decreased number of files. The 'defaults' I used were max 10 million rows per file and 20000 rows per group. It'd be great to try out some variations on that. And ideally experiment on the tradeoffs between 'legibility for download' (like use country then admin level 1 like the google buildings data does) vs 'balance of spatial size' (like use the quadkey max size algorithm entirely, instead of country then quadkey, so we'd have much fewer files over all, but each file would be meaningless to users - they'd need to use the 'tool' to download).

The performance I was getting to was 20-30 seconds to download a small number of buildings. But it was just a handful of tests.

Ideally we'd have a command that would run a 'benchmark' that would have 20-30 locations globally and get the performance for each of them and report that out, so we can easily compare how tweaks to the data work.

Implement `split_multipolygons` for OGR process

Description

Add the splitting of multipolygons to the ogr process. I'm not sure if it's possible to do pure CLI call to do this operation, so it may need to make use of Fiona, but that may lose the speed of the column-oriented API. So if it just ends up being about the same speed as pandas (with fiona under its hood) then perhaps we just don't implement.

Structure of the package

Description

Two things that I'm unclear about:

  1. Not quite sure what the purpose of common.py is - can this be deleted?
  2. I would also recommend to move utility functions to a utils.py file and further put all CLI handlers in the cli.py file, with a clear separation of CLI handler (command line interface) and library function (python interface). E.g. not quite clear sure why download_buildings also contains CLI handlers.

Better progress reporting in `get_buildings`

Right now if you do a get_buildings call on a big area it can take a long time, but you have no idea if it's working away or something has gone wrong. It'd be much better if it could report on how things are going. The ideal would be to report on what it's scanned remotely (maybe just in --verbose), and then to show some streaming progress as it downloads buildings.

This may not be possible, since it's DuckDB doing all the querying, but perhaps there's a way to hook into what it's doing and report out.

Make get building requests with smaller quadkeys

Right now the geojson_to_quadkey function in download_buildings keeps zooming out until it hits a quadkey that completely encompasses the area. This can lead to some very big quadkeys if the area to query straddles a big quadkey - I hit one area in italy that was getting like a level 4 quadkey.

It'd be much better if we didn't make big scans. It seems like one route to do this would be to allow for more than one quadkey. The function seems like you could adjust the cut off for number of tiles to be more than one:

for zoom in range(12, -1, -1):
        tiles = list(mercantile.tiles(min_lon, min_lat, max_lon, max_lat, zooms=zoom))
        if len(tiles) == 1:
            return mercantile.quadkey(tiles[0])

So this migh tbe simple, to just try to get a bigger list. I think we probably don't want to return a huge list, so maybe just when it gets to be covered by 4 or 10 quadkeys. Probably worth some experimenting on query times with different combinations - the parquet partition might have increased overhead to query a lot of different options, so maybe just looking for 2 makes sense. But even just 2 seems like it'd likely help in cases where it just straddles a huge quadkey.

There's perhaps some other technique that could be done here, to get the biggest one and then scale down.

`--skip-split-multis` doesn't work right with ogr process on benchmark

Description

Right now if you try to run --skip-split-multis on the ogr process in the benchmark command it will return a table of super fast responses. This is because the operations isn't actually running - with 'convert' it just informs that the process doesn't yet work. This is fine for convert, but for benchmark we should probably at least print a WARNING that says the times aren't valid, and maybe even just leave it off the results. Or we could run it with skip-multis, since the timing is likely representative - the difference for duckdb and pandas isn't significant. Could also just try to implement it, but I'm not sure if it's possible to do with pure command line calls.

Create QGIS plugin for get_buildings

This is technically outside the scope of this project, as a QGIS plugin should have its own repo. But I think this cloud-native geo querying of source.coop datasets could be much more accessible for users if there was a QGIS plugin that used the same technique.

I've never written one, but it would likely be a very cool first one to do, so if I have a chance I'll try it. But anyone else is welcome to try, and to reuse as much or as little code from here as desired.

Make parquet compression options a flag to pass in to benchmark

Description

There are a number of different parquet compression options (snappy, gzip, zstd, brotli, uncompressed, etc) that can make things faster/slower and smaller/bigger. It'd be nice to be able to benchmark / compare those. Right now there's a global variable that can be used to set this. If implemented it should raise appropriate errors on the process used, as each process supports a different set of compression options.

Get CLI working

Environment Information

  • open_buildings version: 0.2
  • Python version: 3.10?
  • Operating System: Mac OS

Description

Trying to use the CLI for open buildings. It seems like something is sorta there - it installs open_buildings on the path (which is further than I got on my own), but then it says to replace this message by 'putting your code in open_buildings.cli.main'.

It seems like it'd be nice to align the pip install open-buildings with the cli, like have both be open-buildings or both be open_buildings.

Originally reported in cholmes/google-buildings-tools#1

What I Did

% open-buildings
zsh: command not found: open-buildings
% open_buildings
Replace this message by putting your code into open_buildings.cli.main
See click documentation at https://click.palletsprojects.com/

Fix --no-gpq to actually set if it work.

Description

Right now there is a --no-gpq flag, but it was poorly implemented and doesn't actually work. You can modify the python code to set a global variable, but it doesn't do anything different if you set it from the CLI.

Nicer error reporting when geocoder fails

When I request a location with --location and it doesn't have results I get a barfed stack trace:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/bin/ob", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/open_buildings/cli.py", line 84, in get_buildings
    geojson_data = geocode(location)
                   ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/open_buildings/cli.py", line 40, in geocode
    location = osmnx.geocode_to_gdf(data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/osmnx/geocoder.py", line 137, in geocode_to_gdf
    gdf = pd.concat([gdf, _geocode_query_to_gdf(q, wr, by_osmid)])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/qgis/lib/python3.11/site-packages/osmnx/geocoder.py", line 187, in _geocode_query_to_gdf
    raise InsufficientResponseError(msg)
osmnx._errors.InsufficientResponseError: Nominatim geocoder returned 0 results for query 'adsfsad'

It'd be better to catch the error and just inform users that their location string didn't work - they can get the geojson on their own or try a more common string.

Install spatial extension if not installed

Environment Information

  • open_buildings version: 0.0.9
  • Python version: 3.11.3
  • Operating System: macos 12.3

Description

Just getting started with a fresh duckdb install, get_buildings fails because we try to load the spatial extension even if it has not been previously installed.

What I Did

pip install -e .
ob tools get_buildings 1.json my-buildings.geojson --country_iso RW

To Replicate

  1. Remove the spatial extension by removing the relevant files within ~/duckdb/extensions/...
  2. Try running get_buildings

Python interface for functionality

Description

I think a proper python interface would be nice to have, in addition to the CLI.

The following things should probably be changed, just gathering my thoughts here:

  • find a more descriptive name for function download()
  • use python logger instead of click.echo() - this removes the need for a bunch of flags for the download function like silent, as this is all handled by the built-in python logger. Using the python logger allows users to register their own handlers more easily. The CLI can simply translate flags into logger settings.
  • ...

I might make a PR when I have time, just wondering if you have any thoughts on this.

Accept more input formats

Description

Currently the ob get_buildings utility and the underlying download_buildings.download() function accept a GeoJSON AOI, either as a file or piped from stdin. As WKT is already used under the hood, it makes sense to also accept WKT as an alternative to GeoJSON. It's easy to distinguish between both even when piped in.

Other formats could also be supported, potentially the full range of formats that is supported as output formats.

Add GeoJSON to google convert and benchmark?

I'm less into this idea, as it seems like crap for this goal of working with huge files, but could be interesting to show performance and size characteristics. I do love GeoJSON, it's one of the best formats, but this is not the use case for it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.