thinkingmachines / geomancer Goto Github PK

View Code? Open in Web Editor NEW

214.0 214.0 16.0 1.2 MB

Automated feature engineering for geospatial data

License: MIT License

Makefile 3.38% Python 96.00% Shell 0.62%

bigquery feature-engineering geospatial machine-learning openstreetmap

geomancer's People

Contributors

Stargazers

Watchers

Forkers

tchen0123 zhiliangpersonal spencerai sprinterzzj awesome-archive rajpcsq lizsaret adewin pandinosaurus obutuz dgreyling niarepo sanyam07 mansanitas iq-scm eibay

geomancer's Issues

Update config to allow both SQLite and BQ backend

Abstract table preparation in base.Spell.cast(). It should only give me the source , target, and engine for the query to execute

Change the truncation of coordinates from 4 decimal places to 8 decimal places

Let's preserve the positional accuracy of the input WKTs and output WKTs. 8 decimal places is a reasonable threshold since 9 decimal places is already 0.01 mm.

Add input DataFrame validation

We should first validate that the input columns exist before running the actual query.

Querying in spatialite returns NULL

The reason for this is that we're not converting the osm POIS into a geometry type. However, if we do it in distance_to_nearest, then the BQ queries will fail (because they're already in geography type). Here's how we will solve it:

Re-upload Philippine OSM features in osm dataset with WKT as type STRING
Apply core.ST_GeoFromText to the osm features

Add unit tests

The main challenge here is to mock the bigquery client. Still not sure how this will all work, but we should have atleast good test coverage for this project.

Optimize BQ uploads by reusing tables (cache)

We upload a dataframe into BQ (as a table) for every call. This is inefficient given larger datasets. There should be a better way to:

Check if the dataframe in question already exists in the BigQuery dataset
If yes, then just get that table, else, do the upload.

Add API documentation

Add option to use a dburl as the cast target

This means that the df in cast(df) can be a DataFrame or a string

Update README

Important sections:

Dependencies
Setup
Basic Usage

Handle DataFrame to BigQuery Table interaction

Things to implement:

pandas.DataFrame to BigQuery table
~~Delete BigQuery table at a certain trigger~~ Add expiry to BQ table

Optional for now (but would probably be important later on):

Show current upload job
Export pandas.DataFrame to Avro, then upload that to BigQuery table

Dependency of #1

Dockerize build process

Depends on #30

Just follow this tutorial: https://romanvm.pythonanywhere.com/post/using-docker-travis-continuous-integration-25/

Add Aggregation Spells

For example,

mean_price = AggregateOf("hotel", which="price", how="mean")
max_price = AggregateOf("hotel", which="price", how="max")
min_price = AggregateOf("hotel", which="price", how="min")

Remove white background in favicon

Fix database load and download during testing

During tests, loading into the database takes precedence than downloading the source.sqlite. The best solution is to provide different paths for each.

Update setup.py

Tests should work with tox which is dependent on setup.py.
It would be nice if we can do something like this:

$ pip install geomancer # install all dependencies
$ pip install geomancer[bq] # only installs BQ-related dependencies
$ pip install geomancer[sqlite] # only installs Spatialite-related dependencies

How to optimize load time for cast?

Issue Description

getting features for columns takes around 25 seconds per column for a data frame of 56,761 rows

Steps to reproduce the issue

df = pd.read_csv('sample.csv')`

pois_book_instance = SpellBook(
    spells = [
        DistanceToNearest(
            'police',
            source_table = 'project.dataset_id.gis_osm_pois_free_1',
            dburl = 'bigquery://project',
            feature_name = 'pois_dist_police'),
        NumberOf(
            'police',
            within = 1000,
            source_table = 'project.dataset_id.gis_osm_pois_free_1',
            dburl = 'bigquery://project',
            feature_name = 'pois_num_1000_police')
])

pois_book = pois_book_instance.cast(df)

What's the expected result?

hopefully faster

What's the actual result?

CPU times: user 2.33 s, sys: 88 ms, total: 2.42 s
Wall time: 46.7 s

Additional etailds / screenshot

Explore QueryBuilder using SQLAlchemy PostGIS

Let's remove the need for writing strings and formatting them. We need a more flexible API for building queries. Advantages:

Backend data warehouse is now pluggable (in the future we ca opt to use BigQuery OR a PostGIS server).
SQL queries are now in a Python-ic DSL. Much better than writing string-formatted queries.

PR Requirements

Port the string-formatted query into an SQLAlchemy Query. The user should still be able to supply the fclass, source_table (OSM) and BigQuery options if possible
No need to pass the client (?), we just need to pass the database URI.
If a table is uploaded into the BQ Dataset, ensure that there is an expiration date (previously, we're creating a new table per call, which is not cost-efficient).

I suggest creating a query() method and have it an abstract class method. Must raise a NotImplementedError for geomancer.base.Spell. The geomancer.base.Spell.cast() method should ideally be inherited (all subclasses should just call super())

Ideal Scenario: when implementing a new Spell (i.e., subclassing the Spell base), I don't need to think of implementing cast anymore (in fact, this whole thing should be inherited, super()). Instead, I should just implement the query() method using the SQLAlchemy dialect.

Add support for building feature extraction

Six different types of buildings: (1) residential, (2) damaged, (3) commercial, (4) industrial, (5) education, (6) health.

For each lat-lng and specified radius or bounding box, extract the following features:

total number of buildings
the total area of buildings
the mean area of buildings
proportion of area occupied by the buildings

Add support for road feature extraction

Five types of roads: (1) primary, (2) trunk, (3) paved, (4) unpaved, and (5) intersection.

For each lat-lng and specified radius or bounding box, extract the following features:

Distance to closest road
Total number of roads
Total road length

[Bug] "NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery" when calling the `cast` method from `DistanceToNearest`

After running the code below on Google Colab:

dist_spell = DistanceToNearest(
    "hospital",
    source_table="phcovid.gis_osm_pois_free_1",
    feature_name="dist_hospital",
    dburl="bigquery://ml-prototypes",
).cast(df)

The following error occurs:

2020-03-29 03:15:18.390 | ERROR | main::6 - An error has been caught in function '', process 'MainProcess' (119), thread 'MainThread' (140460848953216):
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
└ ModuleSpec(name='ipykernel_launcher', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7fbf94d60cf8>, origin='...
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
└ <code object at 0x7fbf94db0660, file "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 5>

File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
│ └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
└ <module 'ipykernel.kernelapp' from '/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py'>

File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 664, in launch_instance
app.start()
│ └ <function IPKernelApp.start at 0x7fbf8f501ea0>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fbf94f0ac50>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
│ │ └ <staticmethod object at 0x7fbf90622a20>
│ └ <class 'tornado.ioloop.IOLoop'>
└ <module 'tornado.ioloop' from '/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py'>

File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
│ │ └ 1
│ └ <zmq.sugar.socket.Socket object at 0x7fbf7f3f1660>
└ <function wrap..null_wrapper at 0x7fbf7f3f4730>

File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ (<zmq.sugar.socket.Socket object at 0x7fbf7f3f1660>, 1)
└ <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>>

File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
│ └ <function ZMQStream._handle_recv at 0x7fbf8fa210d0>
└ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>

File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
│ │ │ └ [<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame o...
│ │ └ <function wrap..null_wrapper at 0x7fbf72c45730>
│ └ <function ZMQStream._run_callback at 0x7fbf8fa1bf28>
└ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>

File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
│ │ └ {}
│ └ ([<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame ...
└ <function wrap..null_wrapper at 0x7fbf72c45730>

File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ ([<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame ...
└ <function Kernel.start..make_dispatcher..dispatcher at 0x7fbf72c456a8>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
│ │ │ └ [<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame o...
│ │ └ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
│ └ <function Kernel.dispatch_shell at 0x7fbf8f56b1e0>
└ <google.colab._kernel.Kernel object at 0x7fbf7f183320>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
│ │ │ └ {'header': {'username': 'username', 'msg_type': 'execute_request', 'msg_id': 'cf57ce7b1af942c3e5e49e6c3d24d01a', 'version': '...
│ │ └ [b'436ca7d2de314043e8098971c7ce30c9']
│ └ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
└ <bound method Kernel.execute_request of <google.colab._kernel.Kernel object at 0x7fbf7f183320>>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
│ └ True
└ {}

File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
│ │ │ │ └ False
│ │ │ └ True
│ │ └ 'dist_spell = DistanceToNearest(\n "hospital",\n source_table="phcovid.gis_osm_pois_free_1",\n feature_name="dist_ho...
│ └ <function ZMQInteractiveShell.run_cell at 0x7fbf8f4fe158>
└ <google.colab._shell.Shell object at 0x7fbf7f183240>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
│ │ │ └ {'store_history': True, 'silent': False}
│ │ └ ('dist_spell = DistanceToNearest(\n "hospital",\n source_table="phcovid.gis_osm_pois_free_1",\n feature_name="dist_h...
│ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
└ <class 'ipykernel.zmqshell.ZMQInteractiveShell'>

File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
│ │ └ <ExecutionResult object at 7fbf6f4772e8, execution_count=58 error_before_exec=None error_in_exec=None result=None>
│ └ <IPython.core.compilerop.CachingCompiler object at 0x7fbf7f3ded68>
└ 'last_expr'

File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
if self.run_code(code, result):
│ │ │ └ <ExecutionResult object at 7fbf6f4772e8, execution_count=58 error_before_exec=None error_in_exec=None result=None>
│ │ └ <code object at 0x7fbf6e50f540, file "", line 1>
│ └ <function InteractiveShell.run_code at 0x7fbf929eb400>
└ <google.colab._shell.Shell object at 0x7fbf7f183240>

File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
│ │ │ │ └ {'name': 'main', 'doc': 'Automatically created module for IPython interactive environment', 'package': None, ...
│ │ │ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
│ │ └ <property object at 0x7fbf92f7a9a8>
│ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
└ <code object at 0x7fbf6e50f540, file "", line 1>

File "", line 6, in
).cast(df)
└ WKT ... name
0 POINT (120.6202008 14.3854827) ...

File "/usr/local/lib/python3.6/dist-packages/geomancer/spells/base.py", line 184, in cast
engine = core.get_engine()
│ └ <function DBCore.get_engine at 0x7fbf6d68c0d0>
└ <geomancer.backend.cores.bq.BigQueryCore object at 0x7fbf6d3e1e48>

File "/usr/local/lib/python3.6/dist-packages/geomancer/backend/cores/base.py", line 108, in get_engine
return create_engine(self.dburl)
│ │ └ bigquery://ml-prototypes
│ └ <geomancer.backend.cores.bq.BigQueryCore object at 0x7fbf6d3e1e48>
└ <function create_engine at 0x7fbf6dc226a8>

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/init.py", line 423, in create_engine
return strategy.create(*args, **kwargs)
│ │ │ └ {}
│ │ └ (bigquery://ml-prototypes,)
│ └ <function DefaultEngineStrategy.create at 0x7fbf6d8fb620>
└ <sqlalchemy.engine.strategies.PlainEngineStrategy object at 0x7fbf6dc33cf8>

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/strategies.py", line 61, in create
entrypoint = u._get_entrypoint()
│ └ <function URL._get_entrypoint at 0x7fbf6db5a598>
└ bigquery://ml-prototypes

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/url.py", line 172, in _get_entrypoint
cls = registry.load(name)
│ │ └ 'bigquery'
│ └ <function PluginLoader.load at 0x7fbf6dfeda60>
└ <sqlalchemy.util.langhelpers.PluginLoader object at 0x7fbf6db59240>

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/langhelpers.py", line 240, in load
"Can't load plugin: %s:%s" % (self.group, name)
│ │ └ 'bigquery'
│ └ 'sqlalchemy.dialects'
└ <sqlalchemy.util.langhelpers.PluginLoader object at 0x7fbf6db59240>

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery

NoSuchModuleError Traceback (most recent call last)
in ()
4 feature_name="dist_hospital",
5 dburl="bigquery://ml-prototypes",
----> 6 ).cast(df)

6 frames
/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/langhelpers.py in load(self, name)
238
239 raise exc.NoSuchModuleError(
--> 240 "Can't load plugin: %s:%s" % (self.group, name)
241 )
242

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery

Update documentation to include min IAM to access BigQueryCore

Add LengthOf function

The LengthOf function would compute the length of all line features inside a certain circular radius centered at a location with a certain lat and lon.

Upgrade bleach>=3.3.0

Reference: https://github.com/thinkingmachines/geomancer/security/dependabot/requirements-dev.txt/bleach/open

Add CI/CD Pipeline

Some options:

Requirements

Must be free for open-source projects
Easy deployment (just a yaml file, no need for on-prem solutions)
Should support Docker, build matrices, and cron jobs

Add NumberOf function

Add a function to get the NumberOf point features within a certain radius.

The NumberOf function should inherit Spells class, and should allow arguments for poi_type and radius

Make sure SpellBook indices align

Right now, SpellBook relies on a simple ordered index from the input dataframe. If that's not the case, the output dataframe will be misaligned.

Upgrade cryptography due to security vulnerability

needs to be ~> 3.2

For reference: CVE-2020-25659 (Moderate severity)

Implement feature for getting distance to nearest POI (point)

How about as a class? Then we should have a Spell abstraction?

class DistanceToNearest(Spell):
    def __call__():
        pass

[Bug] NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery" when calling the `cast` method from `DistanceToNearest`

Full error message:
`

Rename query_job return variable

You're not returning the query job, you're returning the output dataframe. Change your variable name.

Add functionality for multiple `on` parameters for spells

Right now, a user can only input one on parameter to any of the Spells.

dist_primary = DistanceToNearest(
    on="primary",
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary")

The on parameter is equivalent to a WHERE clause in SQL.

WHERE fclass = 'primary'

There might be cases wherein multiple on parameters are needed for feature engineering and it'd be useful to have that option available out of the box.

dist_primary_or_secondary = DistanceToNearest(
    on=["primary", "secondary"],
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary_or_secondary")

WHERE fclass = 'primary' OR fclass = 'secondary'

Check for SpellBook feature name collissions

Can’t extract features greater than 11,000 fclass count

Issue Description

Unable to extract features for a data frame of 56,761 rows if the feature class has a count of around 11,000 in its dataset.

(Though it works if the feature class has a count of around 5,000 in its dataset, count of around 7,000 works sometimes)

Steps to reproduce the issue

df = pd.read_csv('sample.csv')

instance_primary = DistanceToNearest(
    'primary',
    source_table = 'project.dataset_id.gis_osm_roads_free_1',
    feature_name = 'dist_primary'
)

test_primary = instance_primary.cast(df, dburl = 'bigquery://project')

What's the expected result?

no error

What's the actual result?

DatabaseError Link to Colab Notebook

Additional details / screenshot

Add PostGIS DBCore

Check implementation of geomancer.backend.cores.DBCore and BigQueryCore

Make "On" parameter arbitrary to any column in a dataset

Should not limit the "filter" function to just the fclass column in OSM but generalizable for any table

Add SpellBook

Usage ideas:

from geomancer import SpellBook

# When you want to register spells
my_spellbook = SpellBook([
  DistanceToNearest("embassy", within=10000, source_table="tm-geospatial.ph_osm.pois"), # From BQ
  DistanceToNearest("hospital", within=5000, source_table="pois"), # From Spatialite
])

# You can then do multiple casts
my_features = my_spellbook.cast(df, host=[bigquery.Client(), "tests/data/source.sqlite"])

# Saving the Spellbook
my_spellbook.author = "Lj Miranda" # optional 
my_spellbook.description = "Some cool features for other stuff" # optional
my_spellbook.to_json("path/to/my/own/features.json")

Some potential challenges:

It is possible to create a spellbook with spells coming from different warehouses (one feature from BQ, another from SQLite, etc.). However, setting the source_table and the host is decoupled (one during init, another during cast()).
Concatenating everything inside a dataframe (similar output column names, etc.). We should do some validation before doing concat?

Some preliminary tasks:

Write down all possible metadata to include: things that are automatically generated (date? unique ID? etc.) and those that are manually set).

thinkingmachines / geomancer Goto Github PK

geomancer's People

Contributors

Stargazers

Watchers

Forkers

geomancer's Issues

Issue Description

Steps to reproduce the issue

What's the expected result?

What's the actual result?

Additional etailds / screenshot

PR Requirements

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery

Requirements

Issue Description

Steps to reproduce the issue

What's the expected result?

What's the actual result?

Additional details / screenshot

Recommend Projects

Recommend Topics

Recommend Org