Giter Site home page Giter Site logo

thinkingmachines / geomancer Goto Github PK

View Code? Open in Web Editor NEW
214.0 214.0 16.0 1.2 MB

Automated feature engineering for geospatial data

License: MIT License

Makefile 3.38% Python 96.00% Shell 0.62%
bigquery feature-engineering geospatial machine-learning openstreetmap

geomancer's People

Contributors

jtmiclat avatar ljvmiranda921 avatar magtanggol03 avatar marksteve avatar tm-ardie-orden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geomancer's Issues

Querying in spatialite returns NULL

The reason for this is that we're not converting the osm POIS into a geometry type. However, if we do it in distance_to_nearest, then the BQ queries will fail (because they're already in geography type). Here's how we will solve it:

  • Re-upload Philippine OSM features in osm dataset with WKT as type STRING
  • Apply core.ST_GeoFromText to the osm features

Add unit tests

The main challenge here is to mock the bigquery client. Still not sure how this will all work, but we should have atleast good test coverage for this project.

Optimize BQ uploads by reusing tables (cache)

We upload a dataframe into BQ (as a table) for every call. This is inefficient given larger datasets. There should be a better way to:

  • Check if the dataframe in question already exists in the BigQuery dataset
  • If yes, then just get that table, else, do the upload.

Update README

Important sections:

  • Dependencies
  • Setup
  • Basic Usage

Handle DataFrame to BigQuery Table interaction

Things to implement:

  • pandas.DataFrame to BigQuery table
  • Delete BigQuery table at a certain trigger Add expiry to BQ table

Optional for now (but would probably be important later on):

  • Show current upload job
  • Export pandas.DataFrame to Avro, then upload that to BigQuery table

Dependency of #1

Add Aggregation Spells

For example,

mean_price = AggregateOf("hotel", which="price", how="mean")
max_price = AggregateOf("hotel", which="price", how="max")
min_price = AggregateOf("hotel", which="price", how="min")

Update setup.py

Tests should work with tox which is dependent on setup.py.
It would be nice if we can do something like this:

$ pip install geomancer # install all dependencies
$ pip install geomancer[bq] # only installs BQ-related dependencies
$ pip install geomancer[sqlite] # only installs Spatialite-related dependencies

How to optimize load time for cast?

Issue Description

getting features for columns takes around 25 seconds per column for a data frame of 56,761 rows

Steps to reproduce the issue

df = pd.read_csv('sample.csv')`
pois_book_instance = SpellBook(
    spells = [
        DistanceToNearest(
            'police',
            source_table = 'project.dataset_id.gis_osm_pois_free_1',
            dburl = 'bigquery://project',
            feature_name = 'pois_dist_police'),
        NumberOf(
            'police',
            within = 1000,
            source_table = 'project.dataset_id.gis_osm_pois_free_1',
            dburl = 'bigquery://project',
            feature_name = 'pois_num_1000_police')
])
pois_book = pois_book_instance.cast(df)

What's the expected result?

  • hopefully faster

What's the actual result?

  • CPU times: user 2.33 s, sys: 88 ms, total: 2.42 s
    Wall time: 46.7 s

Additional etailds / screenshot

image
image

Explore QueryBuilder using SQLAlchemy PostGIS

Let's remove the need for writing strings and formatting them. We need a more flexible API for building queries. Advantages:

  • Backend data warehouse is now pluggable (in the future we ca opt to use BigQuery OR a PostGIS server).
  • SQL queries are now in a Python-ic DSL. Much better than writing string-formatted queries.

PR Requirements

  • Port the string-formatted query into an SQLAlchemy Query. The user should still be able to supply the fclass, source_table (OSM) and BigQuery options if possible
  • No need to pass the client (?), we just need to pass the database URI.
  • If a table is uploaded into the BQ Dataset, ensure that there is an expiration date (previously, we're creating a new table per call, which is not cost-efficient).

I suggest creating a query() method and have it an abstract class method. Must raise a NotImplementedError for geomancer.base.Spell. The geomancer.base.Spell.cast() method should ideally be inherited (all subclasses should just call super())

Ideal Scenario: when implementing a new Spell (i.e., subclassing the Spell base), I don't need to think of implementing cast anymore (in fact, this whole thing should be inherited, super()). Instead, I should just implement the query() method using the SQLAlchemy dialect.

Add support for building feature extraction

Six different types of buildings: (1) residential, (2) damaged, (3) commercial, (4) industrial, (5) education, (6) health.

For each lat-lng and specified radius or bounding box, extract the following features:

  • total number of buildings
  • the total area of buildings
  • the mean area of buildings
  • proportion of area occupied by the buildings

Add support for road feature extraction

Five types of roads: (1) primary, (2) trunk, (3) paved, (4) unpaved, and (5) intersection.

For each lat-lng and specified radius or bounding box, extract the following features:

  • Distance to closest road
  • Total number of roads
  • Total road length

[Bug] "NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery" when calling the `cast` method from `DistanceToNearest`

After running the code below on Google Colab:

dist_spell = DistanceToNearest(
    "hospital",
    source_table="phcovid.gis_osm_pois_free_1",
    feature_name="dist_hospital",
    dburl="bigquery://ml-prototypes",
).cast(df)

The following error occurs:

2020-03-29 03:15:18.390 | ERROR | main::6 - An error has been caught in function '', process 'MainProcess' (119), thread 'MainThread' (140460848953216):
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
└ ModuleSpec(name='ipykernel_launcher', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7fbf94d60cf8>, origin='...
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
└ <code object at 0x7fbf94db0660, file "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 5>

File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
│ └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
└ <module 'ipykernel.kernelapp' from '/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py'>

File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 664, in launch_instance
app.start()
│ └ <function IPKernelApp.start at 0x7fbf8f501ea0>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fbf94f0ac50>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
│ │ └ <staticmethod object at 0x7fbf90622a20>
│ └ <class 'tornado.ioloop.IOLoop'>
└ <module 'tornado.ioloop' from '/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py'>

File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
│ │ └ 1
│ └ <zmq.sugar.socket.Socket object at 0x7fbf7f3f1660>
└ <function wrap..null_wrapper at 0x7fbf7f3f4730>

File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ (<zmq.sugar.socket.Socket object at 0x7fbf7f3f1660>, 1)
└ <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>>

File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
│ └ <function ZMQStream._handle_recv at 0x7fbf8fa210d0>
└ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>

File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
│ │ │ └ [<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame o...
│ │ └ <function wrap..null_wrapper at 0x7fbf72c45730>
│ └ <function ZMQStream._run_callback at 0x7fbf8fa1bf28>
└ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>

File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
│ │ └ {}
│ └ ([<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame ...
└ <function wrap..null_wrapper at 0x7fbf72c45730>

File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ ([<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame ...
└ <function Kernel.start..make_dispatcher..dispatcher at 0x7fbf72c456a8>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
│ │ │ └ [<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame o...
│ │ └ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
│ └ <function Kernel.dispatch_shell at 0x7fbf8f56b1e0>
└ <google.colab._kernel.Kernel object at 0x7fbf7f183320>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
│ │ │ └ {'header': {'username': 'username', 'msg_type': 'execute_request', 'msg_id': 'cf57ce7b1af942c3e5e49e6c3d24d01a', 'version': '...
│ │ └ [b'436ca7d2de314043e8098971c7ce30c9']
│ └ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
└ <bound method Kernel.execute_request of <google.colab._kernel.Kernel object at 0x7fbf7f183320>>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
│ └ True
└ {}

File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
│ │ │ │ └ False
│ │ │ └ True
│ │ └ 'dist_spell = DistanceToNearest(\n "hospital",\n source_table="phcovid.gis_osm_pois_free_1",\n feature_name="dist_ho...
│ └ <function ZMQInteractiveShell.run_cell at 0x7fbf8f4fe158>
└ <google.colab._shell.Shell object at 0x7fbf7f183240>

File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
│ │ │ └ {'store_history': True, 'silent': False}
│ │ └ ('dist_spell = DistanceToNearest(\n "hospital",\n source_table="phcovid.gis_osm_pois_free_1",\n feature_name="dist_h...
│ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
└ <class 'ipykernel.zmqshell.ZMQInteractiveShell'>

File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
│ │ └ <ExecutionResult object at 7fbf6f4772e8, execution_count=58 error_before_exec=None error_in_exec=None result=None>
│ └ <IPython.core.compilerop.CachingCompiler object at 0x7fbf7f3ded68>
└ 'last_expr'

File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
if self.run_code(code, result):
│ │ │ └ <ExecutionResult object at 7fbf6f4772e8, execution_count=58 error_before_exec=None error_in_exec=None result=None>
│ │ └ <code object at 0x7fbf6e50f540, file "", line 1>
│ └ <function InteractiveShell.run_code at 0x7fbf929eb400>
└ <google.colab._shell.Shell object at 0x7fbf7f183240>

File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
│ │ │ │ └ {'name': 'main', 'doc': 'Automatically created module for IPython interactive environment', 'package': None, ...
│ │ │ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
│ │ └ <property object at 0x7fbf92f7a9a8>
│ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
└ <code object at 0x7fbf6e50f540, file "", line 1>

File "", line 6, in
).cast(df)
└ WKT ... name
0 POINT (120.6202008 14.3854827) ...

File "/usr/local/lib/python3.6/dist-packages/geomancer/spells/base.py", line 184, in cast
engine = core.get_engine()
│ └ <function DBCore.get_engine at 0x7fbf6d68c0d0>
└ <geomancer.backend.cores.bq.BigQueryCore object at 0x7fbf6d3e1e48>

File "/usr/local/lib/python3.6/dist-packages/geomancer/backend/cores/base.py", line 108, in get_engine
return create_engine(self.dburl)
│ │ └ bigquery://ml-prototypes
│ └ <geomancer.backend.cores.bq.BigQueryCore object at 0x7fbf6d3e1e48>
└ <function create_engine at 0x7fbf6dc226a8>

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/init.py", line 423, in create_engine
return strategy.create(*args, **kwargs)
│ │ │ └ {}
│ │ └ (bigquery://ml-prototypes,)
│ └ <function DefaultEngineStrategy.create at 0x7fbf6d8fb620>
└ <sqlalchemy.engine.strategies.PlainEngineStrategy object at 0x7fbf6dc33cf8>

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/strategies.py", line 61, in create
entrypoint = u._get_entrypoint()
│ └ <function URL._get_entrypoint at 0x7fbf6db5a598>
└ bigquery://ml-prototypes

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/url.py", line 172, in _get_entrypoint
cls = registry.load(name)
│ │ └ 'bigquery'
│ └ <function PluginLoader.load at 0x7fbf6dfeda60>
└ <sqlalchemy.util.langhelpers.PluginLoader object at 0x7fbf6db59240>

File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/langhelpers.py", line 240, in load
"Can't load plugin: %s:%s" % (self.group, name)
│ │ └ 'bigquery'
│ └ 'sqlalchemy.dialects'
└ <sqlalchemy.util.langhelpers.PluginLoader object at 0x7fbf6db59240>

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery

NoSuchModuleError Traceback (most recent call last)
in ()
4 feature_name="dist_hospital",
5 dburl="bigquery://ml-prototypes",
----> 6 ).cast(df)

6 frames
/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/langhelpers.py in load(self, name)
238
239 raise exc.NoSuchModuleError(
--> 240 "Can't load plugin: %s:%s" % (self.group, name)
241 )
242

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery

Add LengthOf function

The LengthOf function would compute the length of all line features inside a certain circular radius centered at a location with a certain lat and lon.

Add CI/CD Pipeline

Some options:

Requirements

  1. Must be free for open-source projects
  2. Easy deployment (just a yaml file, no need for on-prem solutions)
  3. Should support Docker, build matrices, and cron jobs

Add NumberOf function

Add a function to get the NumberOf point features within a certain radius.

The NumberOf function should inherit Spells class, and should allow arguments for poi_type and radius

Make sure SpellBook indices align

Right now, SpellBook relies on a simple ordered index from the input dataframe. If that's not the case, the output dataframe will be misaligned.

Add functionality for multiple `on` parameters for spells

Right now, a user can only input one on parameter to any of the Spells.

dist_primary = DistanceToNearest(
    on="primary",
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary")

The on parameter is equivalent to a WHERE clause in SQL.

WHERE fclass = 'primary'

There might be cases wherein multiple on parameters are needed for feature engineering and it'd be useful to have that option available out of the box.

dist_primary_or_secondary = DistanceToNearest(
    on=["primary", "secondary"],
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary_or_secondary")
WHERE fclass = 'primary' OR fclass = 'secondary'

Can’t extract features greater than 11,000 fclass count

Issue Description

Unable to extract features for a data frame of 56,761 rows if the feature class has a count of around 11,000 in its dataset.

(Though it works if the feature class has a count of around 5,000 in its dataset, count of around 7,000 works sometimes)

Steps to reproduce the issue

df = pd.read_csv('sample.csv')

instance_primary = DistanceToNearest(
    'primary',
    source_table = 'project.dataset_id.gis_osm_roads_free_1',
    feature_name = 'dist_primary'
)

test_primary = instance_primary.cast(df, dburl = 'bigquery://project')

What's the expected result?

  • no error

What's the actual result?

Additional details / screenshot

image

Add PostGIS DBCore

Check implementation of geomancer.backend.cores.DBCore and BigQueryCore

Add SpellBook

Usage ideas:

from geomancer import SpellBook

# When you want to register spells
my_spellbook = SpellBook([
  DistanceToNearest("embassy", within=10000, source_table="tm-geospatial.ph_osm.pois"), # From BQ
  DistanceToNearest("hospital", within=5000, source_table="pois"), # From Spatialite
])

# You can then do multiple casts
my_features = my_spellbook.cast(df, host=[bigquery.Client(), "tests/data/source.sqlite"])

# Saving the Spellbook
my_spellbook.author = "Lj Miranda" # optional 
my_spellbook.description = "Some cool features for other stuff" # optional
my_spellbook.to_json("path/to/my/own/features.json")

Some potential challenges:

  • It is possible to create a spellbook with spells coming from different warehouses (one feature from BQ, another from SQLite, etc.). However, setting the source_table and the host is decoupled (one during init, another during cast()).
  • Concatenating everything inside a dataframe (similar output column names, etc.). We should do some validation before doing concat?

Some preliminary tasks:

  • Write down all possible metadata to include: things that are automatically generated (date? unique ID? etc.) and those that are manually set).

Fix expiry date creation

It should just be in 3 hours, but because we're working on different UTC, then the timedelta is different. This should be solved by using pytz.utc

It should be something like this

Overly-verbose error logging

I think we should stop using logger.catch. Or let's catch common errors early on rather than propagating them internally

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.