thinkingmachines / geomancer Goto Github PK
View Code? Open in Web Editor NEWAutomated feature engineering for geospatial data
License: MIT License
Automated feature engineering for geospatial data
License: MIT License
base.Spell.cast()
. It should only give me the source
, target
, and engine
for the query
to executeLet's preserve the positional accuracy of the input WKTs and output WKTs. 8 decimal places is a reasonable threshold since 9 decimal places is already 0.01 mm.
We should first validate that the input columns exist before running the actual query.
The reason for this is that we're not converting the osm POIS into a geometry type. However, if we do it in distance_to_nearest
, then the BQ queries will fail (because they're already in geography type). Here's how we will solve it:
osm
dataset with WKT
as type STRING
core.ST_GeoFromText
to the osm
featuresThe main challenge here is to mock the bigquery client. Still not sure how this will all work, but we should have atleast good test coverage for this project.
We upload a dataframe into BQ (as a table) for every call. This is inefficient given larger datasets. There should be a better way to:
This means that the df
in cast(df)
can be a DataFrame or a string
Important sections:
Things to implement:
pandas.DataFrame
to BigQuery tableOptional for now (but would probably be important later on):
pandas.DataFrame
to Avro, then upload that to BigQuery tableDependency of #1
Depends on #30
Just follow this tutorial: https://romanvm.pythonanywhere.com/post/using-docker-travis-continuous-integration-25/
For example,
mean_price = AggregateOf("hotel", which="price", how="mean")
max_price = AggregateOf("hotel", which="price", how="max")
min_price = AggregateOf("hotel", which="price", how="min")
During tests, loading into the database takes precedence than downloading the source.sqlite
. The best solution is to provide different paths for each.
Tests should work with tox
which is dependent on setup.py
.
It would be nice if we can do something like this:
$ pip install geomancer # install all dependencies
$ pip install geomancer[bq] # only installs BQ-related dependencies
$ pip install geomancer[sqlite] # only installs Spatialite-related dependencies
getting features for columns takes around 25 seconds per column for a data frame of 56,761 rows
df = pd.read_csv('sample.csv')`
pois_book_instance = SpellBook(
spells = [
DistanceToNearest(
'police',
source_table = 'project.dataset_id.gis_osm_pois_free_1',
dburl = 'bigquery://project',
feature_name = 'pois_dist_police'),
NumberOf(
'police',
within = 1000,
source_table = 'project.dataset_id.gis_osm_pois_free_1',
dburl = 'bigquery://project',
feature_name = 'pois_num_1000_police')
])
pois_book = pois_book_instance.cast(df)
Let's remove the need for writing strings and formatting them. We need a more flexible API for building queries. Advantages:
fclass
, source_table
(OSM) and BigQuery options if possibleI suggest creating a query() method
and have it an abstract class method. Must raise a NotImplementedError
for geomancer.base.Spell
. The geomancer.base.Spell.cast()
method should ideally be inherited (all subclasses should just call super()
)
Ideal Scenario: when implementing a new Spell
(i.e., subclassing the Spell
base), I don't need to think of implementing cast
anymore (in fact, this whole thing should be inherited, super()
). Instead, I should just implement the query()
method using the SQLAlchemy dialect.
Six different types of buildings: (1) residential, (2) damaged, (3) commercial, (4) industrial, (5) education, (6) health.
For each lat-lng and specified radius or bounding box, extract the following features:
Five types of roads: (1) primary, (2) trunk, (3) paved, (4) unpaved, and (5) intersection.
For each lat-lng and specified radius or bounding box, extract the following features:
After running the code below on Google Colab:
dist_spell = DistanceToNearest(
"hospital",
source_table="phcovid.gis_osm_pois_free_1",
feature_name="dist_hospital",
dburl="bigquery://ml-prototypes",
).cast(df)
The following error occurs:
2020-03-29 03:15:18.390 | ERROR | main::6 - An error has been caught in function '', process 'MainProcess' (119), thread 'MainThread' (140460848953216):
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
└ ModuleSpec(name='ipykernel_launcher', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7fbf94d60cf8>, origin='...
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
└ <code object at 0x7fbf94db0660, file "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 5>
File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
│ └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
└ <module 'ipykernel.kernelapp' from '/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py'>
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 664, in launch_instance
app.start()
│ └ <function IPKernelApp.start at 0x7fbf8f501ea0>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fbf94f0ac50>
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
│ │ └ <staticmethod object at 0x7fbf90622a20>
│ └ <class 'tornado.ioloop.IOLoop'>
└ <module 'tornado.ioloop' from '/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py'>
File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
│ │ └ 1
│ └ <zmq.sugar.socket.Socket object at 0x7fbf7f3f1660>
└ <function wrap..null_wrapper at 0x7fbf7f3f4730>
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ (<zmq.sugar.socket.Socket object at 0x7fbf7f3f1660>, 1)
└ <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>>
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
│ └ <function ZMQStream._handle_recv at 0x7fbf8fa210d0>
└ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
│ │ │ └ [<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame o...
│ │ └ <function wrap..null_wrapper at 0x7fbf72c45730>
│ └ <function ZMQStream._run_callback at 0x7fbf8fa1bf28>
└ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
│ │ └ {}
│ └ ([<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame ...
└ <function wrap..null_wrapper at 0x7fbf72c45730>
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ ([<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame ...
└ <function Kernel.start..make_dispatcher..dispatcher at 0x7fbf72c456a8>
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
│ │ │ └ [<zmq.sugar.frame.Frame object at 0x7fbf6d6886c0>, <zmq.sugar.frame.Frame object at 0x7fbf6d688778>, <zmq.sugar.frame.Frame o...
│ │ └ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
│ └ <function Kernel.dispatch_shell at 0x7fbf8f56b1e0>
└ <google.colab._kernel.Kernel object at 0x7fbf7f183320>
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
│ │ │ └ {'header': {'username': 'username', 'msg_type': 'execute_request', 'msg_id': 'cf57ce7b1af942c3e5e49e6c3d24d01a', 'version': '...
│ │ └ [b'436ca7d2de314043e8098971c7ce30c9']
│ └ <zmq.eventloop.zmqstream.ZMQStream object at 0x7fbf7f3deef0>
└ <bound method Kernel.execute_request of <google.colab._kernel.Kernel object at 0x7fbf7f183320>>
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
│ └ True
└ {}
File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
│ │ │ │ └ False
│ │ │ └ True
│ │ └ 'dist_spell = DistanceToNearest(\n "hospital",\n source_table="phcovid.gis_osm_pois_free_1",\n feature_name="dist_ho...
│ └ <function ZMQInteractiveShell.run_cell at 0x7fbf8f4fe158>
└ <google.colab._shell.Shell object at 0x7fbf7f183240>
File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
│ │ │ └ {'store_history': True, 'silent': False}
│ │ └ ('dist_spell = DistanceToNearest(\n "hospital",\n source_table="phcovid.gis_osm_pois_free_1",\n feature_name="dist_h...
│ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
└ <class 'ipykernel.zmqshell.ZMQInteractiveShell'>
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
│ │ └ <ExecutionResult object at 7fbf6f4772e8, execution_count=58 error_before_exec=None error_in_exec=None result=None>
│ └ <IPython.core.compilerop.CachingCompiler object at 0x7fbf7f3ded68>
└ 'last_expr'
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
if self.run_code(code, result):
│ │ │ └ <ExecutionResult object at 7fbf6f4772e8, execution_count=58 error_before_exec=None error_in_exec=None result=None>
│ │ └ <code object at 0x7fbf6e50f540, file "", line 1>
│ └ <function InteractiveShell.run_code at 0x7fbf929eb400>
└ <google.colab._shell.Shell object at 0x7fbf7f183240>
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
│ │ │ │ └ {'name': 'main', 'doc': 'Automatically created module for IPython interactive environment', 'package': None, ...
│ │ │ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
│ │ └ <property object at 0x7fbf92f7a9a8>
│ └ <google.colab._shell.Shell object at 0x7fbf7f183240>
└ <code object at 0x7fbf6e50f540, file "", line 1>
File "", line 6, in
).cast(df)
└ WKT ... name
0 POINT (120.6202008 14.3854827) ...
File "/usr/local/lib/python3.6/dist-packages/geomancer/spells/base.py", line 184, in cast
engine = core.get_engine()
│ └ <function DBCore.get_engine at 0x7fbf6d68c0d0>
└ <geomancer.backend.cores.bq.BigQueryCore object at 0x7fbf6d3e1e48>
File "/usr/local/lib/python3.6/dist-packages/geomancer/backend/cores/base.py", line 108, in get_engine
return create_engine(self.dburl)
│ │ └ bigquery://ml-prototypes
│ └ <geomancer.backend.cores.bq.BigQueryCore object at 0x7fbf6d3e1e48>
└ <function create_engine at 0x7fbf6dc226a8>
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/init.py", line 423, in create_engine
return strategy.create(*args, **kwargs)
│ │ │ └ {}
│ │ └ (bigquery://ml-prototypes,)
│ └ <function DefaultEngineStrategy.create at 0x7fbf6d8fb620>
└ <sqlalchemy.engine.strategies.PlainEngineStrategy object at 0x7fbf6dc33cf8>
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/strategies.py", line 61, in create
entrypoint = u._get_entrypoint()
│ └ <function URL._get_entrypoint at 0x7fbf6db5a598>
└ bigquery://ml-prototypes
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/url.py", line 172, in _get_entrypoint
cls = registry.load(name)
│ │ └ 'bigquery'
│ └ <function PluginLoader.load at 0x7fbf6dfeda60>
└ <sqlalchemy.util.langhelpers.PluginLoader object at 0x7fbf6db59240>
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/langhelpers.py", line 240, in load
"Can't load plugin: %s:%s" % (self.group, name)
│ │ └ 'bigquery'
│ └ 'sqlalchemy.dialects'
└ <sqlalchemy.util.langhelpers.PluginLoader object at 0x7fbf6db59240>
NoSuchModuleError Traceback (most recent call last)
in ()
4 feature_name="dist_hospital",
5 dburl="bigquery://ml-prototypes",
----> 6 ).cast(df)
6 frames
/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/langhelpers.py in load(self, name)
238
239 raise exc.NoSuchModuleError(
--> 240 "Can't load plugin: %s:%s" % (self.group, name)
241 )
242
NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:bigquery
The LengthOf
function would compute the length of all line features inside a certain circular radius
centered at a location with a certain lat
and lon
.
Some options:
yaml
file, no need for on-prem solutions)Add a function to get the NumberOf
point features within
a certain radius.
The NumberOf
function should inherit Spells class, and should allow arguments for poi_type
and radius
Right now, SpellBook relies on a simple ordered index from the input dataframe. If that's not the case, the output dataframe will be misaligned.
needs to be ~> 3.2
For reference: CVE-2020-25659 (Moderate severity)
How about as a class? Then we should have a Spell
abstraction?
class DistanceToNearest(Spell):
def __call__():
pass
Full error message:
`
You're not returning the query job, you're returning the output dataframe. Change your variable name.
Right now, a user can only input one on
parameter to any of the Spells
.
dist_primary = DistanceToNearest(
on="primary",
source_table="geospatial.ph_osm.gis_osm_roads_free_1",
feature_name="dist_primary")
The on
parameter is equivalent to a WHERE
clause in SQL.
WHERE fclass = 'primary'
There might be cases wherein multiple on
parameters are needed for feature engineering and it'd be useful to have that option available out of the box.
dist_primary_or_secondary = DistanceToNearest(
on=["primary", "secondary"],
source_table="geospatial.ph_osm.gis_osm_roads_free_1",
feature_name="dist_primary_or_secondary")
WHERE fclass = 'primary' OR fclass = 'secondary'
Unable to extract features for a data frame of 56,761 rows if the feature class has a count of around 11,000 in its dataset.
(Though it works if the feature class has a count of around 5,000 in its dataset, count of around 7,000 works sometimes)
df = pd.read_csv('sample.csv')
instance_primary = DistanceToNearest(
'primary',
source_table = 'project.dataset_id.gis_osm_roads_free_1',
feature_name = 'dist_primary'
)
test_primary = instance_primary.cast(df, dburl = 'bigquery://project')
Check implementation of geomancer.backend.cores.DBCore
and BigQueryCore
Should not limit the "filter" function to just the fclass column in OSM but generalizable for any table
Usage ideas:
from geomancer import SpellBook
# When you want to register spells
my_spellbook = SpellBook([
DistanceToNearest("embassy", within=10000, source_table="tm-geospatial.ph_osm.pois"), # From BQ
DistanceToNearest("hospital", within=5000, source_table="pois"), # From Spatialite
])
# You can then do multiple casts
my_features = my_spellbook.cast(df, host=[bigquery.Client(), "tests/data/source.sqlite"])
# Saving the Spellbook
my_spellbook.author = "Lj Miranda" # optional
my_spellbook.description = "Some cool features for other stuff" # optional
my_spellbook.to_json("path/to/my/own/features.json")
Some potential challenges:
source_table
and the host
is decoupled (one during init, another during cast()
).Some preliminary tasks:
It should just be in 3 hours, but because we're working on different UTC, then the timedelta
is different. This should be solved by using pytz.utc
It should be something like this
I think we should stop using logger.catch
. Or let's catch common errors early on rather than propagating them internally
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.