Comments (5)
Fletcher and I worked a bit on this earlier in the week- we tried an
approach involving the centroid coordinates and the area of each parcel
(and controlling the precision of each to ensure consistency across
machines). Create a string with x, y, and area concatenated together (each
with defined precision), and then hashing this string. This ended up
being unique for all parcels except one pair (and we can drop one of these
two parcels since they nearly completely overlap).
The next step for me (to be done today or tomorrow) is test runs of data
regeneration to test this on different machines to check that the resulting
id's are the same. Our hypothesis is that controlling the precision will
help to achieve the same id's across machines.
We added area in because after visually examining where centroids were
falling, there were examples of centroids being essentially in the same
place even though parcel geometry different (for example in cases where
parcel representing common area surrounds a parcel.). Adding in area
differentiated the parcels in the examples that were visually examined, and
this was confirmed after looking at the resulting id's.
On Thu, Jul 2, 2015 at 8:10 AM, Tom Buckley [email protected]
wrote:
@fscottfoti https://github.com/fscottfoti @mkreilly
https://github.com/mkreilly @janowicz https://github.com/janowicz,
we've talked a lot about this feature, so it makes sense to put some notes
here.@fscottfoti https://github.com/fscottfoti's current thinking is that
hashes of centroids for the parcels should be unique. And that because that
is the case, the user will always be able to say whether a parcel from a
given run is identical to a parcel from another run.But I'm still a bit unclear on how to write the story for this feature,
so I don't know how we will say when it is complete.I think the story is: as a person that is modeling the state of parcels
over time, i would like to be able to say whether any given parcel that i
am describing is identical to another parcel, over time.It seems that one issue was that a user would assign attributes to a
parcel at some point in the modeling process, and they they would later try
to apply those attributes to another set of parcels and be unable to do so
because the parcel table had changed, and therefore unique identifiers
changed. @mkreilly https://github.com/mkreilly could you clarify on
what percentage difference or similarity would be acceptable when joining
parcels across tables? That might help us define what the successful
completion of this story is like.One previous attempt at keeping a parcel's ID the same was to keep an ID
column on the table that had a unique name which was generated in some
early process, and then just make sure that that ID column remained on the
table in all cases where parcels were used in the modeling process.Another approach is to use the hash of the geometry column. For example
https://github.com/synthicity/bayarea_urbansim/blob/a0cdcee377500198645d468e130541e32a08a3dd/data_regeneration/match_aggregate.py#L762-L771.
However, when we compared the geom_id's from @janowicz
https://github.com/janowicz's (Windows 7) laptop to those generated by
the MTC Windows Server 2012, only 1/3 of the parcels were exactly
identical. On the other hand, across Linux machines built in exactly the
same way, more than 95% of the geometry ID's are identical.Other ideas for keeping a parcel's ID the same include using a geohash
http://postgis.net/docs/manual-dev/ST_GeoHash.html or similar.There might be 2 notions of time that are relevant: parcel time and
database time. For example, lets assume that parcel A that has an attribute
something=1 at time-1 in the parcel table. If we discover, at time-2,
that we were incorrect, and that in time-1, parcel A in fact had
something = 2, do we revise the time-1 parcel table? Or do we only
resolve the time-2 table? This could be more complicated if something is
actually the geometry of the parcel, or if the parcel splits.—
Reply to this email directly or view it on GitHub
#56.
from bayarea_urbansim.
The function to be tested:
def generate_unique_id_from_geom(x, y, area, precision = 5, hash_values = False):
# Keep only non-null values
x = x[~x.isnull()]
y = y[~y.isnull()]
area = area[area > 0]
#x, y, area
identifier = x.round(precision).astype('str') + ',' + \
y.round(precision).astype('str') + ',' + \
area.round(precision).astype('str')
if len(identifier) > len(np.unique(identifier)):
print 'Non-unique id values present. %s rows and %s unique values.' % (len(identifier), len(np.unique(identifier)))
if hash_values:
identifier = identifier.apply(get_md5_hexdigest)
return identifier
from bayarea_urbansim.
Thanks @janowicz. I think that functionally thats very similar to this approach.
In any case, I think the key here will be writing tests.
from bayarea_urbansim.
Remarkable that centroids could be identical for parcels with different geometries.
This may be a hack solution that may be vulnerable if precision indeed varies by system. It relies on PostGIS' own ST_AsEWKT
which renders a geometry (not just the centroid here) as a string, and then I pull that into pandas and hash it from there.
import hashlib
from sqlalchemy import create_engine
engine = create_engine('postgresql://postgres:xxxxxx@localhost:5432/gisdb')
county_area = pd.read_sql('select gid,ST_AsEWKT(geom) from zones1454',engine)
def hashbrown(s):
hash = hashlib.sha1(s)
hex = hash.hexdigest()
return hex
county_area.st_asewkt.apply(hashbrown).head()
Out[20]:
0 32d484fa10ca4ea9f91dd385759ef0e3e57524c2
1 cb69cf3774eb47edf949c113f38162a703e0f2ce
2 389f9b65bdb42dcba7a8955301c1969cb0fa0b27
3 90db87b0f2922f3407f7fcc93d2c06b2d060143a
4 b8d5ff99f8d97e678d4d4fa2d506ff3b2de409c4
Name: st_asewkt, dtype: object
from bayarea_urbansim.
@akselx thats the approach taken here: https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/match_aggregate.py#L762-L768
from bayarea_urbansim.
Related Issues (20)
- json serialization TypeError: 1442 is not JSON serializable
- requirements file HOT 1
- version control for outputs (ipython notebook versions?) HOT 7
- version control for dependencies HOT 2
- Estimation.ipynb generates NA's in config parameters HOT 1
- How are configs written out by Estimation.ipynb? HOT 2
- misaligned lookup table values? HOT 4
- documentation should specify that the requirements of data regeneration are different from those of simulation HOT 1
- Confusing as a Service: debugging data regeneration processing tasks with different dependencies HOT 2
- data regeneration->estimation.py->nrh_estimate: no object named costar in the file HOT 2
- data regeneration: use better filters/spatial queries to identify identical parcel geometries HOT 7
- Price model: spikes in future years for many HOT 1
- Slow HLCM estimation HOT 30
- Estimating a model with random subset of data? HOT 3
- getting changes from UAL HOT 2
- Incompatibility with UDST/urbansim PR #172 HOT 5
- accessory_units model step fails for simulation years not in 2010, 2015, 2020, etc.
- summaries.py fails with KeyError on jobs table
- issues related to the update to the new microdata
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bayarea_urbansim.