<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The function to be tested: <div class="snippet-clipboard-content notranslate posit

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

stable identifier for parcels "across runs" about bayarea_urbansim HOT 5 CLOSED

tbuckl commented on July 22, 2024

stable identifier for parcels "across runs"

from bayarea_urbansim.

Comments (5)

janowicz commented on July 22, 2024

Fletcher and I worked a bit on this earlier in the week- we tried an
approach involving the centroid coordinates and the area of each parcel
(and controlling the precision of each to ensure consistency across
machines). Create a string with x, y, and area concatenated together (each
with defined precision), and then hashing this string. This ended up
being unique for all parcels except one pair (and we can drop one of these
two parcels since they nearly completely overlap).

The next step for me (to be done today or tomorrow) is test runs of data
regeneration to test this on different machines to check that the resulting
id's are the same. Our hypothesis is that controlling the precision will
help to achieve the same id's across machines.

We added area in because after visually examining where centroids were
falling, there were examples of centroids being essentially in the same
place even though parcel geometry different (for example in cases where
parcel representing common area surrounds a parcel.). Adding in area
differentiated the parcels in the examples that were visually examined, and
this was confirmed after looking at the resulting id's.

On Thu, Jul 2, 2015 at 8:10 AM, Tom Buckley [email protected]
wrote:

@fscottfoti https://github.com/fscottfoti @mkreilly
https://github.com/mkreilly @janowicz https://github.com/janowicz,
we've talked a lot about this feature, so it makes sense to put some notes
here.

@fscottfoti https://github.com/fscottfoti's current thinking is that
hashes of centroids for the parcels should be unique. And that because that
is the case, the user will always be able to say whether a parcel from a
given run is identical to a parcel from another run.

But I'm still a bit unclear on how to write the story for this feature,
so I don't know how we will say when it is complete.

I think the story is: as a person that is modeling the state of parcels
over time, i would like to be able to say whether any given parcel that i
am describing is identical to another parcel, over time.

It seems that one issue was that a user would assign attributes to a
parcel at some point in the modeling process, and they they would later try
to apply those attributes to another set of parcels and be unable to do so
because the parcel table had changed, and therefore unique identifiers
changed. @mkreilly https://github.com/mkreilly could you clarify on
what percentage difference or similarity would be acceptable when joining
parcels across tables? That might help us define what the successful
completion of this story is like.

One previous attempt at keeping a parcel's ID the same was to keep an ID
column on the table that had a unique name which was generated in some
early process, and then just make sure that that ID column remained on the
table in all cases where parcels were used in the modeling process.

Another approach is to use the hash of the geometry column. For example
https://github.com/synthicity/bayarea_urbansim/blob/a0cdcee377500198645d468e130541e32a08a3dd/data_regeneration/match_aggregate.py#L762-L771.
However, when we compared the geom_id's from @janowicz
https://github.com/janowicz's (Windows 7) laptop to those generated by
the MTC Windows Server 2012, only 1/3 of the parcels were exactly
identical. On the other hand, across Linux machines built in exactly the
same way, more than 95% of the geometry ID's are identical.

Other ideas for keeping a parcel's ID the same include using a geohash
http://postgis.net/docs/manual-dev/ST_GeoHash.html or similar.

There might be 2 notions of time that are relevant: parcel time and
database time. For example, lets assume that parcel A that has an attribute
something=1 at time-1 in the parcel table. If we discover, at time-2,
that we were incorrect, and that in time-1, parcel A in fact had
something = 2, do we revise the time-1 parcel table? Or do we only
resolve the time-2 table? This could be more complicated if something is
actually the geometry of the parcel, or if the parcel splits.

—
Reply to this email directly or view it on GitHub
#56.

from bayarea_urbansim.

janowicz commented on July 22, 2024

The function to be tested:


def generate_unique_id_from_geom(x, y, area, precision = 5, hash_values = False):
    
    # Keep only non-null values
    x = x[~x.isnull()]
    y = y[~y.isnull()]
    area = area[area > 0]
    
    #x, y, area
    identifier = x.round(precision).astype('str') + ',' + \
                 y.round(precision).astype('str') + ',' + \
                 area.round(precision).astype('str')
            
    if len(identifier) > len(np.unique(identifier)):
        print 'Non-unique id values present.  %s rows and %s unique values.' % (len(identifier), len(np.unique(identifier)))
            
    if hash_values:
        identifier = identifier.apply(get_md5_hexdigest)
    return identifier

from bayarea_urbansim.

tbuckl commented on July 22, 2024

Thanks @janowicz. I think that functionally thats very similar to this approach.

In any case, I think the key here will be writing tests.

from bayarea_urbansim.

akselx commented on July 22, 2024

Remarkable that centroids could be identical for parcels with different geometries.

This may be a hack solution that may be vulnerable if precision indeed varies by system. It relies on PostGIS' own ST_AsEWKT which renders a geometry (not just the centroid here) as a string, and then I pull that into pandas and hash it from there.

import hashlib
from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:xxxxxx@localhost:5432/gisdb')
county_area = pd.read_sql('select gid,ST_AsEWKT(geom) from zones1454',engine)
def hashbrown(s):
    hash = hashlib.sha1(s)
    hex = hash.hexdigest()
    return hex

county_area.st_asewkt.apply(hashbrown).head()
Out[20]:
0    32d484fa10ca4ea9f91dd385759ef0e3e57524c2
1    cb69cf3774eb47edf949c113f38162a703e0f2ce
2    389f9b65bdb42dcba7a8955301c1969cb0fa0b27
3    90db87b0f2922f3407f7fcc93d2c06b2d060143a
4    b8d5ff99f8d97e678d4d4fa2d506ff3b2de409c4
Name: st_asewkt, dtype: object

from bayarea_urbansim.

tbuckl commented on July 22, 2024

@akselx thats the approach taken here: https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/match_aggregate.py#L762-L768

from bayarea_urbansim.

stable identifier for parcels "across runs" about bayarea_urbansim HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent