Giter Site home page Giter Site logo

Comments (5)

janowicz avatar janowicz commented on July 22, 2024

Fletcher and I worked a bit on this earlier in the week- we tried an
approach involving the centroid coordinates and the area of each parcel
(and controlling the precision of each to ensure consistency across
machines). Create a string with x, y, and area concatenated together (each
with defined precision), and then hashing this string. This ended up
being unique for all parcels except one pair (and we can drop one of these
two parcels since they nearly completely overlap).

The next step for me (to be done today or tomorrow) is test runs of data
regeneration to test this on different machines to check that the resulting
id's are the same. Our hypothesis is that controlling the precision will
help to achieve the same id's across machines.

We added area in because after visually examining where centroids were
falling, there were examples of centroids being essentially in the same
place even though parcel geometry different (for example in cases where
parcel representing common area surrounds a parcel.). Adding in area
differentiated the parcels in the examples that were visually examined, and
this was confirmed after looking at the resulting id's.

On Thu, Jul 2, 2015 at 8:10 AM, Tom Buckley [email protected]
wrote:

@fscottfoti https://github.com/fscottfoti @mkreilly
https://github.com/mkreilly @janowicz https://github.com/janowicz,
we've talked a lot about this feature, so it makes sense to put some notes
here.

@fscottfoti https://github.com/fscottfoti's current thinking is that
hashes of centroids for the parcels should be unique. And that because that
is the case, the user will always be able to say whether a parcel from a
given run is identical to a parcel from another run.

But I'm still a bit unclear on how to write the story for this feature,
so I don't know how we will say when it is complete.

I think the story is: as a person that is modeling the state of parcels
over time, i would like to be able to say whether any given parcel that i
am describing is identical to another parcel, over time.

It seems that one issue was that a user would assign attributes to a
parcel at some point in the modeling process, and they they would later try
to apply those attributes to another set of parcels and be unable to do so
because the parcel table had changed, and therefore unique identifiers
changed. @mkreilly https://github.com/mkreilly could you clarify on
what percentage difference or similarity would be acceptable when joining
parcels across tables? That might help us define what the successful
completion of this story is like.

One previous attempt at keeping a parcel's ID the same was to keep an ID
column on the table that had a unique name which was generated in some
early process, and then just make sure that that ID column remained on the
table in all cases where parcels were used in the modeling process.

Another approach is to use the hash of the geometry column. For example
https://github.com/synthicity/bayarea_urbansim/blob/a0cdcee377500198645d468e130541e32a08a3dd/data_regeneration/match_aggregate.py#L762-L771.
However, when we compared the geom_id's from @janowicz
https://github.com/janowicz's (Windows 7) laptop to those generated by
the MTC Windows Server 2012, only 1/3 of the parcels were exactly
identical. On the other hand, across Linux machines built in exactly the
same way, more than 95% of the geometry ID's are identical.

Other ideas for keeping a parcel's ID the same include using a geohash
http://postgis.net/docs/manual-dev/ST_GeoHash.html or similar.

There might be 2 notions of time that are relevant: parcel time and
database time. For example, lets assume that parcel A that has an attribute
something=1 at time-1 in the parcel table. If we discover, at time-2,
that we were incorrect, and that in time-1, parcel A in fact had
something = 2, do we revise the time-1 parcel table? Or do we only
resolve the time-2 table? This could be more complicated if something is
actually the geometry of the parcel, or if the parcel splits.


Reply to this email directly or view it on GitHub
#56.

from bayarea_urbansim.

janowicz avatar janowicz commented on July 22, 2024

The function to be tested:


def generate_unique_id_from_geom(x, y, area, precision = 5, hash_values = False):
    
    # Keep only non-null values
    x = x[~x.isnull()]
    y = y[~y.isnull()]
    area = area[area > 0]
    
    #x, y, area
    identifier = x.round(precision).astype('str') + ',' + \
                 y.round(precision).astype('str') + ',' + \
                 area.round(precision).astype('str')
            
    if len(identifier) > len(np.unique(identifier)):
        print 'Non-unique id values present.  %s rows and %s unique values.' % (len(identifier), len(np.unique(identifier)))
            
    if hash_values:
        identifier = identifier.apply(get_md5_hexdigest)
    return identifier

from bayarea_urbansim.

tbuckl avatar tbuckl commented on July 22, 2024

Thanks @janowicz. I think that functionally thats very similar to this approach.

In any case, I think the key here will be writing tests.

from bayarea_urbansim.

akselx avatar akselx commented on July 22, 2024

Remarkable that centroids could be identical for parcels with different geometries.

This may be a hack solution that may be vulnerable if precision indeed varies by system. It relies on PostGIS' own ST_AsEWKT which renders a geometry (not just the centroid here) as a string, and then I pull that into pandas and hash it from there.

import hashlib
from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:xxxxxx@localhost:5432/gisdb')
county_area = pd.read_sql('select gid,ST_AsEWKT(geom) from zones1454',engine)
def hashbrown(s):
    hash = hashlib.sha1(s)
    hex = hash.hexdigest()
    return hex
​
county_area.st_asewkt.apply(hashbrown).head()
Out[20]:
0    32d484fa10ca4ea9f91dd385759ef0e3e57524c2
1    cb69cf3774eb47edf949c113f38162a703e0f2ce
2    389f9b65bdb42dcba7a8955301c1969cb0fa0b27
3    90db87b0f2922f3407f7fcc93d2c06b2d060143a
4    b8d5ff99f8d97e678d4d4fa2d506ff3b2de409c4
Name: st_asewkt, dtype: object

from bayarea_urbansim.

tbuckl avatar tbuckl commented on July 22, 2024

@akselx thats the approach taken here: https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/match_aggregate.py#L762-L768

from bayarea_urbansim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.