Giter Site home page Giter Site logo

snknitin / us-zipcode-distance Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 978 KB

Code to create a Pairwise Zip code Distance Matrix and use an index lookup. This is a tiny utility I built, for use in my work, but I figured it might be helpful to anyone else dealing with zip/postal code distances

License: Apache License 2.0

Jupyter Notebook 88.16% Python 11.84%
zipcode-distance postal-codes numpy-matrix pandas uszipcode haversine-distance latitude-and-longitude

us-zipcode-distance's Introduction

Project - US-Zipcode-Distance

Code to create a Pairwise Zip code Distance Matrix and use an index lookup. This is a tiny utility I built, for use in my work, but I figured it might be helpful to anyone else dealing with zip/postal code distances. This can be a huge and efficient time saver if you're working with large data and need to keep hitting some sort of a zip/latlong service API multiple times.

The idea is to form a (41483,41483) matrix where the cell values are the haversine distances between two zips that are indexed. I save the index look up and the matrix file as .npz, .json or .pkl to reload and use in different projects. I hope this saves your effort when you're working with data streams or pandas dataframes and want to avoid iterating, using pypi package or a custom apply function to calculate distance between 2 zipcodes multiple times.

Use this matrix and vectorize your operations !

Numpy Pandas

Installation - Dependencies

Install the folllowing python packages

pip install pandas
pip install numpy
pip install mpu

Running the script

Since the the matrix is too huge to upload on git, you can try running the .py script in the code folder or go through the Jupter Notebook. The source file has been added in the Data folder. Running the Zip_distance.py file will generate and save the matrix on your machine and you can load it using

zip_dist = np.load(os.path.join(walk_up_folder(os.getcwd(), 2), "zip_dist.npz"))['arr_0']

Usage/Examples

# Load the saved objects
f = open(os.path.join(walk_up_folder(os.getcwd(), 3), "Data/zips_indexer.json"))
zips_indexer = json.load(f)
zip_dist = np.load(os.path.join(walk_up_folder(os.getcwd(), 3), "Data/zip_dist.npz"))['arr_0']

a='95035' # Zipcode as String
b='94085'

# Extract that cell value from the distance matrix
distance = zip_dist[zips_indexer[a],zips_indexer[b]]

Running Code

To run the code fresh or make any changes, you can follow the steps in the Notebook

Optimizations

What optimizations did I make in my code?

  • Converted the matrix from 'float64' to 'float32' to reduce disk space
  • Tried using Numba.jit to speed up the iterrows and numpy loops
  • mpu.haversine gives result in kms. Convert to miles 1km = 0.621371 miles
  • Did not normalize the matrix before saving since i do need the raw distance values and since it is pairwise the max distance will be too high

Lessons Learned

  • numba -
    • Per the deprecation recommendations, it's very reasonable that code which doesn't compile with @jit(nopython=True) could be faster without the decorator.
    • The available libraries that can be used with numba jit in nopython is fairly limited (pretty much only to numpy arrays and certain python builtin libraries).
  • Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.
  • Using list(zip(a,b)) is much faster if yopu want top create a new pandas column that is a tuple of 2 other columns
  • To make a symmetrix pairwise matrix, initialze using np.zeros so that you won't need to worrty about the diagonal elements. Extrack the upper triangular indices. Add the transpose to the matrix to fill lower as well and set diagonal to 0 again

Authors

Bibtex

Please use the following bibtex, when you refer

@software{Samala_Pairwise_Zip-code_Haversine_2021,
author = {Samala, Nitin Kishore Sai},
month = {11},
title = {{Pairwise Zip-code Haversine distance lookup matrix}},
url = {https://github.com/snknitin/us-zipcode-distance},
year = {2021}
}

Acknowledgements

Related

Here are some related projects

pgeocode

FAQ

Does this only work for US Zipcodes ?

No. You can change the source file of the zip details from here and run the code for any country

Question 2

Answer 2

Feedback

If you have any feedback, please reach out to us at [email protected]

License

Apache 2.0

us-zipcode-distance's People

Contributors

snknitin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.