Giter Site home page Giter Site logo

geodict's Introduction

geodict
-------

A simple Python library/command-line tool for pulling location information from unstructured text

Installing
----------

This library uses a large geo-dictionary of countries, regions and cities, all stored in a MySQL database. The source data required is included in this project. To get started:

- Enter the details of your MySQL server and account into geodict_config.py
- Install the MySQLdb module for Python ('easy_install MySQL-python' may do the trick)
- cd into the folder you've unpacked this to, and run ./populate_database.py

This make take several minutes, depending on your machine, since there's over 2 million cities

Running
-------

Once you've done that, give the command-line tool a try:
./geodict.py < testinput.txt

That should produce something like this:
Spain
Italy
Bulgaria
New Zealand
Barcelona, Spain
Wellington New Zealand
Alabama
Wisconsin

Those are the actual strings that the tool picked out as locations. If you want more information
on each of them in a machine-readable format you can specify JSON or CSV:
./geodict.py -f json < testinput.txt
[{"found_tokens": [{"code": "ES", "matched_string": "Spain", "lon": -4.0, "end_index": 4, "lat": 40.0, "type": "COUNTRY", "start_index": 0}]}, {"found_tokens": [{"code": "IT", "matched_string": "Italy", "lon": 12.833299999999999, "end_index": 10, "lat": 42.833300000000001, "type": "COUNTRY", "start_index": 6}]}, {"found_tokens": [{"code": "BG", "matched_string": "Bulgaria", "lon": 25.0, "end_index": 19, "lat": 43.0, "type": "COUNTRY", "start_index": 12}]}, {"found_tokens": [{"code": "NZ", "matched_string": "New Zealand", "lon": 174.0, "end_index": 42, "lat": -41.0, "type": "COUNTRY", "start_index": 32}]}, {"found_tokens": [{"matched_string": "Barcelona", "lon": 2.1833300000000002, "end_index": 52, "lat": 41.383299999999998, "type": "CITY", "start_index": 44}, {"code": "ES", "matched_string": "Spain", "lon": -4.0, "end_index": 59, "lat": 40.0, "type": "COUNTRY", "start_index": 55}]}, {"found_tokens": [{"matched_string": "Wellington", "lon": 174.78299999999999, "end_index": 70, "lat": -41.299999999999997, "type": "CITY", "start_index": 61}, {"code": "NZ", "matched_string": "New Zealand", "lon": 174.0, "end_index": 82, "lat": -41.0, "type": "COUNTRY", "start_index": 72}]}, {"found_tokens": [{"code": "AL", "matched_string": "Alabama", "lon": -86.807299999999998, "end_index": 196, "lat": 32.798999999999999, "type": "REGION", "start_index": 190}]}, {"found_tokens": [{"code": "WI", "matched_string": "Wisconsin", "lon": -89.638499999999993, "end_index": 332, "lat": 44.256300000000003, "type": "REGION", "start_index": 324}]}]

./geodict.py -f csv < testinput.txt
location,type,lat,lon
Spain,country,40.0,-4.0
Italy,country,42.8333,12.8333
Bulgaria,country,43.0,25.0
New Zealand,country,-41.0,174.0
"Barcelona, Spain",city,41.3833,2.18333
Wellington New Zealand,city,-41.3,174.783
Alabama,region,32.799,-86.8073
Wisconsin,region,44.2563,-89.6385

For more of a real-world test, try feeding in the front page of the New York Times:
curl -L "http://newyorktimes.com/" | ./geodict.py
Georgia
Brazil
United States
Iraq
China
Brazil
Pakistan
Afghanistan
Erlanger, Ky
Japan
China
India
India
Ecuador
Ireland
Washington
Iraq
Guatemala

The tool just treats its input as plain text, so in production you'd want to use something like
beautiful soup to strip the tags out of the HTML, but even with messy input like that it's able
to work reasonably well.

Developers
----------

To use this from within your own Python code
import geodict_lib

and then call
locations = geodict_lib.find_locations_in_text(text)

The code itself may be a bit non-idiomatic, I'm still getting up to speed with Python!

Credits
-------

© Pete Warden, 2010 <[email protected]> - http://www.openheatmap.com/

World cities data is from MaxMind: http://www.maxmind.com/app/worldcities

All code is licensed under the GPL V3. For more details on the license see the included gpl.txt
file or go to http://www.gnu.org/licenses/

geodict's People

Contributors

petewarden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geodict's Issues

Trouble with the cities table

I'm seeing something strange in the cities table; it looks as though a lot of cities that are in the source data are missing from the populated geodict database, possibly getting clobbered on import.

Take Brooklyn, for example. In worldcitiespop.csv, grep finds 49 entries for 'brooklyn' (42 of which are in the US); in the geodict database, there are five entries for 'brooklyn', only one of which is in the US (and the US entry is in Alabama). The same seems to be true of other US cities like Rochester and Boston, each of which is found only once in the US (and in an alphabetically early state like AL or CA). Are the others getting clobbered on import? Or am I maybe making a mistake in looking through the database (not much experience with MySQL here).

The SQL query I'm using is:

SELECT city, country, region_code, population, lat, lon FROM cities WHERE city = 'Brooklyn';

Other things that might be relevant:

  1. The populate_database.py script produces two errors when I run it:
    ./populate_database.py:49: Warning: Data truncated for column 'last_word' at row 1
    (city, country, region_code, population, lat, lon, last_word))

    ./populate_database.py:49: Warning: Data truncated for column 'city' at row 1
    (city, country, region_code, population, lat, lon, last_word))

  2. populate_database.py won't work at all unless I first create the geodict database by hand, even though it looks as though the script is meant to handle that.

  3. System info:

    uname -a

    Darwin wilkens-imac.wustl.edu 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386

    mysql --version

    mysql Ver 14.14 Distrib 5.1.56, for apple-darwin10.3.0 (i386) using readline 5.1

Any other info I can provide? Happy to do any kind of debugging that might help. Thanks!

Populate regions table

Hi Pete,

What an interesting work. I wanted to try your tool with my own data but I can't move forward with the population of the database. Population of cities and countries works fine, but regions table stay empty. Script populate_database.py flags me just a warning : "./populate_database.py:42: Warning: Data truncated for column 'city' at row 1 (city, country, region_code, population, lat, lon))"

Have you any idea?
Thank you.

Python Module

Could you add a blank __init__.py file into your folder? I'm importing geodict as a module and have to create the file manually.

Thanks for the awesome script!

Language Support

Hey Pete, I would like to use your geodict library with a german text. Is there some way to get a language specific .csv file?

regards, Mirko

Showing previous data

Hi

import geodict_lib
lis = ["I live in India with love", "DACH and the Netherlands Klarna Checkout purchases, Nordics Klarna Checkout purchases, and Klarna Checkout for In-App Purchases"]
for li in lis:
print geodict_lib.find_locations_in_text(li)

I tried with the above code, but for for the text, it shows INDIA as country instead of Netherlands in the second element of the text.

Thanks,
Viswesh M

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.