Giter Site home page Giter Site logo

ontrack's Introduction

OnTrack

Given a piece of text and a database of possible matches, OnTrack uses word vectorization and cosine matching to find the closest match despite different formatting conventions.

This repo contains a collection of text cleaner scripts and matcher classes.

Documentation

See the wiki for the full documentation.

Dependencies

  • numpy
  • pandas
  • nltk
  • scipy>=0.16.1
  • scikit-learn

Dependencies will be installed during setup.

Installation

pip install <git+http://this-repository.git>

Not sure what to do? More detailed instructions here.

Usage

This example will find records matching "Rehabilitation/Reconstruction/Removal of Gravel on Bulacan, Road. North km+1993-km+384" in the database path-to-file.csv.

1. Prepare your query.

OnTrack accepts the following as input:

  • list of strings
  • pandas DataFrame
  • csv filepath, accessed using read()
db_A1 = ['Laoag', 'Construction'] # two separate queries
db_A2 = pd.DataFrame([['Laoag', 'Construction', 'unhelpful column'],
                      ['Bohol', 'Rehabilitation', 'unhelpful column'],
                      ['Quezon City', 'Construction', 'unhelpful column']]) # three rows/queries

df_B = read('small.csv')

2. Find records with exact, case-insensitive substring matches from a specific column in the corpus.

matches1 = find_exact(db_A1, dbB)

matches2 = find_exact(db_A2, dbB, colA=[0,1], colB=['contract_desc', 'implementing_office'], fname='exact2') # the result will be outputted in a csv with the default name 'exact matches.csv'

For matches1, the result will be outputted in a csv with the default name 'exact matches.csv' For matches2, since fname was specified, results will be saved in 'exact2.csv'. Filling in the parameters colA and colB means specific columns are selected; 0 and 1 are the column number of the helpful columns in db_A2, contract_desc and implementing_office are the important headers in db_B ('small.csv').

If colA and colB are not specified, find_exact will use all columns.

3. Find top 5 records with the highest cosine similarity score.

matches1 = find_closest(db_A1, df_B)

matches2 = find_closest(db_A2, df_B, n=10, cleaner=None)

The result from the first line will be saved in 'closest match.csv'.

Since no file name was specified, 'closest match.csv' will be overwritten by the result of the second line. You can choose to increase/decrease the number of best matches displayed (default: 5) by changing n. Setting cleaner to None means that either the input databases were precleaned, or you choose not to keep them in their raw format.

More detailed instructions are available in the wiki.

Motivation

OnTrack tackles the problem of huge numbers of unmatched, messy records across government data silos by automating the matching process. In half an hour, the current version of OnTrack is able to accurately shortlist 85% of the records manually matched in 15 work-months.

Read more about our pilot test here.

Contributors

  • Stephanie Sy
  • Ray Dino
  • Jose Araneta
  • Gilian Uy

Contact

This is a work in progress. Please send comments and questions to [email protected].

ontrack's People

Contributors

ray-dino avatar marksteve avatar stefsy avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar gil avatar  avatar

Forkers

liangi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.