Giter Site home page Giter Site logo

corporation-parser's Introduction

Porting Echelon Corporation Parser to Parslet

Sunlight labs recently built a tool called Echelon to examine companies named in US lobbying forms.

You can read the blog post here

The part that parses company names into normalised forms is actually a very small part of the code. It uses the Clojure library instaparse which works on the following grammar:

beings = <whitespace>* name  (<whitespace>+ splitters <whitespace>+ name)* <whitespace>*
name   = token (<whitespace>+ token)* ["."*]

(*TODO: What to do about missing spaces. How smart can this parser actually get?*)
<token>   = simple | special

(*TODO: why does formers have to included here? It feels odd, perhaps the two negative lookups in simple and then splitters cancel out and formers get lost somehow*) 
<simple>  = !(special | splitters | formers | initials) #"[a-z0-9&!'/]+" ["."]

<special> = &(special-helper whitespace) special-helper
<special-helper> = and | corporates | domain | initials | north-america | number | saint | usa

and = "&" | "and" 

corporates = llc | pllc | llp | lp | incorporated | corporation | limited | company | international | association | foreign 
association = "association" | "assn." | "associations" | "association's" | "associations'"
international = "international"
llc  = "llc" | "lc" | "lcc" | "llc."
pllc = "pllc"
llp  = "llp" | "llp."
lp =   "lp" | "lp." | "l.p."
incorporated  = "incorporated" | "inc" | "inc."
corporation = "corps" | "corporations" | "corporation" | "corp" | "corp."
limited = "ltd" | "ltd." | "ltd.."
company = "company" | "companies" | "co."
foreign = "ltda." 

initials = !(usa | north-america | splitters | corporates) initial+ 
initial = #"[a-z]" "."
domain = #'[a-z0-9]+' <"."> ("com" | "org" | "us" | "net")
north-america = "north america" | "n.a." | "north american"
number = &(number-helper whitespace) number-helper
<number-helper> = <["no." | "#"]> [<whitespace>] some-digits
<some-digits> = (digit | two-digits | three-digits) {"," three-digits} {digit}
<two-digits> = digit digit
<three-digits> = digit digit digit 
<digit> = #"[0-9]"
saint = "st." | "saint" | "saints"
usa = &("u.s." whitespace) "u.s." | &("u.s.a" whitespace) "u.s.a" | &("u.s.a." whitespace) "u.s.a."


splitters = fka  | aka
fka = "fka" | "f.k.a." | "f/k/a/" |  simple-fka  | complex-fka
<simple-fka>  = !complex-fka formers
<complex-fka> = formers <whitespace> fka-verbs [<whitespace> "as"]
<formers> = "formerly" | "formelry" | "formarly" | "frmly" | "frly"
<fka-verbs> = "registered" | "filed" | "reported" | "known" | "know" | "field"

aka = "a/k/a" | "a.k.a." | "also known as"

whitespace = ' ' | ',' | '-' | '(' | ')' | ':' | #'$' | '\"' | '/' | '*' | '=' | '>' | '+' | '[' | ']' | '_' | '$'

Notes on Echelon

The parser works on small specific strings that come from forms like this one

https://github.com/influence-usa/lobbying_federal_domestic/wiki/House-Data-Dictionary

on these fields in particular

lobbying/client/name
lobbying/foreign-entity/name
lobbying/registrant/name
lobbying/affiliated-organization/name

Using Ruby's Parslet

I've had a go at porting this grammar to the Ruby library parslet. Some resources to learn parslet:

I've not ported every single aspect of the echelon parser so I don't take account of 'USA' or 'saint' or grouping digits in names.

You can run the parser like so:

$ cd this/folder
$ bundle install
$ bundle exec ruby parser.rb

Which shows that "SkyTerra Communications, Inc., formerly Mobile Satellite Ventures" gets turned into

{:beings=>
  {:company=>
    [{:simple=>"skyterra"@0},
     {:simple=>"communications"@9},
     {:special=>{:corporates=>"inc."@25}}],
   :splitters=>{:fka=>"formerly"@31},
   :company_alt=>
    [{:simple=>"mobile"@40},
     {:simple=>"satellite"@47},
     {:simple=>"ventures"@57}]}}

corporation-parser's People

Contributors

xavriley avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.