openva / crump Goto Github PK

View Code? Open in Web Editor NEW

20.0 6.0 3.0 679 KB

A parser for the Virginia State Corporation Commission's business registration records.

Home Page: https://vabusinesses.org/

License: MIT License

Python 100.00%

virginia business open-data

crump's Introduction

Crump

A parser for the Virginia State Corporation Commission's business entity records, which are provided as a single, enormous fixed-width file. Named for Beverley T. Crump, the first member of the State Corporation Commission.

Crump retrieves the current SCC records (updated weekly) and turns them into CSV and JSON. Alternately, it can improve the quality of the data (formatting dates, ZIP codes, replacing internal status codes with human-readable translations, etc.), atomize the data into millions of individual JSON files, or create Elasticsearch-compatible bulk API data.

The most recent copy of the raw SCC data can be found at https://s3.amazonaws.com/virginia-business/current.zip.

Usage

usage: crump [-h] [-a] [-i file.txt] [-o output_dir] [-t] [-d] [-e] [-m]

optional arguments:
  -h, --help            show this help message and exit
  -a, --atomize         generate millions of per-record JSON files
  -i file.txt, --input file.txt
                        raw SCC data (default: cisbemon.txt)
  -o output_dir, --output output_dir
                        directory for JSON and CSV
  -t, --transform       format properly date, ZIP, etc. fields
  -d, --download        download the data file, if missing
  -e, --elasticsearch   create Elasticsearch bulk API data
  -m, --map             generate Elasticsearch index map

For general purposes, ./crump -td is probably the best way to invoke Crump. This will download the current data file and transform the data to make it adhere to basic data quality norms.

License

Released under the MIT License.

crump's People

Contributors

Stargazers

Watchers

Forkers

jacqui djeraseit

crump's Issues

Rename the executable

file_extractor is not a good name. Perhaps, simply, crump?

Format date fields per ISO-8601

At present, date fields are YYYYMMDD. This, of course, is not a thing. Reformat dates to comply with ISO-8601, as YYYY-MM-DD.

Do this optionally, per #45.

Establish an optional transformations stanza

This ticket is an answer to #34 ("At what point do we make changes?")

We make changes within Crump, but we do so in a manner that is optional. That means creating a stanza where all of our transformations are performed, and making that optional, under a command-line switch. I think this addresses the concern raised in #29, too.

The transformations should be represented within the YAML table maps (#44) whenever possible, to keep the logic out of the Python.

Table 8 isn't being exported at all

8_registered_names.txt is blank. I assume the line length that we're checking for is just wrong.

Change how we detect sub-files

We don't have to use line length—it turns out the first two characters of each line (e.g., 04) are the file number. Use that instead.

Eliminate REC-TYPE fields

Every file has REC-TYPE fields, indicating what file that the row belongs to. This is unnecessary once the file is broken up into multiple files. Drop this field.

Create JSON versions of each CSV file

csvkit can handle this.

Reformat field names

The field names can be made more friend, without making any serious changes.

Each file uses the same prefix for each field. For instance, 3_lp.csv has the following headers:

CORP-FORMED,CORP-ID,CORP-NAME,CORP-STATUS,CORP-STATUS-DATE,CORP-PER-DUR,CORP-INC-DATE,CORP-STATE-INC,CORP-IND-CODE,CORP-STREET1,CORP-STREET2,CORP-CITY,CORP-STATE,CORP-ZIP,CORP-PO-EFF-DATE,CORP-RA-NAME,CORP-RA-STREET1,CORP-RA-STREET2,CORP-RA-CITY,CORP-RA-STATE,CORP-RA-ZIP,CORP-RA-EFF-DATE,CORP-RA-STATUS,CORP-RA-LOC

There is no value to this when the files are broken out. (I'm not sure that there's any value to it when the files are combined, but that's neither here nor there.) Eliminate the uniform prefixes.

Also, there's no benefit to these being in all caps. If they're going to be of uniform case, they might as well be in lowercase.

Create a map of column positions and data fields

The SCC provides a PDF version of this, for which we have a crude conversion to ASCII. I've gotten started on creating a machine-readable mapping, which really just consists of turning the existing mapping into something properly structured.

Realistically, I think we'll want to break this down into a series of files, one per data file, and maybe represent it as YAML instead of CSV.

What does "RA" stand for?

I see this everywhere in field names. "CORP-RA-NAME", "CORP-RA-STREET," etc. Figure out what this is about.

Repair invalid JSON

All of the generated JSON files are broken. There's no comma separating each of the elements. The problem presumably lies in json.dump(line,json_file).

Add CSV column headers

The CSV files don't have any column headers. Add them. We have that data within the YAML, so it should be a straightforward addition.

Wrong count in LP yaml file

The second entry in https://github.com/openva/crump/blob/master/table_maps/3_lp.yaml is wrong -- it's not a separate field but part of the CORP-ID

Close gaps in fields

We're getting big gaps within text fields, like such:

{
  "res-exp-date": "2014-07-26",
  "res-zip": "L3520-9000",
  "res-street1": "2 METROPLEX DR STE 500",
  "res-street2": "",
  "res-name": "BEHAVIORAL HEALTH SYSTEMS, INC.",
  "res-city": "BIRMINGHAM",
  "res-type": "C",
  "res-number": "R155520",
  "res-state": "A",
  "res-requestor": "PATRICIA L FRIEDLEY                               BEHAVIORAL HEALTH SYSTEMS INC",
  "res-status": "60"
}

Figure out what the source of this problem is, and eliminate it.

Eliminate trailing commas

Our JSON files are being closed out with a },] instead of a }]. The trailing comma prevents the JSON from validating. I assume that there's an extra character that's preventing the find-and-replace from working properly, as we already have code that ostensibly fixes this problem.

Create AWK recipes to split up lines

For each file, it should be relatively easy to write an AWK recipe to split up each line into delimited components from its present fixed-width schema, using the map of column positions. But rather than writing a series of AWK scripts, what we should really do is automatically generate them, using the YAML table maps. That's assuming we can use YAML in Bash.

Produce individual JSON files for each record

In addition to creating the 9 JSON files, also create 1 JSON file for each record in each of those files. Or, rather, one JSON file for each record that includes a corporate ID (*-corp-id). Save each file as [corporate_id].json.

This will make it possible to join together all records from different files that pertain to a single corporation.

Converted file sizes are suspiciously small

2_corporate.txt is 227.6 MB. 2_corporate.csv is 11.5 MB. 5_officers.txt is 62.5 MB. 5_officers.csv is 882 KB. Unless I've accidentally implemented a brilliant compression algorithm, something is going missing.

Calculate the check digit in the script

Now that we know how to calculate the check digit (#46), implement that within the script. We'll have to:

add a function to calculate the check digit
know which fields are corporate IDs
figure out how best to store that (modify the corporate ID or add a new field?)

This should be done as per #45—as part of optional data-enhancement functionality.

Table 2 mapping is off

All of the columns are slightly off when table 2 is converted.

02,F000079,Phoenix Life Insurance Company,00,20110303,00000000,19040113,NY,30,ONE AMERICAN ROW,MERICAN ROW,,,,HA,"4701 COX ROAD, SUITE 285","X ROAD, SUITE 285",GLEN AL,LEN             VA23060,00,,45143S00,0,000,1,0000S0COMMO,N,,,,,,,,00000,00000000000,00000000,00000000000,00000000,00000000000,00000000,00000000000,00000000,000

I think that more than one field is wrong. Step through and debug it.

Add sanity checks to verify filesystem permissions

At present, we just barrel through the process of creating files and directories. We need to be using try/except to do this, to handle permissions problems gracefully.

Terminate each JSON files with a newline

Elasticsearch's bulk import API insists that each file end in a newline (\n), and there's no harm in ending other JSON files in a newline, too.

file_extractor.py not PEP-8 compliant

Formatting on the python script was not correct. Not a huge deal, but it will make it harder to work with python devs. The biggest problem is using tabs instead of spaces to indent which, personally, I think is the wrong way to do it, but I'm told there are Reasons. The biggest issue with the space/tab disparity is that, if you use one and everyone else uses another, copying and pasting blocks of code from, for example, Stack Overflow will cause strange errors that will be impossible to see if your editor/IDE doesn't point them out.

Represent transformations in the YAML table maps

There are some types of simple transformations that can be represented within the YAML table maps, rather than relying on logic within Python. This might be direct substitutions (e.g., F becomes foreign corporation), or it might be a pointer to a lookup table. Explore how to best implement these, and do so.

One example to look for and resolve is the "industry code," such as is found in 9_llc.yaml.

Fix YAML field names missing "RA"

When standardizing table field names in YAML (#21), I saw that "RA" seemed to be unnecessary in columns, since I found an instance when it was omitted from a foreign key reference. So I stripped it out from the field names that used it. Now that I see that "RA" stands for "Registered Agent" (#25), I see that it is clearly important. Revisit #21 to re-add "RA" to relevant field names.

What are tables and how do we use them?

Throughout the file layouts, there are references to "Table Codes." For example, in the ninth file, there are several fields that are "table codes," like "LLC Status" and "State of Formation."

I'm pretty sure that the lookup tables are all listed in the first file. Here's an excerpt:

01,00,00,TABLE NAMES
01,00,01,CORPORATE STATUS TABLE
01,00,02,STATE TABLE
01,00,03,INDUSTRY CODE TABLE
01,00,04,REGISTERED AGENT STATUS
01,00,05,COURT LOCALITY CODE
01,00,06,ANNUAL REPORT STATUS
01,00,07,ASSESSMENT INDICATOR
01,00,12,STOCK CLASS
01,00,19,AMENDMENT TYPES
01,00,23,L/P AMENDMENT & L/P CERT. OF CORRECTION TYPES
01,00,28,L/P REGISTERED AGENT STATUS
01,00,40,LLC REGISTERED AGENT STATUS
01,01,00,ACTIVE
01,01,01,FEE DELINQUENT
01,01,14,DISSOLVING
01,01,15,CANC(VOLUNTARY)  (LLC/LP/BT)
01,01,16,INVDISS/COURT
01,01,17,CANC/DOM         (CORP)
01,01,18,SURRENDERED      (CORP)
01,01,19,CANC (AUTO-FEE) LLC/LP/BT-FEES NOT PAID
01,01,20,MERGED
01,02,AK,ALASKA
01,02,AL,ALABAMA
01,02,AR,ARKANSAS
01,02,AS,AMERICAN SAMOA
01,02,AZ,ARIZONA
01,02,CA,CALIFORNIA

The first column—01—is just the file ID. We can ignore that. I'm pretty sure that the second one is the lookup table ID, which appears to resolve to the list of table IDs.

Basically, I think this is a metadata table with listing foreign keys.

Fix character encoding problems

When dealing with a couple of the CSV files, csvkit complains about character encoding problems. For example:

$ csvjson -e utf-8 5_officers.csv > 5_officers.json
Your file is not "utf-8" encoded. Please specify the correct encoding with the -e flag. Use the -v flag to see the complete error.

This is a problem with 2_corporate.csv, 5_officers.csv, 6_name.csv, and 9_llc.json.

My concern here isn't csvkit (I mean, it kind of is, but that's not the point of this issue), but instead that the provided data has character encoding problems that are providing to be a show-stopper for some purposes. Figure out how to ID and fix character encoding problems in Python.

Standardize table names in YAML

Expand tables.csv by adding a third column, like such:

id,label,key
01,Corp Status,Corporate Status

So we retain the official label, but we also get a field name we can actually do something with. This is to deal with the problems raised within #18.

Bad characters in source file

If you open up the sample source file, line 33737 (CorePoint Insurance Agency, Inc - you can search for F108854), the address has something not ASCII. I'm trying to move over to the unicode aware CVS Kit for the cvs output, but it does not like that line at all. The characters should clearly be spaces rather than whatever they are. I'm not sure how we want to deal with this. Probably worth checking that line from the non-csvkit output to see if it translated correctly.

Traceback (most recent call last):
File "/Applications/PyCharm.app/helpers/pydev/pydevd.py", line 1538, in
debugger.run(setup['file'], None, None)
File "/Applications/PyCharm.app/helpers/pydev/pydevd.py", line 1149, in run
pydev_imports.execfile(file, globals, locals) #execute the script
File "/Users/bgeiger/Envs/crump/crump/file_extractor.py", line 72, in
csv_writer.writerow(line)
File "/Users/bgeiger/Envs/crump/lib/python2.7/site-packages/csvkit/init.py", line 79, in writerow
unicsv.UnicodeCSVDictWriter.writerow(self, row)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/Users/bgeiger/Envs/crump/lib/python2.7/site-packages/csvkit/unicsv.py", line 82, in writerow
self.writer.writerow([unicode(s if s != None else '').encode(self.encoding) for s in row])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa6 in position 3: ordinal not in range(128)

Format ZIP fields

Format ZIP fields per USPS Publication 28, as 12345-6789, or 12345. At present, they're all formatted as 123450000—no hyphen separating the ZIP and the +4, and rarely (never?) actually including the +4, but instead an erroneous 0000.

Do this optionally, per #45.

Add progress messages

After #15, no status messages are emitted anymore. Restore some verbosity.

Switching to Python and/or csvkit?

Although there's a lot to be said for the simplicity and universality of using Bash and AWK, getting to the point of reading YAML into Bash variables (#5) is perhaps the point of absurdity.

It may be time to switch to Python, instead of Bash, and use that to parse the YAML and generate system calls to awk.

In addition, it may make sense to move to csvkit for manipulating CSV data. Its in2csv tool turn fixed-width formats into CSV, although it requires its own—very simple—CSV file format to do so. (I imagine we'd need to convert our YAML the prescribed CSV format, which is a bit awkward, but not bad.) That would allow us to skip the pipe-delimited portion of the process, and go straight to CSV, and we could also use it to produce JSON.

How to negate a default-true flag?

We're using argparse.add_argument to deal with command-line switches, with a few that default to true. But how does one set those to false? I assume that this is handled natively by argparse is one way, but I don't know what it is.

Move "description" field in YAML

The present order of fields is confusing:

- name:        table-identifier
  type:        A
  start:       2
  length:      2
  description: ID for Specific Table
- name:        table-code
  type:        A
  start:       4
  length:      10
  description: Table code
- name:        table-desc
  type:        A
  start:       14
  length:      50
  description: Value for Table Code

Putting name first and description last makes it logical for a reader to group together descriptions and the following name, even though they're unrelated. Move the description field up to appear after name to prevent this confusion.

Rethink file structure

A lot of this is terribly thought out—the output directory, the table_maps directory, sticking with the cisbemon.txt filename, the random file_extractor.py filename—it's all a bit of a mess. Think this all through to establish a structure that's more sensible.

Expand status flags to be more verbose

There are a bunch of record types where a single letter is a code for something more verbose (e.g., "F = Incorporated outside VA; 1 Numeric = Incorporated in VA"). It's fine to retain this for a certain duration of improvements, but eventually we've got to find and replace all of these with more verbose information.

Stop faking the line count in the file extractor

We're faking our math in the file extractor:

FILE_LENGTH=`wc -l cisbemon.txt |cut -d " " -f 2`
HALF_LENGTH=891210

That's because I can't figure out how to round numbers in Bash. (If the file length has an odd number of lines, then half of that number would be invalid to pass to wc, which complains about decimals.) This is a particularly nasty bug, because everything would keep working for a very long time, until finally it would not.

Fix this.

Perform table 1 lookup code substitutions

There is a lot of shorthand data scattered throughout the files, and those fields are assigned a table_code value within the YAML. They can be looked up within 1_tables.csv and resolved. Seems like we ought to be performing that resolution. Perform the following steps:

Read the contents of 1_tables.csv into a dictionary
For any field for which we have a table_code value, replace the provided value with the translated value

Make sure that this transformation is performed within the transformation stanza, so that anybody who wants the raw data can have it.

Generate Elasticsearch-compatible files

Optionally, generate JSON that Elasticsearch's Bulk API can ingest. This should be a simple addition. I'm thinking that those files would be generated instead of the normal JSON files, instead of in addition to them—the program can simply be run twice if somebody wants both sets of files. But the only reason I prefer that is because I think it's going to be easier than generating two sets of files.

Add carriage returns to JSON output

At present, we don't employ carriage returns, which makes it non-trivial to divide up the 9 main JSON files for indexing by ElasticSearch. If we put each record on its own line, then we could divide up files and index them without having to even look at the content, requiring considerably less horsepower than would otherwise be necessary.

To be clear, this would be an alternative to generating literally millions of individual JSON files.

Only produce per-record JSON with a command-line switch

Producing literally millions of JSON files is a fairly demanding thing to do, and not always useful. Only do this if specifically requested, via a command-line switch.

Make output directory a configurable option

Perhaps via config file, perhaps via the command line. At present, everything goes in /output/, which is strikingly inelegant.

At what point do we make changes?

At some point, this stops being about moving SCC data into a better form, and starts being about actually improving the data and metadata. When is that? Do we really want to provide CSV that's just the fixed-width data as CSV? Or JSON that's just the fixed-width data as JSON? Do we move the cleanup process earlier in the pipeline (before CSV is generated at all), somewhere in the middle (creating raw CSV files and then improved ones), or at the end (after CSV, JSON, and whatever else is generated)?

Consider dropping .py suffix

I'm not sure that it's necessary.

Download data file if it isn't found

The most recent version can be found at https://s3.amazonaws.com/virginia-business/current.zip.

Make this a command-line switch, with a default value of true.

General extractor structure

Now that there's a basic program structure, I have some proposals for changes. I know you're still trying to get things working, and I'm absolutely against premature optimization. What I'm going to suggest uses a lot of the basic ideas but fixes some things that are listed as bugs / enhancements already and will also make things faster.

Right now we are reading/writing the really large file 10.5 times at least. Now that we have the YAML Structure definition, and we know that the lines start with which file they belong to, it makes a lot more sense just to go through the initial file once, look at each line, figure out which YAML definition goes with that line, and output the corresponding CSV line. (Eventually that will be "Put it in a data structure and export it in the format requested", but one step at a time.)

The type of parsing we're doing doesn't really require anything particularly fancy. It's "grab characters x through y and write it out to a file with a comma after / store them as item z in a dictionary", so I don't think we're gaining a lot by using optimized tools.

The things that look like will be handled or helped by the change are:
#11 is only really necessary because we're creating a bunch of intermediate files.
#10 will be hard to step through as you are shelling out to a script. Handling that in-program will make it easier to debug
#9 is handled during this change
#7 is pretty much the same situation as #10.

But it's a big change, so before I just dove in, I wanted to see what you thought. I'd hate to have two people put in a bunch of work on divergent paths if it's not what we're going for.

Create JSON subdirectories

Further to #39, we need to create one subdirectory for each file type. At the moment, the first attempt to write a file to one of the subdirectories (/output/2/) causes the program to fail. (I just created the directories—2 through 9 manually.)

Figure out what the A and N field length prefixes mean

In the record layouts, every field length is preceded with either an "A" or an "N", e.g.:

2 REC-TYPE        (A2)          Record Type = '01'

I have no idea what this means, but I assume it's important. I'm capturing this data in table_columns.csv, as its own column.

Fix the description of CORP-FORMED

3_lp.yaml has a duplicate description for the CORP-FORMED field. Devise a new description. (Identified in #29.)

Figure out how to calculate a check digit

The SCC sometimes provides a 7-digit ID, sometimes an 8-digit ID. It turns out that the eighth digit is a checksum. This is the SCC's code to calculate that checksum but, unfortunately, it's written in Natural. So we've got to figure out Natural sufficiently to recreate the code in Python.