datakind-dc / cares Goto Github PK

View Code? Open in Web Editor NEW

This project forked from johnmccambridge/cares

6.0 6.0 7.0 88.95 MB

US CARES Act Payment Protection Program data, cleaned for analysis

License: GNU General Public License v3.0

R 0.18% Dockerfile 0.01% Makefile 0.01% Jupyter Notebook 99.78% Python 0.02%

cares's People

Contributors

Stargazers

Watchers

Forkers

kathyxiong jvanzalk zmousavi mowilli3 mdshuey jfontestad jomo06

cares's Issues

Cross check name and address information for consistency #7

posted by @JohnMcCambridge

Using Business Name, Address, City, State, Zip try and match to a real, most likely entry (perhaps via Google Maps API)
For businesses fully validated in this way, also confirm that the NAICSCode, NonProfit are aligned with what public information shows

"Then some loan information is just wrong. The loan listed for Ford’s Hometown Services Inc has the company listed at 549 Grove St in Hartford, CT. But that address doesn’t exist. The company’s website says that it’s located 60 miles (100 km) away at 549 Grove St in Worcester, MA.

The Zip code listed for a loan for La Jolla Dentistry in San Diego is 91121, a Zip code that is for Pasadena, California, 110 miles away. The correct Zip code is 92121." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

repo restructure

Proposing a system like:

bin/ for in production executable files (like reading in the PPP data)
docs/ for references, data dictionaries, manuals, etc.
data/ for raw data files that scripts rely on or that others would find useful--I think tidied data files should be uploaded to Google Drive to make it easier for others to use. Please cite sources in the README!
code/ folder with individual project subfolders on the CARES act data. Enhancements people make can go here. Contributions should be documented in the README. For example:
- code/NAICS/ is where scripts go for joining NAICS and PPP data
- code/project_name/ for the next project, etc.
tests/ for each project's tests

but open to any organizational structure or documentation scheme!

All finalized code should be able to be run on the output of the ppp data script.R file @JohnMcCambridge contributed, or on a dataframe as the result of reading in a CSV file of that data

evaluate data cleanliness of 0808 data, and attempt diff with prior adbs file

0808 is a fully refreshed data set, from what I can see. It excludes all cancelled or refunded loans according to the data sheet that comes with it. Therefore we should work from this new dataset for all core analyses.

The old dataset does provide some interesting opportunities for diffing. However, some of the changes in the new dataset are simply fixes to the old dataset, and of course neither contains any unique identifying keys for each loan, so a true diff on a loan-to-loan basis is not possible

Duplicates

@kbmorales
I am worried this file has repeats of data already present in the individual states/territories folders.

@JohnMcCambridge commented 2 days ago
Is this a merge issue or a data issue on their part. I presume a data issue, because by the definition of the files the individual state/territories should only be up to 150k. Can you share some examples we can test?

@kbmorales commented 2 days ago
I think you're right. I found around 4k duplicates in the data but it doesn't seem to be from the 150k file.

@JohnMcCambridge commented 2 days ago •
If we just use duplicated I'm not convinced we are going to drop true dupes, especially if applied to the <150k files. It would seem entirely possible that an entry in there could be identical to another row, but be a legitimate row (same loan amount, to a similar business in the same zip code, for example). Keeping in mind especially many columns have NAs it's quite easy for a false dupe to appear based on only the barest of detail (loan amount + zip code, basically). 3860cfa

@kbmorales commented 2 days ago
That's a good point, but it'd also be based on the other variables with high completeness (DateApproved, NAICSCode, Lender, and JobsRetained being the more information-rich ones), so I'm less concerned about them not being true duplicates.

While we can't confirm without the BusinessName, I personally feel fine about removing dupes with duplicated but I can retain them for now.

@JohnMcCambridge commented 2 days ago
For the baseline script I would suggest we stick with a minimalist/'do no harm' principle. We could also create an 'add-on' script which applies more maximalist interventions, like row drops (rather than flagging of problematic rows) and de-duping. That way users can decide which version to use when data diving.

Pipeline for joining with US census data

@kbmorales
It'd be great to have a pipe script that adds on latest census data by zip code (or even congressional district 'CD')

@JohnMcCambridge commented
Agreed. As of now I'm having difficulty finding open-access ZIP code level census data.

For Loan Amount, add in numeric max and min values for any range

posted by @JohnMcCambridge

create for each loan that has no specific LoanAmount a numeric max/min value, to allow for quick computation of max/min totals
for entries with specific LoanAmount values, use those as they are

Shiny enhancements: add mouseover annotations

Add functionality to see detailed information when mousing over certain geography (something like this). Detailed info we might want to include on mouseover are: name of geography, numeric value of loan amount, numeric value of demographic variable. This SO thread might be helpful as a starting point: https://stackoverflow.com/questions/30964020/popup-when-hover-with-leaflet-in-r

Related files:

code/maps/map_prep.R: prepares input datasets
code/maps/map_shiny.R: creates shiny app

Shiny enhancements: edit color bar legend

Need to edit color bar legend to display value ranges, instead of percentile ranges, that correspond to each color. Value ranges will need to be calculated from percentile ranges based on the demographic variable selected.

Related files:

code/maps/map_prep.R: prepares input datasets
code/maps/map_shiny.R: creates shiny app

review issue of State variable, source variable, and CD variable for 0808 refresh

CD values in latest 0808 data don't seem coherent: some loans in e.g., FL state have non-FL CD values!

Identify and attempt to correct data transposition/truncation issues

posted by @JohnMcCambridge

flag impacted fields
if possible, try and see if any of these issues spilled across rows: that would be Very Bad.

"There are 1,182 loans where numeric digits appear in the city field. Some of those are clearly spill overs or duplication from the address field. On 198 loan listings the city field contains an office suite number.

Quartz was able to identify 842 loans where what appears to be a name associated with the loan is listed in the city field. For 781 of those, the loaned amount was less than $150,000 which meant the recipients identity was intended to be withheld by the SBA. This error appears 824 times on loans processed by Bank of America.

A loan listed under Morgan-Keller Inc. says the company is at 70 THOMAS JOHNSON DRIVE in the city of SUITE 200 FREDERICK, MD rather than 70 Thomas Johnson Drive, Suite 200 in the city of Frederick, MD, as their website indicates.

A loan listed under Volta Power Systems LLC has its location listed as SUPERIOR CT in the city of 12550 HOLLAND, MI. On what appears to be the company’s website, a contact address of 12550 Superior Ct. Holland, MI is listed.

For 600 loans the city field contains a five-digit number. For 519 loans, that number matches the listed Zip code. In the loans where those fields don’t match, there are clearly data errors. A loan given to an unnamed business with an address at JFK Airport in New York is listed as being in Michigan. Its zip code is listed as 48851. It’s certainly not a coincidence that the industry code for “Freight Transportation Arrangement” is 488510." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

Parse dates in initial PPP read in script

Add in a flag and details for refunded loans

Perhaps sourced via e.g., https://factba.se/sba-loans#methodology which covers loans to public firms, based on 8-K trawling.

Identify data 'fixes' versus new in 0808 dataset

We know any entries in 0808 dataset that have record dates prior to the first data set being circulated should be 'original' values, and so we can do a like-for-like comparison of those two datasets for that specific period with much greater confidence.

Smooth City Names

posted by @JohnMcCambridge

process all city names and align with best match to 'real' city (e.g., CHCAGO, CHIACAGO, CHIACGO, CHIAGO, CHICAAGO, CHICACO, CHICAFO, CHICAG, CHICAGO, CHICAGOI, CHICAGOL, CHICAGOO, CHICAGP, CHICAO, CHICARGO, CHICGAO, CHICSGO, CHIGAGO, CHIOCAGO, and CHOCAGO.)
validate against state and zip, to ensure we re not accidentally assigning a real, obscure city name into a larger city elsewhere

Enhanced Zip Code Validation

posted by @JohnMcCambridge

check that zip codes are real, not just of proper format
check that the zip code, state field, are aligned
check that zip code, state field, city name, state file name, and congressional district field are all in alignment.

Flag addresses or names with insufficient information to be complete

posted by @JohnMcCambridge

"For instance the most common street address for a business receiving a loan is “PO BOX” without a street address or box number." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

Smooth State and Territories Values

posted by @JohnMcCambridge

process all state/territory values to ensure they are consistent e.g., FI instead of FL?

Pipeline for raw PPP data

Set up to download from google drive

Join NAICS code industry types

Be good to have this already in the data ready to roll

review relationship of source file and state variable

Check if the source filename (e.g., "PPP Data up to 150k - AZ.csv") matches to the values in the State variable (e.g., State == "AK")

I've added data_dictionary code to test this, results are about as expected:

We see that the only source files with unmatched values are 150kplus and Other, which is coherent
Interestingly, entries with state value XX appear in both the 150kplus and the Other source files

Create functionality to group and rank by Loan Amount

Allowing us to identify e.g., ZIP codes with large/small numbers of, average values or, or total loan values. See my commit here for this functionality: JohnMcCambridge@def9780

Create consistent, smallest-possible-number of Lender names

posted by @JohnMcCambridge

process Lender across unified dataset and cross check against real bank names, and reduce to smallest possible set of consistent names

Improve NAICS matching

I am seeing evidence that some (all?) of the unmatched codes may be older, e.g., 722110: that's a valid 2007 NAICS code
it's possible different lenders use different code-years, or perhaps everyone is using old codes
It might be best to iterate through and if a match fails, go to the next newest code year list?

Fully automated /bin scripts

As the project matures, it would be wise to fully automate the core scripts in /bin so that a user can run them without any intervention. This would ideally include pulling of the data, not just processing. This will likely require the addition of some non-R layer.

As of now, one simple issue that cannot be fully resolved is the setting of a valid working directory for the user. Attempts at using the package here fall short because it locks itself onto the working directory set when the library is first loaded (meaning if a user is warned they need to change their working directory and they comply, later uses of the here() function will not behave as they might expect, since it will still be holding on to whatever the original working directory was). Other options such as relying on .Rproj files would introduce a RStudio dependency.

All of this can be resolved nicely by adding a simple non-R layer atop the final codebase.

For an example of the issue: https://hrdag.org/tech-notes/harmful.html#the-working-directory-thing

Data dictionary

Created a data dictionary that describes what is in the data, degree of missingness, and sense of values within each variable

R package?

Any thought to put code into R package structure, even if goal is not to create a package?

Shiny enhancements: add scatterplot

Add a scatterplot to provide another way of looking at the correlation between loan amount and demographic variables. On the scatter plot, one axis would represent the demographic variable, one axis would represent the loan amount, and the points would represent individual congressional districs/counties/states, etc.

Related files:

code/maps/map_prep.R: prepares input datasets
code/maps/map_shiny.R: creates shiny app

Check company names against Business Type for coherence

posted by @JohnMcCambridge

"The field for business type contains checkable errors too. Excluding organizations listed as non-profits, there are 2,627 loans to organizations with “LLP” in their names which are not listed as a limited liability partnership under the business type. Similarly there are 21,287 loans to organizations with “LLC” in their names which aren’t listed as a limited liability company." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

Shiny enhancements: add circle legend

Need to add a legend for circle sizes (which represents loan amounts). Might be a bit tricky because the circle sizes change with zoom level. We can either fix circle sizes on the map or somehow update circle sizes on the legend as we zoom, or use another creative solution.

Related files:

code/maps/map_prep.R: prepares input datasets
code/maps/map_shiny.R: creates shiny app

Shiny enhancements: fix default zoom level

Set map boundary box so that the desired geographic area (all U.S. or a specific state) takes up the maximum space on the map.

Related files:

code/maps/map_prep.R: prepares input datasets
code/maps/map_shiny.R: creates shiny app

datakind-dc / cares Goto Github PK

cares's People

Contributors

Stargazers

Watchers

Forkers

cares's Issues

Recommend Projects

Recommend Topics

Recommend Org