Giter Site home page Giter Site logo

datakind-dc / cares Goto Github PK

View Code? Open in Web Editor NEW

This project forked from johnmccambridge/cares

6.0 6.0 7.0 88.95 MB

US CARES Act Payment Protection Program data, cleaned for analysis

License: GNU General Public License v3.0

R 0.18% Dockerfile 0.01% Makefile 0.01% Jupyter Notebook 99.78% Python 0.02%

cares's People

Contributors

emigre459 avatar johnmccambridge avatar kaogilvie avatar kathyxiong avatar kbmorales avatar mdshuey avatar rmcarder avatar zmousavi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cares's Issues

Cross check name and address information for consistency #7

posted by @JohnMcCambridge

  • Using Business Name, Address, City, State, Zip try and match to a real, most likely entry (perhaps via Google Maps API)
  • For businesses fully validated in this way, also confirm that the NAICSCode, NonProfit are aligned with what public information shows

"Then some loan information is just wrong. The loan listed for Ford’s Hometown Services Inc has the company listed at 549 Grove St in Hartford, CT. But that address doesn’t exist. The company’s website says that it’s located 60 miles (100 km) away at 549 Grove St in Worcester, MA.

The Zip code listed for a loan for La Jolla Dentistry in San Diego is 91121, a Zip code that is for Pasadena, California, 110 miles away. The correct Zip code is 92121." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

repo restructure

Proposing a system like:

  • bin/ for in production executable files (like reading in the PPP data)
  • docs/ for references, data dictionaries, manuals, etc.
  • data/ for raw data files that scripts rely on or that others would find useful--I think tidied data files should be uploaded to Google Drive to make it easier for others to use. Please cite sources in the README!
  • code/ folder with individual project subfolders on the CARES act data. Enhancements people make can go here. Contributions should be documented in the README. For example:
    • code/NAICS/ is where scripts go for joining NAICS and PPP data
    • code/project_name/ for the next project, etc.
  • tests/ for each project's tests

but open to any organizational structure or documentation scheme!

All finalized code should be able to be run on the output of the ppp data script.R file @JohnMcCambridge contributed, or on a dataframe as the result of reading in a CSV file of that data

evaluate data cleanliness of 0808 data, and attempt diff with prior adbs file

0808 is a fully refreshed data set, from what I can see. It excludes all cancelled or refunded loans according to the data sheet that comes with it. Therefore we should work from this new dataset for all core analyses.

The old dataset does provide some interesting opportunities for diffing. However, some of the changes in the new dataset are simply fixes to the old dataset, and of course neither contains any unique identifying keys for each loan, so a true diff on a loan-to-loan basis is not possible

Duplicates

@kbmorales
I am worried this file has repeats of data already present in the individual states/territories folders.

@JohnMcCambridge commented 2 days ago
Is this a merge issue or a data issue on their part. I presume a data issue, because by the definition of the files the individual state/territories should only be up to 150k. Can you share some examples we can test?

@kbmorales commented 2 days ago
I think you're right. I found around 4k duplicates in the data but it doesn't seem to be from the 150k file.

@JohnMcCambridge commented 2 days ago •
If we just use duplicated I'm not convinced we are going to drop true dupes, especially if applied to the <150k files. It would seem entirely possible that an entry in there could be identical to another row, but be a legitimate row (same loan amount, to a similar business in the same zip code, for example). Keeping in mind especially many columns have NAs it's quite easy for a false dupe to appear based on only the barest of detail (loan amount + zip code, basically). 3860cfa

@kbmorales commented 2 days ago
That's a good point, but it'd also be based on the other variables with high completeness (DateApproved, NAICSCode, Lender, and JobsRetained being the more information-rich ones), so I'm less concerned about them not being true duplicates.

While we can't confirm without the BusinessName, I personally feel fine about removing dupes with duplicated but I can retain them for now.

@JohnMcCambridge commented 2 days ago
For the baseline script I would suggest we stick with a minimalist/'do no harm' principle. We could also create an 'add-on' script which applies more maximalist interventions, like row drops (rather than flagging of problematic rows) and de-duping. That way users can decide which version to use when data diving.

Shiny enhancements: add mouseover annotations

Add functionality to see detailed information when mousing over certain geography (something like this). Detailed info we might want to include on mouseover are: name of geography, numeric value of loan amount, numeric value of demographic variable. This SO thread might be helpful as a starting point: https://stackoverflow.com/questions/30964020/popup-when-hover-with-leaflet-in-r

Related files:

  • code/maps/map_prep.R: prepares input datasets
  • code/maps/map_shiny.R: creates shiny app

Shiny enhancements: edit color bar legend

Need to edit color bar legend to display value ranges, instead of percentile ranges, that correspond to each color. Value ranges will need to be calculated from percentile ranges based on the demographic variable selected.

Related files:

  • code/maps/map_prep.R: prepares input datasets
  • code/maps/map_shiny.R: creates shiny app

Identify and attempt to correct data transposition/truncation issues

posted by @JohnMcCambridge

  • flag impacted fields
  • if possible, try and see if any of these issues spilled across rows: that would be Very Bad.

"There are 1,182 loans where numeric digits appear in the city field. Some of those are clearly spill overs or duplication from the address field. On 198 loan listings the city field contains an office suite number.

Quartz was able to identify 842 loans where what appears to be a name associated with the loan is listed in the city field. For 781 of those, the loaned amount was less than $150,000 which meant the recipients identity was intended to be withheld by the SBA. This error appears 824 times on loans processed by Bank of America.

A loan listed under Morgan-Keller Inc. says the company is at 70 THOMAS JOHNSON DRIVE in the city of SUITE 200 FREDERICK, MD rather than 70 Thomas Johnson Drive, Suite 200 in the city of Frederick, MD, as their website indicates.

A loan listed under Volta Power Systems LLC has its location listed as SUPERIOR CT in the city of 12550 HOLLAND, MI. On what appears to be the company’s website, a contact address of 12550 Superior Ct. Holland, MI is listed.

For 600 loans the city field contains a five-digit number. For 519 loans, that number matches the listed Zip code. In the loans where those fields don’t match, there are clearly data errors. A loan given to an unnamed business with an address at JFK Airport in New York is listed as being in Michigan. Its zip code is listed as 48851. It’s certainly not a coincidence that the industry code for “Freight Transportation Arrangement” is 488510." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

Identify data 'fixes' versus new in 0808 dataset

We know any entries in 0808 dataset that have record dates prior to the first data set being circulated should be 'original' values, and so we can do a like-for-like comparison of those two datasets for that specific period with much greater confidence.

Smooth City Names

posted by @JohnMcCambridge

  • process all city names and align with best match to 'real' city (e.g., CHCAGO, CHIACAGO, CHIACGO, CHIAGO, CHICAAGO, CHICACO, CHICAFO, CHICAG, CHICAGO, CHICAGOI, CHICAGOL, CHICAGOO, CHICAGP, CHICAO, CHICARGO, CHICGAO, CHICSGO, CHIGAGO, CHIOCAGO, and CHOCAGO.)
  • validate against state and zip, to ensure we re not accidentally assigning a real, obscure city name into a larger city elsewhere

Enhanced Zip Code Validation

posted by @JohnMcCambridge

  • check that zip codes are real, not just of proper format
  • check that the zip code, state field, are aligned
  • check that zip code, state field, city name, state file name, and congressional district field are all in alignment.

review relationship of source file and state variable

Check if the source filename (e.g., "PPP Data up to 150k - AZ.csv") matches to the values in the State variable (e.g., State == "AK")

I've added data_dictionary code to test this, results are about as expected:

  • We see that the only source files with unmatched values are 150kplus and Other, which is coherent
  • Interestingly, entries with state value XX appear in both the 150kplus and the Other source files

Improve NAICS matching

I am seeing evidence that some (all?) of the unmatched codes may be older, e.g., 722110: that's a valid 2007 NAICS code
it's possible different lenders use different code-years, or perhaps everyone is using old codes
It might be best to iterate through and if a match fails, go to the next newest code year list?

Fully automated /bin scripts

As the project matures, it would be wise to fully automate the core scripts in /bin so that a user can run them without any intervention. This would ideally include pulling of the data, not just processing. This will likely require the addition of some non-R layer.

As of now, one simple issue that cannot be fully resolved is the setting of a valid working directory for the user. Attempts at using the package here fall short because it locks itself onto the working directory set when the library is first loaded (meaning if a user is warned they need to change their working directory and they comply, later uses of the here() function will not behave as they might expect, since it will still be holding on to whatever the original working directory was). Other options such as relying on .Rproj files would introduce a RStudio dependency.

All of this can be resolved nicely by adding a simple non-R layer atop the final codebase.

For an example of the issue: https://hrdag.org/tech-notes/harmful.html#the-working-directory-thing

Data dictionary

Created a data dictionary that describes what is in the data, degree of missingness, and sense of values within each variable

Shiny enhancements: add scatterplot

Add a scatterplot to provide another way of looking at the correlation between loan amount and demographic variables. On the scatter plot, one axis would represent the demographic variable, one axis would represent the loan amount, and the points would represent individual congressional districs/counties/states, etc.

Related files:

  • code/maps/map_prep.R: prepares input datasets
  • code/maps/map_shiny.R: creates shiny app

Check company names against Business Type for coherence

posted by @JohnMcCambridge

"The field for business type contains checkable errors too. Excluding organizations listed as non-profits, there are 2,627 loans to organizations with “LLP” in their names which are not listed as a limited liability partnership under the business type. Similarly there are 21,287 loans to organizations with “LLC” in their names which aren’t listed as a limited liability company." via https://qz.com/1878225/heres-what-we-know-is-wrong-with-the-ppp-data/

Shiny enhancements: add circle legend

Need to add a legend for circle sizes (which represents loan amounts). Might be a bit tricky because the circle sizes change with zoom level. We can either fix circle sizes on the map or somehow update circle sizes on the legend as we zoom, or use another creative solution.

Related files:

  • code/maps/map_prep.R: prepares input datasets
  • code/maps/map_shiny.R: creates shiny app

Shiny enhancements: fix default zoom level

Set map boundary box so that the desired geographic area (all U.S. or a specific state) takes up the maximum space on the map.

Related files:

  • code/maps/map_prep.R: prepares input datasets
  • code/maps/map_shiny.R: creates shiny app

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.