Giter Site home page Giter Site logo

cares's Introduction

CARES

CARES Act data: PPP, EIDL and more.

Data files can be downloaded from the DataKind Google Drive

TOC

  1. Contributing.
  2. Directory Structure.
  3. Data Sources.
  4. PPP Data Dictionary
  5. Enhancements

Please either fork or make a development branch of the repo to contribute.

Please ask a fellow volunteer to review your code with a pull request before merging with the master branch! You can always ask @JohnMcCambridge or @kbmorales if you don't know who to ask to review!

  • bin/ for in production executable files
    • bin/pull_data.R authenticates with Google Drive, downloads PPP flat files
    • bin/ppp_data_merge.R reads in PPP flat files
  • code/ folder with individual project subfolders on the CARES act data. Enhancements people make can go here, grouped by "project". A project is any discreet enhancement to the data, like adding in NAICS code industry identifiers. All projects should be documented in the README. Project examples:
    • code/NAICS/ is where scripts go for joining NAICS and PPP data
    • code/census_mapping/ for US census joins, mapping, etc.
  • docs/ for references, data dictionaries, manuals, etc. Each project should have an accompanying docs/project_name/ folder
  • data/ for raw data files that scripts rely on or that others would find useful--I think tidied data files should be uploaded to Google Drive to make it easier for others to use. Please organize data files roughly by topic! Please cite sources in the README!
  • tests/ for each project's unit tests; i.e., tests/NAICS/

All finalized code should be able to be run on the output of the setup scripts in bin/, or on a dataset read in as a CSV file created by cleaning code.

Structure

Rows: 4,885,388

Potential duplicate rows: ~4,353 (still investigating)

Variables:

variable n Missing % Missing Validation Notes
LoanRange 4224170 86.5 see notes
BusinessName 4224171 86.5 no values for loan amounts under 150K
Address 4224170 86.5 no values for loan amounts under 150K
City 1 0.0 see notes
State 0 0.0 see notes
Zip 224 0.0 see notes
NAICSCode 133527 2.7 validation pending
BusinessType 4723 0.1
RaceEthnicity 0 0.0 89.3% "Unanswered"
Gender 0 0.0 77.7% "Unanswered"
Veteran 0 0.0 84.7% "Unanswered"
NonProfit 4703708 96.3 see notes
JobsRetained 324122 6.6 see notes
DateApproved 0 0.0 earliest: 2020-04-03 latest: 2020-06-30
Lender 0 0.0
CD 0 0.0
LoanAmount 661218 13.5 no values for loan amounts over 150K

LoanRange

LoanRange is missing from all state data, giving the 86.5% missing number, but actual loan amount is included instead. To address this we have created a computed field LoanRange_Unified which assigns all precise numeric loan values from the 'Under 150K' State files into compatible groups. Within these groups, some values are improbably low e.g.:

LoanRange n %
Less than Zero 1 0.0
Zero 71 0.0
Up to $10 217 0.0
$100 - $1000 26318 0.5

Additionally we have created numeric fields for other calcuations, such as ranking and summing across groups: LoanRangeMin, LoanRangeMid, LoanRangeMax.

City

City is not a formalized field and contains open-text values, meaning it cannot be used as-is for any kind of geo-coding or validation

State

State contains a small number of odd values:

State n % notes
AE 1 0.0 zipcode suggests this is indeed a military address outside the US
FI 1 0.0 zipcode suggests this should be FL
XX 210 0.0

Zip

All non-missing values are in valid 5 digit format, but not all of those match to real zip codes. Further validation pending. Note also that just because a zip code is valid does not mean it can be mapped to a ZCAT (e.g., PO Box Zips)

NonProfit

Has only Y or NA values, and so can be assumed to be a required question, implying actual Missingness of 0%

JobsRetained

contains some improbable values, and many values are Zero:

JobsRetained n %
Less than Zero 7 0.0
Zero 554146 11.3

RaceEthnicity, Gender, and Veteran

Most of the data are "Not Answered" due to these questions being optional.

NAICS Codes

Adds NAICS industry identifiers to the PPP data. See the notebook

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.