Giter Site home page Giter Site logo

kbmorales / ppp Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 57 KB

An R Package that downloads Paycheck Protection Program loan data and reads it in as a tidy data set

License: GNU General Public License v3.0

R 100.00%
ppp paycheck-protection-program naics

ppp's Introduction

PPP

The goal of PPP is to download and assemble a unified, cleaned data set of Paycheck Protection Program loans issued in 2020.

The data is too large to share on GitHub, but this package will allow you to recreate the data locally.

The work for this package grew out of a project with DataKind’s DC chapter and the National Press Foundation.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("kbmorales/PPP")

Assembling PPP data

# Note: this function will download over 600MB of data to your local machine,
# and read a large dataframe into memory!
ppp_data = ppp_assemble()

Data Sources

PPP Data Dictionary

Though there are two versions of the PPP data released, both will have the same final structure as detailed below. The main structual difference between these versions is the removal of a JobsRetained variable, and the addition of JobsReported. Whether or not this is a simple semantic change is unknown at this point.

The final PPP dataset output by ppp_assemble() will contain the following columns:

variable original notes
LoanRange Yes no values for loan amounts under 150K
BusinessName Yes no values for loan amounts under 150K
Address Yes no values for loan amounts under 150K
City Yes
State Yes
Zip Yes
NAICSCode Yes
BusinessType Yes
RaceEthnicity Yes 89.3% “Unanswered”
Gender Yes 77.7% “Unanswered”
Veteran Yes 84.7% “Unanswered”
NonProfit Yes
JobsRetained Yes only in 2020-07-06 release
JobsReported Yes only in 2020-08-08 release
DateApproved Yes
Lender Yes
CD Yes
LoanAmount Yes no values for loan amounts over 150K
source_file No file name for where data was pulled
version No release date of PPP data
LoanRange_Unified No Loan ranges regardless of amount
JobsRetained_Grouped No brackets for # of ‘jobs retained’
JobsReported_Grouped No brackets for # of ‘jobs reported’
LoanRangeMin No minimum possible loan value
LoanRangeMax No maximum possible loan value
LoanRangeMid No middle estimated loan value
naics_lvl_1 No Most general NAICS industry class
naics_lvl_2 No Second level NAICS industry class
naics_lvl_3 No Third level NAICS industry class
naics_lvl_4 No Forth level NAICS industry class
naics_lvl_5 No Most specific NAICS industry class
NAICS_version No version of NAICS where code was found
NAICS_valid No was NAICS code matched?

ppp's People

Contributors

kbmorales avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

paultrino

ppp's Issues

fallback for URL download fail

  1. naics files bundled with package?
  2. check if url is still good
  3. offer option to provide your own URL
  4. offer option to use a local file

Set up tests

We need tests for the R package set up with testthat

Ones in particular that would be super useful:

  1. Check status of download URLs for raw data files to ensure they still return a 200 status code before attempting to download
  2. Checking number of rows before and after joins to ensure join duplication does not occur
  3. Ensure that coercion of variables in ppp_clean does not lose information (e.g., turning LoanAmount into a double doesn't turn some useful character data into a NA

ppp_assemble(version = 1) --- crash!

Runs.
But no error, just no data.
Nor are the data directories set up.

(Have not traced through code yet).

I mostly use vanilla terminal, nvim and without any interface

qualified names (package::FUN())

Not a biggie, but R-pkg book, Ch 6 and Ch 13.6.1 recommends:

All of our calls to tidyverse functions have now been qualified with the name of the specific package that actually provides the function, e.g. dplyr::mutate(). There are other ways to access functions in another package, explained in chapter 13, but this is our recommended default. It is also our strong recommendation that no one depend on the tidyverse meta-package in a package10. Instead, it is better to identify the specific package(s) you actually use. In this case, our package only uses dplyr.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.