Giter Site home page Giter Site logo

dheerajchand / fec-scraper-toolbox Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cschnaars/fec-scraper-toolbox

0.0 2.0 0.0 255 KB

Python tools to scrape, load and manage campaign finance data housed on the Federal Election Commission website

Python 100.00%

fec-scraper-toolbox's Introduction

FEC Scraper Toolbox

The FEC Scraper Toolbox is a series of Python modules you can use to find and download electronically filed campaign finance reports housed on the Federal Election Commission website and load those reports into a database manager.

Generally, the FEC Scraper Toolbox is meant to replace the FEC Scraper repository. You might want to use the older repository, however, if you want to limit the scope of your database to include only specific committees. The default behavior of the FEC Scraper Toolbox is to download every available report whereas FEC Scraper downloads reports only for the committees you specify.

Presently, the Toolbox consists of three major modules. They are documented fully below, but in brief, they are:

  • download_reports: Downloads daily compilations of electronically filed reports and consumes an RSS feed to find and download recently filed reports.
  • parse_reports: Combines any number of reports into a single file for each form type (Schedule A, Schedule B and so on). Report header information is loaded into a database.
  • update_master_files: Downloads daily and weekly master files housing detailed information about all candidates and committees, individual contributions and contributions from committees to candidates and other committees.

FEC Scraper Toolbox was developed under Python 2.7.4. I presently am running it under 2.7.6.

Requirements

The following modules are required to use FEC Scraper Toolbox. All of them are included with a standard Python 2.7 installation:

  • csv
  • datetime
  • ftplib
  • glob
  • linecache
  • multiprocessing
  • os
  • pickle
  • pyodbc
  • re
  • shutil
  • time
  • urllib
  • urllib2
  • zipfile

User Settings

You can add an optional usersettings.py file to the directory housing your Python modules to customize database connection strings and file locations. In each module, you'll see a try statement, where the module will attempt to load this file. Default values can be specified in the except portion of the try statement.

You can copy and paste the text below into your usersettings.py file, then specify the values you want to use.

    ARCPROCDIR = '' # Directory to house archives that have been processed
    ARCSVDIR = '' # Directory to house archives that have been downloaded but not processed
    DBCONNSTR = '' # Database connection string
    MASTERDIR = '' # Master directory for weekly candidate and committee master files
    RPTERRDIR = '' # Directory to house error logs generated when a field can't be parsed
    RPTHOLDDIR = '' # Directory to house electronically filed reports that cannot be processed
    RPTOUTDIR = '' # Directory to house data files generated by parse_reports
    RPTPROCDIR = '' # Directory to house electronically filed reports that have been processed
    RPTRVWDIR = '' # Directory to house electronically filed reports that could not be imported and need to be reviewed
    RPTSVDIR = '' # Directory to house electronically filed reports that have been downloaded but not processed

download_reports Module

This module tracks and downloads all electronically filed reports housed on the Federal Election Commission website. Specifically, it ensures all daily archives of reports (which go back to 2001) have been downloaded and extracted. It then consumes the FEC's RSS feed listing all reports filed within the past seven days to look for new reports.

Electronic reports filed voluntarily by a handful of Senators presently are not included here.

This module does not load any data or otherwise interact with a database manager (though I plan to add functionality to ping a database to build a list of previously downloaded reports rather than require the user to warehouse them). Its sole purpose is to track and download reports.

If you don't want to download archives back to 2001 or otherwise want to manually control what is downloaded, you'll find commented out code below as well as in the module that you can use to modify the zipinfo.p pickle (which is described in the first bullet point below).

This module goes through the following process in this order:

  • Uses the pickle module to attempt to load zipinfo.p, a dictionary housing the name of the most recent archive downloaded as well as a list of files not downloaded previously. Commented out code available below and in the module can be used to modify this pickle if you want to control which archives will be retrieved.
  • Calls build_prior_archive_list to construct a list of archives that already have been downloaded and saved to ARCPROCDIR or ARCSVDIR.
    NOTE: I plan to deprecate this function. I added this feature for development and to test the implementation of the zipinfo.p pickle. Using the pickle saves a lot of time and disk space compared to warehousing all the archives.
  • Calls build_archive_download_list, which processes the zipinfo.p pickle to build a list of available archive files that have not been downloaded.
  • Uses multiprocessing and calls download_archive to download each archive file. These files are saved in the directory specified with the ARCSVDIR variable. After downloading an archive, the subroutine compares the length of the downloaded file with the length of the source file. If the lengths do not match, the file is deleted from the file system. The subroutine tries to download a file up to five times.
    NOTE: You can set the NUMPROC variable in the user variables section to specify the number of downloads that occur simultaneously. The default value is 10.
  • Uses multiprocessing and calls unzip_archive to extract any files in the archive that have not been downloaded previously. The second parameter is an overwrite flag; existing files are overwritten when this flag is set to 1. Default is 0.
    NOTE: You can set the NUMPROC variable in the user variables section to specify the number of downloads that occur simultaneously. The default value is 10.
  • Again calls build_prior_archive_list to reconstruct the list of archives that already have been downloaded and saved to ARCPROCDIR or ARCSVDIR.
    NOTE: As stated above, this feature is slated for deprecation.
  • Calls pickle_archives to rebuild zipinfo.p and save it to the same directory as this module.
  • Calls build_prior_report_list to build a list of reports housed in RPTHOLDDIR, RPTPROCDIR and RPTSVDIR.
    NOTE: I plan to add a function that can build a list of previously processed files using a database call rather than combing the file system (though it will remain necessary to look in the RPTHOLDDIR and RPTSVDIR directories to find files that have not been loaded into the database).
  • Calls consume_rss, which uses a regular expression to scan an FEC RSS feed listing all electronically filed reports submitted within the past seven days. The function returns a list of these reports.
  • Calls verify_reports to test whether filings flagged for download by consume_rss already have been downloaded. If so, the function verifies the length of the downloaded file matches the length of the file posted on the FEC website. When the lengths do not match, the saved file is deleted and retained in the download list.
  • Uses multiprocessing and calls download_report to download each report returned by verify_reports. After downloading a report, the subroutine compares the length of the downloaded file with the length of the source file. If the lengths do not match, the file is deleted from the file system. The subroutine tries to download a file up to five times.
    NOTE: You can set the NUMPROC variable in the user variables section to specify the number of downloads that occur simultaneously. The default value is 10.

Modifying the zipinfo Pickle

Here is the commented-out code available in the download_reports module that you can use to manually control the zipinfo.p pickle if you don't want to download all available archives back to 2001:

    # Set mostrecent to the last date you DON'T want, so if you want
    # everything since Jan. 1, 2013, set mostrecent to: '20121231.zip'
    zipinfo['mostrecent'] = '20121231.zip' # YYYYMMDD.zip
    zipinfo['badfiles'] = [] # You probably want to leave this blank

parse_reports Module

This module is the workhorse of the Toolbox. It parses all downloaded reports in a specified directory and saves child rows for each subform type into a single data file for easy import into a database manager.

One of the main challenges with parsing electronically filed reports is that each form presently there are 56 has its own column headers. What's more, the layout of each form has been through as many as 13 iterations, each of which also can have its own headers.

The parse_reports module handles this by examining the two header rows atop each electronically filed report to determine the form type and version of that report. (The current version is 8.1.) If the form type and version are supported by the parser, the columns in each data row are mapped to standardized columns. Generally speaking, the standardized column headings largely mimic a form's version 8.0 headings, though a few legacy columns have been retained.

You can modify the standardized columns at any time by manipulating the outputhdrs variable. All data for any columns not included in this variable are dropped from the data stream and are not included in the output files. The headers for each version of each form type are housed in the filehdrs variable in this format:

[[formtype-1, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]],
[formtype-2, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]],
...
[formtype-n, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]]]

While all child rows are parsed and saved to delimited text files, the two header rows for each file are loaded into a database manager. By default, the database manager is SQL Server, and the module calls a series of stored procedures to load the header rows into the database.
However, you can easily modify this behavior to use a different database manager or save the header rows to their own text files.

The main reason to use the module to load the headers is the database manager can verify each report is not already in the database.
Eventually, I will add my database structure and stored procedure code to this repository to make it easier to port the functionality to interact with other database managers.

The parser supports the following form types:

  • Form 1, Statement of Organization: Includes Form1S and Text records
  • Form 3, Report of Receipts and Disbursements: Filed by Congressional candidates; includes all schedules and Text records.
  • Form 3L, Report of Contributions Bundled by Lobbyists/Registrants: Includes all schedules and Text records.
  • Form 3P, Report of Receipts and Disbursements: Filed by presidential and vice-presidential candidates; includes all schedules and Text records.
  • Form 3X, Report of Receipts and Disbursements: Filed by all committees other than the principal campaign committees of Congressional, presidential and vice-presidential candidates; includes all schedules and Text records.

Reports not recognized by parse_reports are moved to the directory specified by the RPTHOLDDIR variable.

This module goes through the following process in this order:

  • Calls build_list_of_supported_report_types, which examines the list housed in the filehdrs variable to determine which types of electronically filed reports can be parsed by the module.
  • Calls create_file_timestamp, which creates a timestamp string that is affixed to the filename of each data file generated by the module.
  • Creates an output file for each type of child row data one for Schedule A, one for Schedule B, one for Text and so on and writes the column headers to each file. These files are saved in the directory specified by RPTOUTDIR. The module also generates an "Other Data" file, where rows the module can't write to other data files are saved.
  • Appends various full name fields to header lists. These fields were used in older electronic filings until the FEC decided to split names across multiple fields. These extra headers are appended here because the module attempts to parse these names and does not write the full name fields to the output data files. If a name can't be parsed, it is saved to the appropriate last name field.

From this point, the module iterates over each electronic filing saved in the directory specified by RPTSVDIR. For each file, the module:

  • Saves the six-digit filename as ImageID. This value is prepended to every child row so those rows can be mapped to the parent header row.
  • If the ImageID is contained in BADREPORTS, the module moves the file to the directory specified by RPTHOLDDIR, then proceeds to the next electronic filing.
  • Extracts the file header. The file header contains basic information about the file, such as the header version, the software used to generate the report, the delimiter used for names and a date format string. For most forms, this information is constrained to the first line of the file. But header versions 1 and 2 for all form types use multi-line file headers. When a multi-line file header is detected, the module scans from the top of the file until it finds and reads the entire header.
  • Extracts the report header, which is always contained on only one line, immediately below the file header.
  • Checks to see whether the default delimiter specified by DELIMITER can be found anywhere in the file header. If not, the default delimiter is changed to a comma.
  • Extracts the report type (i.e., F3PA, F3XN, F3T) and an abbreviated report type (i.e., F3P, F3X, F3) from the report header. The last letter of a Form 3 report type indicates whether the report is a New, Amended or Termination report. If the module does not support that report type, it moves the file to the RPTHOLDDIR and proceeds to the next electronic filing.
  • Extracts the version number from the file header.
  • Parses the file header. Custom code is used for header versions 1 and 2 while the parse_data_row function is used for all later, single-line headers.
  • Creates a dictionary to house data for the report header, adds a key for each column, then calls parse_data_row for all versions.
  • If for some reason the delimiter used for names is unknown, the module attempts to determine the delimiter.
  • Calls a custom function for each form type to validate the report header data, then calls load_report_header to load that data into a database manager. If the data can't be validated, the module will fail. If the data is valid but can't be loaded into the database (either because of an error or because the report already exists in the database) the module moves the file to the directory specified
    by RPTRVWDIR and proceeds to the next electronic filing.

As noted elsewhere, FEC Scraper Toolbox uses SQL Server as the default database manager and uses stored procedures to interact with the database. However, this is an issue only in terms of loading header data. All other data is saved to delimited text files you can load into any database manager. (The default delimiter is a tab, which you can change by setting the OUTPUTDELIMITER variable.)

The main reason to load the headers from within the Python module is to verify an electronic report does not already exist in the database.
It also ensures a valid parent-child relationship exists before any child rows are loaded. Saving all data, including file headers, to flat files so all the data can be imported in one set would be faster, but querying the database and adding a header one report at a time is a trade-off to ensure database viability.

Once the header has been parsed and loaded into the database, the module creates a list to hold each type of child row (one file for Schedule A data, one for Schedule B data and so on). The module then iterates over the file, skipping the headers, and processes each child row as follows:

  • The module removes double spaces from the data. If OUTPUTDELIMITER is set to a tab, the module also converts tabs to spaces.
  • The data row is converted to a list.
  • The module looks at the first element of the list to determine the row's form type. If the type can't be determined, the row is written to the "Other Data" file.
  • The module calls populate_data_row_dict to build a dictionary to house the output data, mapping the version-specific headers of the data row to the headers used in the data output file.
  • The module calls a form type-specific function to validate and clean the data.
  • Full name fields, if any, are removed from the data row.
  • The module calls build_data_row to convert the dictionary to a list.
  • The list is appended to the list created to house that type of data.

Once the module has finished iterating over the data for an electronic report, each line of each form type-specific list is converted to a delimited string and written to the appropriate data file before the module proceeds to the next file.

At the end of the module, you'll see a call to a SQL Server stored procedure called usp_DeactivateOverlappingReports. (Again, I plan to post all my SQL Server code in this repository very soon.) Briefly, this stored procedure addresses the problem of reports with different but overlapping date ranges. Rather than delete amended reports, I use database triggers to flag the most recent report as Active and to deactivate all earlier reports whenever the date ranges are exactly the same. But there are dozens and perhaps hundreds of cases where reports with different date ranges overlap. This stored procedure addresses this problem by scrubbing all headers in the database each time this module is run.

update_master_files Module

This module can be used to download and extract the master files housed on the FEC website. The FEC updates the candidate (cn), committee (cm) and candidate-committee linkage (ccl) files daily. The FEC updates the individual contributions (indiv), committee-to-committee transactions (oth) and committee-to-candidate transactions (pas2) files every Sunday evening.

Warehousing Master Files

By default, the update_master_files module archives the compressed master files (but not the extracted data files). This was done to preserve the source data during development and because the FEC overwrites the data files.

To disable this behavior, set the value of the ARCHIVEFILES user variable (see the user variables section near the top of the module) to zero. When ARCHIVEFILES is set to any value other than one (1), the master files are not archived.

The script assumes you will run it daily and late enough in the day to make sure the FEC has created the daily data files before you try to process them. At the time of this writing, the FEC updates the three daily files around 7:30 a.m. and the weekly files around 4 p.m. If you're scheduling a daily job to run this script, I recommend you schedule it for late evening to give yourself plenty of leeway.

By default, the script will ignore the weekly files if the script is not run on a Sunday. To change this behavior and download the Sunday files on a different day of the week, set the OMITNONSUNDAYFILES variable near the top of the script to 0.

How the update_master_files Module Works

This module goes through the following process in this order:

  • Calls delete_data to remove all .txt and .zip files from the working directory specified by the MASTERDIR variable. (The .zip files from the previous execution of the script will be in this directory only if they were not archived when the script was last run.)
  • Uses the local date on the machine running the code to calculate the current election cycle and determine whether the files are being downloaded on a Sunday. If the weekday is not Sunday and OMITNONSUNDAYFILES is set to 1, the script will ignore all master files except for the candidate (cn), committee (cm) and candidate-committee linkage (ccl) files.
  • Uses multiprocessing and calls download_file to download each master file specified by the MASTERFILES user variable. (By default, all nine master files are downloaded.) These files are saved in the directory specified by the MASTERDIR variable.
    After downloading a file, the subroutine compares the length of the downloaded file with the length of the source file of the FEC website. If the lengths do not match, the file is deleted from the file system. The subroutine tries to download a file up to five times.
    NOTE: You can set the NUMPROC variable in the user variables section to specify the number of downloads that occur simultaneously. The default value is 10.
  • Uses multiprocessing and calls unzip_master_file to extract the data files from each master file. If the extracted filename does not include a year reference, the subroutine appends a two-digit year.
    NOTE: You can set the NUMPROC variable in the user variables section to specify the number of downloads that occur simultaneously. The default value is 10.
  • When the ARCHIVEFILES user variable is set to 1, the module calls the archive_master_files subroutine, which creates a YYYYMMDD directory for the most recent Sunday date (if that directory does not already exist) and moves all .zip files in the MASTERDIR directory to the new directory.

About the Master Files

The FEC recreates three of the master files daily and the remaining master every Sunday evening. Each time the files are generated, they overwrite the previously posted files. The archive filenames include a two-digit year to identify the election cycle, but the files housed in those archives often do not. For that reason, this module appends a two-digit election cycle to extracted filenames that do not include a year reference.

There are nine compressed files for each election cycle. You can click the links below to view the data dictionary for a particular file:

  • add: New Individual Contributions Lists all contributions added to the master Individuals file in the past week.
    Files extracted from these archives are named addYYYY.txt. Generated every Sunday.
  • ccl: Candidate Committee Linkage
    Houses all links between candidates and committees that have been reported to the FEC. Strangely, this file does not include candidate ties to Leadership PACs, which are reported on Form 1.
    Files extracted from these archives are named ccl.txt. Generated daily.
  • changes: Individual Contribution Changes Lists all transactions in the master Individuals file that have been changed during the past month.
    Files extracted from these archives are named chgYYYY.txt. Generated every Sunday.
  • cm: Committee Master File
    Lists all committees registered with the FEC for a specific election cycle. Among other information, you can use this file to see a committee's FEC ID, name, address and treasurer. You can use the Committee Designation field (CMTE_DSGN) to find a specific committee type (such as principal campaign committees, joint fundraisers, lobbyist PACs and leadership pacs). Additionally, you can look for code O in the Committee Type field (CMTE_TP) to identify independent expenditure only committees commonly known as Super PACs.
    Files extracted from these archives are named cm.txt. Generated daily.
  • cn: Candidate Master File Lists all candidates registered with the FEC for a specific election cycle. You can use this file to see all candidates who have filed to run for a particular seat as well as information about their political parties, addresses, treasurers and FEC IDs.
    Files extracted from these archives are named cn.txt. Generated daily.
  • delete: Deleted Individual Contributions Lists all contributions deleted from the master Individuals file in the past week.
    Files extracted from these archives are named delYYYY.txt. Generated every Sunday.
  • indiv: Individual Contributions For the most part, lists all itemized contributions of $200 or more made by INDIVIDUALS to any committee during the election cycle. Does not include most contributions from businesses, PACs and other organizations.
    Files extracted from these archives are named itcont.txt. Generated every Sunday.
  • oth: Committee-to-Committee Transactions Lists contributions and independent expenditures made by one committee to another.
    Files extracted from these archives are named itoth.txt. Generated every Sunday.
  • pas2: Committee-to-Candidate Transactions Lists contributions made by a committee to a candidate.
    Files extracted from these archives are named itpas2.txt. Generated every Sunday.

Using the indiv, add, changes and delete files

The indiv file generated every Sunday for each election cycle is comprehensive, meaning it is a current (as of that Sunday) snapshot of all individual contributions currently in the database. Many users simply drop and recreate this table every week. If you do this, you do not need the add, changes and delete files. The FEC generates these files to provide an alternative to rebuilding the indiv table each week simply because of the sheer volume of data it houses.

I presently don't use any of these files and instead rely on the raw filings themselves, which are immediately available on the FEC website once they're filed and contain more data that the indiv files. But many journalists I know who do use these files say they just rebuild the indiv table each week because it's easier and less error-prone than trying to patch it each week.

If you decide to use the add, changes and delete files rather than rebuild the indiv table each week, just be aware that if you ever miss a weekly download, you will have to rebuild the indiv table.

Next Steps

I house all of my campaign-finance data in a SQL Server database and tend to use SQL Server Integration Services packages to load the data generated/extracted by the FEC Scraper Toolbox. I do this because of the sheer volume of the data (The table housing all Schedule A data, for example, contains more than 120 million rows so far.) and because I can use SSIS to bulk load the data rather than load it from Python a row at a time.

The lone exception is the parse_reports module. That module attempts to load a report's header row into the database to test whether that report previously has been loaded. All child rows in new reports are parsed and moved to separate data files (one for Schedule A, one for Schedule B and so on).

Understandably, the lack of code to load the data makes the Toolbox less attractive to some potential users, who must manually import the data files or develop their own Pythonic means of doing so.
Nevertheless, the modules presently provide very fast and efficient means of downloading massive quantities of data, managing that data and preparing it for import.

At some point, I plan to develop Python functions to handle the data imports, and of course I welcome any contributions from the open-source community. I also plan to open source my entire database design so others can recreate it in any database manager they choose.

Stay tuned!

fec-scraper-toolbox's People

Contributors

cschnaars avatar

Watchers

Dheeraj Chand avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.