Giter Site home page Giter Site logo

sc_salary_data's Introduction

Salary Data Explorations

This is a project to explore the SC Salary data that is provided through the state's Transparency Portal.

There are two sets of files currently in the raw_data directory: salary CSV files and state employee counts by agency. Files before July 2019, were pulled from Archive.org's copy of the site and there are points at which the CSV file links appear to have been broken so some files may be missing. Please also note, that the SC State Accountability Portal provides a careful description of the limitations of this information; this includes agencies not being included or not providing full details about all compensation, but review that source fully to understand the accuracy limits of this data.

Data Files

There are two sets of data provided in this project currently: Salary data as disclosed by the state government for people making more than $50,000 a year and a count of employees by agency as provided by the state. These two data sets should not match as the state does not include a disclaimer stating that only employees making more than $50,000 are included in the count. The state's data is included here as it should, in theory, make it easier to create rough estimates about how these lower paid employees affect the calculation of averages.

Scripts

This project provides a few simple scripts to help acquire files over time and clean them up.

get_salary_data.py

Checks for a new version of the salary data file on the admin.sc.gov site and adds it to the raw_data directory if one is found. This script is designed as a daily cron job and assumes the page layout and links use the same markup they have used since 2015 (one day this will be an invalid assumption).

combine_files.py

Takes all .csv files in the raw_data directory and combines them into the processed.json file in order to provide the full historic data set.

get_emp_data.py

Downloads the employee count reports provided in PDF format (currently). It is designed as a daily cron job and assumes the link doesn't change and that they do not switch to CSV (which could cause naming conflicts with the salary data).

Licensing

The included License only applies to other material in this project as the salary information (provided by the SC State government) and the content in the raw_data directory (which is unaltered from the source except the file name) are both public domain.

sc_salary_data's People

Contributors

acrosman avatar andrefpoliveira avatar ianfindlay avatar jaystag avatar sanchitjain123 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sc_salary_data's Issues

Clean up file names

The raw_data directory has gotten to the point that the file names are inconsistent. They should be updated to remove %20 and replace with a [space].

Fix file naming on extraction

The process in get_salary_data.py is doing a nice job of pulling data from the state's servers on a regular basis. But #8 is caused by the fact that it's not handling file names well. File names should be consistent when saved to disk. Please update the process to make sure the files are consistently named, and artifacts like %20 from the URL are properly decoded.

Fix Code Standard

This project should conform to Flake8 but doesn't in some places.

Fix combine_files.py

Currently attempts to merge files into the processed.json file using combined_files.py triggers the following:

$ python combine_files.py 
/path/to/project/raw_data/FOIA Transparency Salary Data 1.2021.csv
Traceback (most recent call last):
  File "combine_files.py", line 34, in <module>
    file_date = re.match('.*([0-9]{8}).*csv$', listing).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.