Giter Site home page Giter Site logo

etl-project's Introduction

ETL-project Proposal

  • Data Sources
  • Data Cleanup
  • Database
  • Data Analysis

Data Sources

  1. NFL teams looked up and then typed into excel, saved as csv
  2. NFL team stats from https://www.nfl.com/stats/team-stats/
  3. https://www.profootballnetwork.com/nfl-power-rankings-end-of-regular-season-week-17-2020/

Data Cleaning

  1. We typed the data of each NFL team name manually (sometimes this is just easier!)

  2. Data was extracted for NFL teams were aggregated for each team and their stats into a data base by looping through each team name and scraping the list of stats from the website. Some of the stats used hyphens as joiners, but they were read as new line characters, so we decided to use string replace method instead. We then stored each team stats table in its own csv file. After converting the data to a csv, in a different jupyter file the csv was pulled. The team_id column was made by sorting the table by team name alphabetically, then making an index and incrementing by one. We then had to create good column names that are effectively used in SQL and rewrite our jupyter notebook to be able to use these column names instead. In order to aggregate every teams’ stats into one database, we first looped through every team’s individual csv file and pulled either the home or opponent column for the_all_stats or allowed_all_stats tables, respectively. These were then appended together using pandas. Next, we added the team_id column to the dataframe. Since our data always stayed in the same order as our teams csv, we did this just by adding one to the index. The last step of cleaning this data was to reorder the columns to get the team_id first. Finally, the dataframes were written to csv files.

  3. The data for rankings was pretty clean, we had to gather each line from the html which had team, record, and rank, then split them using the .split() function. The Philidelphia Eagles ranking didn't work right because of a typo, but since it was the only one that was broken we just manually wrote in the info for it. From here the team name was dropped, leaving team_id, ranking, and record. This data could be further cleaned by adding the team records somewhere more central to the database.

Database

Since all our data was based on the NFL teams; we decided to use a relational database. This let us store the team names in a table with a 'team_id', and then have a column 'team_id' in the rest of our tables to reference this instead of storing the team names every time.

Our database has a team’s table that contains the columns 'team_id' and 'team_name'. The team_id is how we then refer to the team in the other tables. The all_stats table contains the stats for every team from nfl.com. Some examples of the stats are first downs, touchdowns, passing yards, etc. The allowed_all_stats tables contains the same kind of data as all_stats, but it also contains data for how many first downs, touchdowns, passing yards, etc., that each team allowed. The pro_fb_net_rankings table has the rankings from ProFootballNetwork, as well as each team's record.

Possibilities for Analysis:

Developers can use this data in many ways. The nfl_stats_rankings table is an example of how the data can be used. the jupyter notebook file stat_based_rankings.ipynb uses the all_stats data to calculate a ranking for each team.

Developers could use this data to identify how much time during the game each team had the ball- average possession time vs yards or throwing touchdowns. This would show which offense moved the fastest.

Data could be used to make comparisons of number of offensive plays vs touch downs between teams. Additionally, data can help make projections about certain teams based on this data for upcoming season. Identify teams who may use a "bend not break defense" – may give up a ton of yards but not a lot of points.

Team members: Britney Hopkins, Ryan Hathaway, Jet-Lania Simpson, Connor Mackensie, and Marisabel Matta-Hyams

etl-project's People

Contributors

amerikonnor avatar mayan5 avatar jetsimpson avatar hathawayryan avatar bhopkins21 avatar

Watchers

 avatar

Forkers

bhopkins21

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.