ETL-project Proposal

Data Sources
Data Cleanup
Database
Data Analysis

Data Sources

NFL teams looked up and then typed into excel, saved as csv
NFL team stats from https://www.nfl.com/stats/team-stats/
https://www.profootballnetwork.com/nfl-power-rankings-end-of-regular-season-week-17-2020/

Data Cleaning

We typed the data of each NFL team name manually (sometimes this is just easier!)
Data was extracted for NFL teams were aggregated for each team and their stats into a data base by looping through each team name and scraping the list of stats from the website. Some of the stats used hyphens as joiners, but they were read as new line characters, so we decided to use string replace method instead. We then stored each team stats table in its own csv file. After converting the data to a csv, in a different jupyter file the csv was pulled. The team_id column was made by sorting the table by team name alphabetically, then making an index and incrementing by one. We then had to create good column names that are effectively used in SQL and rewrite our jupyter notebook to be able to use these column names instead. In order to aggregate every teams’ stats into one database, we first looped through every team’s individual csv file and pulled either the home or opponent column for the_all_stats or allowed_all_stats tables, respectively. These were then appended together using pandas. Next, we added the team_id column to the dataframe. Since our data always stayed in the same order as our teams csv, we did this just by adding one to the index. The last step of cleaning this data was to reorder the columns to get the team_id first. Finally, the dataframes were written to csv files.
The data for rankings was pretty clean, we had to gather each line from the html which had team, record, and rank, then split them using the .split() function. The Philidelphia Eagles ranking didn't work right because of a typo, but since it was the only one that was broken we just manually wrote in the info for it. From here the team name was dropped, leaving team_id, ranking, and record. This data could be further cleaned by adding the team records somewhere more central to the database.

Database

Since all our data was based on the NFL teams; we decided to use a relational database. This let us store the team names in a table with a 'team_id', and then have a column 'team_id' in the rest of our tables to reference this instead of storing the team names every time.

Our database has a team’s table that contains the columns 'team_id' and 'team_name'. The team_id is how we then refer to the team in the other tables. The all_stats table contains the stats for every team from nfl.com. Some examples of the stats are first downs, touchdowns, passing yards, etc. The allowed_all_stats tables contains the same kind of data as all_stats, but it also contains data for how many first downs, touchdowns, passing yards, etc., that each team allowed. The pro_fb_net_rankings table has the rankings from ProFootballNetwork, as well as each team's record.

Possibilities for Analysis:

Developers can use this data in many ways. The nfl_stats_rankings table is an example of how the data can be used. the jupyter notebook file stat_based_rankings.ipynb uses the all_stats data to calculate a ranking for each team.

Developers could use this data to identify how much time during the game each team had the ball- average possession time vs yards or throwing touchdowns. This would show which offense moved the fastest.

Data could be used to make comparisons of number of offensive plays vs touch downs between teams. Additionally, data can help make projections about certain teams based on this data for upcoming season. Identify teams who may use a "bend not break defense" – may give up a ton of yards but not a lot of points.

Team members: Britney Hopkins, Ryan Hathaway, Jet-Lania Simpson, Connor Mackensie, and Marisabel Matta-Hyams

amerikonnor / etl-project Goto Github PK

etl-project's Introduction

ETL-project Proposal

etl-project's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent