This is an unofficial Python-based parser of www.basketball-reference.com allowing NBA and statistics enthusiasts to observe and analyze NBA data.
The goal of this repository is to provide easy-to-execute methods to access NBA player and team data.
Using this repo, our hope is that everyday fans like us can perform statistical analyses to potentially build models that can:
- Regress team wins by player biodata (i.e. win percentage vs. height/age/etc)
- Generate comprehensive statistics to determine the GOAT (greatest of all time) and GOED (greatest of each decade)
- Assess anomalies in betting lines
and much more.
The two easiest ways to utilize this code is either by using it in the Google Colab environment, or setting up your own virtualenv.
The following notebooks have been provided if you prefer to use Google Colab:
- single_player_search.ipynb: scrape biodata and basic (per game, total) and advanced (per-minute, per-possession, per-play, shooting, salary) season-wide statistics for single players, as well as gamelogs for that player
- Example: all major Kobe Bryant Data Tables (including season and playoffs) and gamelogs
- basketball-reference-scraper.ipynb: scrape the same data as single_player_search.ipynb but for all players in the entire database
- Note: given the tens of thousands of html requests, this takes many hours to complete, but the data can be saved so that you only have to run this once
Follow the Hitchhiker's guide to python to set up a virtual environment. Most of the packages used come native to python3, but you may need to install these using the following commands
BeautifulSoup: pip3 install bs4
tqdm: pip3 install tqdm
kb_meta, kb_data, kb_gamelogs = single_player_scraper('Kobe Bryant')
kb_meta
kb_data
kb_gamelogs
Only needs to be completed once.
ROOT = <set/path/to/repo>
df_players_meta, df_players_data, df_players_gamelogs = main(ROOT)
After scraping, you can access the data by unpickling the DataFrames
players_df_meta = pickle_load(DATA_PATH+'players_df_meta.pkl')
players_df_data = pickle_load(DATA_PATH+'players_df_data.pkl')
players_df_gamelogs = pickle_load(DATA_PATH+'players_df_gamelogs.pkl')
# Player Meta Query
df_large = df_players_meta.loc[(df_players_meta['height']>80) &
(df_players_meta['weight']>30)]
df_large = df_large.dropna(how='all', axis='columns') # drops all columns that are empty
display(df_large[df_large['weight'] == df_large['weight'].max()])
Select any of the following data types for different table fields:
- Per Game (per_game)
- Totals (totals)
- Advanced
- Per Minute
- Per Possession
- Adjusted Shooting
- Play-By-Play
- Shooting
- All-Star
- Salaries
The following is a an example to filter for Per Game statistics:
table_selected = players_df_data[players_df_data['data_type'] == 'per_game']
table_selected[table_selected['pts_per_g'] == np.nanmax(table_selected['pts_per_g'])]
players_df_gamelogs[players_df_gamelogs['pts']==np.nanmax(players_df_gamelogs['pts'])]
This project was built entirely by Rahim Hashim ([email protected]) and Aunoy Poddar ([email protected]). None of this could have been done without the tireless and comprehensive effort of those who work at Basketball Reference providing an open-source, API-friendly database containing millions of datapoints from which the entirety of this codebase is built.