Giter Site home page Giter Site logo

nba's Introduction

NBA Standings Scraper

This project scrapes a given list of URLs from the NBA website and pushes the standings into a MySQL database.

Files included

scrape_nba.py - Default script which, when provided with a list of years on the command line, will scrape the NBA standings websites for those years and load the data into a MySQL database.

Scraper.py - Module containing the web scraper. Takes in raw HTML through the scrape_nba_html(year,html) method and produces a python dict containing the standings data using the get_standings() method.

NBAUploader.py - Module containing the class which uploads standings data produced by Scraper to a database. Takes a database handle via the constructor and uploads with the upload_standings(standings_dict) method.

test.py - Basic test driver for the web scraper.

schema.sql - SQL schema for the MySQL database.

data.sql - SQL dump of the loaded data for 2012-2013, 2013-2014, and 2014-2015 NBA standings.

data.json - JSON dump of the loaded data for 2012-2013, 2013-2014, and 2014-2015 NBA standings.

Setup Instructions

This project requires that python3, pip, and virtualenv are installed on the host system. To install the dependencies for this project, run the following commands on a unix command line:

$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

To set up the required MySQL database, the schema located at schema.sql needs to be run on a suitable database that is already running. This can be done from the command line using the MySQL command line client:

$ mysql -u <username> -h <hostname> -p <database_name> < schema.sql

The data.sql file can also be loaded into the database in a similar fashion:

$ mysql -u <username> -h <hostname> -p <database_name> < data.sql

Default usage

To use the default scrape_nba.py script, you must define the login details of the MySQL datastore. These should be set by exporting the following environment variables (assuming a unix shell environment):

$ export NBA_DB_HOST=<Hostname of the MySQL DB> 
$ export NBA_DB_USER=<your MySQL username>
$ export NBA_DB_PASSWORD=<your MySQL password
$ export NBA_DB_NAME=<Name of the MySQL DB>

To scrape from individual NBA pages using the default script, you must activate the virtual environment created above and provide a list of years on the command line to the scrape_nba.py, for example:

$ source venv/bin/activate
$ python3 scrape_nba.py 2012 2013 2014

This will pull data from each of the years 2012, 2013, and 2014, and then push the scraped data into the MySQL database given by the NBA_DB environment variables.

Library usage

Contained are two classes, Scraper and Uploader. With the Scraper.scrape_nba_html module you can pass in HTML pulled from the NBA standings website. Below is a basic example of how you would use these classes in your own code:

from NBAUploader import Uploader
from Scraper import Scraper
import MySQLdb 
import urllib.request

urls = {
	'2014' : 'http://www.nba.com/standings/2014/team_record_comparison/conferenceNew_Std_Div.html',
	'2013' : 'http://www.nba.com/standings/2013/team_record_comparison/conferenceNew_Std_Div.html',
	'2012' : 'http://www.nba.com/standings/2012/team_record_comparison/conferenceNew_Std_Div.html'
}

scraper = Scraper()
for year in urls:
	request = urllib.request.Request(urls[year])
	response = urllib.request.urlopen(request)
	# Must pass raw HTML into scraper object
	scraper.scrape_nba_html(year, str(response.read()))

# Create database handle and pass it into uploader
dbh = MySQLdb.connect('localhost', 'myuser', 'mypass', 'mydbname')
try:
	uploader = Uploader(dbh)
	uploader.upload_standings(scraper.get_standings())
finally:
	dbh.close()

To use the Uploader module, a database connection that adheres to the PEP DB specification must be passed in to the constructor of the object:

dbh = MySQLdb.connect(username, host, password, dbname)
try:
	uploader = Uploader(dbh)
	...

Afterwards, pass into the upload_standings the output from the get_standings() method on Scraper, or alternatively a dict of the format:

{
	'<year>' : {
		'<conference name>' : {
			'<division name>' : {
				'<team name>' : {
					'wins' : <wins no.>,
					'losses' : <losses no.>,
					'pct' : <PCT stat>,
					'gb' : <GB stat>,
					'conf_wins' : <conf wins>,
					'conf_losses' : <conf losses>,
					'div_wins' : <division wins>,
					'div_losses' : <division losses>,
					'home_wins' : <home wins>,
					'home_losses' : <home losses>,
					'road_wins' : <road wins>,
					'road_losses' : <road losses>,
					'l10_wins' : <last ten wins>,
					'l10_losses' : <last ten losses>,
					'streak' : <streak>
				},
				...
			},
			...
		},
		...
	},
	...
}

Testing

I have provided a basic test driver to test the Scraper module. It passes two example HTML files into the Scraper modules and tests the output it generates against the content of the two JSON files expected_output_1.json and expected_output_2.json. To run the test driver, run the following command:

$ python3 -m unittest test.py

nba's People

Contributors

jamezmoran avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.