Giter Site home page Giter Site logo

nflscrapr's People

Contributors

dutta avatar maksimhorowitz avatar michaelm91 avatar ryurko avatar tanho63 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nflscrapr's Issues

season_player_game - subscript errors

Running the below code;
> temp <- nflscrapR::season_player_game(2017, Weeks = 11)
Gives the below error;
Error in nfl.json[[1]] : subscript out of bounds
However running this for Weeks = 10 the data is retrieved.

Digging in a little I suspect it might be related to the code inside the function season_player_game() below;

if (Weeks %in% 3:15) {
game_ids <- game_ids[1:(16*Weeks)-1]
} else if (Weeks %in% 1:2) {
game_ids <- game_ids[1:(16*Weeks)]
}

If we wanted to calculate the first 11 weeks using the first clause we'd get 16 * 11 - 1 = 175 which actually ends up with game_id = 2017112611 which is the last of the Sunday games in Week 12. Using Weeks = 10 we get all the games till the end of week 11 - so the games are out by a week.

I've submitted a pull request with my changes which removes the above calculations entirely and has extracting_gameids() take a LastWeek parameter to stipulate the week to scrape last.

gameplayer function

Just want to let you know that after you have changed your -playergame- command to -player_game-, you have not made the change in the -season_player_game- command, which still calls -playergame- instead of -player_game-. Also, the example on github homepage of nflscrapR still demonstrates with playergame and season_playergame commands. Great package, by the way!

How to produce "first down conversion" field?

In the play-by-play, it appears the "FirstDown" field includes plays in which the offense did not convert for a first down. Using the fields already in the dataset (i.e. Touchdown, yds.gained, PenalizedTeam), is it possible to create an indicator variable equal to "1" only when the offense successfully converts and "0" otherwise?

Suggestion: More Informative Error Message for "Cannot Open the Connection"

I know there's a closed issue where someone asked about the error message below here.

Error in file(con, "r") : cannot open the connection

And I think this has caused some confusion here as well, based on bnjcbsn's comment.

I was also pretty confused at one point on this before I found the first issue thread above. I think if some error-handling code is added to throw a more explanatory error message when the NFL API returns a connection error, there will be less confusion and fewer mis-diagnosed issues among users.

Missing plays for 2017 week 1

There is missing information for week 1 in the play-by-play data. For example, Stefon Diggs shows up as having 6 receptions on 7 targets if you query the play-by-play data, but in the 'stats' section in the json document for the game, he is listed as having 7 receptions (http://www.nfl.com/liveupdate/game-center/2017091100/2017091100_gtd.json).

Is there an accepted methodology for dealing with missing data? Also, since the json file has totals for each player is it possible to internally check for missing plays?

Thanks,
Eric

Player positions?

This package is AWESOME.

I'm very new to the football stats community and I'd love to use this package to make predictions of players performance within given positions. However, it doesn't seem like player position is available anywhere within the package?

I've already tried some painstaking string matching with other packages but it's quite messy...Any chance that player position will be available in the near future as part of the data?

nflteams data frame could use updating

For example, the Rams are listed as the St. Louis Rams and the Chargers are still the San Diego Chargers. Adding a column for season could allows for all teams and abbreviations going as far back as the data can be scraped.

Player data 2017

I see there is player stat data only up to 2016. Is 2017 in the works? I would be more than happy to help compile them if the work hasn't been done.

Add more game-level metadata to season_games()

The following game-level metadata would be helpful for many public football researchers:

  • In which stadium was the game played?
  • What was the weather at kickoff (and/or at other points in the game)?
  • Who won the coin toss? What did they choose?

To do this, consider updating the season_games() function so that the resulting data frames would include the above information and existing data from the season_games() output or in the play-by-play data (e.g. date, home team, away team, game ID, etc).

Passer column in play-by-play sometimes empty

Hi Max,

sorry to report another issue I just found out...
It seems to be deeply related to issue #14.
If the passer in a play-by-play row has his first name shortened to more than one letter, the value in the "Passer" column is incorrectly set to "NA".

An example of this can be seen in the second play of the 2016010305 game:

> pbp <- game_play_by_play(2016010305)
> pbp[2,]$desc
[1] "(15:00) (Shotgun) Jo.Freeman pass incomplete deep left to T.Hilton."
> pbp[2,]$Passer
[1] NA

Connection errors trying to pull data

When I attempt to pull play by play data for any season I am getting the following error:

Error in file(con, "r") : cannot open the connection

The code I am running is as follows:

playByPlay.2009 <- season_play_by_play(Season = 2009)

Any help resolving this would be greatly appreciated.

Thanks,

Brian

2017 play-by-play error

I tried to obtain the data for week 1 of the 2017 season. Received the following error:

pbp_2017 <- season_play_by_play(2017)
Loading required package: XML
Loading required package: RCurl
Loading required package: bitops
Error in nfl.json[[1]]$drives[[x]]$plays :
$ operator is invalid for atomic vectors

season_games Week parameter

The help for v1.4.0 lists a Week parameter for season_games, but the source code does not include this parameter. Is there a script or recommended practice for getting a whole week's worth of data?
screenshot from 2018-04-21 10-42-03
During the NFL season, that is a common use case. Thanks!

Assign the passer role to the quarterback being sacked

Incredible contribution to our community of football statheads! Just a suggestion for a little tweak: assign the passer role to the quarterback being sacked. Currently, the passer column is blank in plays that ended with a sack. That makes it harder to evaluate passers' overall performance, since it has been shown that sacks are mostly the quarterback's responsibility. Thank you.

Inaccurate GoalToGo variable

The GoalToGo column is always 1 when the offense is within ten yards of the endzone, and is always 0 otherwise.

pbp_data$yrdline100_factor <- cut(pbp_data$yrdline100, breaks=c(0,10, 100))

table(pbp_data$GoalToGo, pbp_data$yrdline100_factor)
    (0,10] (10,100]
  0      0    43135
  1   2886        0

This is not correct.

I highlighted some cases where a team committed an offensive penalty and so was backed up for goal-to-go outside the ten yard line in a code notebook

season_play_by_play not working

library("nflscrapR", lib.loc="/usr/local/lib/R/3.3/site-library")
nfl15 <- season_play_by_play(2015)
Loading required package: XML
Loading required package: RCurl
Loading required package: bitops
Error in [<-.POSIXlt(*tmp*, not_same, value = c(1474521300, 1474521300, :
NAs are not allowed in subscripted assignments

cannot open the connection - error with season_games function

I can't get around this error:

Loading required package: XML Error in file(con, "r") : cannot open the connection

It happens when trying the following command:
'games.2015 <- season_games(2015)'

It happens with other years too. All other season_x functions work fine though. Has the nfl changed its json format or something? Would like to merge game data (e.g., stadium, weather, etc.) with play-by-play data, so I'd appreciate any advice or improvements to help with this.

Great job!

Jonathan

season_play_by_play not loading

I tried fetching the pbp data from 2009 and 2010 but this seems not to work for me. While it is mentioned that this function can take a few minutes to run, I don't get any results after waiting (imo) long enough for it to finish. There are no error messages thrown, it just freezes after this output:

Loading required package: XML
Loading required package: RCurl
Loading required package: bitops

my script:

Sys.setenv(LANG = "en")
install.packages('devtools',repos = "http://cran.us.r-project.org")
library(devtools)
devtools::install_github(repo = "maksimhorowitz/nflscrapR")
library(nflscrapR)

pbp_2009 <- season_play_by_play(2009)
pbp_2010 <- season_play_by_play(2010)

library(tidyverse)
pbp_data <- bind_rows(pbp_2009, pbp_2010)

Issue Loading pbp data

Hi,

I downloaded and loaded the instructions for the package but am still having a hard time loading the data with the season_play_by_play function. For example, when I execute the following code, I get this error:

pbp_2017 <- season_play_by_play(2017)
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace ‘rlang’ 0.1.1 is already loaded, but >= 0.1.2 is required

Any additional help would be much appreciated.

Thanks!

game times

I am trying to retrieve season dates and times. season_games gives me the right dates. Are you are how I can retrieve game times? Is there another package or API that I could use?

Anyone else having trouble with season_play_by_play()?

My system could be the culprit here so I'm checking with other users. As usual, the latest incarnation of nflscrapR is installed. In fact, that's when the problem started.

Log:

nfl16 <- season_play_by_play(2016,6)
Error in stringr::str_extract_all(unlist(sourceHTML), pattern = "data-gameid="[0-9]{10}"") :
lazy-load database '/usr/local/lib/R/3.3/site-library/stringi/R/stringi.rdb' is corrupt
In addition: Warning messages:
1: In stringr::str_extract_all(unlist(sourceHTML), pattern = "data-gameid="[0-9]{10}"") :
restarting interrupted promise evaluation
2: In stringr::str_extract_all(unlist(sourceHTML), pattern = "data-gameid="[0-9]{10}"") :
internal error -3 in R_decompress1

Default Weeks

I think weeks should default to 17 instead of 16 in some of these functions so that you get the whole season, not just the first 16 weeks.

Missing XP in Super Bowl LI

There's currently an issue with Super Bowl LI's play-by-play data where the XP following ATL's TD interception return is missing. Unfortunately this play is missing from the NFL API so this can have a major impact on the win probability during the game. We're looking into this issue and hoping to resolve this with boxscore information to ensure there are no missing scoring plays in the data.

Add direction of plays to play-by-play

Some have expressed interested in the direction of plays within a stadium.

For example, if you are modeling field goals, it could be useful to know what direction the ball was being kicked, since factors associated with some stadium-direction interactions may affect kicking accuracy (e.g. Heinz Field's open end).

Having the direction a team is going in a drive may also help when analyzing penalties, as shown by Davis & Lopez (2017).

This information could probably be obtained by looking at coin toss results.

Add information about coaches

Many would benefit from a centralized dataset of head coaches, offensive coordinators, and defensive coordinators.

This could be a data frame where each row is a coach-team-type, with the following columns: coach_id, coach_type (head/OC/DC), team, start_date, end_date, [additional variables]

Alternatively, this could be a data frame where each row is a team-game, with the following columns: game_id, date, team, team_type (home/away), head_coach, offensive_coordinator, defensive_coordinator, [additional variables]

Error in full season data pull

When trying to scrape full season data, I get the following error:

Error in if (!(any(stringr::str_detect(play_player_data$nfl_stat, "_td")))) { :
missing value where TRUE/FALSE needed

I've tried reinstalling both forks of the package. I noticed that this error isn't thrown for all game id's, just some. I've tried cropping bad game id's but I keep running into more that cause the error.

I've tried using scrape_season_play_by_play for 2016 and 2017, happens with both. I've also tried scrape_json_play_by_play and looping across game id's for both seasons. Same result.

Any help/advice would be greatly appreciated! Thanks!

Warning when calling season_play_by_play(): number of items to replace is not a multiple of replacement length

I ran the following code:
pbp_2009 <- season_play_by_play(2009, Weeks = 3)
pbp_2010 <- season_play_by_play(2010, Weeks = 3)
pbp_2011 <- season_play_by_play(2011, Weeks = 3)
pbp_2012 <- season_play_by_play(2012, Weeks = 3)
pbp_2013 <- season_play_by_play(2013, Weeks = 3)
pbp_2014 <- season_play_by_play(2014, Weeks = 3)
pbp_2015 <- season_play_by_play(2015, Weeks = 3)

This ran without errors/warnings for most years, but I did get the following warning for both 2009 and 2012:

Warning message:
In PBP$Rusher[which(PBP$PlayType == "Run")][elidgiblePlays] <- rusherStepfinal :
number of items to replace is not a multiple of replacement length

This seems to indicate to me that rusherStepfinal really should be the length as the array slice on the left-hand side of the equation, but isn't, which seems like a cause for concern to me. But I could be misunderstanding what's going on. Thanks a lot for any clarity you can give on what's going on here.

PlaybyPlay Hard to Aggregate TD's accurately due to challenged plays

To see an example do the following filters to the 2017

filter(PlayAttempted == 1,
touchdown == 1,
playtype == "Pass",
PassOutcome %in% c("Complete",NA),
posteam == "PHI",
Passer == "C.Wentz")

If you then sum the data, you'll get 34 TD. If you look at the stats for 2017, C.Wentz has 33 TD. So I did some digging and found out the flagged TD was due to a pass to N.Agholor against DAL that was challenged and reversed in favor of DAL. So that touchdown shouldn't be flagged as 1.

So I thought, no problem, just filter by reversed challenges.

but then I added the filter(!Challenge.Replay ==1 | !chalReplayResult == "Reversed")
and reran it and got 32 touchdowns, so one less.

Did some more digging and there was a pass to A.Jeffery against KC that was reversed in favor of PHI, so that one should be counted.

So basically we need a way of knowing if knowing a touchdown is legit due to a challenge. Seems touchdown gets flagged as 1 anytime the play is in the endzone, even for incomplete passes.

I don't know how the scraper works, but maybe there can be a column added called "touchdownRes" which is the result of a touchdown.

It could be coded if the next playttempted == 1 and (!is.na(ExtraPointResult) | !is.na(TwoPointConv)) then it was a successful TD.

Issue: Play-by-Play EPA data for Kick-Off Fumble Returns for TD is wrong

EPA is defined as the expected points added for the Possession Team (PT). On Kickoffs, the team receiving the kick is the PT.

However, on Kickoffs, if the KR team fumbles the ball and the Defensive Team (Kickoff Team) returns it for a TD, the EPA is positive. It should be negative since the Defensive Team scored.

There are 3 examples from 2013 play by play data. Below are details of one of these examples:
GameID: 2013101303 TimeSecs: 1362 EPA: 6.54564648081907 Desc: G.Zuerlein kicks 70 yards from STL 35 to HOU -5. K.Martin to HOU 10 for 15 yards (R.McLeod). FUMBLES (R.McLeod), RECOVERED by STL-D.Bates at HOU 11. D.Bates for 11 yards, TOUCHDOWN.

error in season_play_by_play

I'm getting an error when I try to use the season_play_by_play function.

Whenever I use data = season_play_by_play(2016)

I get the error

NA/NaN/Inf in foreign function call

Rusher column in play-by-play sometimes wrong

Hi Maksim,

first of all, thanks for your wonderful package!
Unfortunately, I just stumbled upon an issue on the "Rusher" column of the play-by-play dataset when extracting data for the 2015 season.
If the runner has his first name shortened to more than one letter, the value in the "Rusher" column is incorrectly set to the tackler (instead of the real runner in the play).

An example of this can be seen in the second play of the first game of the 2015 season:

> pbp <- game_play_by_play(2015091000)
> pbp[2,]$desc
[1] "(15:00) De.Williams right tackle to PIT 38 for 18 yards (D.Hightower)."
> pbp[2,]$Rusher
[1] "D.Hightower"
> pbp[2,]$Tackler1
[1] "D.Hightower"

Rush Yds nto adding for season total (NFL.COM)

Hi All,

I'm trying to replicate player stats for a fantasy analysis. I'm using the SQL function in R to manipulate and query the data into smaller data sets by Season and by Player. When I query rushers for example, I sum(ydsGained) and sum(rushAttempts) my totals for 2016 don't matchup. I've compared them to NFL.com and Pro Football Reference.

Am I missing something? Any suggestions on how to replicate the data found elsewhere? My query is below:

SELECT Rusher, sum(case when PlayType = 'Run' then 1 else 0 end) as PlayTy, sum(RushAttempt) as Carries, sum(Yards_Gained) FROM NFL_DATA_2016 GROUP BY Rusher ORDER BY Carries DESC

My goal in the research and analysis is for a Fantasy Football Rebuild. I was hoping to replicate Year Stats for QB, RB, WR & TE for 2009-2017. I tried with the above SQL code in R and totals do not match NFL.com results for the year.

The project will run various regressions, classification and/or clustering algorithms against the data. Any insight would be awesome, even if it explains the discrepancies.

Thanks in advance,

POSIXIt error for season_play_by_play(2010)

is there any package requirement or some fix for that?

Error in [<-.POSIXlt(*tmp*, not_same, value = c(1498256100, 1498256100, : NAs are not allowed in subscripted assignments

Player on incorrect team in game_play_by_play

Hi
Just trying out this great package for first time and attempting to extend some work done here

For the DEN ATL matchup the ATL running back, D.Freeman, has 3 plays as Rusher with his posteam as DEN on 2016-10-09 when they played Atlanta. They all appear to be related to a shotgun play

season_player_game goes to wrong url

When I do this

library(nflscrapR)
season2009 <- season_player_game(2009)

I get

invalid multibyte string at '<92>ve <72>eached this page by selecting a bookmark that worked previously, it<92>s likely the file moved to 
      a new location because of our recent redesign.</p>
      <p>You can return to the <a href="http://www.nfl.com">NFL.com home page</a>.</p>
  </div>
</div>

Which shows that season_player_game is scraping from the wrong location.

NFLScrapR not correctly parsing penalties; many "NaN" results in PenaltyType

Encountered this after scraping all available data and importing into Python via .csv. I happened to be looking at Seahawks 2017 penalties; here is code and results:

penalties = pbp[(pbp.DefensiveTeam == 'SEA') &
(pbp['Accepted.Penalty']==1) &
(pbp.Season == 2017) &
(pbp.PenalizedTeam == 'SEA')][['desc', 'PenaltyType']]

penalties

363651 | (13:26) (Shotgun) A.Rodgers pass incomplete sh... | Defensive Pass Interference
(9:37) (Shotgun) A.Rodgers pass short middle i... | NaN
(13:51) (No Huddle, Shotgun) A.Rodgers pass sh... | Defensive Holding
(7:24) J.Vogel punts 57 yards to SEA 35, Cente... | NaN
(13:03) (Shotgun) A.Rodgers pass incomplete de... | Defensive Offside

Seems pretty prevalant; of the Seahawks 79 penalties in 2017 only 42 seem to have been correctly parsed.

Game scores?

Hello!

Not really an issue... more of a question. So I apologize in advance if this is not the right forum.

I don't see anywhere that game scores are returned. Am I missing something? Or does this need to be derived from the simple_boxscore data... tds, fgm, etc...

Extra team in 2016?

This may be an issue with the NFL API but the data contained in the package and the data retrieved using the package for the 2016 season includes 33 teams (not 32). Seems that data is recorded for the Jacksonville Jaguars as both 'JAC' and 'JAX'. This only seems to affect the 2016 season.

length(unique(nflscrapR::playerstats15$Team)) ## returns 33
length(unique(nflscrapR::playerstats16$Team)) ## returns 34

This is probably an error caused by how the NFL entered the data, but you may want to update the playerstats16 dataset.

Incorrect Rusher in season_play_by_play

In the 2018 data (all that I've played with, so far), I've found that a single "Rusher" may have many "Rusher_ID"s across different play-by-play records. In some cases this is due to an incorrect value in the Rusher field. For example, GameID 2018092307, play_id 3355, has this desc:
"(11:27) C.Ivory right guard to BUF 30 for 4 yards (E.Kendricks, M.Hughes)"
However, the Rusher is K.Cousins. The Rusher_ID may be correct (other C.Ivory records seem to have 00-0027531 as the Rusher_ID).

Return team not always added in play by play

the line for detecting the return stat does not always catch the return team
dplyr::filter(stringr::str_detect(nfl_stat, "_return") |
nfl_stat %in% c("punt_touchback_receiving",
"punt_downed","punt_fair_catch",
"kickoff_fair_catch",
"kickoff_touchback_receiving"))
should be
dplyr::filter(stringr::str_detect(nfl_stat, "_return") |
any(nfl_stat** %in% c("punt_touchback_receiving",
"punt_downed","punt_fair_catch",
"kickoff_fair_catch",
"kickoff_touchback_receiving")))

Yards.Gained if ChalReplayResult = Reversed

I'm having issues with data when a challenge is reversed. For example Date 2017-10-15, GameID 2017101506, qtr 3 down 3 time 7:48. Kirk Cousins completes a pass for 17 yards to Jamison Crowder, but the play is reversed as he stepped out of bounds and he only gained 3 yards. However, Yards.Gained shows 17. I think there are probably other issues with reversed plays as well, but this is the first one I've noticed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.