Using R to Analyze Baseball Games in “Real Time”

In order to honor the last day of the 2009 MLB regular season (excepting the Twins/Tigers tiebreaker Tuesday night), I was reading a book that combines a few of my favorite thing: statistics, R, and baseball. The book, by Joseph Adler, is called Baseball Hacks, and I highly recommend it if you are interested in analyzing baseball data. Joseph uses Excel for some tips, R for others, and shows you how to download historical and current baseball data for further analysis. One tip that the book offered was a way to download “real time” baseball data from MLB’s site in XML format. I decided to try to write some R functions to retrieve, summarize, and analyze what was available.

Where are the data?

Joseph shows how, at least at the time of the writing of his book and this post, you can go here to download a wealth of XML data from past and current seasons. If you drill down far enough into the directories, you can find a file called miniscoreboard.xml, which is the one I use for this analysis.

The R functions

Here are the R functions I wrote. You can copy and paste them into your R session so that they are available to you. The next section will describe how to use them. Writing these was fairly straightforward, and simply a matter of XML manipulation. I admit that there may be far better ways to do this manipulation using the XML package, but this worked for now.

################################################################################
#   Program Name:     xml-mlb-gameday.R
#   Author:           Erik
#   Created:          10/04/2009
#
#   Last saved
#    Time-stamp:      <2009-10-04 17:23:02 erik>
#
#   Purpose:          show current scoreboard in R 
#
#   ** Generated by auto-insert on 10/04/2009 at 13:25:58**
################################################################################

## need XML package, may need to install w/ install.packages()
library(XML)

## create a boxscore object from an XML description of a game 
createBoxScore <- function(x) {
  status <- if(x$.attrs["status"] != "In Progress")
    "Final" else if(x$.attrs["top_inning"] == "Y")
      "Top" else "Bot"
  
  bs <- list(status = status, 
             inning = as.numeric(x$.attrs["inning"]),
             away.team = x$.attrs["away_name_abbrev"],
             away.runs = as.numeric(x$.attrs["away_team_runs"]),
             away.hits = as.numeric(x$.attrs["away_team_hits"]), 
             away.errors = as.numeric(x$.attrs["away_team_errors"]),
             home.team = x$.attrs["home_name_abbrev"],
             home.runs = as.numeric(x$.attrs["home_team_runs"]), 
             home.hits = as.numeric(x$.attrs["home_team_hits"]), 
             home.errors = as.numeric(x$.attrs["home_team_errors"]))
  class(bs) <- "boxscore"
  bs
}

## print the boxscore object in traditional format
print.boxscore <- function(x, ...) {
  cat("     ", "R   ", "H  ", "E (",
      x$status, " ",
      x$inning, ")\n",
      format(x$away.team, width = 3), " ", 
      format(x$away.runs, width = 2), "  ", 
      format(x$away.hits, width = 2), "  ", 
      x$away.errors, "\n",
      format(x$home.team, width = 3), " ", 
      format(x$home.runs, width = 2), "  ", 
      format(x$home.hits, width = 2), "  ", 
      x$home.errors, "\n\n", sep = "")
}

## utility function ... 
as.data.frame.boxscore <- function(x, row.names, optional, ...) {
  class(x) <- "list"
  as.data.frame(x)
}

## This is the "user accessible" public function you should be calling!
## downloads the XML data, and prints out boxscores for games on "date"
boxscore <- function(date = Sys.Date()) {
  if(date > Sys.Date())
    stop("Cannot retrieve scores from the future.")
         
  year  <- paste("year_", format(date, "%Y"), "/", sep = "")
  month <- paste("month_", format(date, "%m"), "/", sep = "")
  day   <- paste("day_", format(date, "%d"), "/", sep = "")
         
  xmlFile <-
    paste("http://gd2.mlb.com/components/game/mlb/",
          year, month, day, "miniscoreboard.xml", sep = "")
  xmlTree <- xmlTreeParse(xmlFile, useInternalNodes = TRUE)
  xp <- xpathApply(xmlTree, "//game")
  xmlList <- lapply(xp, xmlToList)

  bs.list <- lapply(xmlList, createBoxScore)
  names(bs.list) <-
    paste(sapply(bs.list, "[[", "away.team"),
                 "@",
          sapply(bs.list, "[[", "home.team"))
  bs.list
}



















Examples of summarizing real-time baseball data

Here is how to run some simple analyses on baseball games happening right now. This is the real value add for the idea of downloading data through R. Obviously you could just go to your favorite sports site to find scores if you wanted to know how your team was doing, but pulling the data into R lets you further analyze the data, and even combine it with other data sources (e.g., weather).



> ## print boxscores for games happening NOW!
> boxscore()
$`CWS @ DET`
     R   H  E (Final 9)
CWS  3   7  0
DET  5  12  0


$`HOU @ NYM`
     R   H  E (Final 9)
HOU  0   4  1
NYM  4   9  0


$`PIT @ CIN`
     R   H  E (Final 9)
PIT  0  10  0
CIN  6  11  0


$`WSH @ ATL`
     R   H  E (Final 15)
WSH  2  13  0
ATL  1  13  0


$`CLE @ BOS`
     R   H  E (Final 9)
CLE  7   8  0
BOS 12  11  0


$`FLA @ PHI`
     R   H  E (Final 10)
FLA  6  11  1
PHI  7  12  0


$`TOR @ BAL`
     R   H  E (Final 11)
TOR  4   9  2
BAL  5   8  0


$`NYY @ TB`
     R   H  E (Final 9)
NYY 10  12  0
TB   2   7  2


$`KC @ MIN`
     R   H  E (Final 9)
KC   4  12  0
MIN 13  11  0


$`MIL @ STL`
     R   H  E (Final 10)
MIL  9  15  2
STL  7   7  0


$`ARI @ CHC`
     R   H  E (Final 9)
ARI  5   8  0
CHC  2   6  0


$`LAA @ OAK`
     R   H  E (Final 9)
LAA  5   9  1
OAK  3  12  1


$`SF @ SD`
     R   H  E (Bot 9)
SF   3  11  1
SD   3   4  0


$`COL @ LAD`
     R   H  E (Top 8 )
COL  1   4  1
LAD  5  12  0


$`TEX @ SEA`
     R   H  E (Final 9)
TEX  3   4  0
SEA  4   8  1 
> ## print boxscores for a different day's games
> boxscore(date = as.Date("2009-10-01"))
$`STL @ CIN`
     R   H  E (Final 9)
STL 13  15  1
CIN  0   5  0


$`MIN @ DET`
     R   H  E (Final 9)
MIN  8  13  4
DET  3   7  1


$`MIL @ COL`
     R   H  E (Final 9)
MIL  2   6  0
COL  9  14  1


$`ARI @ SF`
     R   H  E (Final 9)
ARI  3   6  1
SF   7  11  0


$`TEX @ LAA`
     R   H  E (Final 9)
TEX 11  15  1
LAA  3   7  2


$`WSH @ ATL`
     R   H  E (Final 9)
WSH  2   7  0
ATL  1   6  0


$`HOU @ PHI`
     R   H  E (Final 9)
HOU  5  10  0
PHI  3  13  1


$`BAL @ TB`
     R   H  E (Final 9)
BAL  3   7  0
TB   2   5  1


$`CLE @ BOS`
     R   H  E (Final 9)
CLE  0   3  0
BOS  3  12  0


$`PIT @ CHC`
     R   H  E (Final NA)
PIT NA  NA  NA
CHC NA  NA  NA


$`OAK @ SEA`
     R   H  E (Final 9)
OAK  2   7  1
SEA  4   8  0 
> ## save the boxscores for futher analysis
> bs <- boxscore()
> ## convert to a more useful form, a data.frame
> ## with one game per row 
> bs.df <- do.call(rbind, lapply(bs, as.data.frame))
> ## status of today's games
> table(bs.df$status)
Final   Bot   Top 
   13     1     1  
> ## how many innings have been played today? 
> sum(bs.df$inning, na.rm = TRUE)
[1] 144 
> ## how many runs have been scored by the home teams today?
> sum(bs.df$home.runs, na.rm = TRUE)
[1] 79 
> ## how many runs have been scored by the away teams today?
> sum(bs.df$away.runs, na.rm = TRUE)
[1] 62 

Conclusion

These functions are far from robust, and I think they only work for the current year (i.e., 2009, dates from 2008 were not working right). The format looks like it has changed over time, which is not surprising. I only use a very small subset of the available data, even the miniscoreboard.xml file contains far more information than I summarize here. This is really the first time I have dealt with XML data, so I am sure there is a lot more that can be done, but for a one-day project, I think the results are pretty interesting. I will definitely provide the updates I make to these functions, and may even start a baseball R package if they grow extensive enough. I suppose this is a project I can work on in the off season!

Advertisements

4 Responses to Using R to Analyze Baseball Games in “Real Time”

  1. keijo says:

    Thanks, after running these functions on my R session, I get the following error on

    > boxscore()
    failed to load HTTP resource
    Error: 1: failed to load HTTP resource

    This works fine

    > boxscore(date = as.Date(“2009-10-01”))
    $`STL @ CIN`
    R H E (Final 9)
    STL 13 15 1
    CIN 0 5 0

    $`MIN @ DET`
    R H E (Final 9)
    …snip…

  2. erikr says:

    Hello. Thanks for trying this out. The error you are getting is because there are no MLB games today (monday, october 5, 2009), and “boxscore” uses today’s date as the default. These definitely could do more error checking. Thanks for reporting the issue!

  3. SoxFan says:

    Does the MLB dataset include post-season data? If so, you should create some functions for this post-season. It would be interesting to compare the teams in this year’s match-ups, look for patterns, or make some predictions! (Go Red Sox!)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: