Baseball Data Wrangling with Vagrant, R, and Retrosheet

Analytics with the Chadwick tools, dplyr, and ggplot.
Rating: 4.8 out of 5 (146 ratings)
10,695 students
Baseball Data Wrangling with Vagrant, R, and Retrosheet
Rating: 4.8 out of 5 (146 ratings)
10,695 students
install VirtualBox and Vagrant
run a virtual Linux machine
install the Chadwick software tools
extract game and play-by-play baseball data from Retrosheet files
produce graphs with ggplot

Requirements

  • Students will need to have R and RStudio installed on their own computers.
Description

This course is for those interested in doing baseball analytics with the Retrosheet game-by-game and play-by-play data. The main tools for working with such data are in the Chadwick software. We install a virtual Linux machine, on which we will install the Chadwick software. We will then learn how to extract baseball data with the Chadwick software, how to further filter the data with dplyr in R, and how to plot our results with ggplot.

For the first part of the course, in which we install the virtual Linux machine and learn how to work with the Chadwick software, there are no prerequisites. To follow the second part of the course, knowledge of dplyr is necessary. This can be obtained through my course "Baseball Database Queries with SQL and dplyr".

At a relaxed pace, the course should take two to three weeks to complete.

Who this course is for:
  • This course is for those interested in doing baseball analytics with Retrosheet files.
  • No background is needed for the first part of the course. A background in the R package dplyr is necessary to follow the second part of the course.
Course content
4 sections • 28 lectures • 2h 9m total length
  • Introduction
    01:22
  • Installing VirtualBox
    00:57
  • Installing Vagrant
    00:38
  • Creating a Project Folder
    03:17
  • Vagrant Up
    03:11
  • Directory Structure
    04:34
  • Downloading the Chadwick Software
    06:50
  • Installing the Chadwick Software
    08:46
  • The Retrosheet Files
    03:55
  • cwevent and cwgame
    06:06
  • Data Extraction
    11:08
  • Reading our data into R
    06:23
  • The Result Column
    05:07
  • The Date Column
    03:39
  • The Date Column Part II
    04:29
  • The Player Data Frames
    06:46
  • ggplot Crash Course
    08:22
  • Cumulative Home Run Plots
    02:25
  • Colors and Legend
    05:25
  • Project Description
    03:20
  • Data Extraction
    03:02
  • Reading the data into R
    03:41
  • The Date Column
    03:18
  • The Result and AB Columns
    03:15
  • The Player Data Frames
    09:41
  • The Plots
    03:54
  • The Four-Hundred Line
    04:31
  • The Marchi/Albert Book and Course Wrap-Up
    01:30

Instructor
Professor at Mercyhurst University
Charles Redmond
  • 4.6 Instructor Rating
  • 5,757 Reviews
  • 70,060 Students
  • 7 Courses

Dr. Charles Redmond is a professor in the Tom Ridge School of Intelligence Studies and Information Science at Mercyhurst University. He has been a member of the Department of Mathematics and Computer Systems at Mercyhurst for 21 years and has recently completed a term as chair of the department. Dr. Redmond received his PhD in mathematics from Lehigh University in 1993 and has published in the Annals of Applied Probability, the Journal of Stochastic Processes and Their Applications, Mathematics Magazine, the College Mathematics Journal, and Mathematics Teacher. In his spare time he enjoys making music and computer generated art, reading, and owning a Clumber Spaniel.