Baseball Data Wrangling with Vagrant, R, and Retrosheet
4.8 (62 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
4,306 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Baseball Data Wrangling with Vagrant, R, and Retrosheet to your Wishlist.

Add to Wishlist

Baseball Data Wrangling with Vagrant, R, and Retrosheet

Analytics with the Chadwick tools, dplyr, and ggplot.
4.8 (62 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
4,306 students enrolled
Created by Charles Redmond
Last updated 6/2015
English
Price: Free
Includes:
  • 2 hours on-demand video
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • install VirtualBox and Vagrant
  • run a virtual Linux machine
  • install the Chadwick software tools
  • extract game and play-by-play baseball data from Retrosheet files
  • produce graphs with ggplot
View Curriculum
Requirements
  • Students will need to have R and RStudio installed on their own computers.
Description

This course is for those interested in doing baseball analytics with the Retrosheet game-by-game and play-by-play data. The main tools for working with such data are in the Chadwick software. We install a virtual Linux machine, on which we will install the Chadwick software. We will then learn how to extract baseball data with the Chadwick software, how to further filter the data with dplyr in R, and how to plot our results with ggplot.

For the first part of the course, in which we install the virtual Linux machine and learn how to work with the Chadwick software, there are no prerequisites. To follow the second part of the course, knowledge of dplyr is necessary. This can be obtained through my course "Baseball Database Queries with SQL and dplyr".

At a relaxed pace, the course should take two to three weeks to complete.

Who is the target audience?
  • This course is for those interested in doing baseball analytics with Retrosheet files.
  • No background is needed for the first part of the course. A background in the R package dplyr is necessary to follow the second part of the course.
Students Who Viewed This Course Also Viewed
Curriculum For This Course
28 Lectures
02:09:32
+
Setting up Vagrant
6 Lectures 13:59

This is our course introduction in which I detail the structure of the course.

Introduction
01:22

After viewing this lecture, you will be able to install VirtualBox on your machine.

Installing VirtualBox
00:57

After viewing this lecture, you will be able to install Vagrant on your machine.

Installing Vagrant
00:38

After this lecture, you will create a folder for your project and navigate to the folder via the command-line. You will also check to make sure you have Vagrant installed.

Creating a Project Folder
03:17

After viewing this lecture, you will download your Linux box and start it.

Vagrant Up
03:11

After viewing this lecture, you will be able to ssh into your Linux box and navigate the directory structure.

Directory Structure
04:34
+
Installing and Working with the Chadwick Software
4 Lectures 25:37

After viewing this lecture, you will be able to download the Chadwick software and copy files from one directory to another on your Linux machine.

Downloading the Chadwick Software
06:50

After viewing this lecture, you will be able to install the Chadwick software to your Linux machine.

Installing the Chadwick Software
08:46

After viewing this lecture, you will be familiar with the contents of the Retrosheet event files.

The Retrosheet Files
03:55

After viewing this lecture, you will be able to work with the cwevent and cwgame programs from the Chadwick software.

cwevent and cwgame
06:06
+
Project #1: Mike Schmidt and Greg Luzinski
9 Lectures 53:44

After viewing this video, you will be able to extract the information you need for our first project using the Chadwick software. You will also see how to work with wildcards in Linux.

Data Extraction
11:08

After viewing this lecture, you will be able to assign names to data frame columns. We will also review how to read csv files into R.


Reading our data into R
06:23

After viewing this lecture, you will be able to work with some of the logical operators in R. We will also review the mutate verb in dplyr.

The Result Column
05:07

After viewing this lecture, you will be able to work with the substr function in R.


The Date Column
03:39

After viewing this lecture, you will be able to work with the paste function and the as.Date function in R.

The Date Column Part II
04:29

In this video, we extract the information we need for our player data frames from our main bdat data frame.

The Player Data Frames
06:46

After viewing this video, you will understand enough of the essentials of ggplot to create our cumulative home run plots.

ggplot Crash Course
08:22

After viewing this video, you will be able to put multiple plots on one graph.

Cumulative Home Run Plots
02:25

After viewing this lecture, you will be able to add axis labels, a title, and a legend to your graph. You will also understand how to use color within the aesthetics.

Colors and Legend
05:25
+
Project #2: Dykstra, Murray, and Brett
9 Lectures 36:12

In this video, I give the details of project #2.

Project Description
03:20

In this video, we return to our Linux machine and extract the data we need for our project.

Data Extraction
03:02

In this video, we read our data into R.

Reading the data into R
03:41

In this video, we generate a column of date objects in the default R date format.

The Date Column
03:18

In this video, we modify the AB column and generate an H (hits) column.

The Result and AB Columns
03:15

In this video, we generate the player data frames. We accumulate the AB and H columns and then divide them to obtain a batting average column.

The Player Data Frames
09:41

In this video, we finally generate our plots.

The Plots
03:54

In this video we add a horizontal line to our graph to represent the .400 batting average line.

The Four-Hundred Line
04:31

In this video, I recommend a text for additional ideas and examples.

The Marchi/Albert Book and Course Wrap-Up
01:30
About the Instructor
Charles Redmond
4.6 Average rating
2,067 Reviews
26,375 Students
7 Courses
Professor at Mercyhurst University

Dr. Charles Redmond is a professor in the Tom Ridge School of Intelligence Studies and Information Science at Mercyhurst University. He has been a member of the Department of Mathematics and Computer Systems at Mercyhurst for 21 years and has recently completed a term as chair of the department. Dr. Redmond received his PhD in mathematics from Lehigh University in 1993 and has published in the Annals of Applied Probability, the Journal of Stochastic Processes and Their Applications, Mathematics Magazine, the College Mathematics Journal, and Mathematics Teacher. In his spare time he enjoys making music and computer generated art, reading, and owning a Clumber Spaniel.