Baseball Data Wrangling with Vagrant, R, and Retrosheet

Analytics with the Chadwick tools, dplyr, and ggplot.
4.6 (32 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
2,764 students enrolled
Free
Start Learning Now
  • Lectures 28
  • Length 2 hours
  • Skill Level Intermediate Level
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 6/2015 English

Course Description

This course is for those interested in doing baseball analytics with the Retrosheet game-by-game and play-by-play data. The main tools for working with such data are in the Chadwick software. We install a virtual Linux machine, on which we will install the Chadwick software. We will then learn how to extract baseball data with the Chadwick software, how to further filter the data with dplyr in R, and how to plot our results with ggplot.

For the first part of the course, in which we install the virtual Linux machine and learn how to work with the Chadwick software, there are no prerequisites. To follow the second part of the course, knowledge of dplyr is necessary. This can be obtained through my course "Baseball Database Queries with SQL and dplyr".

At a relaxed pace, the course should take two to three weeks to complete.

What are the requirements?

  • Students will need to have R and RStudio installed on their own computers.

What am I going to get from this course?

  • install VirtualBox and Vagrant
  • run a virtual Linux machine
  • install the Chadwick software tools
  • extract game and play-by-play baseball data from Retrosheet files
  • produce graphs with ggplot

What is the target audience?

  • This course is for those interested in doing baseball analytics with Retrosheet files.
  • No background is needed for the first part of the course. A background in the R package dplyr is necessary to follow the second part of the course.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: Setting up Vagrant
01:22

This is our course introduction in which I detail the structure of the course.

00:57

After viewing this lecture, you will be able to install VirtualBox on your machine.

00:38

After viewing this lecture, you will be able to install Vagrant on your machine.

03:17

After this lecture, you will create a folder for your project and navigate to the folder via the command-line. You will also check to make sure you have Vagrant installed.

03:11

After viewing this lecture, you will download your Linux box and start it.

04:34

After viewing this lecture, you will be able to ssh into your Linux box and navigate the directory structure.

Section 2: Installing and Working with the Chadwick Software
06:50

After viewing this lecture, you will be able to download the Chadwick software and copy files from one directory to another on your Linux machine.

08:46

After viewing this lecture, you will be able to install the Chadwick software to your Linux machine.

03:55

After viewing this lecture, you will be familiar with the contents of the Retrosheet event files.

06:06

After viewing this lecture, you will be able to work with the cwevent and cwgame programs from the Chadwick software.

Section 3: Project #1: Mike Schmidt and Greg Luzinski
11:08

After viewing this video, you will be able to extract the information you need for our first project using the Chadwick software. You will also see how to work with wildcards in Linux.

06:23

After viewing this lecture, you will be able to assign names to data frame columns. We will also review how to read csv files into R.


05:07

After viewing this lecture, you will be able to work with some of the logical operators in R. We will also review the mutate verb in dplyr.

03:39

After viewing this lecture, you will be able to work with the substr function in R.


04:29

After viewing this lecture, you will be able to work with the paste function and the as.Date function in R.

06:46

In this video, we extract the information we need for our player data frames from our main bdat data frame.

08:22

After viewing this video, you will understand enough of the essentials of ggplot to create our cumulative home run plots.

02:25

After viewing this video, you will be able to put multiple plots on one graph.

05:25

After viewing this lecture, you will be able to add axis labels, a title, and a legend to your graph. You will also understand how to use color within the aesthetics.

Section 4: Project #2: Dykstra, Murray, and Brett
03:20

In this video, I give the details of project #2.

03:02

In this video, we return to our Linux machine and extract the data we need for our project.

03:41

In this video, we read our data into R.

03:18

In this video, we generate a column of date objects in the default R date format.

03:15

In this video, we modify the AB column and generate an H (hits) column.

09:41

In this video, we generate the player data frames. We accumulate the AB and H columns and then divide them to obtain a batting average column.

03:54

In this video, we finally generate our plots.

04:31

In this video we add a horizontal line to our graph to represent the .400 batting average line.

01:30

In this video, I recommend a text for additional ideas and examples.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Charles Redmond, Professor at Mercyhurst University

Dr. Charles Redmond is a professor in the Tom Ridge School of Intelligence Studies and Information Science at Mercyhurst University. He has been a member of the Department of Mathematics and Computer Systems at Mercyhurst for 21 years and has recently completed a term as chair of the department. Dr. Redmond received his PhD in mathematics from Lehigh University in 1993 and has published in the Annals of Applied Probability, the Journal of Stochastic Processes and Their Applications, Mathematics Magazine, the College Mathematics Journal, and Mathematics Teacher. In his spare time he enjoys making music and computer generated art, reading, and owning a Clumber Spaniel.

Ready to start learning?
Start Learning Now