Are you interested in learning about data analysis and machine learning, but don't know where to start? Are you interested in sports and curious to know how analytics can be applied to sports? In the game of football, are you curious as which positions are the most important (other than the quarterback)?
If so, you've come to the right course! In this course, I will show you how easy it is to use the statistical software program R Studio in order to use data from the NFL to answer the question of which positions matter the most in the game of football! I will work you through this project so you learn about R by doing, as opposed to watching boring lectures that cover theory without any applications
My hope is that going over this project will provide the interest and motivation necessary for you to answer your own statistics and data-related questions using the concepts I cover in this course. I want you to become proactive instead of just being spectators and consumers
Introduction to the concepts I will cover in this course. I will cover web scraping using R and Python, dealing with dataframes, manipulating and merging dataframes, and then using Lasso regression and Random Forest to generate models that show which positions in football best correlate with the winning percentages of teams
Obtaining data for projects in which you want to generate predictions is not always easy to obtain. It will not always be available in just a simple file you can download. I will show you the valuable skill known as web scraping that is used by data scientists and data analysts to obtain data from websites by extracting the HTML content. I use the BeautifulSoup package in Python for this.
I cover where to obtain R Studio, the link to the code I use in this course (https://github.com/jk34/NFL_model), how to read in the Excel files into R, storing the Excel files as dataframes, and combining dataframes using cbind and merge.
I cover how to extract the HTML tables from ESPN containing the teams and their winning percentages for each season. I extract these HTML tables using web scraping with R.
I cover how to use just a subset of the dataframe containing the players and their positions for each team so that it only contains the top rated player at each position for each team for each season. I then merge this new dataframe with the dataframe containing the winning percentage of each team
I cover how to split our dataframe into a training, validation, and test set. Because I don't have an actual test set, I use 1/5 of our data as the "test" set, and use K-fold cross-validation on the remaining data to generate the training and validation sets.
I talk about the problem of overfitting and how to overcome this problem when generating predictive models
I finally cover how to use our data to generate a model that can answer our question of which positions matter the most in the NFL. I go over Linear regression, and how Lasso regression can modifies Linear regression so it picks out only the relevant positions in the game of football. I mention how to use Linear and Lasso regression in R.
I cover another predictive model we will use: Random Forests. It is an ensemble average of decision trees
I talk about how I implement Random Forest on our data in R, and then how to generate the nice visualization that shows which positions are the most important to winning in the NFL
Jerry Kim has a MA in Physics from the University of Texas at Austin, along with a BA in Physics and BS in Applied Mathematics from the University of California at Los Angeles. He taught Physics courses for 2 years as a teaching assistant at the University of Texas at Austin. He has experience with programming, computational physics, and data science.
He wants to provide his experience and knowledge of programming and data analysis to others. He currently works as a freelancer for data science projects.