Exploratory Data Analysis in R

Name: Exploratory Data Analysis in R
Rating: 4.9 (4 reviews)

Four graphical techniques you can use to quickly explore your data

Created byRay James Hoobler

Last updated 1/2022

English

What you'll learn

Develop a fundamental framework to carry out your own Exploratory Data Analysis
The use of scatter plots and how to incorporate linear and non-linear models into your graphics
How to evaluate if your data is "normal" using histograms and probability plots
The power of box plots to compare groups

Course content

7 sections • 28 lectures • 5h 17m total length

Introduction to EDA in R2:43

03 - Histograms - overview presentation11:27
03 Histograms - getting started2:26
03 - Histograms - normal data12:18
03 - Histograms - Non-normal, short-tailed5:56
03 Histograms - Non-normal, long-tailed4:26
03 - Histograms - symmetric and bimodal9:43
03 - Histograms - bimodal mixture of two normal distributions4:43
03 - Histograms - Non-normal skewed right9:33
Some diversions along the way as we plot a skewed dataset. Many times, we'll need to assign names to our variables (i.e., column headers) or specify the type of data as either numeric, factors, etc. As import the data, we'll run into these issues, and instead of editing them out of the video, I've included them as an example of how you can solve these problems when importing data.
03 - Histograms - symmetric with outliers10:14
We will include a simple method for adding data to an existing data frame.

Extra Materials51:07
These are additional topics related to the creation of "complete" graphics. It is not part of the course but presented as a guide to help you move beyond the basics. The timestamps below are approximate:
0:00 Intro and loading data
10:00 Themes
12:34 Controlling chunk messages and warnings
13:25 Adding fit lines
17:13 Color and factors
19:55 Filtering data inside ggplot()
22:01 Using facets to create multiple plots
25:00 Adding ASCII characters
26:55 Creating a complete graphic
31:23 Adjusting transparency with the alpha argument
35:50 Adding color to the data points
39:30 Adding data to the data frame using mutate
41:08 Summarizing data with group-by() and summarise() functions
43:25 Arranging data inside a data frame
46:43 Reordering data in ggplot()

Requirements

You will need to have R and RStudio Desktop installed on your computer (Mac or PC) as well as an internet connection to download and install packages within RStudio Desktop. A basic understanding of the RStudio environment is assumed.

Description

This example-based course introduces exploratory data analysis (EDA) using R. A primary objective is to apply graphical EDA techniques to representative data sets using the RStudio platform.

I have incorporated datasets from the NIST/SEMATECH e-Handbook of Statistical Methods into this course and adopted their fundamental approach of Exploratory Data Analysis.

We use scatter plots to examine relationships between two variables, determine if there is a linear or non-linear relationship, analyze variations of the dependent variable, and determine if there are outliers in the dataset.

Of course, we need to remember that causality implies association and that association does NOT imply causality.

We will summarise the distribution of a dataset graphically using histograms. This tool can quickly show us the location and spread of the data, and give us a good indication if the data follows a normal distribution, is skewed, has multiple modes or outliers.

An underused, complementary technique to histograms is the probability plot. We will construct probability plots by plotting the data against a theoretical normal distribution. If the data follows a normal distribution, the plot will form a straight line. We will use the normal probability plot to assess whether or not our examples follow a normal distribution.

Finally, we will use box plots to view the variation between different groups within the data.

Aside from scatterplots, most spreadsheet programs do not support these methods, so learning how to do this fundamental analysis in R can improve your ability to explore your data.

Who this course is for:

If you currently create multiple data visualizations in spreadsheets, you've probably wondered how you could improve your work or how you could work more efficiently. Or, if you have to recreate graphics repeatedly, you might be looking for a tool to make your work more reproducible. This course focuses on the basic techniques used in Exploratory Data Analysis: scatterplots, histograms, probability plots, and box plots. Learning R and ggplot2 will allow you to move beyond spreadsheets and use a professional tool to explore your data effectively.

Exploratory Data Analysis in R

What you'll learn

Explore related topics

Course content

Introduction to EDA in R1 lecture • 3min

Graphical techniques - scatter plots8 lectures • 1hr 56min

Graphical techniques - histograms9 lectures • 1hr 11min

Graphical techniques - box plots4 lectures • 47min

Graphical techniques - probability plots4 lectures • 28min

Conclusion to EDA in R1 lecture • 2min

Extra materials for EDA in R1 lecture • 51min

Requirements

Description

Who this course is for: