Understanding New Data - Exploratory Analysis in R

Name: Understanding New Data - Exploratory Analysis in R
Rating: 4.4 (20 reviews)

Learn how to use R to quickly understand and analyze new data and start your data analysis projects with ease

Created byR-Tutorials Training

Last updated 3/2021

English

What you'll learn

Identify suitable R libraries for data exploration
Create suitable data visualizations
Learn the succession of steps in data exploration
Use a combination of hypothesis tests, explorations and models
How to prepare data for exploration
What to do when problems arise in the initial stages
Work with the main variable types
Use time series data

Course content

3 sections • 63 lectures • 6h 52m total length

The Landscape: Data Science and Data Analysis10:30
Explore how data analysis sits inside the data science landscape—from research questions and data collection to data cleaning, validation, exploratory analysis, modeling, and communicating findings.
Data Analysis Stages: IDA, EDA and CDA8:05
Explore the data analysis pipeline from initial data analysis to exploratory data analysis and confirmatory data analysis, focusing on data quality checks, preprocessing, visualization, and hypothesis testing.
Why Do We Work with Statistical Samples? - Population vs. Sample8:36
Learn why researchers use samples instead of testing whole populations, and how random, representative samples estimate population parameters with sample statistics.
The Normal Probability Distribution8:36
The Tidyverse5:59
Explore the tidyverse ecosystem, a suite of data cleaning and preprocessing packages with pipe-based workflows. Master core tools and the reader and table concepts for efficient data preparation.
Datasets and R Libraries3:51
Install and update R and RStudio, run the course scripts to install required packages (base and tidyverse), and explore datasets car parts, diamonds, and flights used as training data.
Summary1:39
Contrast data science with data analysis and the initial and exploratory stages. Grasp population versus sample and the normal distribution while setting up R and Studio.

Introduction2:46
Learn to perform initial data analysis by preparing and validating data quality, importing and restructuring oddly shaped data, and applying duplicate removal, missing value imputation, and outlier detection in R.
The Succession of Data Pre-processing Steps7:19
Learn the succession of data preprocessing steps in R, from importing and tabular restructuring to type classification, duplicate removal, missing value imputation, and outlier handling.
Importing Tabular Data9:30
Import tabular data into R using the reader package, skip the x1 column, and classify fields as factors or dates.
Reading and Parsing JSON Files9:27
Reshaping Techniques7:20
Learn how to reshape oddly shaped data in R by using tidyr's gather and spread to convert between long and wide formats, and separate concatenated variables with separate functions.
Sampling Approaches: Creating Subsets with R Base6:59
Apply sampling methods in R base to create representative subsets, addressing sample size and sampling error. Use stratified sampling to preserve population proportions and ensure reproducibility with seed.
Sampling Approaches: Stratified Sampling4:59
Learn stratified sampling in R with slice_sample and group_by to achieve proportional representation across iris species, using seeds and pipe syntax for reproducible, streamlined sampling.
Classifying Variables and Objects9:25
Explore core data types and structures in R, vectors, lists, data frames, and tables, and how class and structure affect analysis and visualization.
Data Class Conversion2:50
Managing Duplicates9:56
Relative Group Sizes: Calculating Marginal Sums5:51
Understanding Missing Values8:07
Identify and categorize missing values as MCAR, MAR, or MNAR. Apply deletion, hot deck, mean imputation, interpolation, and multiple imputation by chained equations to mitigate bias.
R's Toolbox for Missing Data Handling5:34
Detecting Missing Data with Visual Tools: Pattern Identification5:33
Explore visual tools from the debt package to detect patterns in missing data, using color-coded and greyscale plots to relate missingness to data types and overall missing value percentages.
Simple NA Handling Methods6:33
Learn simple missing-value handling in R: delete with na.omit, and impute with super package tools (na.fill, na.locf, na.approx, na.spline) and na.aggregate for group-wise means.
Investigating the Structure of Missing Values5:31
Explore missing data with the mice package, performing multiple imputation by chained equations using 24 methods for diverse data types, and inspect patterns with the MBD pattern tool.
Deciding for a Suitable NA Handling Method3:54
Assess missing values in profit and sales to decide an appropriate imputation strategy, considering outliers and avoiding mean-based or ordered methods, and apply random forest imputation.
Multiple Imputation with Random Forest9:39
Explore a random forest-based multiple imputation method using mice with three imputations and a fixed seed, then validate data for sales and profit with no new outliers.
Validating Numeric Variables3:58
Validate numeric variables before outlier detection by ensuring positive sales and profit and that sales exceed profit, then flag validity and remove invalid observations for reliable analysis in R.
Understanding Outliers and the Reasons Behind Them4:52
Identify outliers and invalid values after imputing missing data, and assess their impact on analysis accuracy. Explore delete, substitute, or preserve strategies, with one-column and multivariable approaches and plausibility checks.
Exploring Outliers in the Data4:07
Explore univariate outlier detection with descriptive statistics, six sigma methods, visual tools, and box plots in a car parts dataset, imputing missing values and examining profit and sales.
Outlier Detection with Visual Methods: The Boxplot Method4:25
Learn how the boxplot method detects outliers in a single variable using Q1, Q3, and 1.5 times the intercoastal distance, with whiskers and the out element.
Outlier Detection with the Six Sigma Method9:25
Apply the Six Sigma Ed Rule to identify outliers using mean plus or minus three standard deviations, with attention to normal distribution and imputation.
Detecting Outliers with Hypothesis Tests8:07
Explore outlier detection with hypothesis tests in R using Grub's test and the outliers package, including one- and two-sided options, test types, p-values, and practical interpretation of results.
Multivariate Outlier Detection4:15
Explore multivariate outlier detection by contrasting model-based and proximity-based approaches. Learn methods that transform data along eigenvectors to account for covariance and high-dimensional data, and compare points to a distribution.
Robust Principal Component Algorithm for Outlier Detection5:45
Apply the robust principal component algorithm with the pc out function to detect outliers in numeric car parts data, then assign a binary normal/outlier classifier and visualize the results.
Outlier Detection with the Mahalanobis Distance7:42
Testing for Outliers in Transformed Data8:09
Apply the Box-Cox transformation to right-skewed data using lambda to assess normality and detect outliers. Compare methods after transformation and decide to keep, remove, or impute identified observations.
Plausibility Checks for Non-numeric Data6:56
Explore plausibility checks for non-numeric data using a summary to spot typos and misclassified categories, then reclassify and drop empty levels; adjust dates with the lubricate package year function.
Writing a Report: What to Include in an IDA Progress Documentation5:55
Summary: Initial Data Analysis7:50
Learn initial data analysis in R by importing, cleaning, and converting data to tables, then handle duplicates, missing values, and outliers with simple and multiple imputation.

Introduction2:55
Master exploratory data analysis with a reusable blueprint, using data visualizations and chart types to reveal factors influencing price and arrival delays, including time-series summaries and a generalized linear model.
What Is EDA and What Is the Succession of Steps?8:39
The Benefits of Using Data Visualizations in EDA4:55
Explore how data visualizations empower exploratory data analysis by enabling interactive, rapid insights, telling the analytical story through charts like box plots and a dashboard.
Basic Plot Types for EDA7:37
Dataset Overview and Quality Check: Diamonds from Ggplot25:48
Explore the ggplot2 diamonds data set in R by inspecting its structure, summary statistics, and potential outliers, duplicates, balance, and missing values for a quick quality check.
Non-parametric Methods to Explore the Distribution in Numeric Variables9:50
Explore numeric variable distributions using non-parametric methods, compare histograms and box plots to reveal skewness, center, and variability without assuming a specific distribution.
Parametric Methods to Explore the Distribution in Numeric Variables7:23
Compare numeric variable distributions to theoretical models using the Cucu plot to assess normality and fit. Explore parametric options like lognormal, gamma, and weibull, with data transformations and outlier considerations.
Exploring Categorical Variables4:41
The Distribution in Relation to Grouping Variables5:54
Density Plot7:38
Relationships Between Numeric Variables7:02
Explore how numeric variables relate using scatterplots to reveal linear trends, correlation, and regression concepts, and understand why correlation does not imply causation in exploratory data analysis.
Dataset Overview: Flights5:13
Explore the nycflights13 flights data with 336k observations and 19 variables, focusing on delays and fragmented date-time fields. Learn when to use correlation and hypothesis tests, plus sampling.
Dataset Summary and Variable Classification5:38
Analyze the dataset summary and fix variable classes and date time components from the 2013 data with 24-hour time, remove incomplete rows, and prepare for the exploratory phase in R.
Summaries for Grouping Variables3:43
Explore grouping variables using count summaries to reveal unbalanced carrier groups and evenly distributed origin airports, then visualize with a bar chart using reorder for clear ordering.
Assembling Summary Tables of Custom Aggregations5:51
Assemble summary tables using group by and summarize to compute flights, total distance, arrival delays in minutes, and arrival delay per 1,000 miles by carrier, then compare punctuality across carriers.
Numeric Distributions8:00
Analyze numeric distributions with histograms and box plots to compare flight distance, air time, and departure and arrival delays, reveal skewness, and guide nonparametric testing.
Time Series Based Summaries7:08
Explore a year's flight data by merging date and time components into a daily timeline, then analyze daily flights and delays with histograms, box plots, and scatterplots to reveal irregularities.
Visual Exploration of the Time Component7:49
Explore time-based data in R by visualizing daily flights against the date, and analyze weekly and seasonal patterns with line and bar plots, delays, and derived weekday and week variables.
Analysing What Is Missing: Cancelled Flights7:59
Linear Relationships Between Numeric Variables6:18
Examine how to assess linear relationships between numeric variables using correlation coefficients and scatterplots, with Spearman and Pearson comparisons showing near zero distance–delay and strong departure–arrival delay correlations.
Measuring the Strenght of Association Between Events10:34
Learn to measure the association between departure and arrival delays using the odds ratio with binary delay indicators and relative risk, and interpret the results.
Statistical Models in Exploratory Analysis6:07
Explore how statistical models in exploratory analysis identify independent and dependent variables, using correlations and regressions to assess relationships, select features, and quantify uncertainty with coefficients and random error.
Identifying Covariates with Logistic Regression6:56
Explore logistic regression for binary outcomes with grouping variables, using a binomial GLM to model arrival delay. Learn that departure delay and carrier matter, while distance and origin/destination may not.
Conclusions about the Flights Dataset7:44
Investigate the factors behind arrival delays in the flights dataset through exploratory analysis. Highlight carrier and departure delay as influential predictors, and note weather context on extreme days.
Farewell0:58
Celebrate completing the data science course and apply techniques from data analysis courses in Python and Tablo. Explore the portfolio and instructor profile for latest news, courses, monthly promotions, discounts.

Requirements

Basic R programming skills
A general understanding of statistics and data visualization
R and RStudio ready on your computer

Description

Are you new to R and data analysis?
Do you ever struggle starting an analysis with a new dataset?
Do you have problems getting the data into shape and selecting the right tools to work with?
Have you ever wondered if a dataset had the information you were interested in and if it was worth the effort?

If some of these questions occurred to you, then this program might be a good start to set you up on your data analysis journey. Actually, these were the question I had in mind when I designed the curriculum of this course. As you can see below, the curriculum is divided into three main sections. Although this course doesn't have a focus on the basic concepts of statistics, some of the most important concepts are covered in the first section of the course.

The two other sections have their focus on the initial and the exploratory data analysis phases respectively. Initial data analysis (or IDA for short) is where we clean and shape the data into a form suitable for the planned methods. This is also where we make sure the data makes sense from a statistical point of view. In the IDA section I present tools and methods that will help you figure out if the data was collected properly and if it is worthy of being analyzed.

On the other hand, the exploratory data analysis (EDA) section offers techniques to find out if the data can answer your analytical questions, or in other words, if the data has a relevant story to tell. This will spare you from investing time and effort into a project that will not deliver the results you hoped for. In an ideal case the results of EDA may confirm that the planned analysis is worth it and that there are insights to be gained from that dataset and project.

If you are interested in statistical methods and R tools that help you bridge the gap between data collection and the confirmatory data analysis (CDA), then this program is for you. Take a look at the curriculum and give this course a try!

Who this course is for:

Data scientists
Analysts of all fields
Researchers working and analyzing data
Young professionals wanting to switch to data analysis related work
Students taking data analysis exams
Everyone interested in analyzing data
Data exploration is an initial phase of a data analysis project therefore you will need these skills in most of your projects

Understanding New Data - Exploratory Analysis in R

What you'll learn

Explore related topics

Course content

Introduction7 lectures • 47min

Initial Data Analysis and Data Pre-processing31 lectures • 3hr 23min

Exploratory Data Analysis25 lectures • 2hr 42min

Requirements

Description

Who this course is for: