Data Analysis and Machine Learning with R
3.0 (1 rating)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
2 students enrolled

Data Analysis and Machine Learning with R

Explore advanced algorithm and visualization concepts to get the most out of your data through real-world examples.
3.0 (1 rating)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
2 students enrolled
Created by Packt Publishing
Last updated 4/2019
English
English [Auto-generated]
Current price: $139.99 Original price: $199.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 4.5 hours on-demand video
  • 1 downloadable resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Handle missing values and duplicates
  • Learn to scale and standardize values
  • Learn to apply classification techniques and regression techniques
  • Work with advanced algorithms and techniques to enable efficient machine learning using the R programming language.
  • Explore concepts such as the random forest algorithm.
  • Work with support vector machine and examine and plot the results.
  • Find out how to use the K-Nearest Neighbor for data projection.
  • Work with a variety of real-world algorithms that suit your problem.
Course content
Expand all 62 lectures 04:26:45
+ R Data Analysis Solutions - Machine Learning Techniques
43 lectures 03:11:28

This video gives overview of the entire course.

Preview 03:48

CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields.

  • Read data from data-file

  • Verify the result

  • Use optional arguments

Reading Data from CSV Files
06:30

You may sometimes need to extract data from websites. Many providers also supply data in XML and JSON formats.

  • Load the XML and JSON libraries

  • Extract the data

  • Convert the extracted data into data frames

Reading XML and JSON Data
06:07

In fixed-width formatted files, columns have fixed widths; if a data element does not use up the entire allotted column width, then the element is padded with spaces to make up the specified width also. During data analysis, you will create several R objects.

  • Download and store the student-fwf.txt file

  • Specify the width

  • Load the data from R files and libraries

Reading Data from Fixed-Width Formatted Files, R Files, and R Libraries
06:39

When we have abundant data, we sometimes want to eliminate the cases that have missing values for one or more variables. When you disregard cases with any missing variables, you lose useful information that the non-missing values in that case convey.

  • Download the data file

  • Get a data frame that has only the cases with no missing values

  • Read data and replace the missing values

Removing and Replacing Missing Values
06:17

We sometimes end up with duplicate cases in our datasets and want to retain only one among the duplicates.

  • Create a sample data frame

  • Get unique values

Removing Duplicate Cases
02:03

Variables with higher values tend to dominate distance computations and you may want to rescale the values to be in the range of 0 - 1.

  • Install the scale package

  • Rescale the variable to [0, 1]

  • Rescale variable to [0, 100]


Rescaling a Variable
02:15

Variables with higher values tend to dominate distance computations and you may want to use the standardized values.

  • Download the BostonHousing.csv data file

  • Use the scale() function

  • Standardize several variables simultaneously

Normalizing or Standardizing Data in a Data Frame
03:04

Sometimes we need to convert numerical data to categorical data or a factor.

  • Create a vector of break points

  • Create a vector of names for break points

  • Cut the vector using the break points

Binning Numerical Data
03:27

In situations where we have categorical variables (factors) but need to use them in analytical methods that require numbers, we need to create dummy variables.

  • Read the data-conversion.csv file

  • Create dummies for all factors in the data frame

  • Choose the variable to create dummies for

Creating Dummies for Categorical Variables
03:49

In this video, we summarize the data using the summary function.

  • Read the data

  • Get the summary statistics

Preview 03:28

In this video, we will look at two ways to subset data.

  • Index by position and name

  • Retrieve all data

  • Get mpg and car_name for all cars

Extracting Subset of a Dataset
05:45

Split a dataset to create groups corresponding to each level and to analyze each group separately

  • Download the datafile auto-mpg.csv

  • Split cylinders

Splitting a Dataset
01:55

By partitioning data we can unbiasedly evaluate the quality of data.

  • Install the packages

  • Create a numerical target variable and two partitions and three partitions

  • Create a categorical target variable and two partitions and three partitions

Creating Random Data Partitions
07:37

Before even embarking on any numerical analyses, you may want to get a good idea about the data through a few quick plots. So we cover only the simplest forms of basic graphs.

  • Generate a histogram for acceleration

  • Create a boxplot for mpg

  • Create a scatterplot for mpg

Generating Standard Plots
05:23

We often want to see plots side by side for comparisons. This video shows how we can achieve this.

  • Get old graphical parameter settings

  • Create a grid of one row and two columns

  • Reset par back to old value so that subsequent

Generating Multiple Plots
01:49

R can send its output to several different graphic devices to display graphics in different formats. This video deals with selecting proper graphic device.

  • Create a PostScript file

Selecting a Graphics Device
01:53

The lattice package produces Trellis plots to capture multivariate relationships in the data. Also, ggplot2 graphs are built iteratively, starting with the most basic plot.

  • Load the lattice package

  • Draw a boxplot and a scatter plot

  • For ggplot 2, draw an initial plot and add layers

Creating Plots with the Lattice and ggplot2package
09:05

In large datasets, we often gain good insights by examining how different segments behave. This video shows how to create graphs that enable such comparisons.

  • Set up a 2 × 2 grid

  • Extract data

  • Plot the histogram

Creating Charts that Facilitate Comparisons
02:43

Visualizing hypothesized causality helps to communicate our ideas clearly.

  • Show a hypothesized causality between the weather situation and the number of rentals

  • Overlay the actual points

Creating Charts that Visualize Possible Causality
01:35

When exploring data, we want to get a feel for the interaction of as many variables as possible. In this video, we will show you how you can bring up to five variables into play.

  • Read the data from file and create factors

  • Create a multivariate plot


Creating Multivariate Plots
02:14

Getting an idea of how the model does in training data itself is useful, but you should never use that as an objective measure.

  • Create and display a two-way table

  • Display raw numbers as proportions

  • Get row-wise percentages rounded to one decimal place

Preview 04:25

Receiver operating characteristic (ROC) charts helps by giving a visual representation of the true and false positives at various cutoff levels.

  • Load the package and read data file.

  • Create the prediction and performance objects

  • Plot the chart

Generating ROC Charts
03:47

This video shows you how you can use the rpart package to build classification trees and the rpart.plot package to generate nice-looking tree diagrams.

  • Create data partitions

  • Generate a diagram of the tree

  • Generate the error/classification-confusion matrix

Building, Plotting, and Evaluating – Classification Trees
06:07

The randomForest package can help you to easily apply the very powerful but computationally intensive random forest classification technique.

  • Load the package and read the data

  • Build the random forest model

  • Build the error matrix

Using random Forest Models for Classification
04:20

The e1071 package can help you to easily apply the very powerful Support Vector Machine (SVM) classification technique.

  • Load the package and read the data

  • Convert the outcome variable class to a factor

  • Partition the data and build the model

Classifying Using the Support Vector Machine Approach
05:26

The e1071 package contains the naiveBayes function for the Naïve Bayes classification.

  • Load the package and read the data

  • Partition the data and build the model

  • Predict for each case of the validation partition

Classifying Using the Naïve Bayes Approach
02:22

The class package contains the knn function for KNN classification.

  • Load the package and read the data

  • Partition the data

  • Generate predictions for validation cases with k=1:

Classifying Using the KNN Approach
05:02

The nnet package contains the nnet function for classification using neural networks.

  • Load the package and read the data

  • Convert the outcome variable class to a factor

  • Partition the data and build the neural network model

Using Neural Networks for Classification
04:18

The MASS package contains the lda function for classification using linear discriminant function analysis.

  • Load the package and read the data

  • Convert the outcome variable class to a factor

  • Partition the data and build the Linear Discriminant Function model

Classifying Using Linear Discriminant Function Analysis
02:48

The stats package contains the glm function for classification using logistic regression.

  • Load the package and read the data

  • Convert the outcome variable class to a factor

  • Partition the data and build the logistic regression model

Classifying Using Logistic Regression
04:01

R has several libraries that implement boosting where we combine many relatively inaccurate models to get a much more accurate model. The ada package provides boosting functionality on top of classification trees.

  • Load the package and read the data

  • Convert the outcome variable class to a factor

  • Generate predictions on the validation partition

Using AdaBoost to Combine Classification Tree Models
03:32

You generally evaluate a model's performance based on the training data, but will rely on the model's performance on the hold out data to get an objective measure.

  • Compute the RMS error

  • Plot the results

  • Show the 45 degree line

Computing the Root Mean Squared Error
02:43

In this video, we look at the use of the knn.reg function to build the model and then the process of predicting with the model as well. We also show some additional convenience mechanisms to make the process easier.

  • Load the dummies, FNN, scales, and caret packages

  • Generate dummies for the categorical variable

  • Create three partitions and build model for several values of K

Building KNN Models for Regression
08:23

In this video, we will discuss linear regression, arguably the most widely used technique. The stats package has the functionality for linear regression and R loads it automatically at startup.

  • Create partitions

  • Build the linear regression model

  • View the results

Performing Linear Regression
07:17

The MASS package has the functionality for variable selection and this recipe illustrates its use.

  • Load the package and read the data

  • Build the linear regression model

  • Run the variable selection procedure


Performing Variable Selection in Linear Regression
02:22

This video covers the use of tree models for regression. The rpart package provides the necessary functions to build regression trees.

  • Partition the data

  • Build and view the regression tree model

  • Plot the tree and prune the tree with the chosen cp value

Building Regression Trees
07:55

This video looks at random forests—one of the most successful machine learning techniques.

  • Build the random forest model

  • Examine variable importance

  • Compare predicted and actual values for the training partition

Building Random Forest Models for Regression
05:07

The nnet package contains functionality to build neural network models for classification as well as prediction. In this recipe, we cover the steps to build a neural network regression model using nnet.

  • Find the range of the response variable

  • Build the model

  • Plot the network and compute the RMS error on the training data

Using Neural Networks for Regression
03:35

The R implementation of some techniques, such as classification and regression trees, performs cross-validation out of the box to aid in model selection and to avoid overfitting.

  • Read the data

  • Show line numbers for discussion

  • For k fold cross validation→ k-fold cross-validation with k=5; for leave one out cross validation→ run leave-one-out-cross-validation

Performing k-Fold Cross-Validation and Leave-One-Out-Cross-Validation
05:07

The standard R package stats provides the function for K-means clustering. We also use the cluster package to plot the results of our cluster analysis.

  • Define a convenience function to standardize the relevant variables

  • Use the convenience function to standardize the variables of interest

  • Perform K-means clustering for a given value of K


Performing Cluster Analysis Using K-Means Clustering
06:48

The hclust function in the package stats helps us perform hierarchical clustering.

  • Define a convenience function to standardize the relevant variables

  • Use the convenience function to standardize the variables of interest

  • Compute the distance matrix to provide as input to the hclust function

Performing Cluster Analysis Using Hierarchical Clustering
04:00

The stats package offers the prcomp function to perform PCA. This recipe shows you how to perform PCA using these capabilities.

  • View the correlation matrix to check whether some variables are highly correlated

  • Examine the rotations for the principal components generated

  • Visualize the importance of the components through a scree plot or a barplot

Reducing Dimensionality with Principal Component Analysis
04:37
+ Machine Learning using Advanced Algorithms and Visualization in R
19 lectures 01:15:17

This video provides an overview of the entire course.


Preview 01:42

The goal of this video is to explain the random forest algorithm.

  • Open RStudio

  • Create a model with randomForest()

  • Inspect the model with getTree()


Random Forest Overview
07:22

In this video, we will do exploratory analysis of the vote92 data set.

  • Examine the data frame

  • Plot votes faceted by features

  • Graph continuous variables as bucketed categories

Exploring the Vote92 Data Set
06:57

In this video, we will use a randomForest() model.

  • Call our data get_vote_df() function

  • Create a randomForest() model

  • Examine the results in a confusion matrix


Using a Random Forest Model
02:59

In this video, we will explore the model results closer.

  • Create a single feature random forest and examine one of the decision trees

  • Create a more complicated random forest and examine the decision tree

  • Examine the feature importance (information gain) results from the more complicated result

Examining the model
03:16

In this video, we will examine the test set predictions versus actuals in terms of election results.

  • Create our final version of the model

  • Use that model on the test set

  • Compare the election results and discuss the implications


New Model and Final Results
02:22

The goal of this video is to understand the Support Vector Machine and EDA

  • Discuss Support Vector Machines

  • Explore the MNIST digit data set

  • Look at the code to load the data set

SVM Overview and EDA
05:08

In this video, we will create a support vector machine model.

  • Load the data and use PCA to reduce dimensions

  • Model on the new data

  • View the confusion matrix and best tuning parameters

Building an SVM Model
02:41

In this video, we will examine the results and create a support vector machine model with advanced parameters.

  • Load the data and use PCA to reduce dimensions

  • Model on the new data

  • View the confusion matrix and best tuning parameters

Examining the Results and Model
01:55

In this video, we will do a more advanced plot of the confusion matrix.

  • Create a model with results

  • Create a confusion matrix

  • Plot the confusion matrix with ggplot2


Visualizing a Confusion Matrix
02:17

The goal of this video is to examine the Satellite data.

  • Do some k-means clustering

  • Make a table of cluster versus class

  • Discuss the results

Overview of Satellite Data
04:39

In this video, we will explore k-nearest in R.

  • Show a simple k-means model

  • Examine the model

  • Discuss the results

Overview of K-Nearest Neighbor
04:58

In this video, we will create a k-nearest neighbor model.

  • Build a k-means model

  • Examine the model

  • Examine the model

Using KNN
01:35

In this video, we will do a more advanced plot of the records.

  • Tuning the number of votes (K)

  • Plotting the results


Visualizing KNN Results
02:31

The goal of this video is to examine the movie data.

  • Explore features

  • Discuss challenges

  • Identify a plan


Overview of Movie Review Data
02:42

The goal of this video is to understand documents represented by vectors, such as count vectors.

  • Use tm to create a corpus

  • Vectorize the documents

  • Examine the results

Overview of Document Vectors
06:55

In this video, we will classify the documents we vectorized.

  • Discuss topic modeling

  • Do NMF matrix reduction on the documents

  • Classify on the output of the NMF reduction

Classifying Document Matrices
07:30

In this video, we will cluster our documents and compare the clusters with the classification.

  • Use k-means clustering on our documents

  • Use a table to compare our clusters with classes

  • Plot the clusters and centroids


Clustering Documents
02:38

In this video, we will rank similar documents as you would in a search.

  • Again, examine our topic modeling scores

  • Use those with cosine similarity

  • Look at the top-n similar documents

Similar Documents
05:10
Test your knowledge
4 questions
Requirements
  • Prior basic R programming knowledge, data frames, and some basic knowledge in statistics are assumed.
Description

Data analysis has recently emerged as a very important focus for a huge range of organizations and businesses. Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. R makes detailed data analysis easier, making advanced data exploration and insight accessible to anyone interested in learning it. The R language is widely used among statisticians and data miners to develop statistical software and data analysis.

This comprehensive 2-in-1 course follows a recipe-based approach to exploring advanced algorithm and visualization concepts to get the most out of your data through real-world examples. To begin with, you’ll perform analyzing techniques and learn to handle missing values and duplicates. You’ll also learn to apply classification techniques and regression techniques. Moving further, you’ll work with advanced algorithms and techniques to enable efficient Machine Learning using the R programming language. Finally, you’ll work with a variety of real-world algorithms such as decision trees and support vector machines.

Towards the end of this course, you'll explore advanced algorithm and visualization concepts to get the most out of your data through real-world examples.

Contents and Overview

This training program includes 2 complete courses, carefully chosen to give you the most comprehensive training possible.

The first course, R Data Analysis Solutions - Machine Learning Techniques, covers analyzing techniques to get the most out of your data. This video empowers you by showing you ways to use R to generate professional analysis reports. It provides examples of various important analysis and machine-learning tasks that you can try out with associated and readily available data. You will learn to carry out different tasks on the data to bring it into action. By the end of this course, you will be able to carry out different analyzing techniques, apply classification and regression, and also reduce data.

The second course, Machine Learning using Advanced Algorithms and Visualization in R, covers Advanced Algorithms and additional visualization. In this course, you will work through various examples of advanced algorithms and focus a bit more on some visualization options. We’ll start by showing you how to use the random forest to predict what type of insurance a patient has based on their treatment and you will get an overview of how to use random forest/decision tree and examine the model. Then, we’ll walk you through the next example on letter recognition, where you will train a program to recognize letters using a support Vector machine, examine the results, and plot a confusion matrix. After that, you will look into the next example on soil classification from satellite data using K-Nearest Neighbor where you will predict what neighborhood a house is in based on other data about it. Finally, you’ll dive into the last example of predicting a movie genre based on its title, where you will use the tm package and learn some techniques for working with text data.

Towards the end of this course, you'll explore advanced algorithm and visualization concepts to get the most out of your data through real-world examples.

About the Authors

  • Viswa Viswanathan is an associate professor of Computing and Decision Sciences at the Stillman School of Business in Seton Hall University. After completing his Ph.D. in Artificial Intelligence, Viswa spent a decade in Academia and then switched to a leadership position in the software industry for a decade. During this period, he worked for Infosys, Igate, and Starbase. He embraced Academia once again in 2001. Viswa has taught extensively in diverse fields, including operations research, computer science, and software engineering, management information systems, and enterprise systems. In addition to teaching at the university, Viswa has conducted training programs for industry professionals. He has written several peer-reviewed research publications in journals such as Operations Research, IEEE Software, Computers and Industrial Engineering, and International Journal of Artificial Intelligence in Education. He has authored a book entitled Data Analytics with R: A Hands-on Approach.


  • Shanthi Viswanathan is an experienced technologist who has delivered technology management and enterprise architecture consultations to many enterprise customers. She has worked for Infosys Technologies, Oracle Corporation, and Accenture. As a consultant, Shanthi has helped several large organizations, such as Canon, Cisco, Celgene, Amway, Time Warner Cable, and GE, among others, in areas such as data architecture and analytics, master data management, service-oriented architecture, business process management, and modeling. When she is not in front of her Mac, Shanthi spends time hiking in the suburbs of NY/NJ, working in the garden, and teaching yoga. Shanthi would like to thank her husband, Viswa, for all the great discussions on numerous topics during their hikes together and for exposing her to R and Java. She would also like to thank her sons, Nitin and Siddarth, for getting her into the data analytics world.


  • Tim Hoolihan currently works at DialogTech, a marketing analytics company focused on conversations. He is the Senior Director of Data Science there. Prior to that, he was CTO at Level Seven, a regional consulting company in the US Midwest. He is the organizer of the Cleveland R User Group. In his job, he uses deep neural networks to help automate of a lot of conversation classification problems. In addition, he works on some side-projects researching other areas of Artificial Intelligence and Machine Learning. Outside Data Science, he is interested in mathematical computation in general; he is a lifelong math learner and really enjoys applying it wherever he can. Recently, he has been spending time in financial analysis, and game development. He also knows a variety of languages: R, Python, Ruby, PHP, C/C++, and so on. Previously, he worked in web application and mobile development.

Who this course is for:
  • This course is perfect for:
  • Data Scientist, Professional Developers who want to learn analytical techniques from scratch and understand how the R programming environment and packages can be used for developing Machine Learning systems.