R: Complete Machine Learning Solutions
- 8 hours on-demand video
- 2 articles
- 13 downloadable resources
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to 4,000+ top Udemy courses anytime, anywhere.Try Udemy for Business
- Create and inspect the transaction dataset and perform association analysis with the Apriori algorithm
- Predict possible churn users with the classification approach
- Implement the clustering method to segment customer data
- Compress images with the dimension reduction method
- Build a product recommendation system
RStudio makes the process of development with R easier.
- Download RStudio
- Install Rstudio
- Downloading and Installing RStudio
In R, since nominal, ordinal, interval, and ratio variable are treated differently in statistical modeling, we have to convert a nominal variable from a character into a factor.
- Display the structure of the data using str
- Find the attribute name, data type, and values contained in each attribute
- Use the factor function to transform data from character to factor
The exploratory analysis helps users gain insights into how single or multiple variables may affect the survival rate. However, it does not determine what combinations may generate a prediction model. We need to use a decision tree for that.
- Construct a data split function
- Split data according to the need
- Generate the prediction model and plot the tree
Univariate statistics deals with a single variable and hence is very simple.
- Load data into a data frame. Compute the length of the variable.
- Obtain mean, median, standard deviation and variance.
- Obtain IQR, quantile, maxima, minima, and so on. Plot a histogram.
Comparing a sample with a reference probability or comparing cumulative distributions of two data sets calls for a Kolmogorov- Smirnov test.
- Check a normal distribution with one sample Kolmogrov-Smirnov test.
- Generate uniformly distributed sample data.
- Plot the ecdf of two samples. Apply a two-sample Kolmogrov-Smirnov test.
To examine the relation between categorical independent variables and continuous dependent variables, Anova is used. When there is a single variable, one-way ANOVA is used.
- Visualize the data with a boxplot
- Conduct a one-way ANOVA and perform ANOVA analysis
- Plot the differences in mean level.
It would be really convenient for us if we could predict unknown values. You can do that using linear regression.
- Build a linear fitted model
- Compute the prediction result using confidence interval
- Compute the prediction result using prediction interval
GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
- Input the independent and dependent variables
- Fit variables to a model
- Compare the fitted models with ANOVA function
There can be parts in a dataset which are not essential for classification. In order to remove these parts, we have to prune the dataset.
- Locate the record with minimum cross validation errors
- Extract the CP of the record and assign the value to churn
- Prune the classification tree
Like the prediction performance of a traditional classification tree, we can also evaluate the performance of a conditional inference tree.
- Predict the category of the testing dataset
- Generate a classification table
- Determine the performance measurements
K-nearest neighbor classifier is a non parametric lazy learning method. Thus it has the advantages of both the types of methods.
- Build a classification model
- Generate a classification table. Generate a confusion matrix from it
- Examine the sensitivity and specificity
Classification in logistic regression is done based one or more features. It is more robust and doesn’t have as many conditions as the traditional classification model.
- Generate a logistic regression model. Generate the model’s summary
- Predict the categorical dependent variable of the testing dataset
- Generate the confusion matrix
The Naïve Bayes classifier is based on applying Bayes’ theorem with a strong independent assumption.
- Specify the variables as first input parameters and churn label as the second input parameter in the function call
- Assign the classification model to the classifier variable
- Use a confusion matrix to calculate the performance measurement
Support vector machines are better at classification because they can capture complex relations between data points and provide both linear and non-linear classifications.
- Train a support vector machine
- Use different functions and arguments as desired for the output
- Obtain a summary of the built support vector machine
A neural network is used in classification, clustering and prediction. Its efficiency depends on how well you train it. Let’s learn to do that.
- Split the dataset into training and testing datasets.
- Add the required columns. Train the network model.
- Configure the hidden neurons. Examine the information of the neural network model.
Similar to other classification models, we can predict labels using neural networks and also validate performance using confusion matrix.
- Create an output probability matrix. Convert the probability matrix to class labels.
- Generate a classification matrix based on the labels obtained.
- Employ a confusion matrix to measure the prediction performance of the built neural network.
As we have already trained the neural network using nnet, we can use the model to predict labels.
- Generate the predicted labels based on a testing dataset.
- Generate a classification table based on predicted labels.
- Employ a confusion matrix to measure the prediction performance of the trained neural network.
The k-fold cross-validation technique is a common technique used to estimate the performance of a classifier as it overcomes the problem of over-fitting. In this video we will illustrate how to perform a k-fold cross-validation:
- Generate an index with 10 folds with the cut function
- Use a for loop to perform a 10-fold cross-validation
- Generate average accuracies with the mean function
In this video, we will illustrate how to use tune.svm to perform 10-fold cross-validation and obtain the optimum classification model.
- Apply tune.svm to the training dataset
- Obtain the summary information of the model
- Access the performance details of the tuned model
- Generate a classification table
To measure the performance of a regression model, we can calculate the distance from the predicted output and the actual output as a quantifier of the performance of the model. In this video we will illustrate how to compute these measurements from a built regression model.
- Load the dataset
- Calculate the root mean square error, relative square error and R-Square value
In this video we will see how to measure performance differences between fitted models with the caret package.
- Resample the three generated models and obtain its summary
- Plot the re-sampling result in the ROC metric or box-whisker plot
The adabag package implements both boosting and bagging methods. For the bagging method, the package first generates multiple versions of classifiers, and then obtains an aggregated classifier. Let’s learn the bagging method from adabag to generate a classification model.
- Install the adabag package and use the bagging function
- Generate the classification model
- Obtain a classification table and average error
To assess the prediction power of a classifier, you can run a cross validation method to test the robustness of the classification model. This video will show how to use bagging.cv to perform cross validation with the bagging method.
- Use bagging.cv to perform cross-validation
- Obtain the confusion matrix
- Retrieve the minimum estimation error
Boosting starts with a simple or weak classifier and gradually improves it by reweighting the misclassified samples. Thus, the new classifier can learn from previous classifiers. One can use the boosting method to perform ensemble learning. Let’s see how to use the boosting method to classify the telecom churn dataset.
- Use the boosting function from the adabag package
- Make a prediction based on the boosted model and testing dataset
- Retrieve the classification table and obtain average errors
Similar to the bagging function, adabag provides a cross validation function for the boosting method, named boosting.cv. In this video, we will learn how to perform cross-validation using boosting.cv.
- Use boosting.cv to cross-validate the training dataset
- Obtain the confusion matrix
- Retrieve the average errors
Gradient boosting creates a new base learner that maximally correlates with the negative gradient of the loss function. One may apply this method on either regression or classification problems. But first, we need to learn how to use gbm.
- Install the gbm package and use the gbm function to train a training dataset
- Use cross-validation and plot the ROC curve
- Use the coords function and obtain a classification table from the predicted results
A margin is a measure of certainty of a classification. It calculates the difference between the support of a correct class and the maximum support of an incorrect class. This video will show us how to calculate the margins of the generated classifiers.
- Use the margins function
- Use the plot function to plot a marginal cumulative distribution graph
- Compute the percentage of negative margin
The adabag package provides the errorevol function for a user to estimate the ensemble method errors in accordance with the number of iterations. Let’s explore how to use errorevol to show the evolution of errors of each ensemble classifier.
- Use the errorevol function for error evolution of boosting classifiers
- Use the errorevol function for error evolution of bagging classifiers
Random forest grows multiple decision trees which will output their own prediction results. The forest will use the voting mechanism to select the most voted class as the prediction result. In this video, we illustrate how to classify data using the randomForest package.
- Install and load the randomForest package
- Plot the mean square error of the forest object
- Use the varImpPlot function, the margin function, hist, and boxplot
At the beginning of this section, we discussed why we use ensemble learning and how it can improve the prediction performance. Let’s now validate whether the ensemble model performs better than a single decision tree by comparing the performance of each method.
- Estimate the error rate of the bagging model
- Estimate the error rate of the boosting method
- Estimate the error rate of the random forest model
- Use churn.predict and estimate the error rate of single decision tree
Hierarchical clustering adopts either an agglomerative or a divisive method to build a hierarchy of clusters. This video shows us how to cluster data with the help of hierarchical clustering.
- Load the data and save it
- Examine the dataset structure
- Use agglomerative hierarchical clustering to cluster data
Before starting with a mining association rule, you need to transform the data into transactions. This video will show how to transform any of a list, matrix, or data frame into transactions.
- Install and load the arule package
- Use the as function
- Transform the matrix-format data and data-frame-format dataset into transactions
The arule package uses its own transactions class to store transaction data. As such, we must use the generic function provided by arule to display transactions and association rules. Let’s see how to display transactions and association rules via various functions in the arule package.
- Obtain a LIST representation and use the summary function
- Use the inspect function and filter transactions by size
- Use the image function and itemFrequenctPlot
Association mining is a technique that can discover interesting relationships hidden in transaction datasets. This approach first finds all frequent itemsets and then generates strong association rules from frequent itemsets. In this video, we see how to perform association analysis using the apriori rule.
- Load the Groceries dataset and examine the summary
- Use itemFrequencyPlot and apriori
- Inspect the first few rules
Besides listing rules as text, you can visualize association rules, making it easier to find the relationship between itemsets. In this video, we will learn how to use the aruleViz package to visualize the association rules.
- Install and load the arulesViz package
- Make a scatter plot from the pruned rules and add jitter to it
- Plot soda_rule in a graph plot and a ballon plot
An apriori algorithm performs a breadth-first search to scan the database. So, support counting becomes time consuming. Alternatively, if the database fits into the memory, you can use the Eclat algorithm, which performs a depth-first search to count the supports. Let’s see how to use the Eclat algorithm.
- Use the eclat function to generate a frequent itemset
- Obtain the summary information
- Examine the top ten support frequent itemsets
In addition to mining interesting associations within the transaction database, we can mine interesting sequential patterns using transactions with temporal information. This video demonstrates how to create transactions with temporal information.
- Install and load the arulesSequences package
- Turn the list into transactions and use the inspect function
- Obtain summary information and read transaction data in basket format
In contrast to association mining, we should explore patterns shared among transactions where a set of itemsets occurs sequentially. One of the most famous frequent sequential pattern mining algorithms is the Sequential Pattern Discovery using Equivalence classes (SPADE) algorithm. Let’s see how to use SPADE to mine frequent sequential patterns.
- Use the cspade function to generate frequent sequential patterns
- Examine the summary of the frequent sequential patterns
- Transform a generated sequence format data back to the data frame
- No prior knowledge of R is required
Are you interested in understanding machine learning concepts and building real-time projects with R, but don’t know where to start? Then, this is the perfect course for you!
The aim of machine learning is to uncover hidden patterns, unknown correlations, and find useful information from data. In addition to this, through incorporation with data analysis, machine learning can be used to perform predictive analysis. With machine learning, the analysis of business operations and processes is not limited to human scale thinking; machine scale analysis enables businesses to capture hidden values in big data.
Machine learning has similarities to the human reasoning process. Unlike traditional analysis, the generated model cannot evolve as data is accumulated. Machine learning can learn from the data that is processed and analyzed. In other words, the more data that is processed, the more it can learn.
R, as a dialect of GNU-S, is a powerful statistical language that can be used to manipulate and analyze data. Additionally, R provides many machine learning packages and visualization functions, which enable users to analyze data on the fly. Most importantly, R is open source and free.
Using R greatly simplifies machine learning. All you need to know is how each algorithm can solve your problem, and then you can simply use a written package to quickly generate prediction models on data with a few command lines.
By taking this course, you will gain a detailed and practical knowledge of R and machine learning concepts to build complex machine learning models.
What details do you cover in this course?
We start off with basic R operations, reading data into R, manipulating data, forming simple statistics for visualizing data. We will then walk through the processes of transforming, analyzing, and visualizing the RMS Titanic data. You will also learn how to perform descriptive statistics.
This course will teach you to use regression models. We will then see how to fit data in tree-based classifier, Naive Bayes classifier, and so on.
We then move on to introducing powerful classification networks, neural networks, and support vector machines. During this journey, we will introduce the power of ensemble learners to produce better classification and regression results.
We will see how to apply the clustering technique to segment customers and further compare differences between each clustering method.
We will discover associated terms and underline frequent patterns from transaction data.
We will go through the process of compressing and restoring images, using the dimension reduction approach and R Hadoop, starting from setting up the environment to actual big data processing and machine learning on big data.
By the end of this course, we will build our own project in the e-commerce domain.
This course will take you from the very basics of R to creating insightful machine learning models with R.
We have combined the best of the following Packt products:
- R Machine Learning Solutions by Yu-Wei, Chiu (David Chiu)
- Machine Learning with R Cookbook by Yu-Wei, Chiu (David Chiu)
- R Machine Learning By Example by Raghav Bali and Dipanjan Sarkar
The source content have been received well by the audience. Here is a one of the reviews:
"good product, I enjoyed it"
- Ertugrul Bayindir
Meet your expert instructors:
Yu-Wei, Chiu (David Chiu) is the founder of LargitData a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems.
Dipanjan Sarkar is an IT engineer at Intel, the world's largest silicon company, where he works on analytics, business intelligence, and application development. His areas of specialization includes software engineering, data science, machine learning, and text analytics.
Raghav Bali has a master's degree (gold medalist) in IT from the International Institute of Information Technology, Bangalore. He is an IT engineer at Intel, the world's largest silicon company, where he works on analytics, business intelligence, and application development.
Meet your managing editor:
This course has been planned and designed for you by me, Tanmayee Patil. I'm here to help you be successful every step of the way, and get maximum value out of your course purchase. If you have any questions along the way, you can reach out to me and our author group via the instructor contact feature on Udemy.
- If you are interested in understanding machine learning concepts and building real-time projects with R, then this is the perfect course for you!