R is a data analysis software as well as a programming language. Data scientists, statisticians and analysts use R for statistical analysis, data visualization and predictive modeling. R is open source and allows integration with other applications and systems. Compared to other data analysis platforms, R has an extensive set of data products. Problems faced with data are cleared with R’s excellent data visualization feature.
The first section in this course deals with how to create R functions to avoid the unnecessary duplication of code. You will learn how to prepare, process, and perform sophisticated ETL for heterogeneous data sources with R packages. An example of data manipulation is provided, illustrating how to use the ‘dplyr’ and ‘data.table’ packages to efficiently process larger data structures. We also focus on ‘ggplot2’ and show you how to create advanced figures for data exploration.
In addition, you will learn how to build an interactive report using the “ggvis” package. Later sections offer insight into time series analysis, while there is detailed information on the hot topic of machine learning, including data classification, regression, clustering, association rule mining, and dimension reduction.
By the end of this course, you will understand how to resolve issues and will be able to comfortably offer solutions to problems encountered while performing data analysis.
About The Author
Yu-Wei, Chiu (David Chiu) is the founder of LargitData, a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences.
In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing.
R has got a lot of functions and a user can also define a function for a specific purpose. Once user creates functions, it becomes really important to learn about passing arguments. Let’s explore how to create an R function and pass arguments to it.
R stores and manages variables using the environment. Each function activates its environment whenever a new function is created. Let’s see how the environment of each function works.
Lexical scoping determines how a value binds to a free variable in a function. This is a key feature that originated from the scheme functional programming language This video will show us how lexical scoping works in R.
In previous videos, we illustrated how to create a named function. But dealing with functions without a name, that is, closure, can be a bit tricky. Let’s see how to use it in a standard function.
R functions evaluate arguments lazily; the arguments are evaluated as they are needed. Thus, it reduces the time needed for computation. Let’s take a look at how lazy evaluation works.
Normally, we operate on variables a and b by creating a function func (a,b). Although it is standard function syntax, it’s hard to read. We need to simplify the function syntax. Let’s see how we can do that using infix.
In R, there might be instances where we may have to assign a value to a function call. It becomes really important to learn about the replacement function, as it does the same. Let’s explore how it works and how we can use it.
There are various errors we may encounter during development in R, as in any other programming language. We need to learn how to handle those errors. Not only will it help in rectification but also it will make the program more robust.
As it is inevitable for all code to include bugs, an R programmer has to be well prepared for them with a good debugging toolset. Let’s explore how to debug a function using various functions.
The primary step for any data analysis to collect high-quality, meaningful data. One important data source is open data, which is published online in either text format or as APIs. Let’s see how we can download the text format of an open data file.
Now that we’ve learned how to download open data files, it becomes crucial to know how to read and write them for further processing. Let’s see how we can read a file with R.
The functions we’ve learned, read.table and read.csv, are useful only when the data size is small. We need know how to read large files for flexible data processing. Let’s explore how we can do that using the scan function.
Excel is widely used for storing and analyzing data. One can convert Excel files to other formats. But it’s a bit complex process. This video shows how to read and write an Excel file containing world development indicators with the xlsx package.
As R reads data in memory, it is perfect for processing and analyzing small datasets. However, database documents are becoming more common for the purpose of storing and analyzing bigger data. In this video, we will demonstrate how to use RJDBC to connect data stored in the database.
In most cases, the majority of data will not exist in the database, but will instead be published in different forms on the Internet. To dig up more valuable information from these data sources, we need to know how to access and scrape data from the Web.
Data analysis requires preprocessing of data. There are various steps which need to be performed for preparing data ready for analysis. The primary step is renaming data variables so that one can operate efficiently. Let’s see how we can use the names function to rename variables.
There are many instances where one does not specify the data type while importing. This leads to a difficulty in data manipulation as assigned data type is different than actual one. Let’s explore how we can simplify this by converting data type.
Some attributes in employees and salaries are in date format. So, we have to calculate the number of years between the employees' date of birth and current year to estimate their age. This might be a tedious task. Let’s see how we can do it by manipulating date data.
Similar to database operations, we can add a new record to the data frame by the schema of the dataset. But in R, we can also perform these operations much more easily. In this video, we’ll see how to use the rbind and cbind functions to add a new record or attribute.
Some analyses require partial data of particular interest. For that purpose, data filtering is required. In database operations, SQL command is used with the where clause to subset data. But, we need to know how it is done in R. Let’s see how we can do that.
There might be some unwanted records in the dataset even after filtering. This can generate inaccurate results. Now that we’ve learned how to filter the dataset, let’s see how we remove or drop bad data.
Similar to data tables in a database, we sometimes need to combine two datasets for correlating data. In R, we can do that using merge and plyr. Also, in order to analyze data more efficiently, R provides two methods, sort and order, which we must learn to sort data.
There are instances where data analysis is possible only when the data is in a specific format. We must know how to reshape data and remove data with missing values for efficient data processing.
Missing data may occur from data process flaws or simply typos. But this small mistake can affect the whole analysis as the results may be misleading. Thus, it becomes really important to learn how to detect missing values in R.
We’ve learned how to detect missing data. But, there might be some instances where analysis may go wrong due to those missing values. This video will introduce some techniques to impute missing values for efficient data processing.
When you process a dataset that is a gigabyte or larger in size, you may find that data.frame is rather inefficient. To address this issue, you can use the enhanced extension of data.frame—data.table. In this video, we will see how to create a data.table in R.
Two major advantages of a data.table as compared to a data.frame are the speed and clearer syntax of the former. Similar to a data.frame, we can perform operations to slice and subset a data.table. This video shows some operations that you can perform on data.table.
Another advantage of a data.table is that we can easily aggregate data without the help of additional packages. This video illustrates how to perform data aggregation using data.table.
In addition to performing data manipulation on a single table, we often need to import more features or correlate data from other data sources. Therefore, we can join two or more tables into one. In this video, we look at some methods to merge two data.table.
To perform more advanced descriptive analysis, we must know how to use the dplyr package to reshape data and obtain summary statistics. This video will guide us how to use dplyr to manipulate data and to use the filter and slice functions to subset and slice data.
As a single machine cannot efficiently process big data problems, a practical approach is to take samples that we can effectively use to draw conclusions. Here, we will see how to use dplyr to sample from data.
Besides selecting individual rows from the dataset, we can use the select function in dplyr to select a single or multiple columns from the dataset. In this video, we will look at how to select particular columns using the select function.
To perform multiple operations on data using dplyr, we can wrap up the function calls into a larger function call. Or, we can use the %>% chaining operator to chain operations instead. This video will introduce chaining of operations when using dplyr.
Arranging rows in order may help us rank data by value or gain a more structured view of data in the same category. In this video, we will take a look at how to arrange rows with dplyr.
To avoid counting duplicate rows, we can use the distinct operation in SQL. In dplyr, we can also eliminate duplicated rows from a given dataset. Let’s explore how to do that.
Besides performing data manipulation on existing columns, there are situations where a user may need to create a new column for more advanced analysis. Let’s see how to add a new column using dplyr.
Besides manipulating a dataset, the most important part of dplyr is that one can easily obtain summary statistics from the data. In SQL, we use the GROUP BY function for this purpose. This video will show us how to summarize data with dplyr.
In a SQL operation, we can perform a join operation to combine two different datasets. In dplyr, we have the same join operation that enables us to merge data easily. In this video, we’ll learn how join works in dplyr.
In ggplot2, the data is charted by mapping the element from mathematical space to physical space. We can use simple elements to build a figure. This video shows how to construct our very first ggplot2 plot using the superstore sales dataset.
Aesthetics mapping describes how data variables are mapped to the visual property of a plot. In this video, we discuss how to modify aesthetics mapping on geometric objects so that we can change the position, size and color of a given geometric object.
Geometric objects are elements that we mark on the plot. One can use geometric object in ggplot2 to create either a line, bar, or box chart. Moreover, one can integrate them with aesthetic mapping to create a more professional plot. This video introduces how to use geometric objects to create various charts.
Besides mapping particular variables to the x or y axis, one can first perform statistical transformations on variables, and then remap the transformed variable to a specific position. With the help of this video, we’ll be able perform variable transformations with ggplot2.
Besides setting aesthetic mapping for each plot or geometric object, one can use scale to control how variables are mapped to the visual property. Let’s explore how to adjust the scale of aesthetics in ggplot2.
When performing data exploration, it is essential to compare data across different groups. Faceting is a technique used to create graphs for subsets of data. This video will help us use the facet function to create a chart for multiple subsets of data.
One can adjust the layout, color, font, and other attributes of a non-data object using the theme system in ggplot2. By default, ggplot2 provides many themes, and one can adjust the current theme. This video will show us how to use the theme_* function and customize a theme.
To create an overview of a dataset, we may need to combine individual plots into one. This video will guide us on how to combine individual subplots into one plot.
One can use a map to visualize the geographical relationship of spatial data. This video shows us how to create a map from a shapefile with ggplot2 and use ggmap to download data from a mapping service.
Creating an R Markdown report with RStudio is a straightforward process. This video will teach us how to use the built-in GUI to create markdown reports in different format.
The most attractive feature of a markdown report is that it enables the user to create a well-formatted document with plain text and simple markup syntax. Let’s see how we can use Markdown to create, edit, organize, and highlight data.
In an R Markdown report, we can embed R code chunks with the knitr package. This video will guide us on how to create and control the output with different code chunk configurations.
In ggvis, one can use a simple layer to create lines, points, and other geometry objects in the plot. This video guides us through using ggvis syntax and grammar to create different plots.
In addition to making different plots in ggvis, we can control how axes and legends are displayed in a ggvis figure with the *_axis and *_legend functions. Let’s see how we can set their appearance properties and rescale the mapping of the data with the scale function.
ggvis can be used to create an interactive web form. It allows the user to subset data and change the visual properties of the plot by interacting with the web form. In this video, we learn how to add interactivity to a ggvis plot.
An R Markdown report outputs codes and static figures; one cannot perform exploratory data analysis through web interaction. To enable the user to explore data via a web form, we have to build an interactive web page. In this video, we see how to create an interactive web report with Shiny.
In addition to hosting a Shiny app on a local machine, we can host our Shiny app online. RStudio provides a service, http://www.shinyapps.io/, that allows anyone to upload their Shiny app. Let’s see how we publish an R shiny report using shiny apps.
Generating samples is the first step for working with probability distributions. So, learning this basic concept is very important.
When the probability of many events is equal, we need a uniform distribution to show that.
You need to generate samples from a binomial distribution when you evaluate the success or failure of several independent trials. This video will enable you to do that.
For calculating the probability of events with a fixed time interval, Poisson distribution is the best option.
Real-world data follows a normal distribution curve. So sampling from a normal distribution should be learnt. This video will help you with that.
Using R to generate chi-squared distribution.
To estimate the mean of the population from a normal distribution, the student’s t distribution is used.
Along with generating samples, we can also sample subsets from datasets. This video will arm you to do that.
When there are one or more random variables within the model, we need stochastic processes.
To estimate the interval range of unknown parameters in data, we use confidence intervals.
To compare two mean values, we perform Z-tests on data.
In cases where the standard deviation is unknown, we need to perform student’s T-tests.
When the data distribution is unknown, non-parametric testing comes into the picture. We do that by conducting exact binomial tests in R.
When comparing samples or a sample with a probability distribution test, we require Kolmogorov-Smirnov tests.
To discover the relationship between two categorical variables, we need to conduct a Pearson’s chi-squared test.
To test the belonging of two groups to a population, we use Wilcoxon rank Sum and signed rank tests.
To investigate an individual categorical variable relation, one-way ANOVA is used.
When there are more than two categorical variables involved, two-way ANOVA is used.
Before rule mining, it is important to transform the data into transactions.
You will learn to display transactions and associations in this video.
To find the relation within a transaction dataset, we use the Apriori rule.
Sometimes, rules are repeated and are redundant. We need to know how to remove these rules to get significant information. This video will enable you to do that.
To explore the relation between items, we visualize association rules.
Eclat is faster than Apriori in mining itemsets. Hence it is essential to learn how it works.
You will learn to create transactions with temporal information in this video.
A better algorithm for mining frequent sequential patterns is cSPADE. It is important to learn about it and understand it.
Time-indexed variables should be represented in time series data. Hence it is important to know how to create one.
Plotting a time series object will make visualization easy and effective.
To get the components of a time series, we need to decompose it.
To measure the error rate of a regression model, we need to calculate RMSE and RSE.
We can forecast a time series from the smoothed model. Let’s learn how to do that.
ARIMA takes auto-correlation into consideration. This helps in real-life examples.
After understanding the ARIMA model, we can create an ARIMA model of our own. Let’s see how to do that.
We can predict values with the ARIMA model.
You will apply your knowledge of the ARIMA model in prediction of stock prices.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.