Case Studies in Data Mining with R

1,499 students enrolled

Please confirm that you want to add **Case Studies in Data Mining with R** to your Wishlist.

Learn to use the "Data Mining with R" (DMwR) package and R software to build and evaluate predictive data mining models.

1,499 students enrolled

Current price: $12
Original price: $60
Discount:
80% off

30-Day Money-Back Guarantee

- 22 hours on-demand video
- 2 Supplemental Resources
- Full lifetime access
- Access on mobile and TV

- Certificate of Completion

Get your team access to Udemy's top 2,000 courses anytime, anywhere.

Try Udemy for Business
What Will I Learn?

- Understand how to implement and evaluate a variety of predictive data mining models in three different domains, each described as extended case studies: (1) harmful plant growth; (2) fraudulent transaction detection; and (3) stock market index changes.
- Perform sophisticated data mining analyses using the "Data Mining with R" (DMwR) package and R software.
- Have a greatly expanded understanding of the use of R software as a comprehensive data mining tool and platform.
- Understand how to implement and evaluate supervised, semi-supervised, and unsupervised learning algorithms.

Requirements

- Students will need to install no-cost R software and the no-cost RStudio IDE (instructions are provided).

Description

**Case Studies in Data Mining **was originally taught as three separate online data mining courses. We examine three case studies which together present a broad-based tour of the basic and extended tasks of data mining in three different domains: (1) predicting algae blooms; (2) detecting fraudulent sales transactions; and (3) predicting stock market returns. The cumulative "hands-on" 3-course fifteen sessions showcase the use of Luis Torgo's amazingly useful "Data Mining with R" (DMwR) package and R software. Everything that you see on-screen is included with the course: all of the R scripts; all of the data files and R objects used and/or referenced; as well as all of the R packages' documentation. You can be new to R software and/or to data mining and be successful in completing the course. The first case study, * Predicting Algae Blooms*, provides instruction regarding the many useful, unique data mining functions contained in the R software 'DMwR' package. For the algae blooms prediction case, we specifically look at the tasks of data pre-processing, exploratory data analysis, and predictive model construction. For individuals completely new to R, the first two sessions of the algae blooms case (almost 4 hours of video and materials) provide an accelerated introduction to the use of R and RStudio and to basic techniques for inputting and outputting data and text.

Who is the target audience?

- The course is appropriate for anyone seeking to expand their knowledge and analytical skills related to conducting predictive data mining analyses.
- The course is appropriate for undergraduate students seeking to acquire additional in-demand job skill sets for business analytics.
- The course is appropriate for graduate students seeking to acquire additional data analysis skills.
- Knowledge of R software is not required to successfully complete this course.
- The course is appropriate for practicing business analytics professionals seeking to acquire additional job skill sets.

Compare to Other R Courses

Curriculum For This Course

136 Lectures

21:53:54
+
–

A Brief Introduction to R and RStudio using Scripts
14 Lectures
02:17:29

Preview
01:35

Introduction to R for Data Mining

15:57

A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components. Nevertheless, we will just call them members in this site.

Here is a vector containing three numeric values 2, 3 and 5.

> c(2, 3, 5)

[1] 2 3 5

And here is a vector of logical values.

> c(TRUE, FALSE, TRUE, FALSE, FALSE)

[1] TRUE FALSE TRUE FALSE FALSE

A vector can contain character strings.

> c("aa", "bb", "cc", "dd", "ee")

[1] "aa" "bb" "cc" "dd" "ee"

Preview
08:41

Data Structures: Vectors (part 2)

09:35

The function `factor`

is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument `ordered`

is `TRUE`

, the factor levels are assumed to be ordered. For compatibility with S there is also a function `ordered`

.

`is.factor`

, `is.ordered`

, `as.factor`

and `as.ordered`

are the membership and coercion functions for these classes.

Preview
08:20

Factors (part 2)

10:31

seq() is the R function that will produce an enumerated vector.

Generating Sequences

13:42

Given a vector of data one common task is to isolate particular entries or censor items that meet some criteria.

Indexing (aka Subscripting or Subsetting)

07:53

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.

Data Structures: Matrices and Arrays (part 1)

08:22

An array in **R** can have one, two or more dimensions. It is simply a vector which is stored with additional __attributes__ giving the dimensions (attribute `"dim"`

) and optionally names for those dimensions (attribute `"dimnames"`

).

Data Structures: Matrices and Arrays (part 2)

07:41

A **list** is an R structure that may contain object of any other types, including other lists. Lots of the modeling functions (like t.test() for the t test or lm() for linear models) produce lists as their return values, but you can also construct one yourself:

mylist <- list (a = 1:5, b = "Hi There", c = function(x) x * sin(x))

Data Structures: Lists

13:09

A data frame is a list of variables of the same number of rows with unique row names, given class `"data.frame"`

. If no variables are included, the row names determine the number of rows.

Data Structures: Dataframes (part 1)

09:38

Data Structures: Dataframes (part 2)

10:11

Creating New Functions

12:14

+
–

Inputting and Outputting Data and Text
9 Lectures
01:21:31

The scan() function in R reads data into a vector or list from the console or file.

Using the scan() Function for Input (part 1)

08:20

Using the scan() Function for Input (part 2)

07:04

The readline() function in R reads a line from the terminal (in interactive use).

Preview
12:00

The readLines() function

Using readLines() Function and Text Data

12:45

Example Program: powers.r

06:06

Example Program: quartiles1.r

07:23

Example Program: quad2b.r

08:23

Reading and Writing Files (part 1)

05:48

Reading and Writing Files (part 2)

13:42

+
–

Introduction to Predicting Algae Blooms
9 Lectures
01:59:52

This case study introduces you to some basic tasks of data mining: data pre-processing, exploratory data analysis, and predictive model construction. This initial case study studies a relatively small problem by data mining standards. Namely, the case addresses the problem of predicting the frequency occurrence of several harmful algae in water samples.

Predicting Algae Blooms

12:29

A **histogram** is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson.

Data Visualization and Summarization: Histograms

14:34

The **box plot** (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

Data Visualization: Boxplot and Identity Plot

13:15

**Conditioning Plot**. Purpose: Check pairwise relationship between two variables conditional on a third variable. A conditional **plot**, also known as a coplot or subset **plot**, is a **plot** of two variables contional on the value of a third variable (called the **conditioning** variable).

Preview
14:48

In __statistics__, **imputation** is the process of replacing __missing data__ with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with __listwise deletion__ of cases that have missing values. That is to say, when one or more values are missing for a case, most __statistical packages__ default to discarding any case that has a missing value, which may introduce __bias__ or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with a probable value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data

Imputation: Dealing with Unknown or Missing Values

16:03

Imputation: Removing Rows with Missing Values

11:09

Imputation: Replace Missing Values with Central Measures

10:04

Imputation: Replace Missing Values through Correlation

13:57

The ** lattice** package, written by Deepayan Sarkar, attempts to improve on base

Visualizing other Imputations with Lattice Plots

13:33

+
–

Obtaining Prediction Models
5 Lectures
01:14:38

Read in Data Files

14:18

Creating Prediction Models

15:21

In statistics, **regression** analysis is a statistical process for estimating the relationships among variables. It includes many techniques for **modeling** and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').

Examine Alternative Regression Models

17:30

**Regression trees** are for dependent variables that take continuous or. ordered discrete values, with prediction error typically measured by the squared. difference between the observed and predicted values.

Preview
15:42

Strategy for Pruning Trees

11:47

+
–

Evaluating and Selecting Models
9 Lectures
01:38:25

Alternative Model Evaluation Criteria

14:14

**Cross-validation**, sometimes called **rotation estimation**, is a __model validation__ technique for assessing how the results of a __statistical__ analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how __accurately__ a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of *known data* on which training is run (*training dataset*), and a dataset of *unknown data* (or *first seen* data) against which the model is tested (*testing dataset*). The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the *validation dataset*), in order to limit problems like __overfitting__, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

Introduction to K-Fold Cross-Validation

11:32

In *k*-fold cross-validation, the original sample is randomly partitioned into *k* equal sized subsamples. Of the *k* subsamples, a single subsample is retained as the validation data for testing the model, and the remaining *k* − 1 subsamples are used as training data. The cross-validation process is then repeated *k* times (the *folds*), with each of the *k* subsamples used exactly once as the validation data. The *k* results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used

Preview
10:30

Setting up K-Fold Evaluation (part 2)

10:26

Best Model (part 1)

10:10

Best Model (part 2)

09:52

Predicting from the Models

10:16

Comparing the Predictions

10:16

+
–

Examine the Data in the Fraudulent Transactions Case Study
10 Lectures
01:31:52

Exercise Solution from Evaluating and Selecting Models

03:59

This case study addresses an instantiation of the general problem of detecting unusual observations of a phenomena, that is, finding rare and quite different observations. The driving application has to do with transactions of a set of products that are reported by the salespeople of some company. The goal is to find "strange" transaction reports that may indicate fraud attempts by some of the salespeople.

Fraudulent Case Study Introduction

03:04

Prelude to Exploring the Data

04:51

**Data visualization** is the presentation of **data** in a pictorial or graphical format. For centuries, people have depended on visual representations such as charts and maps to understand information more easily and quickly.

Preview
11:21

Exploring the Data Continued (part 1)

13:36

Exploring the Data Continued (part 2)

13:24

Exploring the Data Continued (part 3)

13:15

Dealing with Missing Data (part 1)

10:13

Dealing with Missing Data (part 2)

07:23

Dealing with Missing Data (part 3)

10:46

+
–

Pre-Processing the Data to Apply Methodology
8 Lectures
01:24:52

Review the Data and the Focus of the Fraudulent Transactions Case

12:50

Pre-Processing the Data (part 1)

10:27

Here we explain the whys and hows of creating a list structure containing the unit prices by product.

Preview
10:39

Pre-Processing the Data (part 3)

12:28

In supervised learning the categories, data is assigned to are known before computation. So they are being used in order to 'learn' the parameters that are really significant for those Clusters. In unsupervised learning Datasets are assigned to segments, without the clusters being known.

Defining Data Mining Tasks

11:55

**Semi**-**supervised learning** is a class of **supervised learning** tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.

Semi-Supervised Techniques

06:51

In pattern recognition and information retrieval with binary classification, **precision** (also called positive predictive value) is the fraction of retrieved instances that are relevant, while **recall** (also known as sensitivity) is the fraction of relevant instances that are retrieved.

Precision and Recall

07:31

**Lift** is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and **lift charts** are visual aids for measuring model performance. Both **charts** consist of a **lift** curve and a baseline.

Lift Charts and Precision Recall Curves

12:11

+
–

Methodology to Find Outliers (Fraudulent Transactions)
10 Lectures
01:25:36

Exercise from Previous Session

01:16

Review Precision and Recall

10:05

Review Lift Charts and Precision Recall Curves

08:33

Creating More Functions for the Experimental Methodology

07:04

Experimental Methodology to find Outliers (part 1)

10:36

An **outlier** is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). Usually, the presence of an **outlier** indicates some sort of problem. This can be a case which does not fit the model under study, or an error in measurement. **Outliers** are often easy to spot in histograms.

Experimental Methodology to find Outliers (part 2)

11:27

Experimental Methodology to find Outliers (part 3)

09:12

Experimental Methodology to find Outliers (part 4)

10:18

Experimental Methodology to find Outliers (part 5)

06:57

+
–

The Data Mining Tasks to Find the Fraudulent Transactions
9 Lectures
01:33:10

Review of Fraud Case (part 1)

10:50

Review of Fraud Case (part 2)

11:08

Review of Fraud Case (part 3)

10:33

Baseline Boxplot Rule

07:47

state-of-the-art outlier ranking method. The main idea of this system is to

try to obtain an outlyingness score for each case by estimating its degree of

isolation with respect to its local neighborhood. The method is based on the

notion of the local density of the observations. Cases in regions with very low

density are considered outliers. The estimates of the density are obtained using

the distances between cases.

Preview
11:38

Plotting Everything

08:22

From a theoretical point of view, supervised and unsupervised learning differ only in the causal structure of the model. In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs. In other words, the inputs are assumed to be at the beginning and outputs at the end of the causal chain. The models can include mediating variables between the inputs and outputs. In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain. In practice, models for supervised learning often leave the probability for inputs undefined. This model is not needed as long as the inputs are available, but if some of the input values are missing, it is not possible to infer anything about the outputs. If the inputs are also modelled, then missing inputs cause no problem since they can be considered latent variables as in unsupervised learning.

Supervised and Unsupervised Approaches

11:19

SMOTE and Naive Bayes (part 1)

10:49

SMOTE and Naive Bayes (part 2)

10:44

+
–

Sidebar on Boosting
6 Lectures
01:00:03

Introduction to Boosting (from Rattle course)

08:34

**Boosting** is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones.

Preview
09:01

**Recursive partitioning** is a statistical method for multivariable analysis. **Recursive partitioning** creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables.

Replicating Adaboost using Rpart (Recursive Partitioning) Package

11:10

**AdaBoost**, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the prestigious "Gödel Prize" in 2003 for their work.

Replicating Adaboost using Rpart (part 2)

10:48

Boosting Extensions and Variants

14:06

Boosting Exercise

06:24

5 More Sections

About the Instructor