Find online courses made by experts from around the world.
Take your courses with you and learn anywhere, anytime.
Learn and practice realworld skills and achieve your goals.
Case Studies in Data Mining was originally taught as three separate online data mining courses. We examine three case studies which together present a broadbased tour of the basic and extended tasks of data mining in three different domains: (1) predicting algae blooms; (2) detecting fraudulent sales transactions; and (3) predicting stock market returns. The cumulative "handson" 3course fifteen sessions showcase the use of Luis Torgo's amazingly useful "Data Mining with R" (DMwR) package and R software. Everything that you see onscreen is included with the course: all of the R scripts; all of the data files and R objects used and/or referenced; as well as all of the R packages' documentation. You can be new to R software and/or to data mining and be successful in completing the course. The first case study, Predicting Algae Blooms, provides instruction regarding the many useful, unique data mining functions contained in the R software 'DMwR' package. For the algae blooms prediction case, we specifically look at the tasks of data preprocessing, exploratory data analysis, and predictive model construction. For individuals completely new to R, the first two sessions of the algae blooms case (almost 4 hours of video and materials) provide an accelerated introduction to the use of R and RStudio and to basic techniques for inputting and outputting data and text. Detecting Fraudulent Transactions is the second extended data mining case study that showcases the DMwR (Data Mining with R) package. The case is specific but may be generalized to a common business problem: How does one sift through mountains of data (401,124 records, in this case) and identify suspicious data entries, or "outliers"? The case problem is very unstructured, and walks through a wide variety of approaches and techniques in the attempt to discriminate the "normal", or "ok" transactions, from the abnormal, suspicious, or "fraudulent" transactions. This case presents a large number of alternative modeling approaches, some of which are appropriate for supervised, some for unsupervised, and some for semisupervised data scenarios. The third extended case, Predicting Stock Market Returns is a data mining case study addressing the domain of automatic stock trading systems. These four sessions address the tasks of building an automated stock trading system based on prediction models that utilize daily stock quote data. The goal is to predict future returns for the S&P 500 market index. The resulting predictions are used together with a trading strategy to make decisions about generating market buy and sell orders. The case examines prediction problems that stem from the time ordering among data observations, that is, from the use of time series data. It also exemplifies the difficulties involved in translating model predictions into decisions and actions in the context of 'realworld' business applications.
Not for you? No problem.
30 day money back guarantee.
Forever yours.
Lifetime access.
Learn on the go.
Desktop, iOS and Android.
Get rewarded.
Certificate of completion.
Section 1: A Brief Introduction to R and RStudio using Scripts  

Lecture 1 
Course Overview
Preview

01:35  
Lecture 2 
Introduction to R for Data Mining

15:57  
Lecture 3  08:41  
A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components. Nevertheless, we will just call them members in this site. Here is a vector containing three numeric values 2, 3 and 5. > c(2, 3, 5) And here is a vector of logical values. > c(TRUE, FALSE, TRUE, FALSE, FALSE) A vector can contain character strings. > c("aa", "bb", "cc", "dd", "ee") 

Lecture 4 
Data Structures: Vectors (part 2)

09:35  
Lecture 5  08:20  
The function 

Lecture 6 
Factors (part 2)

10:31  
Lecture 7  13:42  
seq() is the R function that will produce an enumerated vector. 

Lecture 8  07:53  
Given a vector of data one common task is to isolate particular entries or censor items that meet some criteria. 

Lecture 9  08:22  
A matrix is a collection of data elements arranged in a twodimensional rectangular layout. 

Lecture 10  07:41  
An array in R can have one, two or more dimensions. It is simply a vector which is stored with additional attributes giving the dimensions (attribute 

Lecture 11  13:09  
A list is an R structure that may contain object of any other types, including other lists. Lots of the modeling functions (like t.test() for the t test or lm() for linear models) produce lists as their return values, but you can also construct one yourself: mylist < list (a = 1:5, b = "Hi There", c = function(x) x * sin(x)) 

Lecture 12  09:38  
A data frame is a list of variables of the same number of rows with unique row names, given class 

Lecture 13 
Data Structures: Dataframes (part 2)

10:11  
Lecture 14 
Creating New Functions

12:14  
Section 2: Inputting and Outputting Data and Text  
Lecture 15  08:20  
The scan() function in R reads data into a vector or list from the console or file. 

Lecture 16 
Using the scan() Function for Input (part 2)

07:04  
Lecture 17  12:00  
The readline() function in R reads a line from the terminal (in interactive use). 

Lecture 18  12:45  
The readLines() function 

Lecture 19 
Example Program: powers.r

06:06  
Lecture 20 
Example Program: quartiles1.r

07:23  
Lecture 21 
Example Program: quad2b.r

08:23  
Lecture 22 
Reading and Writing Files (part 1)

05:48  
Lecture 23 
Reading and Writing Files (part 2)

13:42  
Section 3: Introduction to Predicting Algae Blooms  
Lecture 24  12:29  
This case study introduces you to some basic tasks of data mining: data preprocessing, exploratory data analysis, and predictive model construction. This initial case study studies a relatively small problem by data mining standards. Namely, the case addresses the problem of predicting the frequency occurrence of several harmful algae in water samples. 

Lecture 25  14:34  
A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. 

Lecture 26  13:15  
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. 

Lecture 27  14:48  
Conditioning Plot. Purpose: Check pairwise relationship between two variables conditional on a third variable. A conditional plot, also known as a coplot or subset plot, is a plot of two variables contional on the value of a third variable (called the conditioning variable). 

Lecture 28  16:03  
In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with a probable value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data 

Lecture 29 
Imputation: Removing Rows with Missing Values

11:09  
Lecture 30 
Imputation: Replace Missing Values with Central Measures

10:04  
Lecture 31 
Imputation: Replace Missing Values through Correlation

13:57  
Lecture 32  13:33  
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships. In particular, the package supports the creation of trellis graphs  graphs that display a variable or the relationship between variables, conditioned on one or more other variables. 

Section 4: Obtaining Prediction Models  
Lecture 33 
Read in Data Files

14:18  
Lecture 34 
Creating Prediction Models

15:21  
Lecture 35  17:30  
In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). 

Lecture 36  15:42  
Regression trees are for dependent variables that take continuous or. ordered discrete values, with prediction error typically measured by the squared. difference between the observed and predicted values. 

Lecture 37 
Strategy for Pruning Trees

11:47  
Section 5: Evaluating and Selecting Models  
Lecture 38 
Alternative Model Evaluation Criteria

14:14  
Lecture 39  11:32  
Crossvalidation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc. 

Lecture 40  10:30  
In kfold crossvalidation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random subsampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10fold crossvalidation is commonly used 

Lecture 41 
Setting up KFold Evaluation (part 2)

10:26  
Lecture 42 
Best Model (part 1)

10:10  
Lecture 43 
Best Model (part 2)

09:52  
Lecture 44 
Finish Evaluating Models
Preview

11:09  
Lecture 45 
Predicting from the Models

10:16  
Lecture 46 
Comparing the Predictions

10:16  
Section 6: Examine the Data in the Fraudulent Transactions Case Study  
Lecture 47 
Exercise Solution from Evaluating and Selecting Models

03:59  
Lecture 48  03:04  
This case study addresses an instantiation of the general problem of detecting unusual observations of a phenomena, that is, finding rare and quite different observations. The driving application has to do with transactions of a set of products that are reported by the salespeople of some company. The goal is to find "strange" transaction reports that may indicate fraud attempts by some of the salespeople. 

Lecture 49 
Prelude to Exploring the Data

04:51  
Lecture 50  11:21  
Data visualization is the presentation of data in a pictorial or graphical format. For centuries, people have depended on visual representations such as charts and maps to understand information more easily and quickly. 

Lecture 51 
Exploring the Data Continued (part 1)

13:36  
Lecture 52 
Exploring the Data Continued (part 2)

13:24  
Lecture 53 
Exploring the Data Continued (part 3)

13:15  
Lecture 54 
Dealing with Missing Data (part 1)

10:13  
Lecture 55 
Dealing with Missing Data (part 2)

07:23  
Lecture 56 
Dealing with Missing Data (part 3)

10:46  
Section 7: PreProcessing the Data to Apply Methodology  
Lecture 57 
Review the Data and the Focus of the Fraudulent Transactions Case

12:50  
Lecture 58 
PreProcessing the Data (part 1)

10:27  
Lecture 59  10:39  
Here we explain the whys and hows of creating a list structure containing the unit prices by product. 

Lecture 60 
PreProcessing the Data (part 3)

12:28  
Lecture 61  11:55  
In supervised learning the categories, data is assigned to are known before computation. So they are being used in order to 'learn' the parameters that are really significant for those Clusters. In unsupervised learning Datasets are assigned to segments, without the clusters being known. 

Lecture 62  06:51  
Semisupervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training  typically a small amount of labeled data with a large amount of unlabeled data. 

Lecture 63  07:31  
In pattern recognition and information retrieval with binary classification, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. 

Lecture 64  12:11  
Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline. 

Section 8: Methodology to Find Outliers (Fraudulent Transactions)  
Lecture 65 
Exercise from Previous Session

01:16  
Lecture 66 
Review Precision and Recall

10:05  
Lecture 67 
Review Lift Charts and Precision Recall Curves

08:33  
Lecture 68 
Cumulative Recall Chart
Preview

10:08  
Lecture 69 
Creating More Functions for the Experimental Methodology

07:04  
Lecture 70 
Experimental Methodology to find Outliers (part 1)

10:36  
Lecture 71  11:27  
An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). Usually, the presence of an outlier indicates some sort of problem. This can be a case which does not fit the model under study, or an error in measurement. Outliers are often easy to spot in histograms. 

Lecture 72 
Experimental Methodology to find Outliers (part 3)

09:12  
Lecture 73 
Experimental Methodology to find Outliers (part 4)

10:18  
Lecture 74 
Experimental Methodology to find Outliers (part 5)

06:57  
Section 9: The Data Mining Tasks to Find the Fraudulent Transactions  
Lecture 75 
Review of Fraud Case (part 1)

10:50  
Lecture 76 
Review of Fraud Case (part 2)

11:08  
Lecture 77 
Review of Fraud Case (part 3)

10:33  
Lecture 78 
Baseline Boxplot Rule

07:47  
Lecture 79  11:38  
stateoftheart outlier ranking method. The main idea of this system is to try to obtain an outlyingness score for each case by estimating its degree of isolation with respect to its local neighborhood. The method is based on the notion of the local density of the observations. Cases in regions with very low density are considered outliers. The estimates of the density are obtained using the distances between cases. 

Lecture 80 
Plotting Everything

08:22  
Lecture 81  11:19  
From a theoretical point of view, supervised and unsupervised learning differ only in the causal structure of the model. In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs. In other words, the inputs are assumed to be at the beginning and outputs at the end of the causal chain. The models can include mediating variables between the inputs and outputs. In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain. In practice, models for supervised learning often leave the probability for inputs undefined. This model is not needed as long as the inputs are available, but if some of the input values are missing, it is not possible to infer anything about the outputs. If the inputs are also modelled, then missing inputs cause no problem since they can be considered latent variables as in unsupervised learning. 

Lecture 82 
SMOTE and Naive Bayes (part 1)

10:49  
Lecture 83 
SMOTE and Naive Bayes (part 2)

10:44  
Section 10: Sidebar on Boosting  
Lecture 84 
Introduction to Boosting (from Rattle course)

08:34  
Lecture 85  09:01  
Boosting is a machine learning ensemble metaalgorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones. 

Lecture 86  11:10  
Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into subpopulations based on several dichotomous independent variables. 

Lecture 87  10:48  
AdaBoost, short for "Adaptive Boosting", is a machine learning metaalgorithm formulated by Yoav Freund and Robert Schapire who won the prestigious "Gödel Prize" in 2003 for their work. 

Lecture 88 
Boosting Extensions and Variants

14:06  
Lecture 89 
Boosting Exercise

06:24  
Section 11: Introduction to Stock Market Prediction Case Study 
Dr. Geoffrey Hubona held fulltime tenuretrack, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 19932010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL (1993); an MA in Economics (1990), also from USF; an MBA in Finance (1979) from George Mason University in Fairfax, VA; and a BA in Psychology (1972) from the University of Virginia in Charlottesville, VA. He was a fulltime assistant professor at the University of Maryland Baltimore County (19931996) in Catonsville, MD; a tenured associate professor in the department of Information Systems in the Business College at Virginia Commonwealth University (19962001) in Richmond, VA; and an associate professor in the CIS department of the Robinson College of Business at Georgia State University (20012010). He is the founder of the Georgia R School (20102014) and of RCourseware (2014Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and nonlinear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling. Dr. Hubona is an expert of the analytical, opensource R software suite and of various PLS path modeling software packages, including SmartPLS. He has published dozens of research articles that explain and use these techniques for the analysis of data, and, with software codevelopment partner Dean Lim, has created a popular cloudbased PLS software application, PLSGUI.