More Data Mining with R
4.0 (48 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
1,611 students enrolled
Wishlisted Wishlist

Please confirm that you want to add More Data Mining with R to your Wishlist.

Add to Wishlist

More Data Mining with R

How to perform market basket analysis, analyze social networks, mine Twitter data, text, and time series data.
4.0 (48 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
1,611 students enrolled
Last updated 1/2016
Curiosity Sale
Current price: $10 Original price: $50 Discount: 80% off
30-Day Money-Back Guarantee
  • 10.5 hours on-demand video
  • 1 Supplemental Resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Understand the conceptual foundations of association analysis and perform market basket analyses.
  • Be able to create visualizations of social (and other) networks using the iGraph package.
  • Understand how to examine and mine social network data to understand all of the implicit relationships.
  • Mine text data to create word association visualizations, term documents with word frequency counts and associations, and create word clouds.
  • Learn how to process text and string data, including the use of 'regular expressions'.
  • Extract prototypical information about cycles from time series data.
View Curriculum
  • Students will need to install the no-cost R console software and the no-cost RStudio IDE suite (instructions are provided).

More Data Mining with R presents a comprehensive overview of a myriad of contemporary data mining techniques. More Data Mining with R is the logical follow-on course to the preceding Udemy course Data Mining with R: Go from Beginner to Advanced although it is not necessary to take these courses in sequential order. Both courses examine and explain a number of data mining methods and techniques, using concrete data mining modeling examples, extended case studies, and real data sets. Whereas the preceding Data Mining with R: Go from Beginner to Advanced course focuses on: (1) linear, logistic and local polynomial regression; (2) decision, classification and regression trees (CART); (3) random forests; and (4) cluster analysis techniques, this course, More Data Mining with R presents detailed instruction and plentiful "hands-on" examples about: (1) association analysis (or market basket analysis) and creating, mining and interpreting association rules using several case examples; (2) network analysis, including the versatile iGraph visualization capabilities, as well as social network data mining analysis cases (marriage and power; friendship links); (3) text mining using Twitter data and word clouds; (4) text and string manipulation, including the use of 'regular expressions'; (5) time series data mining and analysis, including an extended case study forecasting house price indices in Canberra, Australia.

Who is the target audience?
  • This course would be useful for undergraduate and graduate students wishing to broaden their skills in data mining.
  • This course would be helpful to analytics professionals who wish to augment their data mining skills toolset.
  • Anyone who is interested in learning about association analysis (also called 'market basket analysis'), analyzing and mining data from social networks, text (such as Twitter) data, or time series data should take this course.
Students Who Viewed This Course Also Viewed
Curriculum For This Course
67 Lectures
Introduction to R and to Data Mining
9 Lectures 01:17:58

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Course Preliminaries

Data Input and Output (part 2)

Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".

A primary goal of data visualization is to communicate information clearly and efficiently to users via the statistical graphics, plots, information graphics, tables, and charts selected. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understandingcausality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look-up a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables.

More R Scripting and Visualizations (part 1)

More R Scripting and Visualizations (part 2)

More Input and Output (part 1)

More Input and Output (part 2)

Homework Exercise: Execute Second Set of Scripts on your Own
Association Analysis (part 1)
8 Lectures 01:05:02

Affinity analysis, a form of association analysis, is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.

Introduction to Association Analysis (part 1)

Introduction to Association Analysis (part 2)

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the 'women and children first' policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.

These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

Preparing the Titanic Dataset

Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example frommarket basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

Preview 07:10

Rule Mining with Titanic Dataset (part 2)

Interpreting Rules

Association rule mining is one of the most popular data mining methods. However, mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Visualization has a long history of making large amounts of data better accessible using techniques like selecting and zooming. However, most association rule visualization techniques are still falling short when it comes to a large number of rules. In this paper we present a new interactive visualization technique which lets the user navigate through a hierarchy of groups of association rules. We demonstrate how this new visualization techniques can be used to analyze a large sets of association rules with examples from our implementation in the R-package arulesViz.

Visualizing Association Rules (part 1)

Visualizing Association Rules (part 2)
Association Analysis: Online Radio and Predicting Income
7 Lectures 01:07:24

In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.

For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).

Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift. Organizations can then consider each quantile, and by weighing the predicted response rate (and associated financial benefit) against the cost, they can decide whether to market to that quantile or not.

Lift is analogous to information retrieval's average precision metric, if one treats the precision (fraction of the positives that are true positives) as the target response probability.

The lift curve can also be considered a variation on the receiver operating characteristic (ROC) curve, and is also known in econometrics as the Lorenz or power curve.

The difference between the lifts observed on two different subgroups is called the uplift. The subtraction of two lift curves forms the uplift curve, which is a metric used in uplift modelling.

It is important to note that in general marketing practice the term Lift is also defined as the difference in response rate between the treatment and control groups, indicating the causal impact of a marketing program (versus not having it as in the control group). As a result, "no lift" often means there is no statistically significant effect of the program. On top of this, uplift modelling is a predictive modeling technique to improve (up) lift over control.

Association Rules and Lift Reviewed

Association Rules Reviewed (part 2)

Online Radio Predictor Example (part 2)

Predicting Income Example (part 1)

Predicting Income Example (part 2)

Predicting Income Example (part 3)
Social Network Analysis: iGraph Visualizations
8 Lectures 01:18:03

igraph is a library collection for creating and manipulating graphs and analyzing networks. It is written in C/C++ and also exists as Python and R packages. The software is widely used in academic research in network science and related fields.

Introduction to iGraph

iGraph Visualization Examples (part 1)

iGraph Measurement Examples (part 3)

iGraph Measurement Examples (part 4)

iGraph Visualization Examples (part 5)

iGraph Visualization Examples (part 7)
Social Network Analysis (part 2)
6 Lectures 56:31

The term 'social network' is increasingly used in the mainstream where it is inextricably tied to notions of influence. Mark Granovetter's articles on "The Strength of Weak Ties" (Granovetter 1973) and "Threshold Models of Collective Behavior" (Granovetter 1978) were probably the first to ignite public fascination with social networks and the spread of ideas, but Malcom Gladwell's (2000) best selling The Tipping Point is surely responsible for the most recent public fascination with social networks and the spread of social phenomena. Gladwell writes that change occurs when sociological phenomena (ideas, products, behaviors) reach critical mass; in other words, these phenomena spread through society like diseases. This idea has proven so attractive to that people now use the expression "that video went viral" to describe popular YouTube clips. In Gladwell's "framework," the success or failure of any social epidemic depends on the configuration of the network of social ties, which are analogous to disease vectors. He argues that a relatively few number of people, known as "connectors, mavens, and salesmen" hold the keys to spreading a good idea to a large enough number of people so it 'sticks.' The implication is that with the right combination of these few people on your side, you wield major social influence.

Visual Network Basics Revisited

Visual Network: Marriage and Power in 15th Century Florence (part 1)

Example: Friendship Network (part 1)

Example: Friendship Network (part 3)
Text Mining Twitter Data
10 Lectures 01:21:54

The data to analyze is Twitter text data of @RDataMining used in the example of Text Mining, and it can be downloaded as file “termDocMatrix.rdata” at the Data webpage. Putting it in a general scenario of social networks, the terms can be taken as people and the tweets as groups on LinkedIn, and the term-document matrix can then be taken as the group membership of people. We will build a network of terms based on their co-occurrence in the same tweets, which is similar with a network of people based on their group memberships.

Preprocessing Twitter Data

Transforming Twitter Data

Stemming and Frequency Counts

Building a Text Term Document

Frequent Terms and Associations

K-Means and K-Medoids Clustering

Using Lists for Text Processing (part 1)

Using Lists for Text Processing (part 2)

Using Lists for Text Processing (part 3)
Text (String) Manipulation
8 Lectures 01:33:26
Introduction to String Manipulation (slides, part 1)

Introduction to String Manipulation (slides, part 2)

Text and String Manipulation Script Demos (part 1)

Text and String Manipulation Demos (part 2)

Text and String Manipulation Demos (part 3)

A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.

Regular Expression Basics (slides and script)

More Advanced Regular Expression Capabilities (slides and script)
Time Series Data Mining
8 Lectures 01:14:16

Time series decomposition is to decompose a time series into trend, seasonal, cyclical and irregular components.

Maine Unemployment Data (part 1)

Maine Unemployment Data (part 2)

Airline Travel Example

Electric Consumption in Australia (part 1)

Electric Consumption in Australia (part 2)

Time Series Clustering (part 1)

Time Series Classification
Case Study: Forecasting House Price Indices in Canberra, Australia
3 Lectures 35:23
Forecasting House Prices: Exploring the Data (part 1)

Forecasting House Prices: Exploring the Data (part 2)

About the Instructor
Geoffrey Hubona, Ph.D.
4.0 Average rating
1,406 Reviews
11,968 Students
28 Courses
Professor of Information Systems

Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL; an MA in Economics, also from USF; an MBA in Finance from George Mason University in Fairfax, VA; and a BA in Psychology from the University of Virginia in Charlottesville, VA. He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling.