More Data Mining with R

How to perform market basket analysis, analyze social networks, mine Twitter data, text, and time series data.
3.8 (34 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
1,285 students enrolled
$19
$50
62% off
Take This Course
  • Lectures 67
  • Length 10.5 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 8/2015 English

Course Description

More Data Mining with R presents a comprehensive overview of a myriad of contemporary data mining techniques. More Data Mining with R is the logical follow-on course to the preceding Udemy course More Data Mining with R: Go from Beginner to Advanced although it is not necessary to take these courses in sequential order. Both courses examine and explain a number of data mining methods and techniques, using concrete data mining modeling examples, extended case studies, and real data sets. Whereas the preceding More Data Mining with R: Go from Beginner to Advanced course focuses on: (1) linear, logistic and local polynomial regression; (2) decision, classification and regression trees (CART); (3) random forests; and (4) cluster analysis techniques, this course, More Data Mining with R presents detailed instruction and plentiful "hands-on" examples about: (1) association analysis (or market basket analysis) and creating, mining and interpreting association rules using several case examples; (2) network analysis, including the versatile iGraph visualization capabilities, as well as social network data mining analysis cases (marriage and power; friendship links); (3) text mining using Twitter data and word clouds; (4) text and string manipulation, including the use of 'regular expressions'; (5) time series data mining and analysis, including an extended case study forecasting house price indices in Canberra, Australia.

What are the requirements?

  • Students will need to install the no-cost R console software and the no-cost RStudio IDE suite (instructions are provided).

What am I going to get from this course?

  • Understand the conceptual foundations of association analysis and perform market basket analyses.
  • Be able to create visualizations of social (and other) networks using the iGraph package.
  • Understand how to examine and mine social network data to understand all of the implicit relationships.
  • Mine text data to create word association visualizations, term documents with word frequency counts and associations, and create word clouds.
  • Learn how to process text and string data, including the use of 'regular expressions'.
  • Extract prototypical information about cycles from time series data.

What is the target audience?

  • This course would be useful for undergraduate and graduate students wishing to broaden their skills in data mining.
  • This course would be helpful to analytics professionals who wish to augment their data mining skills toolset.
  • Anyone who is interested in learning about association analysis (also called 'market basket analysis'), analyzing and mining data from social networks, text (such as Twitter) data, or time series data should take this course.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: Introduction to R and to Data Mining
Welcome to More Data Mining with R !
Preview
01:29
08:15

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Data Input and Output (part 1)
Preview
11:55
Data Input and Output (part 2)
13:22
12:50

Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".

A primary goal of data visualization is to communicate information clearly and efficiently to users via the statistical graphics, plots, information graphics, tables, and charts selected. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understandingcausality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look-up a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables.

More R Scripting and Visualizations (part 2)
06:29
More Input and Output (part 1)
10:52
More Input and Output (part 2)
11:39
Homework Exercise: Execute Second Set of Scripts on your Own
01:07
Section 2: Association Analysis (part 1)
04:58

Affinity analysis, a form of association analysis, is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.

Introduction to Association Analysis (part 2)
09:53
09:37

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the 'women and children first' policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.

These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

07:10

Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example frommarket basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

Rule Mining with Titanic Dataset (part 2)
05:16
Interpreting Rules
05:17
11:12

Association rule mining is one of the most popular data mining methods. However, mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Visualization has a long history of making large amounts of data better accessible using techniques like selecting and zooming. However, most association rule visualization techniques are still falling short when it comes to a large number of rules. In this paper we present a new interactive visualization technique which lets the user navigate through a hierarchy of groups of association rules. We demonstrate how this new visualization techniques can be used to analyze a large sets of association rules with examples from our implementation in the R-package arulesViz.

Visualizing Association Rules (part 2)
11:39
Section 3: Association Analysis: Online Radio and Predicting Income
11:01

In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.

For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).

Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift. Organizations can then consider each quantile, and by weighing the predicted response rate (and associated financial benefit) against the cost, they can decide whether to market to that quantile or not.

Lift is analogous to information retrieval's average precision metric, if one treats the precision (fraction of the positives that are true positives) as the target response probability.

The lift curve can also be considered a variation on the receiver operating characteristic (ROC) curve, and is also known in econometrics as the Lorenz or power curve.

The difference between the lifts observed on two different subgroups is called the uplift. The subtraction of two lift curves forms the uplift curve, which is a metric used in uplift modelling.

It is important to note that in general marketing practice the term Lift is also defined as the difference in response rate between the treatment and control groups, indicating the causal impact of a marketing program (versus not having it as in the control group). As a result, "no lift" often means there is no statistically significant effect of the program. On top of this, uplift modelling is a predictive modeling technique to improve (up) lift over control.

Association Rules Reviewed (part 2)
08:07
Online Radio Predictor Example (part 1)
Preview
09:06
Online Radio Predictor Example (part 2)
11:17
Predicting Income Example (part 1)
10:59
Predicting Income Example (part 2)
07:13
Predicting Income Example (part 3)
09:41
Section 4: Social Network Analysis: iGraph Visualizations
09:28

igraph is a library collection for creating and manipulating graphs and analyzing networks. It is written in C/C++ and also exists as Python and R packages. The software is widely used in academic research in network science and related fields.

iGraph Visualization Examples (part 1)
10:42
iGraph Visualization Examples (part 2)
Preview
08:16
iGraph Measurement Examples (part 3)
11:28
iGraph Measurement Examples (part 4)
08:16
iGraph Visualization Examples (part 5)
11:25
iGraph Visualization Examples (part 6)
Preview
07:12
iGraph Visualization Examples (part 7)
11:16
Section 5: Social Network Analysis (part 2)
06:36

The term 'social network' is increasingly used in the mainstream where it is inextricably tied to notions of influence. Mark Granovetter's articles on "The Strength of Weak Ties" (Granovetter 1973) and "Threshold Models of Collective Behavior" (Granovetter 1978) were probably the first to ignite public fascination with social networks and the spread of ideas, but Malcom Gladwell's (2000) best selling The Tipping Point is surely responsible for the most recent public fascination with social networks and the spread of social phenomena. Gladwell writes that change occurs when sociological phenomena (ideas, products, behaviors) reach critical mass; in other words, these phenomena spread through society like diseases. This idea has proven so attractive to that people now use the expression "that video went viral" to describe popular YouTube clips. In Gladwell's "framework," the success or failure of any social epidemic depends on the configuration of the network of social ties, which are analogous to disease vectors. He argues that a relatively few number of people, known as "connectors, mavens, and salesmen" hold the keys to spreading a good idea to a large enough number of people so it 'sticks.' The implication is that with the right combination of these few people on your side, you wield major social influence.

Visual Network: Marriage and Power in 15th Century Florence (part 1)
09:04
Visual Network: Marriage and Power in 15th Century Florence (part 2)
Preview
04:32
Example: Friendship Network (part 1)
12:00
Example: Friendship Network (part 2)
Preview
10:15
Example: Friendship Network (part 3)
14:04
Section 6: Text Mining Twitter Data
06:25

The data to analyze is Twitter text data of @RDataMining used in the example of Text Mining, and it can be downloaded as file “termDocMatrix.rdata” at the Data webpage. Putting it in a general scenario of social networks, the terms can be taken as people and the tweets as groups on LinkedIn, and the term-document matrix can then be taken as the group membership of people. We will build a network of terms based on their co-occurrence in the same tweets, which is similar with a network of people based on their group memberships.

Transforming Twitter Data
07:10
Stemming and Frequency Counts
11:09
Building a Text Term Document
06:11
Frequent Terms and Associations
06:03
Word Cloud and Word Clustering
Preview
08:39
K-Means and K-Medoids Clustering
10:36
Using Lists for Text Processing (part 1)
10:07
Using Lists for Text Processing (part 2)
07:54
Using Lists for Text Processing (part 3)
07:40
Section 7: Text (String) Manipulation
Introduction to String Manipulation (slides, part 1)
09:10
Introduction to String Manipulation (slides, part 2)
09:10
Text and String Manipulation Script Demos (part 1)
09:30
Text and String Manipulation Demos (part 2)
20:34
Text and String Manipulation Demos (part 3)
11:42
Text and String Manipulation Demos (part 4)
Preview
06:06
11:32

A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.

More Advanced Regular Expression Capabilities (slides and script)
15:42
Section 8: Time Series Data Mining
10:11

Time series decomposition is to decompose a time series into trend, seasonal, cyclical and irregular components.

Maine Unemployment Data (part 2)
06:44
Airline Travel Example
10:11
Electric Consumption in Australia (part 1)
06:02
Electric Consumption in Australia (part 2)
10:10
Time Series Clustering (part 1)
08:53
Time Series Clustering (part 2)
Preview
09:21
Time Series Classification
12:44
Section 9: Case Study: Forecasting House Price Indices in Canberra, Australia
Forecasting House Prices: Exploring the Data (part 1)
11:31
Forecasting House Prices: Exploring the Data (part 2)
13:15
Forecast House Prices: Use Trend and Seasonal Components
Preview
10:37

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Geoffrey Hubona, Ph.D., Professor of Information Systems

Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL (1993); an MA in Economics (1990), also from USF; an MBA in Finance (1979) from George Mason University in Fairfax, VA; and a BA in Psychology (1972) from the University of Virginia in Charlottesville, VA. He was a full-time assistant professor at the University of Maryland Baltimore County (1993-1996) in Catonsville, MD; a tenured associate professor in the department of Information Systems in the Business College at Virginia Commonwealth University (1996-2001) in Richmond, VA; and an associate professor in the CIS department of the Robinson College of Business at Georgia State University (2001-2010). He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling. Dr. Hubona is an expert of the analytical, open-source R software suite and of various PLS path modeling software packages, including SmartPLS. He has published dozens of research articles that explain and use these techniques for the analysis of data, and, with software co-development partner Dean Lim, has created a popular cloud-based PLS software application, PLS-GUI.

Ready to start learning?
Take This Course