Data Mining with Rattle is a unique course that instructs with respect to both the concepts of data mining, as well as to the "hands-on" use of a popular, contemporary data mining software tool, "Data Miner," also known as the 'Rattle' package in R software. Rattle is a popular GUI-based software tool which 'fits on top of' R software. The course focuses on life-cycle issues, processes, and tasks related to supporting a 'cradle-to-grave' data mining project. These include: data exploration and visualization; testing data for random variable family characteristics and distributional assumptions; transforming data by scale or by data type; performing cluster analyses; creating, analyzing and interpreting association rules; and creating and evaluating predictive models that may utilize: regression; generalized linear modeling (GLMs); decision trees; recursive partitioning; random forests; boosting; and/or support vector machine (SVM) paradigms. It is both a conceptual and a practical course as it teaches and instructs about data mining, and provides ample demonstrations of conducting data mining tasks using the Rattle R package. The course is ideal for undergraduate students seeking to master additional 'in-demand' analytical job skills to offer a prospective employer. The course is also suitable for graduate students seeking to learn a variety of techniques useful to analyze research data. Finally, the course is useful for practicing quantitative analysis professionals who seek to acquire and master a wider set of useful job skills and knowledge. The course topics are scheduled in 10 distinct topics, each of which should be the focus of study for a course participant in a separate week per section topic.
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.
Rattle is a tab-oriented user interface that is similar to Microsoft Office's ribbon interface. It makes getting started with data mining in R very easy. Rattle is a tab-based GUI (graphical user interface) that performs a myriad of data mining functions using a "point-and-click" style of interaction with the GUI software, but Rattle also creates the underlying R code that actually drives the execution actions. Therefore, Rattle appeals to both people seeking the ease-of-use that is very much missing from R, and people looking to learn R programming.
What the Tabs do:
Data: The Data tab allows you to select your data source and import from a variety of file formats.
Explore: The Explore tab contains various things for performing exploratory work on your data to help understand distribution.
Test: The Test tab allows you to perform various statistical tests, from the T-test and F-test to others I've never heard of!
Transform: The Transform tab lets you clean up or modify your data set, using techniques such as ranking or rescaling.
Cluster: The Cluster tab lets you do various forms of clustering from numeric K-means clustering, to heirarchical and biclustering.
Associate: The Associate tab lets you do association rule data mining, which would be great for doing market basket analysis for retail data mining.
Model: The Model tab lets you create decision tree models, random forests, neural nets and other sophisticated data models.
Evaluate: The Evaluate tab is crucial because it helps you determine how well your model has worked. It provides an error matrix showing true outcomes versus the predicted outcomes.
Log: Lastly, the Log tab records all the actions run by your R code in Rattle, which helps you monitor performance, progress and errors.
We explore the shape or distribution of our data before we begin mining.
Through this exploration we begin to understand the "lay of the land," just as a miner works to understand the terrain before blindly digging for gold. Through this exploration we may identify problems with the data, including missing values, noise and erroneous data, and skewed distributions. This will then drive our choice of tools for preparing and transforming our data and for mining it.
Rattle provides tools ranging from textual summaries to visually appealing graphical summaries, tools for identifying correlations between variables, and a link to the very sophisticated GGobi tool for visualising data. The Explore tab provides an opportunity to understand our data in various ways.
In statistics, interactive data exploration is an applied form or exploratory data analysis (EDA), an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
GGobi is an open source visualization program for exploring high-dimensional data. It provides highly dynamic and interactive graphics such astours, as well as familiar graphics such as the scatterplot, barchart and parallel coordinates plots. Plots are interactive and linked withbrushing and identification.
Reshaping data is a common task in real-life data analysis, and it is usually tedious and frustrating. You've struggled with this task in Excel, in SAS, and in R: how do you get your clients' data into the form that you need for summary and analysis? The reshape package for R (R Development Core Team 2007) presents a new approach that aims to reduce the tedium and complexity of reshaping data.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. This is a form of "similarity." These algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix.
Connectivity based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance to) to use. Popular choices are known as single-linkage clustering (the minimum of object distances), complete linkage clustering (the maximum of object distances) or UPGMA ("Unweighted Pair Group Method with Arithmetic Mean", also known as average linkage clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions).
Affinity analysis is a form of association analysis . . . a type data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.[
Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal et al. introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example frommarket basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.
Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached.
Recursive partitioning methods have been developed since the 1980s. Well known methods of recursive partitioning include Ross Quinlan's ID3 algorithm and its successors, C4.5 and C5.0 and Classification and Regression Trees. Ensemble learning methods such as Random Forests help to overcome a common criticism of these methods - their vulnerability to overfitting of the data - by employing different algorithms and combining their output in some way.
Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for decision trees' habit of overfitting to their training set.
The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees with controlled variance.
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification andregression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averagingapproach.
Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones. Boosting is based on the question posed byKearns and Valiant (1988, 1989): Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used forclassification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL (1993); an MA in Economics (1990), also from USF; an MBA in Finance (1979) from George Mason University in Fairfax, VA; and a BA in Psychology (1972) from the University of Virginia in Charlottesville, VA. He was a full-time assistant professor at the University of Maryland Baltimore County (1993-1996) in Catonsville, MD; a tenured associate professor in the department of Information Systems in the Business College at Virginia Commonwealth University (1996-2001) in Richmond, VA; and an associate professor in the CIS department of the Robinson College of Business at Georgia State University (2001-2010). He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling. Dr. Hubona is an expert of the analytical, open-source R software suite and of various PLS path modeling software packages, including SmartPLS. He has published dozens of research articles that explain and use these techniques for the analysis of data, and, with software co-development partner Dean Lim, has created a popular cloud-based PLS software application, PLS-GUI.