
Know your instructor
Modules in the course
Data mining is about building models from data. We build models to gain insights into the world and how the world works, so we can predict how things will behave in the future. A data miner, in building models, deploys many different data analysis and model building techniques. Our choices depend on the business problems to be solved. Although data mining is not the only approach it is becoming very widely used because it is well suited to the data environments we find in today's enterprises. This is characterised by the volume of data available, commonly in the gigabytes and fast approaching the terabytes, and the complexity of that data, both in terms of the relationships that are awaiting discovery in the data and the data types available today, including text, image, audio, and video. Also, the business environments are rapidly changing, and analyses need to be regularly performed and models regularly updated to keep up with today's dynamic world.
Know the basics of R and Rattle
Find the link in the resources
A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs.
You can find more description in https://www.togaware.com/datamining/survivor/Introduction.html
A key task in any data mining project is exploratory data analysis (often abbreviated as EDA), which generally involves getting a basic understanding of a dataset. Statistics, the fundamental tool here, is essentially about uncertainty--to understand it and thereby to make allowance for it.
It is usually a good idea to review the distributions of the values of each of the variables in your dataset. The Distributions option allows you to visually explore the distributions for specific variables.
Using graphical tools to visually investigate the data's characteristics can help our understanding of the data, error correction, and variable selection and variable transformation.
Graphical presentations are more effective for most people, and Rattle provides a graphical summary of the distribution of the data with the Distribution option of the Explore tab.
A correlation plot will display correlations between the values of variables in the dataset. In addition to the usual correlation calculated between values of different variables, the correlation between missing values can be explored by checking the Explore Missing check box.
Learn how to: Principal Component Analysis and Interactive plot.
Statistical Tests: These tests apply to two samples. The paired two-sample tests assume that we have two samples or observations and that we are testing for a change, usually from one time period to another.
Distribution of the Data
* Kolmogorov-Smirnov Non-parametric Are the distributions different?
* Wilcoxon Signed Rank Non-parametric Do paired samples have different distribution?
Location of the Average
* T-test Parametric Are the means different?
* Wilcoxon Rank-Sum Non-parametric Are the medians different?
Variation in the Data
* F-test Parametric Are the variances different?
Correlation
* Correlation Pearsons Are the values from the paired samples correlated?
The Transform tab provides numerous options for transforming our datasets. Cleaning our data and creating new features from the data occupies much of our time as data miners. There is a myriad of approaches, and a programming language like R supports them all. Through the Rattle user interface, we can perform some of the more common transformations. This includes normalising our data, filling in missing values, turning numeric variables into categorical variables, and vice versa, dealing with outliers, and removing variables or entities with missing values.
The task of classification is at the heart of data mining! Most of what we learn from a traditional data mining course focuses on the algorithms from machine learning and statistics that build classification models. These models can then be used to classify new entities. The actual structure of the model also gives us insight into the relationships between the variables that are important in differentiating the classes.
This chapter focuses on this common data mining task of classification and prediction. We consider binary (or two class) classification, but the concepts also apply to multi-class classification.
The two-class model builders provided by Rattle are Decision Trees, Boosted Decision Trees, Random Forests, Support Vector Machines, and Logistic Regression.
To Evaluate the model, we have various features based on the model/clusters we use. Various evaluation criteria are available.
Error Matrix
An error matrix shows the true outcomes against the predicted outcomes. Two tables will be presented here. The first will be the
count of observations and the second will be the proportions. For a binary classification model, the cells of the error matrix are
referred to, from the top left going clockwise, as the True Negatives, False Positives, True Positives, and False Negatives. An error matrix is also known as a confusion matrix.
Association Rule Analysis
Association analysis identifies relationships or affinities between observations and/or between variables. These relationships are then expressed as a collection of association rules. The approach has been particularly successful in mining very large transaction databases. It is also often referred to as basket (as in shopping basket) analysis.
In this course, you will learn about Rattle GUI which is an interactive tool for data mining.
Rattle GUI is a free and open-source software package providing a graphical user interface (GUI) for data mining using the R statistical programming language. Rattle is used in a variety of situations. Rattle provides considerable data mining functionality by exposing the power of the R Statistical Software through a graphical user interface.
Rattle is also used as a teaching facility to learn the R software Language. There is a Log Code tab, which replicates the R code for any activity undertaken in the GUI, which can be copied and pasted. Rattle can be used for statistical analysis, or model generation. Rattle allows for the dataset to be partitioned into training, validation and testing. The dataset can be viewed and edited. There is also an option for scoring an external data file.
Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The computer is responsible for finding the patterns by identifying the underlying rules and features in the data. The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records