
In this class we will see an introduction to the problem of the presence of outliers in our data, and with an example we will see the importance of this problem. It is an esencial issue in Data Mining, Data Analysis, Pattern recognition, Machine learning, and we have to be able to understand the problem and the basics of the methods to deal with it, before we use the softwares like R or Matlab.
Lesson where we see the Matlab code.
Lesson where we see the R code.
We will know what an outlier (or atypical data) really is, how they can arise, and with a simple example we will see how the presence of outliers can affect the statistical analysis.
Lesson where we see the Matlab code.
Lesson where we see the R code.
Let's review the concept of sample and population, and introduce the notion of random variable. We will see some simple examples.
Let's see what the distribution of a random variable is, and understand it using an example.
We will know the Normal distribution, the most used and known distribution in Statistics.
We will introduce the student-t and the chi-square distributions that arise from the Normal.
We are going to see what a sample estimator is, what is an estimation, and the properties that the estimators need to have in order to provide good estimates of the unknown population parameters. We will see the best known sample estimators.
Before starting to see the methods to detect outliers in the univariate space, we must know important concepts that we will use throughout the course, such as the central tendency estimators: mean and median.
We will see the spread (or variation) estimators: Range and Standard deviation and their respective robust versions: Interquartile range and MAD.
We will see the estimators of shape: the classical skewness and the Medcouple.
Lesson where we see the R code for the examples with the sample estimators and their robust versions.
We will know the basics of the SD Method, a classic to detect outliers in a random variable.
We will know the basics of the Z score method, antoher classic approach to detect outliers in a random variable.
We will know the basics of the Tukey Boxplot method, antoher classic approach to detect outliers in a random variable.
We will know the basis of the MADe method, a robust method to detect outliers in a random variable.
We will know the basis of the modified Z-score method, a robust method to detect outliers in a random variable.
We will know the basis of the adjusted Boxplot method, a robust method to detect outliers in a random variable.
Lesson where we see the Matlab and the R codes for the methods for outlier detection in the univariate space, and their robust versions.
We will came out to some conclusions about the methods to detect outliers in univariate space, and summarize everything we saw in this section.
We are going to look at some necessary linear algebra concepts, which are the definition of a vector, a matrix, the transpose of a vector or a matrix, the identity matrix, the inverse of a matrix, the product between a vector and a matrix, the product between two matrices.
We will see how a multivariate random variable is defined and we will study some examples.
We will define the joint distribution and density function of a multidimensional random vector, as well as marginal distributions and densities.
We will see the concept of covariance, correlation and independence between two random variables of a random vector, linked to the notion of joint distribution.
We will see some functions of R that allows us to draw the distribution of a bivariate Normal, and see how changing the parameters affects the function.
We will see how outliers are defined in a multivariate space, and why the methods of the univariate space cannot be used to detect them in the multivariate case.
We will see an example of real data in the multivariate space, specifically simple linear regression, where the presence of outliers influences the results.
We will learn the concept of location in the multivariate space and what are the estimators.
We will see in Matlab how to calculate multivariate location estimators and a graphical example.
We will see in R how to calculate multivariate location estimators and a graphical example.
We will learn the idea of dispersion in the multivariate space and what are the estimators.
We will see the multivariate dispersion estimators with an example in R.
We will learn about the Euclidean distance that allows to sort the data in a space of more than one dimension.
We will learn about the Mahalanobis distance that allows to sort the data in a space of more than one dimension, taking into account the relation between the variables. The classic version is sensitive to outliers, so you have to use the robust version of the distance.
We will learn how to calculate the Mahalanobis distance in R.
We will study the MCD method for the calculation of robust location and dispersion estimators.
How to obtain the MCD estimator in Matlab.
How to obtain the MCD estimator in R.
We will see another way to detect outliers with the robust Mahalanobis distance based on the MCD, considering another cut-off value, the adjusted quantile.
We will study some real data, the Kola project, and use the robust Mahalanobis distance based on the MCD with the adjusted quantile.
We will study some real data, the Kola project, and use the robust Mahalanobis distance based on the MCD with the adjusted quantile, in R.
We will know another method of robust estimation but based on projections.
We will know the code for Stahel-Donoho estimator in R.
We will see the Kurtosis method, based on projections of the data relative to the kurtosis coefficient value.
We will see in Matlab the application of the learned multivariate outlier detection methods.
We will see in R the application of the learned multivariate outlier detection methods.
We will summarize what we have seen in this section and come to some conclusions.
Let's see what the linear regression problem consists of, how it can be expressed in a matrix form and what are the assumptions of the model.
We will see the classic method of estimating the parameters of the multidimensional linear regression model. And we will see with an example that this method is not robust to outliers.
We will know the robust methods in the regression analysis and the types of outliers that we can find.
We will see the codes of the robust methods to estimate the regression model with two datasets, in Matlab.
We will see the codes of the robust methods to estimate the regression model with two datasets, in R.
Robust data analysis and outlier detection are crucial in Statistics, Data Analysis, Data Mining, Machine Learning, Artificial Intelligence, Pattern Recognition, Classification, Principal Components, Regression, Big Data, and any field related to the data.
With the course you will obtain the FREE BOOK ABOUT OUTLIERS with specific tips and tricks, and the summary of all the robust methods to detect them that will help you obtain accurate results and awsome data analysis.
Researchers, students, data analysts, and mostly anyone dealing with real data, should be aware of the problem with outliers (and outliers) and should know how to deal with this problem and what robust methods should be used. . The vast majority of Machine Learning algorithms are capable of detecting characteristics common to the majority of data, but many times they are confused or even ignore those atypical data, which should not be ignored in conditions where the security of people, such as the analysis of medical data, the world of the Internet of Things IoT, or risks and security in companies.
What would happen if a virus spread throughout the world because we ignored anomalous data? We would have a pandemic, like that of COVID19, which if the outlier signals detected by neural networks had not been ignored, we could have acted upon beforehand.
What would happen if we ignored any signal from a Smart City system? We could miss a gas leak.
What would happen if by ignoring an alarm, we miss a meteorite coming towards the earth? We would have to call Bruce Willis, to save us from Armageddon.
With this course you will be an expert in robust data analysis, in the detection and treatment of atypical data, both learning the theoretical concepts, and having at your disposal the algorithms implemented in a practical way with two different languages so that you can choose the one that best suits you: R-Studio and Matlab.
You will also have access to a community for questions, where all the students are and you can ask what you want about the analysis of outliers.
The example implementation codes are available to you in the open Github repository for you to download and use.
In addition, we have two sections of basic concepts that will help you to remember some notions necessary to understand atypical detection methods.
With this course you will be able to understand and know how to deal with one of the most important topics of today both academically, in the industry and in data analysis or machine learning. The examples will help you to visualize the importance of the analysis of outliers as well as a guide to carry out these analyzes yourself.