Categorical Data

Ankit Mistry
A free video tutorial from Ankit Mistry
Software Developer | I want to Improve your life & Income.
4.2 instructor rating • 17 courses • 44,740 students

Learn more from the full course

Data Science with Python Course : Hands-on Data Science 2021

Numpy, Pandas, Matplotlib, Scikit-Learn, WebScraping, Data Science, Machine Learning, Pyspark, statistics, Data Science

15:35:20 of on-demand video • Updated November 2020

  • You will Learn one of the most in demand skill of 21st century Data Science
  • Add Data science skills : python, numpy, pandas, plotly, tableau, machine learning, statistics, probability in your resume
  • Apply linear regression and logistics regression on real dataset.
  • Crash course on python
  • Apply matrix operation with Numpy - Numerical python library
  • Visualize your data with mother of all visualisation library available in Python : MatplotLIb
  • Perform Data analysis, wrangling and cleaning with pandas library
  • Get hands on with interactive visualisation library Plotly
  • Getting start with data visualization tool, Tableau
  • Data Pre-processing technique - Missing data, Normalization, one hot encoding,
  • Importing data in Python from different sources, Files
  • Web Scraping to download web page and extract data
  • Data scaling and transformation
  • Exploratory Data analysis
  • Feature engineering process in Machine Learning system design
  • Machine learning theory
  • Apache spark installation : pyspark
  • Getting started with spark session
  • Mathey required for machine learning : Statistics, probability
  • Setup Data Science Virtual machine on Microsoft Azure Cloud
English Hello Friends Welcome back to the section on the data preprocessing. So in this tutorial we're going to see how to deal with the categorical data. So why categorical data we need to deal with? what is categorical data? So let's take our earlier example first. So I'm just going to print this X. So if you see here all those data which is not numeric in nature which is called the categorical data. So if you see here the first feature and a third feature which is having a value like Mumbai, London, New York Yes, No now all our machine learning algorithm can take only numerical data into account. So we just cannot throw all this Mumbai, London, New York, Yes, No, kind of data directly to machine learning algorithm we need to have some strategy to transform this non-numeric data into numeric data and then only we can process for the machine learning system design. Now if you see y, the output variable, that also contains a yes no yes no. So that is non-numeric in nature. So we should have some mechanism to transform this from one form into another form. So what are the strategies are there to transform into numeric nature? So one thing we can do whenever there two values are available or three values are available. So you see first feature there is only three features, or three different labels that are available either it's Mumbai, it's London or New York, this feature whether a particular person is smoking or not. It has only two possible values are there: yes or no and output also is 2 values. So you can transform like yes to let's say 1 and No to let's say 0. So that is one way we can do it, and in case of this Mumbai, London and New York. You can transform like Mumbai will transform to 0. London will transform to 1. So New York will transform into 2, so whenever there is two features are available. Your problem is almost get solved because either it's in a one direction or it's in the other direction especially in output value. So yes we will categorize to the 1 and we will supply to the machine learning algorithm and No will be categorized to the 0. But here there is one problem. Whenever you have a more than two features or more than two labels are available in a single feature. So here we have a total three values. Now if you supply this Mumbai as a 0, London as a 1 and New York as 2 It may happen the machine learning algorithm try to capture the relationship like London is better than Mumbai or New York is greater than London. But our intention is not to provide the data in such a way that algorithm will understand that London is greater than Mumbai because we are supplying 1 for London and a 0 for the Mumbai. New York is greater than London because New York is 2 And London is 1. So how to solve this problem? So first problem is just very simple mapping we will solve with label encoding. So we are going to use the label encoder but for the another, this first feature non-numeric to numeric conversion We just cannot allow for this kind of strategy. First we need to convert into label encoder. Then we should have a some other strategy like instead of transforming to just 0 1 and 2. What we will do we will just create three more features like first feature is like Mumbai, another one is London And the last one is New York. Now let's take all example one by one, so first feature is saying, first record is saying its person belongs to Mumbai. So London becomes 0, Mumbai become 1 and New York become 0. The second one is London so Mumbai becomes 0, London becomes 1 New York become 0. So in this way, we can transform three different labels in a single feature into three different feature. So once we supply Just in case of let's say Mumbai we are going to supply this M is equal to 1 instead of we are going to supply 1 0 0. So in case of London we are going to supply 0 in case of New York we are going to supply 0 and for Mumbai for that particular record we are going to supply 1. So it won't capture the interdependency like Mumbai is greater than London or London is greater than New York or New York is less than Mumbai. So those kind of interdependency won't take into consideration if we put it into this format and this format is nothing but one hot encoding, for every single label in your feature you are considering One extra feature that is nothing but one hot encoder. So we have to mechanism, one is transformation One is the hot encoder mechanism is available Now let's see how we can do it in a Scikit-learn library. So we're going to use the preprocessing module, "sklearn.preprocessing" and from that we are going to import two classes. One is the label encoder and another one is one hot encoder. So first we will transform all these three features. The city, whether the person is smoke or not, and output variable to the label encoder and after that we will just pass our first feature to the one hot encoder and we will see that all those records which is having a non-numeric value as our input data or output data has been converted into a particular numerical value. So let's create one object for label encoder, so label encoder we are going to use it for input and output, both So "le_x" I will create one variable. Another one is "le_y". OK. So I have created two object this "le_x" and "le_y". Now OK so we have created the two object based on the label encoder class. So now let's just fit this data our input data and we will assign it to the same particular feature. So I'll just, "le_x" and we will use the fit and transform method, this fit and transform is a very common and general method across all machine learning algorithms in our Scikit-learn library. So first fit And then it will transform, so we'll just supply all row and zeroth column because zeroth column is nothing but we need to transform this Mumbai, London column And another one is whether the person is smoke or not So that is a third column indexed by number 2, so just assign it to the same column because we need data at that particular place only for the further processing. So same way we will transform this second, column number third referenced by the number 2, and we'll just assign y also So here we are going to use "le_y" variable. And we will apply the fit and transform in top of y OK so we have successfully executed. Now let's see whether it has transformed correctly or not. OK. So it has transformed this value. Now if you see our earlier array. Mumbai has been transformed to number 1. London has been transformed to 0 and then New York has been transformed to 2 But that is not sufficient. We are going to use one hot encoder on the top of it, and the next one is a yes no yes no yes no yes yes yes yes. So you can see that. Yes, No So Yes is represented by 1 and No is represented by 0 then again the three Yes. So Yes has been transformed to 1 and No has been transformed to 0 , now let's see output y. OK, output also change from 1 0 0 1. OK so with the help of label encoder we have transformed our data to some numerical value. Now let's use the one hot encoder quarter on the very first feature, where it says that Which city person is belongs to. So we are going to use this one hot encoder, we will create an object for this one hot encoder. And here we need to pass one argument which is nothing but a categorical feature. So which feature you think that is a categorical feature so first row, first column only, so first feature is is nothing but a categorical feature whose index is 0. So this is the one object we will create and we will assign it to one hot encoder variable. "ohe". OK. Now let's just fit this "ohe" to our input data X and we will do this to "toarray" and let's assign it again to X OK So we have successfully executed our X. Now lets see whether X has been transformed or not. OK so you can see that now in our X, the input variable, earlier it has a total 4 features. But now if you see here there is a total of 6 features are there. So the first three feature belongs to the city it belongs to. Then immediately this is the age 24, one is whether particular person is smoking or not and 241 is the happiness index So first record belongs to I guess, our Mumbai, second record is London and third is New York. So if you see this Mumbai has been represented by the second column, then it's London, London has represented by the first column and then New York has been represented by the third column. So in this way, we've transformed our all those data from non-numeric form into numeric form which is suitable for the machine learning algorithm to get process So in this video we've seen two important encoding technique which will transform your non-numeric data into numeric data. So that's all about this video friends. I hope you enjoy listening to this video and see you into next video.