Decision Trees and Random Forests with scikit-learn

Dr. Junaid Qazi, PhD
A free video tutorial from Dr. Junaid Qazi, PhD
Data Scientist
4.2 instructor rating • 1 course • 7,160 students

Lecture description

You will create a machine learning model using Decision Tree and Random Forests using scikit-learn. One of the most important and key machine learning algorithm in business Data Science !

Learn more from the full course

Data Science and Machine Learning using Python - A Bootcamp

Numpy Pandas Matplotlib Seaborn Ploty Machine Learning Scikit-Learn Data Science Recommender system NLP Theory Hands-on

24:52:28 of on-demand video • Updated February 2020

  • Python to analyze data, create state of the art visualization and use of machine learning algorithms to facilitate decision making.
  • Python for Data Science and Machine Learning
  • NumPy for Numerical Data
  • Pandas for Data Analysis
  • Plotting with Matplotlib
  • Statistical Plots with Seaborn
  • Interactive dynamic visualizations of data using Plotly
  • SciKit-Learn for Machine Learning
  • K-Mean Clustering, Logistic Regression, Linear Regression
  • Random Forest and Decision Trees
  • Principal Component Analysis (PCA)
  • Support Vector Machines
  • Recommender Systems
  • Natural Language Processing and Spam Filters
  • and much more...................!
English Hi Guys welcome to the decision tree and then Random Forests lecture using sikit-learn in Python. After learning the key concept on decision tree and random forest in the Tenny lecture. Let's move on and use other famous dataset on heart disease in Cleveland. The original and full dataset is a part of UCI machine learning repository and it contained four databases. Cleveland, Hungary, Switzerland, and the VA Long Beach this dataset was donated to UCI depositary in 1988. The original database contains 76 attributes but all published experiment by machine learning researchers referred to using a subset of 14 of them. In particular the Cleveland database is the only one that has been used by the machine learning researchers to this date. In the original database, The "goal" field refers to the presence of heart disease in the patient. It is integer valued from zero to four. Experiments with Cleveland the database have concentrated on simply attempting to distinguish presence from absence. We are also using Cleveland database in this section. You can download the original one from UCI website or use the one provided along with this course. I recommend using one which is provided in the course material because it is already cleaned for the missing data. A new column Target is also added with N for zero and y for 1 2 3 and 4 mean the presence of heart disease. If you are interested to know more about databases you can visit UCI machine learning repository for the heart disease data and read more about this dataset. However the information on the 14 attributes that we are going to use is provided in the notebook. The attribute we are going to use are age, sex CP is the chest pain. "TrestBps" If the resting blood pressure "chol" related to cholesterol "fbs" is the fasting blood sugar and couple of more features. Let's move on to the Jupiter notebook and learn by doing. So as always, let's import the required libraries first. So "import pandas as pd", "import numpy as np" Import "seeborn as sns" and "import matplotlib.pyplot as plt" and then "matplotlib inline". Let's run this cell, and we are done with importing the required libraries. Let's read the dataset into the frame df. So df equal to pd dot read csv and the file is provided with the name, HD. and if you press tab tab, you will see the file HD cleaveland underscore data underscore clean dot csv And this is the file we are going to use in this lecture. Let's run this cell and we have our dataframe in df now. As always let's check the head of our dataframe df dot head and run this cell. And here we have age sex C.P. which was the chest pain and columns related to other features. The last column is target with yes and no. Yes I mean the presence of heart disease and No mean absence of heart disease. Let's add few cells and called info on our dataframe. So df don info and run this cell We see that we have 297 entries in our dataframe and there is no missing value. As mentioned earlier, I have already dealt with the missing values in the file which is provided in the course material. So let's move on and call describe on our dataframe. df dot describe. You can always press "tab" for auto complete. Let's run this. And here we have the basic statistics on our dataframe. So we have count 297 entries in each column, mean for each column, Standard deviation min max and so on. Let's move on and do some exploratory data analysis. We want to know how many persons in the database were diagnoses as a heart patient. Let's grab the target column and call the value counts So df target dot value counts and run tis cell So 160 persons were diagnoses with the absence of heart disease where as 137 persons were diagnoses with the heart disease. Let's move on and see how many male and female were diagnoses with the presence and absence of heart disease. We can pass in hue equal to sex and call countplot from seaborn, so sns dot countplot pass in x is target and hue equal to sex. And our data is df and let's run this cell Here we have, one is for male and zero is for female. So most of the men were diagnosed with heart disease. so moving forward, Let's see how age is related to the cholesterol we can call joint plot from sns. So sns dot jointplot and pass in x equal to age and y equal to Chol which is the cholesterol. and our dataframe is df and let's run this cell And here we have a pair plot age is along x axis and cholesterol is along y axis. We see some trend here. We see that with increasing age the level of cholesterol is getting higher in the patient or in the persons If you want you can do more exploratory data analysis so that you get even better understanding of your dataset. I will encourage you to ask questions to yourself and get more familiar to the dataset. Our focus is machine learning. Let's split the data and move on to the machine learning part. The approach we are going to follow is we are going to start with a single decision tree and then compare the results with random forest. But first thing is we need to do the train test split So let's import train test split from sklearn dot model selection import train test split run this cell and separate the features in X, df dot drop drop target along axis which is 1 and our y is df target. Let's run this cell And now we have features in capital X and target in y. Train test split, i will press shift tab. And copy this code just to save some typing. Let's set test size equal to 0.30 and leave the random state equal to 42 and run this cells to get the test and train samples. So we'll start with training a single decision tree for this purpose we need to import decision to classifier which is a part of tree in sklearn So let's do another import from sklearn dot tree. Import decision tree classifier let's run this cell and add a few lines. We need to create an instance for decision tree. So dtree is going to be the instance for decision tree classifier. If you press shift tab here you will see the options for the parameter that you can pass in. I'm going to use default values at this stage. However if you want you can move down and explore more on these parameters like criterion is the function to measure the quality of split we are going to use the default value. And if you move down, you will see maximum features max depth minimum sample split and so on. Let's move on with the default values and run this cell the next step after creating an instance is to train our model on the training dataset. For this purpose, we need to call fit on our instant and pass in the training data set. So dtree dot fit pass in x train and y train and here we have trained our classifier and these are all the default parameters that we are using to train our model on the training dataset The next step is always predictions and evaluations. So let's get to predictions in pred so pred equal to dtree dot predict and pass in X test and run this cell. And now we have predictions from our model in pred We have the absorbed values in Y underscore test and now we got the prediction in pred. We can print classification report and confusion metrices. Once again we need to import classification and confusion matrix from sklern dot metrics. So let's do another import here from sklearn dot metric import classification report and confusion matrix and let's run this cell And as usual we are going to use print function here to print classification report and confusion matrix. So print classifications report and we need to pass in the observed values which is y test and the predicted values which is in pred, and print confusion metrics pass in y test and pred And let's run this and here we have over classification report. our precision is point 68 recall 68 and you know confusion matrix we are getting some mislabelling here. It looks like we are doing quite good using a single decision tree. Let's try random forest model on the data and compare our results with the decision tree model random forest is under ensemble class in sklearn And we need to import random forest from ensemble, let's add few cells and do the import from sklearn dot ensemble and import random forest classifiers, let's run this cell So the next is we create an instant Let's create an instance roc equal to random forest classifier And if you press shift tab, we see the default value of n_estimators is 10. There are a range of other panel meters that you can play with. But we're going to use all the default values and passing in n estimators equal to 100. So let's copy n estimator from here and pass in equal to 100. You can always change the n estimator and see ow the model is behaving and how the results are different. with changing the number of estimators in your model. Let's run this cell. And the next step is training the model. So we need to call fit on rfc now, roc dot fit, pass is x train and y train and run this cell And now we have trained our model. The next one is predictions. Let's get the predictions in roc underscore pred equal to rfc dot predict and pass in x test and let's run this cell And now we have predictions in rfc underscore pred now we need to evaluate our model let's print classification report and confusion metrics so print classification report pass in y test and rfc pred and then print the confusion matrix and pass in the observed values which are in y test and the predicted values from our Random forest classifier or random forest model. rfc underscore pred and let's run this. So here we have it looks like the random forest has given the improved results over a single tree for the dataset that we have used in this lecture. We got better precision. recall and f1 score using random forest and there are less mislabelling in the sample. You will see that if the dataset gets larger and larger, the random forests will always do better than a single decision tree in the current situation. The data set is not very large but still random forests model works better than a single decision tree. However the model will oversized with larger dataset. So this was all about decision tree classifiers and random forests. Let's have a quick overview and then we'll move on to the next lecture. Where we will do the project using another real dataset. So the first thing we did some imports and then read the data into df. We got summary on the dataset and basic statistics and then we tried to do some exploratory data analysis then we imported train test split and separated our data to features and target and then we splited the data into test and train datasets, We imported the decision tree classifier and created its instance using default parameters. We trained our classifier on the trained dataset and then we did the predictions using X test which is the test dataset. We printed the the classification report and confusion matrix and learned that the decision tree is doing OK but doing quite a lot of mislabelling. The precision, recall and F1 scores are also low. Moving forward we imported random forest classifier passed in estimator equal to 100 and then train our classifier using default parameters on train dataset then we did the predictions and printed classification report and confusion metrics. We learnt that Ryndam forests have improved the predictions over decision tree and we got less mislabelling using random forest classifier Once again I wand to repeat that the random forest model overshine with larger dataset the bigger the dataset is the random forest model will do better job as compared to the single decision tree. So guys go through this lecture once again try to understand each and every step. In the next lecture we are going to practice our skills on decision trees and random forests using another real dataset. See you in the next lecture, Good luck!