
First of all, let's try to understand the application that we want to develop or the problem that we are trying to solve. Once we understand the problem statement and it's use case, it will be much easier for us to develop the application. So let's begin!
Here, we want to help financial companies, such as banks, NBFS, lenders, and so on. We will make an algorithm that can predict to whom financial institutes should give loans or credit. Now you may ask what is the significance of this algorithm? Let me explain that in detail. When a financial institute lends money to a customer, they are taking some kind of risk. So, before lending, financial institutes check whether or not the borrower will have enough money in the future to pay back their loan.
Based on the customer's current income and expenditure, many financial institutes perform some kind of analysis that helps them decide whether the borrower will be a good customer for that bank or not. This kind of analysis is manual and time consuming.
So, it needs some kind of automation. If we develop an algorithm, that will help financial institutes gauge their customers efficiently and effectively.Your next question may be what is the output of our algorithm? Our algorithm will generate probability. This probability value will indicate the chances of borrowers defaulting. Defaulting means borrowers cannot repay their loan in a certain amount of time.
Here, probability indicates the chances of a customer not paying their loan EMI on time, resulting in default. So, a higher probability value indicates that the customer would be a bad or inappropriate borrower (customer) for the financial institution, as they may default in the next 2 years. A lower probability value indicates that the customer will be a good or appropriate borrower (customer) for the financial institution and will not default in the next 2 years.
Here, I have given you information regarding the problem statement and its output, but there is an important aspect of this algorithm: its input. So, let's discuss what our input will be!
Let's discuss the dataset and its attributes in detail. Here, in the dataset, you can find the following files:
• cs-training.csv
°° Records in this file are used for training, so this is our training dataset.
• cs-test.csv
°° Records in this file are used for testing our machine learning models, so this is our testing dataset.
Data Manipulation and Data Statistics
Missing Values
Replacing Missing Values
Correlation Matrix
Outliers
Outliers Detection Techniques
Percentile-Based Outlier Detection
Mean Absolute Deviation (MAD)-Based Outlier Detection
Standard Deviation (STD)-Based Outlier Detection
Majority-Vote Based Outlier Detection
Visualizing Outliers
Handling Outliers
Replacing Outliers
Random Forest used to evaluate the importance of 10 features on an artificial classification task
adaBoost used to evaluate the importance of 10 features on an artificial classification task. The findings of Random Forest and adaBoosting were discussed (The same code was used for both models)
Gradient Boosting model was used for feature importance but selecting 3 attributes ONLY
One case study, five models from data preprocessing to implementation with Python, with some examples where no coding is required.
We will cover the following topics in this case study
Problem Statement
Data
Data Preprocessing 1
Understanding Dataset
Data change and Data Statistics
Data Preprocessing 2
Missing values
Replacing missing values
Correlation Matrix
Data Preprocessing 3
Outliers
Outliers Detection Techniques
Percentile-based outlier detection
Mean Absolute Deviation (MAD)-based outlier detection
Standard Deviation (STD)-based outlier detection
Majority-vote based outlier detection
Visualizing outlier
Data Preprocessing 4
Handling outliers
Feature Engineering
Models Selected
·K-Nearest Neighbor (KNN)
·Logistic regression
·AdaBoost
·GradientBoosting
·RandomForest
·Performing the Baseline Training
Understanding the testing matrix
·The Mean accuracy of the trained models
·The ROC-AUC score
ROC
AUC
Performing the Baseline Testing
Problems with this Approach
Optimization Techniques
·Understanding key concepts to optimize the approach
Cross-validation
The approach of using CV
Hyperparameter tuning
Grid search parameter tuning
Random search parameter tuning
Optimized Parameters Implementation
·Implementing a cross-validation based approach
·Implementing hyperparameter tuning
·Implementing and testing the revised approach
·Understanding problems with the revised approach
Implementation of the revised approach
·Implementing the best approach
Log transformation of features
Voting-based ensemble ML model
·Running ML models on real test data
Best approach & Summary
Examples with No Code
Downloads – Full Code