6 Machine Learning Algorithms for Credit Analysis

Gustavo Jannuzzi
8 min readMay 10, 2022
Photo by Markus Winkler on Unsplash

Creating the first models and algorithms can be challenging for someone entering the world of data science and machine learning. In a quick and easy way, I’ll show you how to implement six credit analysis algorithms to compare which one had the best performance with a given database.

As the objective here is just to present the algorithms, I will leave here the Github link with the pre-processing of the data, which is nothing more than the phase in which we clean and prepare the data before applying it and creating the models.

For all models, we need to split the database into predictor and class attributes.

  • The Predictors are the inputs that we use to feed the algorithm, in this case they are: client: identifier number, income: annual salary and loan: debt that the person has
  • The Class is the output, the answer it returns. In this case, default: 0( if the person paid the debt), 1(if the person DID NOT pay).

To do the evaluation correctly, we still separate the class and the predictor attributes into test bases, which we use to test the accuracy of the model and the training base that is applied to the model so that it learns to recognize the patterns.

THE ML MODELS:

  • Decision Tree — 98.8%

To find out which values we put at the top of our tree, we use these two formulas. But don’t be alarmed by them, the algorithm takes care of that directly.

Fórmula de Entropia e Ganho.

Entropy measures how organized or disorganized the data is. Gain shows us which attributes are most important to stay at the top of the tree.

arvore_credit = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
arvore_credit.fit(x_credit_treinamento, y_credit_treinamento)

How do we know the model’s accuracy?

previsoes = arvore_credit.predict(x_credit_teste)

To know the percentage of hits we had, let’s buy the prediction table with y_credit, so we will have a metric of hits from the algorithm

# Métricas: porcentagem de acertos entre o teste e o y_creditfrom sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_credit_teste, previsoes)

After that, we can generate a Confusion Matrix to visually analyze how many hits we had for each type of data within the base.

cm = ConfusionMatrix(arvore_credit)
cm.fit(x_credit_treinamento, y_credit_treinamento)
cm.score(x_credit_teste, y_credit_teste)
print(classification_report(y_credit_teste, previsoes))
Confusion Matrix — Decision Tree

The decision tree structure that the algorithm created can be summarized in the figure below:

Estrutura da árvore de decisão.

We can analyze the accuracy of the algorithm by generating the Classification Report, which shows us through a table the precision, recall, FI-Score and Support of the data type that the algorithm returns. In this case being 0 and 1, where 0 means the person paid the debt and 1 the person who did not pay off the debt.

Classification Report — Decision Tree.

Looking at the table we can see that the algorithm hit 99% of the people who paid the debt and 93% of the people who didn’t.

  • Rules (CN2Learner) — 97.4%

This algorithm generates a sequence of conditional functions according to the predictor attributes, which are nothing more than rules it goes through until it decides which value it should become, in this case “1” or “0”.

Importing the data base

base_credit = Orange.data.Table('credit_data_regras.csv')

Splitting the data

base_dividida = Orange.evaluation.testing.sample(base_credit, n = 0.25)

We split the database into two parts, one for training and one for testing. In this case we don’t need to split the base into predictor attributes ans class.

base_treinamento = base_dividida[1]
base_teste = base_dividida[0]

Using the CN2Learner function we make the model learn from the training base.

cn2 = Orange.classification.rules.CN2Learner()
regras_credit = cn2(base_treinamento)

Checking the generated rules:

for regras in regras_credit.rule_list: 
print(regras)

The output in this case returns the list of rules that the algorithm created to decide which path to take.

Rules created by CN2Learner.

Checking the accuracy and precision of the model.

previsoes = Orange.evaluation.testing.TestOnTestData(base_treinamento, base_teste, [lambda testdata: regras_credit])Orange.evaluation.CA(previsoes)

With this function we can observe that the algorithm obtained an accuracy of 98.6% of correct answers.

  • Instance learning (KNN ) — 98.6%

The objective of this algorithm is to search in a Cartesian plane the values of the test base at a point closest to the given input value. As the name implies, these are the K nearest neighbors. For this training we will use the Minkowski metric to determine the distances.

Minkowski metric.
from sklearn.neighbors import KNeighborsClassifierimport pickle
with open('/credit.pkl', 'rb') as f:
X_credit_treinamento, y_credit_treinamento, X_credit_teste, y_credit_teste = pickle.load(f)

In the case of this application, the value of n_neighbors = 5 was used, which is presented as the default value, making it easier for this test. As a metric, the Minkowski calculation that was presented in the above formula will be used.

knn_credit = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p = 2)
knn_credit.fit(X_credit_treinamento, y_credit_treinamento)

To know the accuracy of the algorithm:

from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_credit_teste, previsoes) # padronização

Confusion matrix and classification report:

from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(knn_credit)
cm.fit(X_credit_treinamento, y_credit_treinamento)
cm.score(X_credit_teste, y_credit_teste)
Confusion Matrix — KNN
Classification Report — KNN.

With the generated classification report, it is observed that the accuracy for 0 was from 99% to 94% of 1.

  • Logistic Regression — 94.6%

Despite the name of this algorithm, we are not necessarily working with regression techniques, in this case we will use it for classification.
Unlike linear regression, logistic regression generates a sigmoid function to more accurately include records in a Cartesian plane with predictor attribute points and the class.

The theory of this application has a vast content, I advise other curious people to seek more information about how this model works at its fundamental levels. As the objective here is not to explain the fundamentals of each model, let’s go straight to the code:

from sklearn.linear_model import LogisticRegressionimport pickle
with open('credit.pkl', 'rb') as f:
X_credit_treinamento, y_credit_treinamento, X_credit_teste, y_credit_teste = pickle.load(f)

After we import the database, let’s make the algorithm learn from it.

logistic_credit = LogisticRegression(random_state=1)
logistic_credit.fit(X_credit_treinamento, y_credit_treinamento)
logistic_credit.intercept_logistic_credit.coef_previsoes = logistic_credit.predict(X_credit_teste)

Next, we will analyze the accuracy and efficiency of this alortimo printing the confusion matrix and the lassification report.

from sklearn.metrics import accuracy_score, classification_report accuracy_score(y_credit_teste, previsoes)from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(logistic_credit)
cm.fit(X_credit_treinamento, y_credit_treinamento)
cm.score(X_credit_teste, y_credit_teste)
Confusion Matrix — Log. Regression
Classification Report — Log. Regression

With the confusion matrix and the generated classification report, we can see that the algorithm hit 97% of the values for 0 and 79% of the values for 1, with an average of 95% accuracy.

  • Support Vector Machine — 98.8%

The SVM works, in a representative way, by drawing a line separating hyperplanes with the maximum margin through well-applied mathematics. However, this algorithm can be implemented quickly and practically as shown below.

We import the library and define the test and training bases of the attributes.

from sklearn.svm import SVCimport pickle
with open('credit.pkl', 'rb') as f:
X_credit_treinamento, y_credit_treinamento, X_credit_teste, y_credit_teste = pickle.load(f)

Here the algorithm was created by setting the default values.

svm_credit = SVC(kernel='rbf', random_state=1, C = 2.0) # 2 -> 4
svm_credit.fit(X_credit_treinamento, y_credit_treinamento)

As was done with the other trainings, we will generate the classification report to analyze the accuracy and precision..

previsoes = svm_credit.predict(X_credit_teste)from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_credit_teste, previsoes)
Classification Report — SVM
from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(svm_credit)
cm.fit(X_credit_treinamento, y_credit_treinamento)
cm.score(X_credit_teste, y_credit_teste)
Classification Report — SVM.

In this application, the results presented were very positive, with 99% correct for values 0 and 97% for values 1.

  • Neural Artificial Network — 99.8%

Neural networks are the backbone of the rise of applied Machine Learning in the 21st century. Thankfully, machine learning libraries such as sklearn have abstracted this out for us. We’re going to be using the MLPClassifier of “MultiLayer Perceptron Classifier” from SKLearn.

from sklearn.neural_network import MLPClassifierimport pickle
with open('credit.pkl', 'rb') as f:
X_credit_treinamento, y_credit_treinamento, X_credit_teste, y_credit_teste = pickle.load(f)

Creating the neural network:

# 3 -> 100 -> 100 -> 1
# 3 -> 2 -> 2 -> 1
rede_neural_credit = MLPClassifier(max_iter=1500, verbose=True, tol=0.0000100, solver = 'adam', activation = 'relu', hidden_layer_sizes = (20,20))
rede_neural_credit.fit(X_credit_treinamento, y_credit_treinamento)

Generating the classification report…

previsoes = rede_neural_credit.predict(X_credit_teste)from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_credit_teste, previsoes)

The final confusion matrix.

from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(rede_neural_credit)
cm.fit(X_credit_treinamento, y_credit_treinamento)
cm.score(X_credit_teste, y_credit_teste)
Confusion Matrix — Neural Artificial Network
Classification Report — Neural Artificial Network.

CONCLUSION

There are several ways to train an algorithm, even using the same training model we can improve it so that it learns by following other parameterizations. In the case of these tests, the model that generated the best performance was Neural Artificial Network, which generated a percentage of correct answers of 99.8%. But this does not mean that this is the best model for all test bases, in this test it had an excellent performance, however in other applications this type of model can generate an overfiting problem.

The human factor must always be taken into account when selecting the best model. The machine learns from the data a person presents to it. They have no value in themselves, it’s up to you to decide how to teach and what to teach the algorithms with.

--

--

Gustavo Jannuzzi

C. Software Engineer at Accenture. Interested in Data Science | Software Engineering | Ethics linkedin.com/in/gustavo-jannuzzi-a74901196/