MORE ON MACHINE LEARNING METHODS IN CREDIT RISK ESTIMATION

One of the distinguishing features of Valuatum Platform is the utilization of machine learning in developing state-of-the-art bankruptcy risk models for credit risk estimation.

Historically we have used models based on logistic regression but with the emergence of more powerful algorithms using machine learning we have been able to take the bankruptcy risk estimation to another level. Machine learning is a central feature in our models. It enables us to optimize the prediction of a credit default and the estimation of correct risk levels.

This page provides an overview of the machine learning techniques we use or have studied. Additionally, we discuss logistic regression models, which we still use as they have their advantages.

If you are interested in digging deeper into the technical stuff of machine learning methods, you can download our white paper. For our current main model, see our current Bankruptcy Risk Model.

MACHINE LEARNING MODELS

Some of the machine learning techniques we have studied are random forest, artificial neural network, the ensemble method and gradient boosting. Gradient boosting has proved to be the most accurate method and it is currently our flagship model. You can read more about gradient boosting on our main page about Bankruptcy Risk Model.

RANDOM FOREST

Random forest is a flowchart-like structure consisting of nodes that is built using decision trees. Each node is a test on one of the input variables (such as EBIT or equity ratio), whereas each branch presents the outcome of a test. And in this case, outcome of a test is a prediction of the probability of a company going bankrupt.

Decision trees are trained for every node so that the estimated bankruptcies are as close as possible to the actual bankruptcies. The decision trees are generated using different parts of the data set, which results in a collection of hundreds or even thousands of decision trees. The random forest model averages the outcomes of these decision trees and presents a probability of bankruptcy.

Example of a decision tree used in random forest model

ARTIFICIAL NEURAL NETWORK

Illustration of neural network model

An artificial neural network is a collection of nodes called neurons that are connected to each other. These networks consist of input and output layers and – possibly several – hidden layers. Simply put, the hidden layers transform the input into something the output can use.

The model receives one or more numbers (such as EBIT or equity ratio) as input. The input layer assigns weights to the numbers and passes them on to the hidden layers. Thereafter, hidden layers activate the inputs through an activation function, which enables the model to find non-linear dependencies between the inputs and outputs. Artificial neural networks are excellent in finding patterns in data that would be too complex or even impossible for a person to teach to a computer.

ENSEMBLE METHOD

The ensemble method is a combination of several machine learning models. In our case, the ensemble method combines the aforementioned random forest and neural network models. Each individual model gives its own result even when using the same data set. Therefore, the models have their own weaknesses as well. By combining the different kinds of models, we can get better results compared to using only one model. The ensemble method simply averages the outcome probabilities of several different random forest and artificial neural network models.

Ensemble model combines several individual models and averages the outcomes into a final prediction.

LOGISTIC REGRESSION MODELS

In addition to the state-of-the-art models using machine learning, we use also the good old logistic regression models, as those have their advantages. Mainly, logistic regression models are more understandable for humans since the bankruptcy risk is estimated using a polynomial equation. Moreover, result can be easily divided into different components and visualized to gain a better understanding of what contributes to risk.

SINGLE VARIABLE MODELS

Single variable logistic regression models are simple and provide clear demonstration how the explanatory variable affects the risk.

The picture on the right illustrates the connection between profitability (ROA %) and bankruptcy and default risks – for example, as the picture shows, bankruptcy or default risks of companies with positive ROA % stay rather low, but right below the zero, the risk function gets steeper.

Risk estimations are based on Finnish default and bankruptcy data and financial statements data of nearly 200 000 companies. Companies are grouped to large enough groups to make the risk estimates statistically significant.

MULTI-VARIABLE
MODELS

Multi-variable models

A multi-variable model estimates the overall bankruptcy and default risks using several selected components to ensure best possible accuracy. This overall statistical bankruptcy risk can be divided into components.

We can easily illustrate graphically different variables’ impact on the final bankruptcy risk.

Two-variable models

Single variable models have their advantages, but give only a limited view on bankcruptcy risk, and that is why Valuatum system also uses models with more variables.

A two-variable model gives more accurate bankruptcy and payment default risk estimations than single variable models. Any key figures and ratios can be used in these models.

We can also illustrate the development of the risk of the examined company graphically.

For example, bankruptcy risk of companies with the same profitability (ROA -%) level can actually vary a lot depending on their solvency (equity ratio), and even companies with poor profitability can have low bankruptcy risk if they have high solvency.

Methods for Risk Estimation