Predicting the loan default risk of Czech bank customers using getML

Introduction to relational learning with getML

This notebook demonstrates the application of our relational learning algorithm to predict if a customer of a bank will default on his loan. We train the predictor on customer metadata, transaction history, as well as other successful and unsuccessful loans.

Summary:

  • Prediction type: Binary classification
  • Domain: Finance
  • Prediction target: Loan default
  • Source data: 8 tables, 78.8 MB
  • Population size: 682

Author: Dr. Johannes King, Dr. Patrick Urbanke

Background

This notebook features a textbook example of predictive analytics applied to the financial sector. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example, when a home buyer fails to make a mortgage payment. A bank needs to estimate the risk it carries when granting loans to potentially non-performing customers.

The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository (Motl and Schulte, 2015).

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Where is this running?

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

Analysis

Let's get started with the analysis and set-up your session:

In [1]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import Image, Markdown

plt.style.use("seaborn")
%matplotlib inline

import getml

getml.engine.launch()
getml.engine.set_project("loans")
Launched the getML engine. The log output will be stored in /home/patrick/.getML/logs/20220324104157.log.



Connected to project 'loans'

1. Loading data

1.1 Download from source

Downloading the raw data from the CTU Prague Relational Learning Repository into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It will be made available in a future version.

In [2]:
population_train, population_test, order, trans, meta = getml.datasets.load_loans(roles=True, units=True)
Loading population_train...
[========================================] 100%

Loading population_test...
[========================================] 100%

Loading order...
[========================================] 100%

Loading trans...
[========================================] 100%

Loading meta...
[========================================] 100%

1.2 Prepare data for getML

The getml.datasets.load_loans method took care of the entire data lifting:

  • Downloads csv's from our servers in python
  • Converts csv's to getML DataFrames
  • Sets roles to columns inside getML DataFrames

The only thing left is to set units to columns that the relational learning algorithm is allowed to compare to each other.

Data visualization

To simplify the notebook, original data model (image below) is condensed into 4 tables, by resolving the trivial one-to-one and many-to-one joins:

  • A population table population_{train, test}, consiting of loan and account tables
  • Three peripheral tables: order, trans, and meta.
  • Whereas meta is made up of card, client, disp and district
In [3]:
Image("assets/loans-schema.png", width=500)
Out[3]:

Population table

  • Information on the loan itself (duration, amount, date, ...)
  • Geo-information about the branch where the loans was granted (A**)
  • Column status contains binary target. Levels [A, C] := loan paid back and [B, D] := loan default; we recoded status to our binary target: default
In [4]:
population_train.set_role("date_loan", "time_stamp")
population_test.set_role("date_loan", "time_stamp")
In [5]:
population_test
Out[5]:
name date_loan account_id default frequency duration payments amount loan_id district_id date_account status
role time_stamp join_key target categorical numerical numerical numerical unused_float unused_float unused_float unused_string
unit time stamp, comparison only money
0 1996-04-29 19 1  POPLATEK MESICNE 12  2523  30276  4961  21  1995  B
1 1998-10-14 37 1  POPLATEK MESICNE 60  5308  318480  4967  20  1997  D
2 1998-04-19 38 0  POPLATEK TYDNE 48  2307  110736  4968  19  1997  C
3 1997-08-10 97 0  POPLATEK MESICNE 12  8573  102876  4986  74  1996  A
4 1996-11-06 132 0  POPLATEK PO OBRATU 12  7370  88440  4996  40  1996  A
... ... ...  ... ...  ...  ...  ...  ...  ...  ...
218 1995-12-04 11042 0  POPLATEK MESICNE 36  6032  217152  7243  72  1995  A
219 1996-08-20 11054 0  POPLATEK TYDNE 60  2482  148920  7246  59  1996  C
220 1994-01-31 11111 0  POPLATEK MESICNE 36  3004  108144  7259  1  1993  A
221 1998-11-22 11317 0  POPLATEK MESICNE 60  5291  317460  7292  50  1997  C
222 1996-12-27 11362 0  POPLATEK MESICNE 24  5392  129408  7308  67  1995  A

223 rows x 11 columns
memory usage: 0.02 MB
name: population_test
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/loans/population_test/

Peripheral tables

  • meta
    • Meta info about the client (card_type, gender, ...)
    • Geo-information about the client
  • order
    • Permanent orders related to a loan (amount, balance, ...)
  • trans
    • Transactions related to a given loan (amount, ...)

While the contents of meta and order are omitted for brevity, here are contents of trans:

In [6]:
trans
Out[6]:
name date account_id type k_symbol bank operation amount balance trans_id account
role time_stamp join_key categorical categorical categorical categorical numerical numerical unused_float unused_string
unit time stamp, comparison only money
0 1995-03-24 1 PRIJEM NULL NULL VKLAD 1000  1000  1 
1 1995-04-13 1 PRIJEM NULL AB PREVOD Z UCTU 3679  4679  5  41403269.0
2 1995-05-13 1 PRIJEM NULL AB PREVOD Z UCTU 3679  20977  6  41403269.0
3 1995-06-13 1 PRIJEM NULL AB PREVOD Z UCTU 3679  26835  7  41403269.0
4 1995-07-13 1 PRIJEM NULL AB PREVOD Z UCTU 3679  30415  8  41403269.0
... ... ... ... ... ... ...  ...  ...  ...
1056315 1998-08-31 10451 PRIJEM UROK NULL NULL 62  17300  3682983 
1056316 1998-09-30 10451 PRIJEM UROK NULL NULL 49  13442  3682984 
1056317 1998-10-31 10451 PRIJEM UROK NULL NULL 34  10118  3682985 
1056318 1998-11-30 10451 PRIJEM UROK NULL NULL 26  8398  3682986 
1056319 1998-12-31 10451 PRIJEM UROK NULL NULL 42  13695  3682987 

1056320 rows x 10 columns
memory usage: 67.20 MB
name: trans
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/loans/trans/

1.3 Define relational model

To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.

In [7]:
star_schema = getml.data.StarSchema(
    train=population_train, test=population_test, alias="population"
)

star_schema.join(
    trans,
    on="account_id",
    time_stamps=("date_loan", "date"),
)

star_schema.join(
    order,
    on="account_id",
)

star_schema.join(
    meta,
    on="account_id",
)

star_schema
Out[7]:

data model

diagram


transordermetapopulationaccount_id = account_iddate <= date_loanaccount_id = account_idaccount_id = account_id

staging

data frames staging table
0 population POPULATION__STAGING_TABLE_1
1 meta META__STAGING_TABLE_2
2 order ORDER__STAGING_TABLE_3
3 trans TRANS__STAGING_TABLE_4

container

population

subset name rows type
0 train population_train 459 DataFrame
1 test population_test 223 DataFrame

peripheral

name rows type
0 trans 1056320 DataFrame
1 order 6471 DataFrame
2 meta 5369 DataFrame
In [8]:
meta
Out[8]:
name account_id type_disp type_card gender A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 disp_id client_id birth_date district_id card_id issued A2
role join_key categorical categorical categorical categorical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical unused_float unused_float unused_float unused_float unused_string unused_string unused_string
0 1 OWNER NULL F south Bohemia 70699  60  13  2  1  4  65.3 8968  2.8 3.35 131  1740  1910  1  1  1970  18  Pisek
1 2 OWNER NULL M Prague 1204953  0  0  0  1  1  100  12541  0.2 0.43 167  85677  99107  2  2  1945  1  Hl.m. Praha
2 2 DISPONENT NULL F Prague 1204953  0  0  0  1  1  100  12541  0.2 0.43 167  85677  99107  3  3  1940  1  Hl.m. Praha
3 3 OWNER NULL M central Bohemia 95616  65  30  4  1  6  51.4 9307  3.8 4.43 118  2616  3040  4  4  1956  5  Kolin
4 3 DISPONENT NULL F central Bohemia 95616  65  30  4  1  6  51.4 9307  3.8 4.43 118  2616  3040  5  5  1960  5  Kolin
... ... ... ... ... ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... ... ...
5364 11349 OWNER NULL F Prague 1204953  0  0  0  1  1  100  12541  0.2 0.43 167  85677  99107  13647  13955  1945  1  Hl.m. Praha
5365 11349 DISPONENT NULL M Prague 1204953  0  0  0  1  1  100  12541  0.2 0.43 167  85677  99107  13648  13956  1943  1  Hl.m. Praha
5366 11359 OWNER classic M south Moravia 117897  139  28  5  1  6  53.8 8814  4.7 5.74 107  2112  2059  13660  13968  1968  61  1247.0 1995-06-13 Trebic
5367 11362 OWNER NULL F north Moravia 106054  38  25  6  2  6  63.1 8110  5.7 6.55 109  3244  3079  13663  13971  1962  67  Bruntal
5368 11382 OWNER NULL F north Moravia 323870  0  0  0  1  1  100  10673  4.7 5.44 100  18782  18347  13690  13998  1953  74  Ostrava - mesto

5369 rows x 25 columns
memory usage: 1.05 MB
name: meta
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/loans/meta/

2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 getML Pipeline

Set-up of feature learners, selectors & predictor

In [9]:
mapping = getml.preprocessors.Mapping(min_freq=100)

fast_prop = getml.feature_learning.FastProp(
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

feature_selector = getml.predictors.XGBoostClassifier(n_jobs=1)

# the population is really small, so we set gamma to mitigate overfitting
predictor = getml.predictors.XGBoostClassifier(gamma=2, n_jobs=1,)

Build the pipeline

In [10]:
pipe = getml.pipeline.Pipeline(
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=predictor,
)

2.2 Model training

In [11]:
pipe.fit(star_schema.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


OK.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Trying 794 features...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as feature selector...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:0m:2.111608

Out[11]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function=None,
         peripheral=['meta', 'order', 'trans'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['container-o7N3L9'])

url: http://localhost:1709/#/getpipeline/loans/paFZ9f/0/

2.3 Model evaluation

In [12]:
pipe.score(star_schema.test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


Out[12]:
date time set used target accuracy auc cross entropy
0 2022-03-24 10:42:21 train default 0.9978 1. 0.06577
1 2022-03-24 10:42:22 test default 0.9686 0.9564 0.13536

2.4 Studying features

Visualizing the learned features

The feature with the highest importance is:

In [13]:
by_importances = pipe.features.sort(by="importances")
by_importances[0].sql
Out[13]:
DROP TABLE IF EXISTS "FEATURE_1_26";

CREATE TABLE "FEATURE_1_26" AS
SELECT Q1( t2."balance" ) AS "feature_1_26",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;

Feature correlations

We want to analyze how the features are correlated with the target variable.

In [14]:
names, correlations = pipe.features[:50].correlations()

fig, ax = plt.subplots(figsize=(20, 10))

ax.bar(names, correlations, color="#6829c2")

ax.set_title("feature correlations")
ax.set_xlabel("feature")
ax.set_ylabel("correlation")
ax.tick_params(axis="x", rotation=90)

Feature importances

Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [15]:
names, importances = pipe.features[:50].importances()

fig, ax = plt.subplots(figsize=(20, 10))

ax.bar(names, importances, color='#6829c2')

ax.set_title("feature importances")
ax.set_xlabel("feature")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)

Column importances

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

As we can see, a lot of the predictive power stems from the account balance. This is unsurprising: People with less money on their bank accounts are more likely to default on their loans.

In [16]:
names, importances = pipe.columns.importances()

fig, ax = plt.subplots(figsize=(20, 10))

ax.bar(names, importances, color="#6829c2")

ax.set_title("column importances")
ax.set_xlabel("column")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)

The most important feature looks as follows:

In [17]:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
Out[17]:
DROP TABLE IF EXISTS "FEATURE_1_26";

CREATE TABLE "FEATURE_1_26" AS
SELECT Q1( t2."balance" ) AS "feature_1_26",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;

2.5 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [18]:
# Creates a folder named loans_pipeline containing
# the SQL code.
pipe.features.to_sql().save("loans_pipeline")
In [19]:
# Creates a folder named baseball_pipeline_spark containing
# the SQL code.
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("loans_pipeline_spark")

3. Conclusion

By applying getML to the PKDD'99 Financial dataset, we were able to show the power and relevance of Relational Learning on a real-world data set. Within a training time below 1 minute, we outperformed almost all approaches based on manually generated features. This makes getML the prime choice when dealing with complex relational data schemes. This result holds independent of the problem domain since no expertise in the financial sector was used in this analysis.

The present analysis could be improved in two directions. By performing an extensive hyperparameter optimization, the out of sample AUC could be further improved. On the other hand, the hyperparameters could be tuned to produce less complex features that result in worse performance (in terms of AUC) but are better interpretable by humans.

References

Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.

Next Steps

This tutorial went through the basics of applying getML to relational data.

If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.

Here is some additional material from our documentation if you want to learn more about getML:

Get in contact

If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.