This notebook demonstrates the application of our relational learning algorithm to predict if a customer of a bank will default on his loan. We train the predictor on customer metadata, transaction history, as well as other successful and unsuccessful loans.
Summary:
Author: Dr. Johannes King, Dr. Patrick Urbanke
This notebook features a textbook example of predictive analytics applied to the financial sector. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example, when a home buyer fails to make a mortgage payment. A bank needs to estimate the risk it carries when granting loans to potentially non-performing customers.
The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository (Motl and Schulte, 2015).
Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.
Let's get started with the analysis and set-up your session:
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import Image, Markdown
plt.style.use("seaborn")
%matplotlib inline
import getml
getml.engine.launch()
getml.engine.set_project("loans")
Downloading the raw data from the CTU Prague Relational Learning Repository into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It will be made available in a future version.
population_train, population_test, order, trans, meta = getml.datasets.load_loans(roles=True, units=True)
The getml.datasets.load_loans
method took care of the entire data lifting:
The only thing left is to set units to columns that the relational learning algorithm is allowed to compare to each other.
Data visualization
To simplify the notebook, original data model (image below) is condensed into 4 tables, by resolving the trivial one-to-one and many-to-one joins:
population_{train, test}
, consiting of loan
and account
tablesorder
, trans
, and meta
.meta
is made up of card
, client
, disp
and district
Image("assets/loans-schema.png", width=500)
status
contains binary target. Levels [A, C] := loan paid back and [B, D] := loan default;
we recoded status to our binary target: default
population_train.set_role("date_loan", "time_stamp")
population_test.set_role("date_loan", "time_stamp")
population_test
meta
order
trans
While the contents of meta
and order
are omitted for brevity, here are contents of trans
:
trans
To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema
indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.
star_schema = getml.data.StarSchema(
train=population_train, test=population_test, alias="population"
)
star_schema.join(
trans,
on="account_id",
time_stamps=("date_loan", "date"),
)
star_schema.join(
order,
on="account_id",
)
star_schema.join(
meta,
on="account_id",
)
star_schema
meta
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
Set-up of feature learners, selectors & predictor
mapping = getml.preprocessors.Mapping(min_freq=100)
fast_prop = getml.feature_learning.FastProp(
aggregation=getml.feature_learning.FastProp.agg_sets.All,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
feature_selector = getml.predictors.XGBoostClassifier(n_jobs=1)
# the population is really small, so we set gamma to mitigate overfitting
predictor = getml.predictors.XGBoostClassifier(gamma=2, n_jobs=1,)
Build the pipeline
pipe = getml.pipeline.Pipeline(
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=predictor,
)
pipe.fit(star_schema.train)
pipe.score(star_schema.test)
Visualizing the learned features
The feature with the highest importance is:
by_importances = pipe.features.sort(by="importances")
by_importances[0].sql
Feature correlations
We want to analyze how the features are correlated with the target variable.
names, correlations = pipe.features[:50].correlations()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, correlations, color="#6829c2")
ax.set_title("feature correlations")
ax.set_xlabel("feature")
ax.set_ylabel("correlation")
ax.tick_params(axis="x", rotation=90)
Feature importances
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.
names, importances = pipe.features[:50].importances()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, importances, color='#6829c2')
ax.set_title("feature importances")
ax.set_xlabel("feature")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
Column importances
Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.
As we can see, a lot of the predictive power stems from the account balance. This is unsurprising: People with less money on their bank accounts are more likely to default on their loans.
names, importances = pipe.columns.importances()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, importances, color="#6829c2")
ax.set_title("column importances")
ax.set_xlabel("column")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
The most important feature looks as follows:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named loans_pipeline containing
# the SQL code.
pipe.features.to_sql().save("loans_pipeline")
# Creates a folder named baseball_pipeline_spark containing
# the SQL code.
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("loans_pipeline_spark")
By applying getML to the PKDD'99 Financial dataset, we were able to show the power and relevance of Relational Learning on a real-world data set. Within a training time below 1 minute, we outperformed almost all approaches based on manually generated features. This makes getML the prime choice when dealing with complex relational data schemes. This result holds independent of the problem domain since no expertise in the financial sector was used in this analysis.
The present analysis could be improved in two directions. By performing an extensive hyperparameter optimization, the out of sample AUC could be further improved. On the other hand, the hyperparameters could be tuned to produce less complex features that result in worse performance (in terms of AUC) but are better interpretable by humans.
Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.
This tutorial went through the basics of applying getML to relational data.
If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.
Here is some additional material from our documentation if you want to learn more about getML:
If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.