In this notebook, we compare getML against extant approaches in the relational learning literature on the CORA data set, which is often used for benchmarking. We demonstrate that getML outperforms the state of the art in the relational learning literature on this data set. Beyond the benchmarking aspects, this notebooks showcases getML's excellent capabilities in dealing with categorical data.
Summary:
Author: Dr. Patrick Urbanke
CORA is a well-known benchmarking dataset in the academic literature on relational learning. The dataset contains 2708 scientific publications on machine learning. The papers are divided into 7 categories. The challenge is to predict the category of a paper based on the papers it cites, the papers it is cited by and keywords contained in the paper.
It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).
Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.
Let's get started with the analysis and set up your session:
import copy
import os
from urllib import request
import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline
import getml
getml.engine.launch()
getml.engine.set_project('cora')
We begin by downloading the data from the source file:
conn = getml.database.connect_mariadb(
host="relational.fit.cvut.cz",
dbname="CORA",
port=3306,
user="guest",
password="relational"
)
conn
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")
paper
cites
content
getML requires that we define roles for each of the columns.
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
paper
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
cites
We need to separate our data set into a training, testing and validation set:
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)
content
The goal is to predict seven different labels. We generate a target column for each of those labels. We also have to separate the data set into a training and testing set.
data_full = getml.data.make_target_columns(paper, "class_label")
data_full
split = getml.data.split.random(train=0.7, test=0.3, validation=0.0)
split
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.
That is because the class label can be predicting using three different pieces of information:
The main challenge here is that cites
is used twice, once to connect the cited papers and then to connect the citing papers. To resolve this, we need two placeholders on cites
.
dm = getml.data.DataModel(paper.to_placeholder("population"))
# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites]*2, content=content, paper=paper))
dm.population.join(
dm.cites[0],
on=('paper_id', 'cited_paper_id')
)
dm.cites[0].join(
dm.content,
on=('citing_paper_id', 'paper_id')
)
dm.cites[0].join(
dm.paper,
on=('citing_paper_id', 'paper_id'),
relationship=getml.data.relationship.many_to_one
)
dm.population.join(
dm.cites[1],
on=('paper_id', 'citing_paper_id')
)
dm.cites[1].join(
dm.content,
on=('cited_paper_id', 'paper_id')
)
dm.cites[1].join(
dm.paper,
on=('cited_paper_id', 'paper_id'),
relationship=getml.data.relationship.many_to_one
)
dm.population.join(
dm.content,
on='paper_id'
)
dm
Set-up the feature learner & predictor
We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples
).
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1
)
relboost = getml.feature_learning.Relboost(
num_features=10,
num_subfeatures=10,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
seed=4367,
num_threads=1,
min_num_samples=30
)
predictor = getml.predictors.XGBoostClassifier()
Build the pipeline
pipe1 = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=dm,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor]
)
pipe1
pipe2 = getml.pipeline.Pipeline(
tags=['relboost'],
data_model=dm,
feature_learners=[relboost],
predictors=[predictor]
)
pipe2
pipe1.check(container.train)
pipe1.fit(container.train)
pipe2.check(container.train)
The training process seems a bit intimidating. That is because the relboost algorithms needs to train separate models for each class label. This is due to the nature of the generated features.
pipe2.fit(container.train)
pipe1.score(container.test)
pipe2.score(container.test)
To make things a bit easier, we just look at our test results.
pipe1.scores.filter(lambda score: score.set_used == "test")
pipe2.scores.filter(lambda score: score.set_used == "test")
We take the average of the AUC values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).
print(np.mean(pipe1.auc))
print(np.mean(pipe2.auc))
The accuracy for multiple targets can be calculated using one of two methods. The first method is to simply take the average of the pair-wise accuracy values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).
print(np.mean(pipe1.accuracy))
print(np.mean(pipe2.accuracy))
However, the benchmarking papers actually use a different approach:
probabilities1 = pipe1.predict(container.test)
probabilities2 = pipe2.predict(container.test)
class_label = paper.class_label.unique()
ix_max = np.argmax(probabilities1, axis=1)
predicted_labels1 = np.asarray([class_label[ix] for ix in ix_max])
ix_max = np.argmax(probabilities2, axis=1)
predicted_labels2 = np.asarray([class_label[ix] for ix in ix_max])
actual_labels = paper[split == "test"].class_label.to_numpy()
print("Share of accurately predicted class labels (pipe1):")
print((actual_labels == predicted_labels1).sum() / len(actual_labels))
print()
print("Share of accurately predicted class labels (pipe2):")
print((actual_labels == predicted_labels2).sum() / len(actual_labels))
print()
Since this is the method the benchmark papers use, this is the accuracy score we will report as well.
Feature correlations
We want to analyze how the features are correlated with the target variables.
TARGET_NUM = 0
names, correlations = pipe2.features.correlations(target_num=TARGET_NUM)
plt.subplots(figsize=(20, 10))
plt.bar(names, correlations, color='#6829c2')
plt.title('Feature correlations with class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Correlations')
plt.xticks(rotation='vertical')
plt.show()
Feature importances
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.
names, importances = pipe2.features.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances, color='#6829c2')
plt.title('Feature importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
Column importances
Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.
names, importances = pipe2.columns.importances(target_num=TARGET_NUM)
plt.subplots(figsize=(20, 10))
plt.bar(names, importances, color='#6829c2')
plt.title('Columns importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
The most important features look as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder containing the SQL code.
pipe1.features.to_sql().save("cora_pipeline")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("cora_spark")
State-of-the-art approaches on this data set perform as follows:
Approach | Study | Accuracy | AUC |
---|---|---|---|
RelF | Dinh et al (2012) | 85.7% | -- |
LBP | Dinh et al (2012) | 85.0% | -- |
EPRN | Preisach and Thieme (2006) | 84.0% | -- |
PRN | Preisach and Thieme (2006) | 81.0% | -- |
ACORA | Perlich and Provost (2006) | -- | 97.0% |
As we can see, the performance of the relboost algorithm, as used in this notebook, compares favorably to these benchmarks.
Approach | Accuracy | AUC |
---|---|---|
FastProp | 89.9% | 98.5% |
Relboost | 89.9% | 98.3% |
In this notebook we have demonstrated that getML outperforms state-of-the-art relational learning algorithms on the CORA dataset.
Dinh, Quang-Thang, Christel Vrain, and Matthieu Exbrayat. "A Link-Based Method for Propositionalization." ILP (Late Breaking Papers). 2012.
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
Perlich, Claudia, and Foster Provost. "Distribution-based aggregation for relational learning with identifier attributes." Machine Learning 62.1-2 (2006): 65-105.
Preisach, Christine, and Lars Schmidt-Thieme. "Relational ensemble classification." Sixth International Conference on Data Mining (ICDM'06). IEEE, 2006.
This tutorial benchmarked getML against academic state-of-the-art algorithms from relational learning literature and getML's qualities with respect to categorical data.
If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.
Here is some additional material from our documentation if you want to learn more about getML:
If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.