MovieLens - Predicting a user's gender based on the movies they have watched

NOTE: Due to the size of the dataset, this notebook will not run on MyBinder.

In this notebook, we will apply getML to a dataset that is often used for benchmarking in the relational learning literature: The MovieLens dataset.

Summary:

  • Prediction type: Classification model
  • Domain: Entertainment
  • Prediction target: The gender of a user
  • Population size: 6039

Author: Dr. Patrick Urbanke

Background

The MovieLens dataset is often used in the relational learning literature has a benchmark for newly developed algorithms. Following the tradition, we benchmark getML's own algorithms on this dataset as well. The task is to predict a user's gender based on the movies they have watched.

It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

getml.engine.launch()
getml.engine.set_project('MovieLens')
Launched the getML engine. The log output will be stored in /home/patrick/.getML/logs/20220320122747.log.


Loading pipelines...
[========================================] 100%


Connected to project 'MovieLens'

1. Loading data

1.1 Download from source

We begin by downloading the data from the source file:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="imdb_MovieLens",
    port=3306,
    user="guest",
    password="relational"
)

conn
Out[2]:
Connection(conn_id='default',
           dbname='imdb_MovieLens',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)
In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
In [4]:
users = load_if_needed("users")
u2base = load_if_needed("u2base")
movies = load_if_needed("movies")
movies2directors = load_if_needed("movies2directors")
directors = load_if_needed("directors")
movies2actors = load_if_needed("movies2actors")
actors = load_if_needed("actors")

1.2 Prepare data for getML

getML requires that we define roles for each of the columns.

In [5]:
users["target"] = (users.u_gender == 'F')
In [6]:
users.set_role("userid", getml.data.roles.join_key)
users.set_role("age", getml.data.roles.numerical)
users.set_role("occupation", getml.data.roles.categorical)
users.set_role("target", getml.data.roles.target)

users.save()
Out[6]:
name userid target occupation age u_gender
role join_key target categorical numerical unused_string
0 1 1  2 1  F
1 51 1  2 1  F
2 75 1  2 1  F
3 86 1  2 1  F
4 99 1  2 1  F
... ...  ... ...  ...
6034 5658 0  5 56  M
6035 5669 0  5 56  M
6036 5703 0  5 56  M
6037 5948 0  5 56  M
6038 5980 0  5 56  M

6039 rows x 5 columns
memory usage: 0.21 MB
name: users
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/users/

In [7]:
u2base.set_role(["userid", "movieid"], getml.data.roles.join_key)
u2base.set_role("rating", getml.data.roles.numerical)

u2base.save()
Out[7]:
name userid movieid rating
role join_key join_key numerical
0 2 1964242 1 
1 2 2219779 1 
2 3 1856939 1 
3 4 2273044 1 
4 5 1681655 1 
... ... ... 
996154 6040 2560616 5 
996155 6040 2564194 5 
996156 6040 2581228 5 
996157 6040 2581428 5 
996158 6040 2593112 5 

996159 rows x 3 columns
memory usage: 15.94 MB
name: u2base
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/u2base/

In [8]:
movies.set_role("movieid", getml.data.roles.join_key)
movies.set_role(["year", "runningtime"], getml.data.roles.numerical)
movies.set_role(["isEnglish", "country"], getml.data.roles.categorical)

movies.save()
Out[8]:
name movieid isEnglish country year runningtime
role join_key categorical categorical numerical numerical
0 1672052 T other 3  2 
1 1672111 T other 4  2 
2 1672580 T USA 4  3 
3 1672716 T USA 4  2 
4 1672946 T USA 4  0 
... ... ... ...  ... 
3827 2591814 T other 4  2 
3828 2592334 T USA 4  2 
3829 2592963 F France 2  2 
3830 2593112 T USA 4  1 
3831 2593313 F other 4  3 

3832 rows x 5 columns
memory usage: 0.11 MB
name: movies
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/movies/

In [9]:
movies2directors.set_role(["movieid", "directorid"], getml.data.roles.join_key)
movies2directors.set_role( "genre", getml.data.roles.categorical)

movies2directors.save()
Out[9]:
name movieid directorid genre
role join_key join_key categorical
0 1672111 54934 Action
1 1672946 188940 Action
2 1679461 179783 Action
3 1691387 291700 Action
4 1693305 14663 Action
... ... ...
4136 2570825 265215 Other
4137 2572478 149311 Other
4138 2577062 304827 Other
4139 2590181 270707 Other
4140 2591814 57348 Other

4141 rows x 3 columns
memory usage: 0.05 MB
name: movies2directors
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/movies2directors/

In [10]:
directors.set_role("directorid", getml.data.roles.join_key)
directors.set_role(["d_quality", "avg_revenue"], getml.data.roles.numerical)

directors.save()
Out[10]:
name directorid d_quality avg_revenue
role join_key numerical numerical
0 67 4  1 
1 92 2  3 
2 284 4  0 
3 708 4  1 
4 746 4  4 
... ...  ... 
2196 305962 4  4 
2197 305978 4  2 
2198 306168 3  2 
2199 306343 4  1 
2200 306351 4  1 

2201 rows x 3 columns
memory usage: 0.04 MB
name: directors
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/directors/

In [11]:
movies2actors.set_role(["movieid", "actorid"], getml.data.roles.join_key)
movies2actors.set_role( "cast_num", getml.data.roles.numerical)

movies2actors.save()
Out[11]:
name movieid actorid cast_num
role join_key join_key numerical
0 1672580 981535 0 
1 1672946 1094968 0 
2 1673647 149985 0 
3 1673647 261595 0 
4 1673647 781357 0 
... ... ... 
138344 2593313 947005 3 
138345 2593313 1090590 3 
138346 2593313 1347419 3 
138347 2593313 2099917 3 
138348 2593313 2633550 3 

138349 rows x 3 columns
memory usage: 2.21 MB
name: movies2actors
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/movies2actors/

We need to separate our data set into a training, testing and validation set:

In [12]:
actors.set_role("actorid", getml.data.roles.join_key)
actors.set_role("a_quality", getml.data.roles.numerical)
actors.set_role("a_gender", getml.data.roles.categorical)

actors.save()
Out[12]:
name actorid a_gender a_quality
role join_key categorical numerical
0 4 M 4 
1 16 M 0 
2 28 M 4 
3 566 M 4 
4 580 M 4 
... ... ... 
98685 2749162 F 3 
98686 2749168 F 3 
98687 2749204 F 3 
98688 2749377 F 4 
98689 2749386 F 4 

98690 rows x 3 columns
memory usage: 1.58 MB
name: actors
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/MovieLens/actors/

In [13]:
split = getml.data.split.random(train=0.75, test=0.25)
split
Out[13]:
0 train
1 train
2 train
3 test
4 test
...

infinite number of rows
type: StringColumnView

In [14]:
container = getml.data.Container(population=users, split=split)

container.add(
    u2base=u2base,
    movies=movies,
    movies2directors=movies2directors,
    directors=directors,
    movies2actors=movies2actors,
    actors=actors,
)

container
Out[14]:

population

subset name rows type
0 test users 1511 View
1 train users 4528 View

peripheral

name rows type
0 u2base 996159 DataFrame
1 movies 3832 DataFrame
2 movies2directors 4141 DataFrame
3 directors 2201 DataFrame
4 movies2actors 138349 DataFrame
5 actors 98690 DataFrame

2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model

To get started with relational learning, we need to specify the data model.

In [15]:
dm = getml.data.DataModel(users.to_placeholder())

dm.add(getml.data.to_placeholder(
    u2base=u2base,
    movies=movies,
    movies2directors=movies2directors,
    directors=directors,
    movies2actors=movies2actors,
    actors=actors,
))

dm.population.join(
    dm.u2base,
    on='userid'
)

dm.u2base.join(
    dm.movies,
    on='movieid',
    relationship=getml.data.relationship.many_to_one
)

dm.movies.join(
    dm.movies2directors,
    on='movieid',
    relationship=getml.data.relationship.propositionalization
)

dm.movies2directors.join(
    dm.directors,
    on='directorid',
    relationship=getml.data.relationship.many_to_one
)

dm.movies.join(
    dm.movies2actors,
    on='movieid',
    relationship=getml.data.relationship.propositionalization
)

dm.movies2actors.join(
    dm.actors,
    on='actorid',
    relationship=getml.data.relationship.many_to_one
)

dm
Out[15]:

diagram


directorsmovies2directorsactorsmovies2actorsmoviesu2baseusersdirectorid = directoridRelationship: many-to-oneactorid = actoridRelationship: many-to-onemovieid = movieidRelationship: propositionalizationmovieid = movieidRelationship: propositionalizationmovieid = movieidRelationship: many-to-oneuserid = userid

staging

data frames staging table
0 users USERS__STAGING_TABLE_1
1 movies2actors, actors MOVIES2ACTORS__STAGING_TABLE_2
2 movies2directors, directors MOVIES2DIRECTORS__STAGING_TABLE_3
3 u2base, movies U2BASE__STAGING_TABLE_4

2.2 getML pipeline

Set-up the feature learner & predictor

We will set up two pipelines. One of them uses FastProp, the other one uses Relboost. Note that we have marked some of the joins in the data model with the propositionalization tag. This means that FastProp will be used for these relationships, even for the second pipeline. This can significantly speed up the training process.

In [16]:
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

relboost = getml.feature_learning.Relboost(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_subfeatures=50,
    num_threads=1
)

predictor = getml.predictors.XGBoostClassifier(
    max_depth=5,
    n_jobs=1,
)

Build the pipeline

In [17]:
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe1
Out[17]:
Pipeline(data_model='users',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors',
                     'u2base'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])
In [18]:
pipe2 = getml.pipeline.Pipeline(
    tags=['relboost'],
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[relboost],
    predictors=[predictor]
)

pipe2
Out[18]:
Pipeline(data_model='users',
         feature_learners=['Relboost'],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors',
                     'u2base'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['relboost'])

2.3 Model training

In [19]:
pipe1.check(container.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2DIRECTORS__STAGING_TABLE_3 over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2ACTORS__STAGING_TABLE_2 over 'movieid' and 'movieid', there are no corresponding entries for 0.340408% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.
In [20]:
pipe1.fit(container.train)
Checking data model...


Staging...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2DIRECTORS__STAGING_TABLE_3 over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2ACTORS__STAGING_TABLE_2 over 'movieid' and 'movieid', there are no corresponding entries for 0.340408% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Trying 941 features...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:9m:52.645106

Out[20]:
Pipeline(data_model='users',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors',
                     'u2base'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-ixJT2v'])

url: http://localhost:1709/#/getpipeline/MovieLens/46EfC1/0/
In [21]:
pipe2.check(container.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2DIRECTORS__STAGING_TABLE_3 over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2ACTORS__STAGING_TABLE_2 over 'movieid' and 'movieid', there are no corresponding entries for 0.340408% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.
In [22]:
pipe2.fit(container.train)
Checking data model...


Staging...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2DIRECTORS__STAGING_TABLE_3 over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining U2BASE__STAGING_TABLE_4 and MOVIES2ACTORS__STAGING_TABLE_2 over 'movieid' and 'movieid', there are no corresponding entries for 0.340408% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 1h:16m:23.545015

Out[22]:
Pipeline(data_model='users',
         feature_learners=['Relboost'],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors',
                     'u2base'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['relboost', 'container-ixJT2v'])

url: http://localhost:1709/#/getpipeline/MovieLens/oQi9pX/0/

2.4 Model evaluation

In [23]:
pipe1.score(container.test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


Out[23]:
date time set used target accuracy auc cross entropy
0 2022-03-20 12:38:25 train target 0.9114 0.9658 0.2847
1 2022-03-20 13:58:37 test target 0.7776 0.7896 0.4757
In [24]:
pipe2.score(container.test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%


Out[24]:
date time set used target accuracy auc cross entropy
0 2022-03-20 13:54:53 train target 0.9691 0.9948 0.1577
1 2022-03-20 14:08:11 test target 0.816 0.8368 0.4398

2.6 Studying features

Column importances

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

As we can see, most of the predictive accuracy is drawn from the roles played by the actors. This suggests that the text fields contained in this relational database have a higher impact on predictive accuracy than for most other data sets.

In [25]:
names, importances = pipe1.columns.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Columns importances')
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
In [26]:
names, importances = pipe2.columns.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Columns importances')
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

2.7 Features

The most important features look as follows:

In [27]:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
Out[27]:
DROP TABLE IF EXISTS "FEATURE_1_138";

CREATE TABLE "FEATURE_1_138" AS
SELECT MEDIAN( COALESCE( f_1_1_69."feature_1_1_69", 0.0 ) ) AS "feature_1_138",
       t1.rowid AS "rownum"
FROM "USERS__STAGING_TABLE_1" t1
INNER JOIN "U2BASE__STAGING_TABLE_4" t2
ON t1."userid" = t2."userid"
LEFT JOIN "FEATURE_1_1_69" f_1_1_69
ON t2.rowid = f_1_1_69."rownum"
GROUP BY t1.rowid;
In [28]:
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]
Out[28]:
DROP TABLE IF EXISTS "FEATURE_1_1";

CREATE TABLE "FEATURE_1_1" AS
SELECT AVG( 
    CASE
        WHEN ( p_1_1."feature_1_1_69" > 0.242159 ) AND ( p_1_1."feature_1_1_21" > 0.232813 ) AND ( t2."t3__year__mapping_1_target_1_avg" > 0.282119 ) THEN 20.46317569156853
        WHEN ( p_1_1."feature_1_1_69" > 0.242159 ) AND ( p_1_1."feature_1_1_21" > 0.232813 ) AND ( t2."t3__year__mapping_1_target_1_avg" <= 0.282119 OR t2."t3__year__mapping_1_target_1_avg" IS NULL ) THEN 7.321538279840953
        WHEN ( p_1_1."feature_1_1_69" > 0.242159 ) AND ( p_1_1."feature_1_1_21" <= 0.232813 OR p_1_1."feature_1_1_21" IS NULL ) AND ( p_1_1."feature_1_1_69" > 0.243429 ) THEN 5.046599618766721
        WHEN ( p_1_1."feature_1_1_69" > 0.242159 ) AND ( p_1_1."feature_1_1_21" <= 0.232813 OR p_1_1."feature_1_1_21" IS NULL ) AND ( p_1_1."feature_1_1_69" <= 0.243429 OR p_1_1."feature_1_1_69" IS NULL ) THEN -8.250725468943104
        WHEN ( p_1_1."feature_1_1_69" <= 0.242159 OR p_1_1."feature_1_1_69" IS NULL ) AND ( p_1_1."feature_1_1_21" > 0.273123 ) AND ( p_1_1."feature_1_1_76" > 0.008673 ) THEN -3.885674068832839
        WHEN ( p_1_1."feature_1_1_69" <= 0.242159 OR p_1_1."feature_1_1_69" IS NULL ) AND ( p_1_1."feature_1_1_21" > 0.273123 ) AND ( p_1_1."feature_1_1_76" <= 0.008673 OR p_1_1."feature_1_1_76" IS NULL ) THEN -12.86974979841147
        WHEN ( p_1_1."feature_1_1_69" <= 0.242159 OR p_1_1."feature_1_1_69" IS NULL ) AND ( p_1_1."feature_1_1_21" <= 0.273123 OR p_1_1."feature_1_1_21" IS NULL ) AND ( p_1_1."feature_1_1_85" > 0.003477 ) THEN 26.50336909269918
        WHEN ( p_1_1."feature_1_1_69" <= 0.242159 OR p_1_1."feature_1_1_69" IS NULL ) AND ( p_1_1."feature_1_1_21" <= 0.273123 OR p_1_1."feature_1_1_21" IS NULL ) AND ( p_1_1."feature_1_1_85" <= 0.003477 OR p_1_1."feature_1_1_85" IS NULL ) THEN -2.663699179011978
        ELSE NULL
    END
) AS "feature_1_1",
     t1.rowid AS "rownum"
FROM "USERS__STAGING_TABLE_1" t1
INNER JOIN "U2BASE__STAGING_TABLE_4" t2
ON t1."userid" = t2."userid"
LEFT JOIN "FEATURES_1_1_PROPOSITIONALIZATION" p_1_1
ON t2.rowid = p_1_1."rownum"
GROUP BY t1.rowid;

2.8 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [29]:
# Creates a folder named movie_lens_pipeline containing
# the SQL code.
pipe2.features.to_sql().save("movie_lens_pipeline")
In [30]:
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("movie_lens_spark")

2.9 Benchmarks

State-of-the-art approaches on this dataset perform as follows:

Approach Study Accuracy AUC
Probabalistic Relational Model Ghanem (2009) -- 69.2%
Multi-Relational Bayesian Network Schulte and Khosravi (2012) 69% --
Multi-Relational Bayesian Network Schulte et al (2013) 66% --

By contrast, getML's algorithms, as used in this notebook, perform as follows:

Approach Accuracy AUC
FastProp 77.8% 79.0%
Relboost 81.6% 83.7%

3. Conclusion

In this notebook we have demonstrated how getML can be applied to the MovieLens dataset. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms.

Citations

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

Ghanem, Amal S. "Probabilistic models for mining imbalanced relational data." Doctoral dissertation, Curtin University (2009).

Schulte, Oliver, and Hassan Khosravi. "Learning graphical models for relational data via lattice search." Machine Learning 88.3 (2012): 331-368.

Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.

Next Steps

This tutorial went through the basics of applying getML to relational data. If you want to learn more about getML, here are some additional tutorials and articles that will help you:

User Guides (from our documentation):

Get in contact

If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.