IMDb - Predicting actors' gender using getML

Note that due to memory limitations, this notebook will not run on MyBinder.

In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.

The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.

As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.

Summary:

  • Prediction type: Classification model
  • Domain: Entertainment
  • Prediction target: The gender of an actor
  • Population size: 817718

Author: Dr. Patrick Urbanke

Background

The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.

It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Where is this running?

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml
from pyspark.sql import SparkSession

getml.engine.launch()
getml.engine.set_project('imdb')
Launched the getML engine. The log output will be stored in /home/patrick/.getML/logs/20220322172429.log.


Loading pipelines...
[========================================] 100%


Connected to project 'imdb'

In the following, we set some flags that affect execution of the notebook:

  • We don't let the algorithms utilize the information on actors' first names (see below for an explanation).
In [2]:
USE_FIRST_NAMES = False
RUN_SPARK = False

1. Loading data

1.1 Download from source

We begin by downloading the data from the source file:

In [3]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="imdb_ijs",
    port=3306,
    user="guest",
    password="relational"
)

conn
Out[3]:
Connection(conn_id='default',
           dbname='imdb_ijs',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)
In [4]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if getml.data.exists(name):
        return getml.data.load_data_frame(name)
    data_frame = getml.data.DataFrame.from_db(
        name=name,
        table_name=name,
        conn=conn
    )
    data_frame.save()
    return data_frame
In [5]:
actors = load_if_needed("actors")
roles = load_if_needed("roles")
movies = load_if_needed("movies")
movies_genres = load_if_needed("movies_genres")
In [6]:
actors
Out[6]:
name id first_name last_name gender
role unused_float unused_string unused_string unused_string
0 2  Michael 'babeepower' Viera M
1 3  Eloy 'Chincheta' M
2 4  Dieguito 'El Cigala' M
3 5  Antonio 'El de Chipiona' M
4 6  José 'El Francés' M
...  ... ... ...
817713 845461  Herdís Þorvaldsdóttir F
817714 845462  Katla Margrét Þorvaldsdóttir F
817715 845463  Lilja Nótt Þórarinsdóttir F
817716 845464  Hólmfríður Þórhallsdóttir F
817717 845465  Theódóra Þórðardóttir F

817718 rows x 4 columns
memory usage: 40.22 MB
name: actors
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/actors/

In [7]:
roles
Out[7]:
name actor_id movie_id role
role unused_float unused_float unused_string
0 2  280088  Stevie
1 2  396232  Various/lyricist
2 3  376687  Gitano 1
3 4  336265  El Cigala
4 5  135644  Himself
...  ...  ...
3431961 845461  137097  Kata
3431962 845462  208838  Magga
3431963 845463  870  Gunna
3431964 845464  378123  Gudrun
3431965 845465  378123 

3431966 rows x 3 columns
memory usage: 115.41 MB
name: roles
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/roles/

In [8]:
movies
Out[8]:
name id year rank name
role unused_float unused_float unused_float unused_string
0 0  2002  nan  #28
1 1  2000  nan  #7 Train: An Immigrant Journey, ...
2 2  1971  6.4 $
3 3  1913  nan  $1,000 Reward
4 4  1915  nan  $1,000 Reward
...  ...  ...  ...
388264 412316  1991  nan  "zem blch krlu"
388265 412317  1995  nan  "rgammk"
388266 412318  2002  nan  "zgnm Leyla"
388267 412319  1983  nan  " Istanbul"
388268 412320  1958  nan  "sterreich"

388269 rows x 4 columns
memory usage: 19.92 MB
name: movies
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/movies/

In [9]:
movies_genres
Out[9]:
name movie_id genre
role unused_float unused_string
0 1  Documentary
1 1  Short
2 2  Comedy
3 2  Crime
4 5  Western
...  ...
395114 378612  Adventure
395115 378612  Drama
395116 378613  Comedy
395117 378613  Drama
395118 378614  Comedy

395119 rows x 2 columns
memory usage: 9.24 MB
name: movies_genres
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/movies_genres/

1.2 Prepare data for getML

getML requires that we define roles for each of the columns.

In [10]:
actors["target"] = (actors.gender == 'F')
actors.set_role("id", getml.data.roles.join_key)
actors.set_role("target", getml.data.roles.target)

The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag above to see how well getML incoporates such starightforward information into its feature logic.

In [11]:
if USE_FIRST_NAMES:
    actors.set_role("first_name", getml.data.roles.text)
actors
Out[11]:
name id target first_name last_name gender
role join_key target unused_string unused_string unused_string
0 2 0  Michael 'babeepower' Viera M
1 3 0  Eloy 'Chincheta' M
2 4 0  Dieguito 'El Cigala' M
3 5 0  Antonio 'El de Chipiona' M
4 6 0  José 'El Francés' M
... ...  ... ... ...
817713 845461 1  Herdís Þorvaldsdóttir F
817714 845462 1  Katla Margrét Þorvaldsdóttir F
817715 845463 1  Lilja Nótt Þórarinsdóttir F
817716 845464 1  Hólmfríður Þórhallsdóttir F
817717 845465 1  Theódóra Þórðardóttir F

817718 rows x 5 columns
memory usage: 43.49 MB
name: actors
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/actors/

In [12]:
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key)
roles.set_role("role", getml.data.roles.text)
roles
Out[12]:
name actor_id movie_id role
role join_key join_key text
0 2 280088 Stevie
1 2 396232 Various/lyricist
2 3 376687 Gitano 1
3 4 336265 El Cigala
4 5 135644 Himself
... ... ...
3431961 845461 137097 Kata
3431962 845462 208838 Magga
3431963 845463 870 Gunna
3431964 845464 378123 Gudrun
3431965 845465 378123

3431966 rows x 3 columns
memory usage: 87.96 MB
name: roles
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/roles/

In [13]:
movies.set_role("id", getml.data.roles.join_key)
movies.set_role(["year", "rank"], getml.data.roles.numerical)
movies
Out[13]:
name id year rank name
role join_key numerical numerical unused_string
0 0 2002  nan  #28
1 1 2000  nan  #7 Train: An Immigrant Journey, ...
2 2 1971  6.4 $
3 3 1913  nan  $1,000 Reward
4 4 1915  nan  $1,000 Reward
... ...  ...  ...
388264 412316 1991  nan  "zem blch krlu"
388265 412317 1995  nan  "rgammk"
388266 412318 2002  nan  "zgnm Leyla"
388267 412319 1983  nan  " Istanbul"
388268 412320 1958  nan  "sterreich"

388269 rows x 4 columns
memory usage: 18.37 MB
name: movies
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/movies/

In [14]:
movies_genres.set_role("movie_id", getml.data.roles.join_key)
movies_genres.set_role("genre", getml.data.roles.categorical)
movies_genres
Out[14]:
name movie_id genre
role join_key categorical
0 1 Documentary
1 1 Short
2 2 Comedy
3 2 Crime
4 5 Western
... ...
395114 378612 Adventure
395115 378612 Drama
395116 378613 Comedy
395117 378613 Drama
395118 378614 Comedy

395119 rows x 2 columns
memory usage: 3.16 MB
name: movies_genres
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/imdb/movies_genres/

We need to separate our data set into a training, testing and validation set:

In [15]:
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split
Out[15]:
0 train
1 validation
2 train
3 validation
4 validation
...

infinite number of rows
type: StringColumnView

In [16]:
container = getml.data.Container(population=actors, split=split)

container.add(
    roles=roles,
    movies=movies,
    movies_genres=movies_genres,
)

container
Out[16]:

population

subset name rows type
0 test actors 122794 View
1 train actors 571807 View
2 validation actors 123117 View

peripheral

name rows type
0 roles 3431966 DataFrame
1 movies 388269 DataFrame
2 movies_genres 395119 DataFrame

2. Predictive modelling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model

To get started with relational learning, we need to specify the data model.

In [17]:
dm = getml.data.DataModel("actors")

dm.add(getml.data.to_placeholder(
    roles=roles,
    movies=movies,
    movies_genres=movies_genres,
))

dm.population.join(
    dm.roles,
    on=("id", "actor_id"),
)

dm.roles.join(
    dm.movies,
    on=("movie_id", "id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.movies.join(
    dm.movies_genres,
    on=("id", "movie_id"),
)

dm
Out[17]:

diagram


movies_genresmoviesrolesactorsmovie_id = idid = movie_idRelationship: many-to-oneactor_id = id

staging

data frames staging table
0 actors ACTORS__STAGING_TABLE_1
1 movies_genres MOVIES_GENRES__STAGING_TABLE_2
2 roles, movies ROLES__STAGING_TABLE_3

2.2 getML pipeline

Set-up the feature learner & predictor

We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters.

In [18]:
text_field_splitter = getml.preprocessors.TextFieldSplitter()

mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)

feature_selector = getml.predictors.XGBoostClassifier()

predictor = getml.predictors.XGBoostClassifier()

Build the pipeline

In [19]:
pipe = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[text_field_splitter, mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    share_selected_features=0.1,
)

2.3 Model training

In [20]:
pipe.check(container.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'ROLES__STAGING_TABLE_3'. You might want to double-check your join keys.
In [21]:
pipe.fit(container.train)
Checking data model...


Staging...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'ROLES__STAGING_TABLE_3'. You might want to double-check your join keys.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Indexing text fields...
[========================================] 100%

FastProp: Trying 226 features...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as feature selector...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:10m:59.716192

Out[21]:
Pipeline(data_model='actors',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function=None,
         peripheral=['movies', 'movies_genres', 'roles'],
         predictors=['XGBoostClassifier'],
         preprocessors=['TextFieldSplitter', 'Mapping'],
         share_selected_features=0.1,
         tags=['fast_prop', 'container-99dJnF'])

url: http://localhost:1709/#/getpipeline/imdb/PpQOSV/0/

2.4 Model evaluation

In [22]:
pipe.score(container.test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%==] 122%


Out[22]:
date time set used target accuracy auc cross entropy
0 2022-03-22 17:36:28 train target 0.8413 0.9136 0.3223
1 2022-03-22 17:36:49 test target 0.8417 0.9134 0.3236

2.5 Features

The most important feature looks as follows:

In [23]:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
Out[23]:
DROP TABLE IF EXISTS "FEATURE_1_142";

CREATE TABLE "FEATURE_1_142" AS
SELECT MAX( COALESCE( f_1_1_16."feature_1_1_16", 0.0 ) ) AS "feature_1_142",
       t1.rowid AS rownum
FROM "ACTORS__STAGING_TABLE_1" t1
INNER JOIN "ROLES__STAGING_TABLE_3" t2
ON t1."id" = t2."actor_id"
LEFT JOIN "FEATURE_1_1_16" f_1_1_16
ON t2.rowid = f_1_1_16.rownum
GROUP BY t1.rowid;

2.6 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster.

In [24]:
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("imdb_spark")
In [25]:
if RUN_SPARK:
    spark = SparkSession.builder.appName(
        "online_retail"
    ).config(
        "spark.driver.maxResultSize","10g"
    ).config(
        "spark.driver.memory", "10g"
    ).config(
        "spark.executor.memory", "20g"
    ).config(
        "spark.sql.execution.arrow.pyspark.enabled", "true"
    ).config(
        "spark.sql.session.timeZone", "UTC"
    ).enableHiveSupport().getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")
In [26]:
if RUN_SPARK:
    population_spark = container.train.population.to_pyspark(spark, name="actors")
In [27]:
if RUN_SPARK:
    movies_genres_spark = container.movies_genres.to_pyspark(spark, name="movies_genres")
    roles_spark = container.roles.to_pyspark(spark, name="roles")
    movies_spark = container.movies.to_pyspark(spark, name="movies")
In [28]:
if RUN_SPARK:
    getml.spark.execute(spark, "imdb_spark")
In [29]:
if RUN_SPARK:
    spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()

3. Conclusion

In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset.

References

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

Neville, Jennifer, and David Jensen. "Relational dependency networks." Journal of Machine Learning Research 8.Mar (2007): 653-692.

Neville, Jennifer, and David Jensen. "Collective classification with relational dependency networks." Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.

Neville, Jennifer, et al. "Learning relational probability trees." Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.

Perovšek, Matic, et al. "Wordification: Propositionalization by unfolding relational data into bags of words." Expert Systems with Applications 42.17-18 (2015): 6442-6456.

Next Steps

This tutorial went through the basics of applying getML to relational data that contains columns with freeform text. If you want to learn more about getML, here are some additional tutorials and articles that will help you:

Tutorials:

User Guides (from our documentation):

Get in contact

If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.