Note that due to memory limitations, this notebook will not run on MyBinder.
In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.
The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.
As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.
Summary:
Author: Dr. Patrick Urbanke
The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.
It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).
Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.
Let's get started with the analysis and set up your session:
import copy
import os
from urllib import request
import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline
import getml
from pyspark.sql import SparkSession
getml.engine.launch()
getml.engine.set_project('imdb')
In the following, we set some flags that affect execution of the notebook:
USE_FIRST_NAMES = False
RUN_SPARK = False
We begin by downloading the data from the source file:
conn = getml.database.connect_mariadb(
host="relational.fit.cvut.cz",
dbname="imdb_ijs",
port=3306,
user="guest",
password="relational"
)
conn
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if getml.data.exists(name):
return getml.data.load_data_frame(name)
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
return data_frame
actors = load_if_needed("actors")
roles = load_if_needed("roles")
movies = load_if_needed("movies")
movies_genres = load_if_needed("movies_genres")
actors
roles
movies
movies_genres
getML requires that we define roles for each of the columns.
actors["target"] = (actors.gender == 'F')
actors.set_role("id", getml.data.roles.join_key)
actors.set_role("target", getml.data.roles.target)
The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag above to see how well getML incoporates such starightforward information into its feature logic.
if USE_FIRST_NAMES:
actors.set_role("first_name", getml.data.roles.text)
actors
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key)
roles.set_role("role", getml.data.roles.text)
roles
movies.set_role("id", getml.data.roles.join_key)
movies.set_role(["year", "rank"], getml.data.roles.numerical)
movies
movies_genres.set_role("movie_id", getml.data.roles.join_key)
movies_genres.set_role("genre", getml.data.roles.categorical)
movies_genres
We need to separate our data set into a training, testing and validation set:
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split
container = getml.data.Container(population=actors, split=split)
container.add(
roles=roles,
movies=movies,
movies_genres=movies_genres,
)
container
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
To get started with relational learning, we need to specify the data model.
dm = getml.data.DataModel("actors")
dm.add(getml.data.to_placeholder(
roles=roles,
movies=movies,
movies_genres=movies_genres,
))
dm.population.join(
dm.roles,
on=("id", "actor_id"),
)
dm.roles.join(
dm.movies,
on=("movie_id", "id"),
relationship=getml.data.relationship.many_to_one,
)
dm.movies.join(
dm.movies_genres,
on=("id", "movie_id"),
)
dm
Set-up the feature learner & predictor
We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters.
text_field_splitter = getml.preprocessors.TextFieldSplitter()
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)
feature_selector = getml.predictors.XGBoostClassifier()
predictor = getml.predictors.XGBoostClassifier()
Build the pipeline
pipe = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=dm,
preprocessors=[text_field_splitter, mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=[predictor],
share_selected_features=0.1,
)
pipe.check(container.train)
pipe.fit(container.train)
pipe.score(container.test)
The most important feature looks as follows:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster.
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("imdb_spark")
if RUN_SPARK:
spark = SparkSession.builder.appName(
"online_retail"
).config(
"spark.driver.maxResultSize","10g"
).config(
"spark.driver.memory", "10g"
).config(
"spark.executor.memory", "20g"
).config(
"spark.sql.execution.arrow.pyspark.enabled", "true"
).config(
"spark.sql.session.timeZone", "UTC"
).enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
if RUN_SPARK:
population_spark = container.train.population.to_pyspark(spark, name="actors")
if RUN_SPARK:
movies_genres_spark = container.movies_genres.to_pyspark(spark, name="movies_genres")
roles_spark = container.roles.to_pyspark(spark, name="roles")
movies_spark = container.movies.to_pyspark(spark, name="movies")
if RUN_SPARK:
getml.spark.execute(spark, "imdb_spark")
if RUN_SPARK:
spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()
In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset.
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
Neville, Jennifer, and David Jensen. "Relational dependency networks." Journal of Machine Learning Research 8.Mar (2007): 653-692.
Neville, Jennifer, and David Jensen. "Collective classification with relational dependency networks." Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.
Neville, Jennifer, et al. "Learning relational probability trees." Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.
Perovšek, Matic, et al. "Wordification: Propositionalization by unfolding relational data into bags of words." Expert Systems with Applications 42.17-18 (2015): 6442-6456.
This tutorial went through the basics of applying getML to relational data that contains columns with freeform text. If you want to learn more about getML, here are some additional tutorials and articles that will help you:
Tutorials:
User Guides (from our documentation):
If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.