In this tutorial, we demonstrate a time series application of getML. We predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. We benchmark our results against Facebook's Prophet. getML's relational learning algorithms outperform Prophet's classical time series approach by ~15%.
Summary:
Author: Sören Nikolaus
The dataset features some particularly interesting characteristics common for time series, which classical models may struggle to deal with appropriately. Such characteristics are:
The analysis is built on top of a dataset provided by the MN Department of Transportation, with some data preparation done by John Hogue.
Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.
Let's get started with the analysis and set-up your session:
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import Image
plt.style.use("seaborn")
%matplotlib inline
import getml
print(f"getML API version: {getml.__version__}\n")
getml.engine.launch()
getml.engine.set_project("interstate94")
Downloading the raw data and convert it into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It is made available in the example notebook featuring the full analysis. We only include data after 2016 and introduced a fixed train/test split at 80% of the available data.
traffic = getml.datasets.load_interstate94()
traffic
The getml.datasets.load_interstate94
method took care of the entire data preparation:
Data visualization
The first week of the original traffic time series is plotted below.
col_data = "black"
col_getml = "darkviolet"
fig, ax = plt.subplots(figsize=(20, 10))
# 2016/01/01 was a friday, we'd like to start the visualizations on a monday
start = 72
end = 72 + 168
fig.suptitle(
"Traffic volume for first full week of the training set",
fontsize=14,
fontweight="bold",
)
ax.plot(
traffic["ds"].to_numpy()[start:end],
traffic["traffic_volume"].to_numpy()[start:end],
color=col_data,
)
Traffic: population table
To allow the algorithm to capture seasonal information, we include time components (such as the day of the week) as categorical variables. Note that we could have also used getML's Seasonal preprocessor (getml.prepreprocessors.Seasonal()
), but in this case the information was already included in the dataset.
Train/test split
We use getML's split functionality to retrieve a lazily evaluated split column, that we can supply to the time series api below.
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))
Split columns are columns of mere strings that can be used to subset the data by forming bolean conditions over them:
traffic[split == "test"]
To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon
, memory
and allow_lagged_targets
). We use the high-level time series api for this.
Under the hood, the time series api abstracts away a self cross join of the population table (traffic
) that allows getML's feature learning algorithms to learn patterns from past observations.
time_series = getml.data.TimeSeries(
population=traffic,
split=split,
time_stamps="ds",
horizon=getml.data.time.hours(1),
memory=getml.data.time.days(7),
lagged_targets=True,
)
time_series
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
Set-up of feature learners, selectors & predictor
relmt = getml.feature_learning.RelMT(
num_features=20,
loss_function=getml.feature_learning.loss_functions.SquareLoss,
seed=4367,
num_threads=1,
)
predictor = getml.predictors.XGBoostRegressor()
Build the pipeline
pipe = getml.pipeline.Pipeline(
tags=["memory: 7d", "horizon: 1h", "relmt"],
data_model=time_series.data_model,
feature_learners=[relmt],
predictors=[predictor],
)
pipe.fit(time_series.train)
pipe.score(time_series.test)
Feature correlations
Correlations of the calculated features with the target
names, correlations = pipe.features.correlations()
plt.subplots(figsize=(20, 10))
plt.bar(names, correlations, color=col_getml)
plt.title("Feature Correlations")
plt.xlabel("Features")
plt.ylabel("Correlations")
plt.xticks(rotation="vertical")
plt.show()
Feature importances
names, importances = pipe.features.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances, color=col_getml)
plt.title("Feature Importances")
plt.xlabel("Features")
plt.ylabel("Importances")
plt.xticks(rotation="vertical")
plt.show()
Visualizing the learned features
We can also transpile the features as SQL code. Here, we show the most important feature.
by_importance = pipe.features.sort(by="importance")
by_importance[0].sql
To showcase getML's ability to handle categorical data, we now look for features that contain information from the holiday column:
w_holiday = by_importance.filter(lambda feature: "holiday" in feature.sql)
w_holiday
As you can see, getML features which incorporate information about holidays have a rather low importance. This is not that surprising given the fact that most information about holidays is fully reproducible from the extracted calendarial information that is already present. In other words: for the algorithm, it doesn't matter if the traffic is lower on every 4th of July of a given year or if there is a corresponding holiday named 'Independence Day'. Here is the SQL transpilation of the most important feature relying on information about holdidays anyway:
w_holiday[0].sql
Plot predictions & traffic volume vs. time
We now plot the predictions against the observed values of the target for the first 7 days of the testing set. You can see that the predictions closely follows the original series. RelMT was able to identify certain patterns in the series, including:
predictions = pipe.predict(time_series.test)
fig, ax = plt.subplots(figsize=(20, 10))
# the test set starts at 2018/03/15 – a thursday; we introduce an offset to, once again, start on a monday
start = 96
end = 96 + 168
actual = time_series.test.population[start:end].to_pandas()
predicted = predictions[start:end]
ax.plot(actual["ds"], actual["traffic_volume"], color=col_data, label="Actual")
ax.plot(actual["ds"], predicted, color=col_getml, label="Predicted")
fig.suptitle(
"Predicted vs. actual traffic volume for first full week of testing set",
fontsize=14,
fontweight="bold",
)
fig.legend()
The most important feature looks as follows:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
module.
# Creates a folder named interstate94_pipeline containing
# the SQL code.
pipe.features.to_sql().save("interstate94_pipeline")
Benchmarks against Prophet
By design, Prophet isn't capable of delivering the 1-step ahead predictions we did with getML. To retrieve a benchmark in the 1-step case nonetheless, we mimic 1-step ahead predictions through cross-validating the model on a rolling origin. This gives Prophet an advantage as all information up to the origin is incorporated when fitting the model and a new fit is calculated for every 1-step ahead forecast. If you are interested in the full analysis please refer to the extended version of this notebook.
Results
We have benchmarked getML against Facebook’s Prophet library on a univariate time series with strong seasonal components. Prophet is made for exactly these sort of data sets, so you would expect this to be a home run for Prophet. The opposite is true - getML’s relational learning algorithms outperform Prophet's 1-step ahead predictions by ~15 percentage points:
This tutorial went through the basics of applying getML to time series.
If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.
Here is some additional material from our documentation if you want to learn more about getML:
If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.