In this tutorial, we demonstrate a time series application of getML.
We benchmark our results against Facebook's Prophet and tsfresh.
getML's relational learning algorithms outperform Prophet's classical time series approach by ~14% and tsfresh's brute force approaches to feature engineering by ~26% (measured in terms of the predictive R-squared).
Summary:
Author: Patrick Urbanke
The data set features some particularly interesting characteristics common for time series, which classical models may struggle to deal with. Such characteristics are:
To quote the maintainers of the data set:
"This loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. It is close enough to the stadium to see unusual traffic after a Dodgers game, but not so close and heavily used by game traffic so that the signal for the extra traffic is overly obvious."
The dataset was originally collected for this paper:
"Adaptive event detection with time-varying Poisson processes" A. Ihler, J. Hutchins, and P. Smyth Proceedings of the 12th ACM SIGKDD Conference (KDD-06), August 2006.
It is maintained by the UCI Machine Learning Repository:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.
Let's get started with the analysis and set-up your session:
from datetime import datetime
import gc
import os
from urllib import request
import time
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
import scipy
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline
For various technical reasons, we want to keep our MyBinder notebook short. That is why we pre-store the features for prophet and tsfresh. However, you are very welcome to try this at home and fully reproduce our results. You can just set the two constants to "True".
RUN_PROPHET = False
RUN_TSFRESH = False
if RUN_PROPHET:
from fbprophet import Prophet
if RUN_TSFRESH:
import tsfresh
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh.feature_selection.relevance import calculate_relevance_table
col_data = "black"
col_getml = "darkviolet"
col_prophet="cornflowerblue"
col_tsfresh="green"
import getml
getml.engine.launch()
getml.engine.set_project('dodgers')
We begin by downloading the data from the UC Irvine Machine Learning repository:
fname = "Dodgers.data"
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/" + fname,
fname
)
data_full_pandas = pd.read_csv(fname, header=None)
If we use the pre-stored features, we have to download them as well:
PROPHET_FILES = [
"predictions_prophet_train.csv",
"predictions_prophet_test.csv",
"combined_train_pandas.csv",
"combined_test_pandas.csv"
]
if not RUN_PROPHET:
for fname in PROPHET_FILES:
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://static.getml.com/datasets/dodgers/" + fname,
fname
)
TSFRESH_FILES = [
"tsfresh_train_pandas.csv",
"tsfresh_test_pandas.csv"
]
if not RUN_TSFRESH:
for fname in TSFRESH_FILES:
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://static.getml.com/datasets/dodgers/" + fname,
fname
)
Prophet is pretty strict about how the columns should be named, so we adapt to these restriction:
data_full_pandas.columns = ["ds", "y"]
data_full_pandas = data_full_pandas[data_full_pandas["y"] >= 0]
data_full_pandas = data_full_pandas.reset_index()
del data_full_pandas["index"]
data_full_pandas["ds"] = [
datetime.strptime(dt, "%m/%d/%Y %H:%M") for dt in data_full_pandas["ds"]
]
data_full_pandas
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")
data_full.set_role("y", getml.data.roles.target)
data_full.set_role("ds", getml.data.roles.time_stamp)
data_full
split = getml.data.split.time(population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20))
split
Traffic: population table
To allow the algorithm to capture seasonal information, we include time components (such as the day of the week) as categorical variables.
To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon
, memory
and allow_lagged_targets
). This is done abstractly using Placeholders
The data model consists of two tables:
traffic_{test/train}
: holds target and the contemporarily available time-based componentstraffic
: same table as the population tablehorizon
) to prevent leaks and (memory
) that keeps the computations feasible# 1. The horizon is 1 hour (we predict the traffic volume in one hour).
# 2. The memory is 2 hours, so we allow the algorithm to
# use information from up to 2 hours ago.
# 3. We allow lagged targets. Thus, the algorithm can
# identify autoregressive processes.
time_series = getml.data.TimeSeries(
population=data_full,
split=split,
time_stamps='ds',
horizon=getml.data.time.hours(1),
memory=getml.data.time.hours(2),
lagged_targets=True
)
time_series
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
Set-up of feature learners, selectors & predictor
mapping = getml.preprocessors.Mapping()
seasonal = getml.preprocessors.Seasonal()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
aggregation=getml.feature_learning.FastProp.agg_sets.All,
)
relboost = getml.feature_learning.Relboost(
num_features=10,
loss_function=getml.feature_learning.loss_functions.SquareLoss,
seed=4367,
num_threads=1
)
predictor = getml.predictors.XGBoostRegressor()
Build the pipeline
pipe = getml.pipeline.Pipeline(
tags=['memory: 2h', 'horizon: 1h', 'fast_prop', 'relboost'],
data_model=time_series.data_model,
preprocessors=[seasonal, mapping],
feature_learners=[fast_prop, relboost],
predictors=[predictor]
)
pipe.check(time_series.train)
pipe.fit(time_series.train)
pipe.score(time_series.test)
predictions_getml_test = pipe.predict(time_series.test)
Prophet is a library for generating predictions on univariate time series that contain strong seasonal components. We would therefore expect it to do well on this particular time series.
data_train_pandas = time_series.train.population.to_pandas()
data_test_pandas = time_series.test.population[1:].to_pandas()
data_train_pandas
if RUN_PROPHET:
model_prophet = Prophet()
model_prophet = model_prophet.fit(data_train_pandas)
predictions_prophet_train = model_prophet.predict(data_train_pandas)["yhat"]
predictions_prophet_test = model_prophet.predict(data_test_pandas)["yhat"]
else:
predictions_prophet_train = pd.read_csv("predictions_prophet_train.csv")["yhat"]
predictions_prophet_test = pd.read_csv("predictions_prophet_test.csv")["yhat"]
Since we are not using the getML engine for Prophet, we have to implement the metrics ourselves. Luckily, that is not very hard.
in_sample = dict()
out_of_sample = dict()
predictions_prophet_test
data_train_pandas
def r_squared(yhat, y):
yhat = np.asarray(yhat)
y = np.asarray(y)
r = scipy.stats.pearsonr(yhat, y)[0]
return r * r
in_sample["rsquared"] = r_squared(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["rsquared"] = r_squared(predictions_prophet_test, data_test_pandas["y"])
def rmse(yhat, y):
yhat = np.asarray(yhat)
y = np.asarray(y)
return np.sqrt(
((y - yhat)*(y - yhat)).sum()/len(y)
)
in_sample["rmse"] = rmse(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["rmse"] = rmse(predictions_prophet_test, data_test_pandas["y"])
def mae(yhat, y):
yhat = np.asarray(yhat)
y = np.asarray(y)
return np.abs(y - yhat).sum()/len(y)
in_sample["mae"] = mae(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["mae"] = mae(predictions_prophet_test, data_test_pandas["y"])
print("""
In sample mae: {:.4f}
In sample rmse: {:.4f}
In sample rsquared: {:.4f}\n
Out of sample mae: {:.4f}
Out of sample rmse: {:.4f}
Out of sample rsquared: {:.4f}
""".format(
in_sample['mae'],
in_sample['rmse'],
in_sample['rsquared'],
out_of_sample['mae'],
out_of_sample['rmse'],
out_of_sample['rsquared'])
)
Let's take a closer look at the predictions to get a better understanding why getML does better than Prophet.
length = 4000
plt.subplots(figsize=(20, 10))
plt.plot(np.asarray(data_test_pandas["y"])[:length], color=col_data, label="ground truth")
plt.plot(predictions_getml_test[:length], color=col_getml, label="getml")
plt.plot(predictions_prophet_test[:length], color=col_prophet, label="prophet")
plt.legend(loc="upper right")
As this plot indicates, getML does better than Prophet, because it can integrate autoregressive processes in addition to seasonal data.
tsfresh is a library for generating features on time series. It uses a brute-force approach: It generates a large number of hard-coded features and then does a feature selection.
For convenience, we have built a wrapper around tsfresh.
As we have discussed in a different notebook, tsfresh consumes a lot of memory. To limit the memory consumption to a feasible level, we only use tsfresh's MinimalFCParameters and IndexBasedFCParameters, which are a superset of the TimeBasedFCParameters.
class TSFreshBuilder():
def __init__(self, num_features, memory, column_id, time_stamp, target):
"""
Scikit-learn style feature builder based on TSFresh.
Args:
num_features: The (maximum) number of features to build.
memory: How much back in time you want to go until the
feature builder starts "forgetting" data.
column_id: The name of the column containing the ids.
time_stamp: The name of the column containing the time stamps.
target: The name of the target column.
"""
self.num_features = num_features
self.memory = memory
self.column_id = column_id
self.time_stamp = time_stamp
self.target = target
self.selected_features = []
def _add_original_columns(self, original_df, df_selected):
for colname in original_df.columns:
df_selected[colname] = np.asarray(
original_df[colname])
return df_selected
def _extract_features(self, df):
df_rolled = roll_time_series(
df,
column_id=self.column_id,
column_sort=self.time_stamp,
max_timeshift=self.memory
)
extracted_minimal = tsfresh.extract_features(
df_rolled,
column_id=self.column_id,
column_sort=self.time_stamp,
default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters()
)
extracted_index_based = tsfresh.extract_features(
df_rolled,
column_id=self.column_id,
column_sort=self.time_stamp,
default_fc_parameters=tsfresh.feature_extraction.settings.IndexBasedFCParameters()
)
extracted_features = pd.concat(
[extracted_minimal, extracted_index_based], axis=1
)
del extracted_minimal
del extracted_index_based
gc.collect()
extracted_features[
extracted_features != extracted_features] = 0.0
extracted_features[
np.isinf(extracted_features)] = 0.0
return extracted_features
def _print_time_taken(self, begin, end):
seconds = end - begin
hours = int(seconds / 3600)
seconds -= float(hours * 3600)
minutes = int(seconds / 60)
seconds -= float(minutes * 60)
seconds = round(seconds, 6)
print(
"Time taken: " + str(hours) + "h:" +
str(minutes) + "m:" + str(seconds)
)
print("")
def _remove_target_column(self, df):
colnames = np.asarray(df.columns)
if self.target not in colnames:
return df
colnames = colnames[colnames != self.target]
return df[colnames]
def _select_features(self, df, target):
df_selected = tsfresh.select_features(
df,
target
)
colnames = np.asarray(df_selected.columns)
correlations = np.asarray([
np.abs(pearsonr(target, df_selected[col]))[0] for col in colnames
])
# [::-1] is somewhat unintuitive syntax,
# but it reverses the entire column.
self.selected_features = colnames[
np.argsort(correlations)
][::-1][:self.num_features]
return df_selected[self.selected_features]
def fit(self, df):
"""
Fits the features.
"""
begin = time.time()
target = np.asarray(df[self.target])
df_without_target = self._remove_target_column(df)
df_extracted = self._extract_features(
df_without_target)
df_selected = self._select_features(
df_extracted, target)
del df_extracted
gc.collect()
df_selected = self._add_original_columns(df, df_selected)
end = time.time()
self._print_time_taken(begin, end)
return df_selected
def transform(self, df):
"""
Transforms the raw data into a set of features.
"""
df_extracted = self._extract_features(df)
df_selected = df_extracted[self.selected_features]
del df_extracted
gc.collect()
df_selected = self._add_original_columns(df, df_selected)
return df_selected
We need to lag our target variable, so we can input as a feature to tsfresh.
y_lagged = np.asarray(data_full_pandas["y"][:-12])
data_full_tsfresh = data_full_pandas[12:]
data_full_tsfresh["y_lagged"] = y_lagged
data_full_tsfresh["id"] = 1
separation = datetime(2005, 8, 20, 0, 0)
data_train_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] < separation]
data_test_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] >= separation]
data_train_tsfresh
data_test_tsfresh
We build 20 features, just like we have with getML.
tsfresh_builder = TSFreshBuilder(
num_features=20,
memory=24,
column_id="id",
time_stamp="ds",
target="y"
)
if RUN_TSFRESH:
tsfresh_train_pandas = tsfresh_builder.fit(data_train_tsfresh)
tsfresh_test_pandas = tsfresh_builder.transform(data_test_tsfresh)
else:
tsfresh_train_pandas = pd.read_csv("tsfresh_train_pandas.csv")
tsfresh_test_pandas = pd.read_csv("tsfresh_test_pandas.csv")
Because tsfresh does not come with built-in predictors, we upload the generated features into a getML pipeline.
tsfresh_train = getml.data.DataFrame.from_pandas(tsfresh_train_pandas, "tsfresh_train")
tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test_pandas, "tsfresh_test")
for df in [tsfresh_train, tsfresh_test]:
df.set_role("y", getml.data.roles.target)
df.set_role("ds", getml.data.roles.time_stamp)
df.set_role(df.roles.unused_float, getml.data.roles.numerical)
df.set_role(["y_lagged", "id"], getml.data.roles.unused_float)
tsfresh_train
We use an untuned XGBoostRegressor to generate predictions from our tsfresh features, just like we have for getML.
predictor = getml.predictors.XGBoostRegressor()
pipe_tsfresh = getml.pipeline.Pipeline(
tags=['tsfresh'],
predictors=[predictor]
)
pipe_tsfresh.fit(tsfresh_train)
pipe_tsfresh.score(tsfresh_test)
predictions_tsfresh_test = pipe_tsfresh.predict(tsfresh_test)
Let's take a closer look at the predictions to get a better understanding why getML does better than tsfresh.
length = 4000
plt.subplots(figsize=(20, 10))
plt.plot(np.asarray(data_test_pandas["y"])[:length], color=col_data, label="ground truth")
plt.plot(predictions_getml_test[:length], color=col_getml, label="getml")
plt.plot(predictions_tsfresh_test[:length], color=col_prophet, label="tsfresh")
plt.legend(loc="upper right")
As we can see, tsfresh struggles with the strong seasonal components of this data set and therefore cannot separate signal from noise to the same extent that getML can.
Prophet is good at extracting seasonal features. tsfresh is good at extracting autoregressive features. So what if we tried to combine them? How well would that perform compared to getML?
Let's give it a try. We begin by extracting all of the seasonal features from Prophet and combining them with the tsfresh features:
def combine(dfs):
combined = pd.DataFrame()
for df in dfs:
df = df.copy()
if "id" in df.columns:
del df["id"]
df = df.reset_index()
for col in df.columns:
combined[col] = df[col]
return combined
if RUN_PROPHET:
prophet_train_pandas = model_prophet.predict(data_train_tsfresh)
prophet_test_pandas = model_prophet.predict(data_test_tsfresh)
combined_train_pandas = combine([tsfresh_train_pandas, prophet_train_pandas])
combined_test_pandas = combine([tsfresh_test_pandas, prophet_test_pandas])
else:
combined_train_pandas = pd.read_csv("combined_train_pandas.csv")
combined_test_pandas = pd.read_csv("combined_test_pandas.csv")
We upload the data to getML:
combined_train = getml.data.DataFrame.from_pandas(combined_train_pandas, "combined_train")
combined_test = getml.data.DataFrame.from_pandas(combined_test_pandas, "combined_test")
The multiplicative terms are all zero, so we set them to unused to avoid an ugly warning message we would get from getML.
for df in [combined_train, combined_test]:
df.set_role("y", getml.data.roles.target)
df.set_role("ds", getml.data.roles.time_stamp)
df.set_role(df.roles.unused_float, getml.data.roles.numerical)
df.set_role(["multiplicative_terms", "multiplicative_terms_lower", "multiplicative_terms_upper", "y_lagged"], getml.data.roles.unused_float)
Once again, we train an untuned XGBoostRegressor on top of these features.
predictor = getml.predictors.XGBoostRegressor()
pipe_combined = getml.pipeline.Pipeline(
tags=['prophet + tsfresh'],
predictors=[predictor]
)
pipe_combined.fit(combined_train)
pipe_combined.score(combined_test)
As we can see, combining tsfresh and Prophet generates better predictions than any single one of them, but it is still considerably worse than getML.
predictions_combined_test = pipe_combined.predict(combined_test)
length = 4000
plt.subplots(figsize=(20, 10))
plt.plot(np.asarray(data_test_pandas["y"])[:length], color=col_data, label="ground truth")
plt.plot(predictions_getml_test[:length], color=col_getml, label="getml")
plt.plot(predictions_combined_test[:length], color=col_prophet, label="tsfresh + prophet")
plt.legend(loc="upper right")
The most important feature is the following:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
pipe.features.to_sql().save("dodgers_pipeline")
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("dodgers_spark")
For a more convenient overview, we summarize these results into a table.
Name | R-squared | RMSE | MAE |
---|---|---|---|
getML | 76% | 6.39 | 4.64 |
Prophet | 63% | 8.32 | 6.22 |
tsfresh | 49% | 9.30 | 7.19 |
Prophet + tsfresh | 67% | 8.41 | 6.18 |
As we can see, getML outperforms both Prophet and tsfresh by all three measures.
We have compared getML's feature learning algorithms to Prophet and tsfresh on a data set related to traffic on LA's 101 North freeway. We found that getML significantly outperforms both Prophet and tsfresh. These results are consistent with the view that relational learning is a powerful tool for time series analysis.
You are encouraged to reproduce these results. You will need getML (https://getml.com/product) to do so. You can download it for free.
This tutorial went showcased another time series application of getML and benchmarked getML against popular time series libraries.
If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.
Here is some additional material from our documentation if you want to learn more about getML:
If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.