Propositionalization: Interstate 94

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

  • Prediction type: Regression model
  • Domain: Transportation
  • Prediction target: Hourly traffic volume
  • Source data: Multivariate time series, 5 components
  • Population size: 24096

Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

In this notebook, we predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. The analysis is built on top of a dataset provided by the MN Department of Transportation, with some data preparation done by John Hogue. For further details about the data set refer to the full notebook.

Analysis

Let's get started with the analysis and set-up your session:

In [1]:
import datetime
import os
import sys
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import Image

plt.style.use("seaborn")
%matplotlib inline

import getml

print(f"getML API version: {getml.__version__}\n")

getml.engine.launch()
getml.engine.set_project("interstate94")
getML API version: 1.2.0

getML engine is already running.



Connected to project 'interstate94'
http://localhost:1709/#/listprojects/interstate94/
In [2]:
sys.path.append(os.path.join(sys.path[0], ".."))

from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder

1. Loading data

1.1 Download from source

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [3]:
traffic = getml.datasets.load_interstate94(roles=True, units=True)
Loading traffic...
[========================================] 100%
In [4]:
traffic.set_role(traffic.roles.categorical, getml.data.roles.unused_string)
In [5]:
traffic
Out[5]:
name ds traffic_volume holiday day month weekday hour year
role time_stamp target unused_string unused_string unused_string unused_string unused_string unused_string
unit time stamp, comparison only day month weekday hour year
0 2016-01-01 1513  New Years Day 1 1 4 0 2016
1 2016-01-01 01:00:00 1550  New Years Day 1 1 4 1 2016
2 2016-01-01 02:00:00 993  New Years Day 1 1 4 2 2016
3 2016-01-01 03:00:00 719  New Years Day 1 1 4 3 2016
4 2016-01-01 04:00:00 533  New Years Day 1 1 4 4 2016
... ...  ... ... ... ... ... ...
24091 2018-09-30 19:00:00 3543  No holiday 30 9 6 19 2018
24092 2018-09-30 20:00:00 2781  No holiday 30 9 6 20 2018
24093 2018-09-30 21:00:00 2159  No holiday 30 9 6 21 2018
24094 2018-09-30 22:00:00 1450  No holiday 30 9 6 22 2018
24095 2018-09-30 23:00:00 954  No holiday 30 9 6 23 2018

24096 rows x 8 columns
memory usage: 2.16 MB
name: traffic
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/interstate94/traffic/

1.2 Define relational model

In [6]:
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))
In [7]:
time_series = getml.data.TimeSeries(
    population=traffic,
    split=split,
    alias="traffic",
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(24),
    lagged_targets=True,
)

time_series
Out[7]:

data model

diagram


traffictrafficds <= dsMemory: 1.0 daysHorizon: 1.0 hoursLagged targets allowed

staging

data frames staging table
0 traffic TRAFFIC__STAGING_TABLE_1
1 traffic TRAFFIC__STAGING_TABLE_2

container

population

subset name rows type
0 test traffic 4800 View
1 train traffic 19296 View

peripheral

name rows type
0 traffic 24096 DataFrame

2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 Propositionalization with getML's FastProp

In [8]:
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

Build the pipeline

In [9]:
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl
Out[9]:
Pipeline(data_model='traffic',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=['traffic'],
         predictors=[],
         preprocessors=['Seasonal'],
         share_selected_features=0.5,
         tags=['feature learning', 'fastprop'])
In [10]:
pipe_fp_fl.check(time_series.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


OK.
In [11]:
benchmark = Benchmark()
In [12]:
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
Checking data model...


Staging...
[========================================] 100%


OK.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Trying 365 features...
[========================================] 100%


Trained pipeline.
Time taken: 0h:0m:12.490392



Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


In [13]:
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


In [14]:
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)
In [15]:
pipe_fp_pr.fit(fastprop_train)
Checking data model...


Staging...
[========================================] 100%

Checking...
[========================================] 100%


OK.


Staging...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:0m:9.742785

Out[15]:
Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'fastprop'])

url: http://localhost:1709/#/getpipeline/interstate94/NdqZ5G/0/
In [16]:
pipe_fp_pr.score(fastprop_test)

Staging...
[========================================] 100%


Out[16]:
date time set used target mae rmse rsquared
0 2022-03-30 00:48:07 fastprop_train traffic_volume 198.9482 292.2493 0.9779
1 2022-03-30 00:48:07 fastprop_test traffic_volume 180.4867 261.9389 0.9827

2.2 Propositionalization with featuretools

In [17]:
traffic_train = time_series.train.population
traffic_test = time_series.test.population
In [18]:
dfs_pandas = {}

for df in [traffic_train, traffic_test, traffic]:
    dfs_pandas[df.name] = df.drop(df.roles.unused).to_pandas()
    dfs_pandas[df.name]["join_key"] = 1
In [19]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=24),
    column_id="join_key",
    time_stamp="ds",
    target="traffic_volume",
    allow_lagged_targets=True,
)
In [20]:
with benchmark("featuretools"):
    featuretools_train = ft_builder.fit(dfs_pandas["train"])

featuretools_test = ft_builder.transform(dfs_pandas["test"])
featuretools: Trying features...
/usr/local/lib/python3.9/dist-packages/featuretools/synthesis/dfs.py:309: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  agg_primitives: ['all', 'any', 'count', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
Selecting the best out of 59 features...
Time taken: 0h:3m:0.332521

/usr/local/lib/python3.9/dist-packages/featuretools/synthesis/dfs.py:309: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  agg_primitives: ['all', 'any', 'count', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
In [21]:
roles = {
    getml.data.roles.join_key: ["join_key"],
    getml.data.roles.target: ["traffic_volume"],
    getml.data.roles.time_stamp: ["ds"],
}

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=roles
)
In [22]:
df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)
In [23]:
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr
Out[23]:
Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])
In [24]:
pipe_ft_pr.check(df_featuretools_train)
Checking data model...


Staging...
[========================================] 100%

Checking...
[========================================] 100%


OK.
In [25]:
pipe_ft_pr.fit(df_featuretools_train)
Checking data model...


Staging...
[========================================] 100%


OK.


Staging...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:0m:2.342894

Out[25]:
Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function=None,
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

url: http://localhost:1709/#/getpipeline/interstate94/fP9Fja/0/
In [26]:
pipe_ft_pr.score(df_featuretools_test)

Staging...
[========================================] 100%


Out[26]:
date time set used target mae rmse rsquared
0 2022-03-30 00:51:58 featuretools_train traffic_volume 217.4832 317.1563 0.974
1 2022-03-30 00:51:58 featuretools_test traffic_volume 209.5696 330.5634 0.9724

2.3 Propositionalization with tsfresh

tsfresh failed to run through due to an apparent bug in the tsfresh library and is therefore excluded from this analysis.

3. Comparison

In [27]:
num_features = dict(
    fastprop=461,
    featuretools=59,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"]],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools"]
In [28]:
comparison
Out[28]:
runtime num_features features_per_second normalized_runtime normalized_runtime_per_feature rsquared rmse mae
getML: FastProp 0 days 00:00:18.836219 461 24.474412 1.000000 1.000000 0.982678 261.938873 180.486734
featuretools 0 days 00:03:00.333020 59 0.327172 9.573738 74.805844 0.972389 330.563417 209.569640
In [29]:
comparison.to_csv("comparisons/interstate94.csv")

Why is FastProp so fast?

First, FastProp hugely benefits from getML's custom-built C++-native in-memory database engine. The engine is highly optimized for working with relational data structures and makes use of information about the relational structure of the data to efficiently store and carry out computations on such data. This matters in particular for time series where we relate the current observation to a certain number of observations from the past: Other libraries have to deal explicitly with this inherent structure of (multivariate) time series; and such explicit transformations are costly, in terms of consumption of both, memory and computational resources. All operations on data stored in getML's engine benefit from implementations in modern C++. Further, we are taking advantage of functional design patterns where all column-based operations are evaluated lazily. So, for example, aggregations are carried out only on rows that matter (taking into account even complex conditions that might span multiple tables in the relational model). Duplicate operations are reduced to a bare minimum by keeping track of the relational data model. In addition to the mere advantage in performance, FastProp, by building on an abstract data model, also has an edge in memory consumption based on the abstract database design, the reliance on efficient storage patterns (utilizing pointers and indices) for concrete data, and by taking advantage of functional design patterns and lazy computations. This allows working with data sets of substantial size even without falling back to distributed computing models.

Next Steps

If you are interested in further real-world applications of getML, visit the notebook section on getml.com. If you want to gain a deeper understanding about our notebooks' contents or download the code behind the notebooks, have a look at the getml-demo repository. Here, you can also find futher benchmarks of getML.

Want to try out without much hassle? Just head to try.getml.com to launch an instance of getML directly in your browser.

Further, here is some additional material from our documentation if you want to learn more about getML:

Get in contact

If you have any questions, just write us an email. Prefer a private demo of getML for your team? Just contact us to arrange an introduction to getML.