# Propositionalization: Occupancy detection¶

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

• Prediction type: Binary classification
• Domain: Energy
• Prediction target: Room occupancy
• Source data: 1 table, 32k rows
• Population size: 32k

# Background¶

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

Our use case here is a public domain data set for predicting room occupancy from sensor data. For further details about the data set refer to the full notebook.

### A web frontend for getML¶

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

### Where is this running?¶

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis¶

Let's get started with the analysis and set-up your session:

In [1]:
import datetime
import os
import sys
import time
from urllib import request

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.style.use("seaborn")
%matplotlib inline

import getml

print(f"getML API version: {getml.__version__}\n")

getml.engine.launch()
getml.engine.set_project("occupancy")

getML API version: 1.2.0

Launched the getML engine. The log output will be stored in /home/patrick/.getML/logs/20220330010336.log.

Connected to project 'occupancy'
http://localhost:1709/#/listprojects/occupancy/

In [2]:
sys.path.append(os.path.join(sys.path[0], ".."))

from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder


The data set can be downloaded directly from GitHub. It is conveniently separated into a train, a validation and a testing set. This allows us to directly benchmark our results against the results of the original paper later.

In [3]:
data_test, data_train, data_validate = getml.datasets.load_occupancy(roles=True)

Loading population_train...
[========================================] 100%

[========================================] 100%

[========================================] 100%

In [4]:
data_all, split = getml.data.split.concat(
"data_all",
train=data_train,
validation=data_validate,
test=data_test,
)


The train set looks like this:

In [5]:
data_train

Out[5]:
name date Occupancy Temperature Humidity Light CO2 HumidityRatio
role time_stamp target numerical numerical numerical numerical numerical
unit time stamp
0 2015-02-11 14:48:00 1  21.76 31.1333 437.3333 1029.6667 0.005021
1 2015-02-11 14:49:00 1  21.79 31  437.3333 1000  0.005009
2 2015-02-11 14:50:00 1  21.7675 31.1225 434  1003.75 0.005022
3 2015-02-11 14:51:00 1  21.7675 31.1225 439  1009.5 0.005022
4 2015-02-11 14:51:59 1  21.79 31.1333 437.3333 1005.6667 0.00503
... ...  ...  ...  ...  ...  ...
9747 2015-02-18 09:15:00 1  20.815 27.7175 429.75 1505.25 0.004213
9748 2015-02-18 09:16:00 1  20.865 27.745 423.5 1514.5 0.00423
9749 2015-02-18 09:16:59 1  20.89 27.745 423.5 1521.5 0.004237
9750 2015-02-18 09:17:59 1  20.89 28.0225 418.75 1632  0.004279
9751 2015-02-18 09:19:00 1  21  28.1 409  1864  0.004321

9752 rows x 7 columns
memory usage: 0.55 MB
name: population_test
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/occupancy/population_test/

## 2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

### 2.1 Propositionalization with getML's FastProp¶

We use all possible aggregations. Because tsfresh and featuretools are single-threaded, we limit our FastProp algorithm to one thread as well, to ensure a fair comparison.

In [6]:
# Our forecast horizon is 0.
# We do not predict the future, instead we infer
# the present state from current and past sensor data.
horizon = 0.0

# We do not allow the time series features
# to use target values from the past.
# (Otherwise, we would need the horizon to
# be greater than 0.0).
allow_lagged_targets = False

# We want our time series features to only use
# data from the last 15 minutes
memory = getml.data.time.minutes(15)

time_series = getml.data.TimeSeries(
population=data_all,
split=split,
time_stamps="date",
horizon=horizon,
memory=memory,
lagged_targets=allow_lagged_targets,
)

time_series

Out[6]:

## data model

#### staging

data frames staging table
0 population POPULATION__STAGING_TABLE_1
1 data_all DATA_ALL__STAGING_TABLE_2

## container

#### population

subset name rows type
0 test data_all 8142 View
1 train data_all 9753 View
2 validation data_all 2665 View

#### peripheral

name rows type
0 data_all 20560 DataFrame
In [7]:
feature_learner = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
aggregation=getml.feature_learning.FastProp.agg_sets.All,
)


Next, we create the pipeline. In contrast to our usual approach, we create two pipelines in this notebook. One for feature learning (suffix _fl) and one for predicition (suffix _pr). This allows for a fair comparison of runtimes.

In [8]:
pipe_fp_fl = getml.pipeline.Pipeline(
feature_learners=[feature_learner],
data_model=time_series.data_model,
tags=["feature learning", "fastprop"],
)

In [9]:
pipe_fp_fl.check(time_series.train)

Checking data model...

Staging...
[========================================] 100%

Checking...
[========================================] 100%

OK.


The wrappers around featuretools and tsfresh fit on the training set and then return the training features. We therefore measure the time it takes getML's FastProp algorithm to fit on the training set and create the training features.

In [10]:
benchmark = Benchmark()

In [11]:
with benchmark("fastprop"):
pipe_fp_fl.fit(time_series.train)
fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

Checking data model...

Staging...
[========================================] 100%

OK.

Staging...
[========================================] 100%

FastProp: Trying 289 features...
[========================================] 100%

Trained pipeline.
Time taken: 0h:0m:2.91909

Staging...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


In [12]:
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")


Staging...
[========================================] 100%

FastProp: Building features...
[========================================] 100%



Now we create a dedicated prediction pipeline and provide the fast prop features (contrained in fastprop_train and fastprop_test.)

In [13]:
predictor = getml.predictors.XGBoostClassifier()

pipe_fp_pr = getml.pipeline.Pipeline(
tags=["prediction", "fastprop"], predictors=[predictor]
)

In [14]:
pipe_fp_pr.check(fastprop_train)

pipe_fp_pr.fit(fastprop_train)

Checking data model...

Staging...
[========================================] 100%

Checking...
[========================================] 100%

OK.
Checking data model...

Staging...
[========================================] 100%

OK.

Staging...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

Trained pipeline.
Time taken: 0h:0m:8.948003


Out[14]:
Pipeline(data_model='population',
feature_learners=[],
feature_selectors=[],
include_categorical=False,
loss_function=None,
peripheral=[],
predictors=['XGBoostClassifier'],
preprocessors=[],
share_selected_features=0.5,
tags=['prediction', 'fastprop'])

url: http://localhost:1709/#/getpipeline/occupancy/S0K2yk/0/
In [15]:
pipe_fp_pr.score(fastprop_test)


Staging...
[========================================] 100%


Out[15]:
date time set used target accuracy auc cross entropy
0 2022-03-30 01:04:00 fastprop_train Occupancy 0.9997 1. 0.004466
1 2022-03-30 01:04:01 fastprop_test Occupancy 0.9889 0.9982 0.046245

### 2.2 Propositionalization with featuretools¶

In [16]:
data_train = time_series.train.population.to_df("train")
data_test = time_series.test.population.to_df("test")

In [17]:
dfs_pandas = {}

for df in getml.project.data_frames:
dfs_pandas[df.name] = df.to_pandas()
dfs_pandas[df.name]["id"] = 1

In [18]:
ft_builder = FTTimeSeriesBuilder(
num_features=200,
horizon=pd.Timedelta(minutes=0),
memory=pd.Timedelta(minutes=15),
column_id="id",
time_stamp="date",
target="Occupancy",
)


The FTTimeSeriesBuilder provides a fit method that is designed to be equivilant to to the fit method of the predictorless getML pipeline above.

In [19]:
with benchmark("featuretools"):
featuretools_train = ft_builder.fit(dfs_pandas["train"])

featuretools_test = ft_builder.transform(dfs_pandas["test"])

df_featuretools_train = getml.data.DataFrame.from_pandas(
featuretools_train, name="featuretools_train", roles=data_train.roles
)
df_featuretools_test = getml.data.DataFrame.from_pandas(
featuretools_test, name="featuretools_test", roles=data_train.roles
)

df_featuretools_train.set_role(
df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
df_featuretools_test.roles.unused, getml.data.roles.numerical
)

featuretools: Trying features...

/usr/local/lib/python3.9/dist-packages/featuretools/synthesis/dfs.py:309: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
agg_primitives: ['all', 'any', 'count', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.
warnings.warn(warning_msg, UnusedPrimitiveWarning)

Selecting the best out of 103 features...
Time taken: 0h:3m:11.436529


/usr/local/lib/python3.9/dist-packages/featuretools/synthesis/dfs.py:309: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
agg_primitives: ['all', 'any', 'count', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.
warnings.warn(warning_msg, UnusedPrimitiveWarning)

In [20]:
predictor = getml.predictors.XGBoostClassifier()

pipe_ft_pr = getml.pipeline.Pipeline(
tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

Out[20]:
Pipeline(data_model='population',
feature_learners=[],
feature_selectors=[],
include_categorical=False,
loss_function=None,
peripheral=[],
predictors=['XGBoostClassifier'],
preprocessors=[],
share_selected_features=0.5,
tags=['prediction', 'featuretools'])
In [21]:
pipe_ft_pr.check(df_featuretools_train)

Checking data model...

Staging...
[========================================] 100%

Checking...
[========================================] 100%

WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

In [22]:
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...

Staging...
[========================================] 100%

WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

Staging...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

Trained pipeline.
Time taken: 0h:0m:3.44978


Out[22]:
Pipeline(data_model='population',
feature_learners=[],
feature_selectors=[],
include_categorical=False,
loss_function=None,
peripheral=[],
predictors=['XGBoostClassifier'],
preprocessors=[],
share_selected_features=0.5,
tags=['prediction', 'featuretools'])

url: http://localhost:1709/#/getpipeline/occupancy/JKgueG/0/
In [23]:
pipe_ft_pr.score(df_featuretools_test)


Staging...
[========================================] 100%


Out[23]:
date time set used target accuracy auc cross entropy
0 2022-03-30 01:10:00 featuretools_train Occupancy 0.9993 1. 0.00537
1 2022-03-30 01:10:00 featuretools_test Occupancy 0.9881 0.9971 0.050637

### 2.3 Propositionalization with tsfresh¶

In [24]:
tsfresh_builder = TSFreshBuilder(
num_features=200, memory=15, column_id="id", time_stamp="date", target="Occupancy"
)

with benchmark("tsfresh"):
tsfresh_train = tsfresh_builder.fit(dfs_pandas["train"])

tsfresh_test = tsfresh_builder.transform(dfs_pandas["test"])

df_tsfresh_train = getml.data.DataFrame.from_pandas(
tsfresh_train, name="tsfresh_train", roles=data_train.roles
)
df_tsfresh_test = getml.data.DataFrame.from_pandas(
tsfresh_test, name="tsfresh_test", roles=data_train.roles
)

df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical)

df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)

/home/patrick/.local/lib/python3.9/site-packages/tsfresh/utilities/dataframe_functions.py:520: UserWarning: Your time stamps are not uniformly sampled, which makes rolling nonsensical in some domains.
warnings.warn(
Rolling: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:05<00:00, 10.27it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:12<00:00,  4.95it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:11<00:00,  5.29it/s]

Selecting the best out of 65 features...
Time taken: 0h:0m:34.374818


/home/patrick/.local/lib/python3.9/site-packages/tsfresh/utilities/dataframe_functions.py:520: UserWarning: Your time stamps are not uniformly sampled, which makes rolling nonsensical in some domains.
warnings.warn(
Rolling: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:04<00:00, 12.34it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:10<00:00,  5.96it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:09<00:00,  6.16it/s]

In [25]:
pipe_tsf_pr = getml.pipeline.Pipeline(
tags=["predicition", "tsfresh"], predictors=[predictor]
)

pipe_tsf_pr

Out[25]:
Pipeline(data_model='population',
feature_learners=[],
feature_selectors=[],
include_categorical=False,
loss_function=None,
peripheral=[],
predictors=['XGBoostClassifier'],
preprocessors=[],
share_selected_features=0.5,
tags=['predicition', 'tsfresh'])
In [26]:
pipe_tsf_pr.check(df_tsfresh_train)

Checking data model...

Staging...
[========================================] 100%

Checking...
[========================================] 100%

WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

In [27]:
pipe_tsf_pr.fit(df_tsfresh_train)

Checking data model...

Staging...
[========================================] 100%

WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

Staging...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

Trained pipeline.
Time taken: 0h:0m:3.72172


Out[27]:
Pipeline(data_model='population',
feature_learners=[],
feature_selectors=[],
include_categorical=False,
loss_function=None,
peripheral=[],
predictors=['XGBoostClassifier'],
preprocessors=[],
share_selected_features=0.5,
tags=['predicition', 'tsfresh'])

url: http://localhost:1709/#/getpipeline/occupancy/D8A0II/0/
In [28]:
pipe_tsf_pr.score(df_tsfresh_test)


Staging...
[========================================] 100%


Out[28]:
date time set used target accuracy auc cross entropy
0 2022-03-30 01:11:09 tsfresh_train Occupancy 0.9985 1. 0.006898
1 2022-03-30 01:11:09 tsfresh_test Occupancy 0.9877 0.9979 0.049359

## 3. Comparison¶

In [29]:
num_features = dict(
fastprop=289,
featuretools=103,
tsfresh=60,
)

runtime_per_feature = [
benchmark.runtimes["fastprop"] / num_features["fastprop"],
benchmark.runtimes["featuretools"] / num_features["featuretools"],
benchmark.runtimes["tsfresh"] / num_features["tsfresh"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
dict(
runtime=[
benchmark.runtimes["fastprop"],
benchmark.runtimes["featuretools"],
benchmark.runtimes["tsfresh"],
],
num_features=num_features.values(),
features_per_second=features_per_second,
normalized_runtime=[
1,
benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"],
],
normalized_runtime_per_feature=normalized_runtime_per_feature,
accuracy=[pipe_fp_pr.accuracy, pipe_ft_pr.accuracy, pipe_tsf_pr.accuracy],
auc=[pipe_fp_pr.auc, pipe_ft_pr.auc, pipe_tsf_pr.auc],
cross_entropy=[
pipe_fp_pr.cross_entropy,
pipe_ft_pr.cross_entropy,
pipe_tsf_pr.cross_entropy,
],
)
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

In [30]:
comparison

Out[30]:
runtime num_features features_per_second normalized_runtime normalized_runtime_per_feature accuracy auc cross_entropy
getML: FastProp 0 days 00:00:04.814605 289 60.024010 1.000000 1.000000 0.988946 0.998243 0.046245
featuretools 0 days 00:03:11.436908 103 0.538036 39.761706 111.561285 0.988086 0.997104 0.050637
tsfresh 0 days 00:00:34.374999 60 1.745454 7.139734 34.388776 0.987718 0.997861 0.049359
In [31]:
# export for further use
comparison.to_csv("comparisons/occupancy.csv")


## Why is FastProp so fast?¶

First, FastProp hugely benefits from getML's custom-built C++-native in-memory database engine. The engine is highly optimized for working with relational data structures and makes use of information about the relational structure of the data to efficiently store and carry out computations on such data. This matters in particular for time series where we relate the current observation to a certain number of observations from the past: Other libraries have to deal explicitly with this inherent structure of (multivariate) time series; and such explicit transformations are costly, in terms of consumption of both, memory and computational resources. All operations on data stored in getML's engine benefit from implementations in modern C++. Further, we are taking advantage of functional design patterns where all column-based operations are evaluated lazily. So, for example, aggregations are carried out only on rows that matter (taking into account even complex conditions that might span multiple tables in the relational model). Duplicate operations are reduced to a bare minimum by keeping track of the relational data model. In addition to the mere advantage in performance, FastProp, by building on an abstract data model, also has an edge in memory consumption based on the abstract database design, the reliance on efficient storage patterns (utilizing pointers and indices) for concrete data, and by taking advantage of functional design patterns and lazy computations. This allows working with data sets of substantial size even without falling back to distributed computing models.

# Next Steps¶

If you are interested in further real-world applications of getML, visit the notebook section on getml.com. If you want to gain a deeper understanding about our notebooks' contents or download the code behind the notebooks, have a look at the getml-demo repository. Here, you can also find futher benchmarks of getML.

Want to try out without much hassle? Just head to try.getml.com to launch an instance of getML directly in your browser.