Formula 1 - Predicting the winner of a race

NOTE: Due to featuretools' memory requirement, this notebook will not run on MyBinder.

In this notebook we will benchmark getML against featuretools to predict the winner of a Formula 1 race.

Summary:

  • Prediction type: Classification model
  • Domain: Sports
  • Prediction target: Win
  • Population size: 31578

Author: Dr. Patrick Urbanke

Background

We would like to develop a prediction model for Formula 1 races, that would allow us to predict the winner of a race before the race has started.

We use dataset of all Formula 1 races from 1950 to 2017. The dataset includes information such as the time taken in each lap, the time taken for pit stops, the performance in the qualifying rounds etc.

The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

We will benchmark getML 's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Where is this running?

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import featuretools
import getml

getml.engine.set_project('formula1')


Connected to project 'formula1'

1. Loading data

1.1 Download from source

We begin by downloading the data:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="ErgastF1",
    port=3306,
    user="guest",
    password="relational"
)

conn
Out[2]:
Connection(conn_id='default',
           dbname='ErgastF1',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)
In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
In [4]:
driverStandings = load_if_needed("driverStandings")
drivers = load_if_needed("drivers")
lapTimes = load_if_needed("lapTimes")
pitStops = load_if_needed("pitStops")
races = load_if_needed("races")
qualifying = load_if_needed("qualifying")
In [5]:
driverStandings
Out[5]:
name driverStandingsId raceId driverId points position wins positionText
role unused_float unused_float unused_float unused_float unused_float unused_float unused_string
0 1  18  1  10  1  1  1
1 2  18  2  8  2  0  2
2 3  18  3  6  3  0  3
3 4  18  4  5  4  0  4
4 5  18  5  4  5  0  5
...  ...  ...  ...  ...  ...  ...
31573 68456  982  835  8  16  0  16
31574 68457  982  154  26  13  0  13
31575 68458  982  836  5  18  0  18
31576 68459  982  18  0  22  0  22
31577 68460  982  814  0  23  0  23

31578 rows x 7 columns
memory usage: 1.85 MB
name: driverStandings
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/driverStandings/

In [6]:
drivers
Out[6]:
name driverId number driverRef code forename surname dob nationality url
role unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string unused_string
0 1  44  hamilton HAM Lewis Hamilton 1985-01-07 British http://en.wikipedia.org/wiki/Lew...
1 2  nan  heidfeld HEI Nick Heidfeld 1977-05-10 German http://en.wikipedia.org/wiki/Nic...
2 3  6  rosberg ROS Nico Rosberg 1985-06-27 German http://en.wikipedia.org/wiki/Nic...
3 4  14  alonso ALO Fernando Alonso 1981-07-29 Spanish http://en.wikipedia.org/wiki/Fer...
4 5  nan  kovalainen KOV Heikki Kovalainen 1981-10-19 Finnish http://en.wikipedia.org/wiki/Hei...
...  ...  ... ... ... ... ... ... ...
835 837  88  haryanto HAR Rio Haryanto 1993-01-22 Indonesian http://en.wikipedia.org/wiki/Rio...
836 838  2  vandoorne VAN Stoffel Vandoorne 1992-03-26 Belgian http://en.wikipedia.org/wiki/Sto...
837 839  31  ocon OCO Esteban Ocon 1996-09-17 French http://en.wikipedia.org/wiki/Est...
838 840  18  stroll STR Lance Stroll 1998-10-29 Canadian http://en.wikipedia.org/wiki/Lan...
839 841  36  giovinazzi GIO Antonio Giovinazzi 1993-12-14 Italian http://en.wikipedia.org/wiki/Ant...

840 rows x 9 columns
memory usage: 0.14 MB
name: drivers
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/drivers/

In [7]:
lapTimes
Out[7]:
name raceId driverId lap position milliseconds time
role unused_float unused_float unused_float unused_float unused_float unused_string
0 1  1  1  13  109088  1:49.088
1 1  1  2  12  93740  1:33.740
2 1  1  3  11  91600  1:31.600
3 1  1  4  10  91067  1:31.067
4 1  1  5  10  92129  1:32.129
...  ...  ...  ...  ...  ...
420364 982  840  54  8  107528  1:47.528
420365 982  840  55  8  107512  1:47.512
420366 982  840  56  8  108143  1:48.143
420367 982  840  57  8  107848  1:47.848
420368 982  840  58  8  108699  1:48.699

420369 rows x 6 columns
memory usage: 23.96 MB
name: lapTimes
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/lapTimes/

In [8]:
pitStops
Out[8]:
name raceId driverId stop lap milliseconds time duration
role unused_float unused_float unused_float unused_float unused_float unused_string unused_string
0 841  1  1  16  23227  17:28:24 23.227
1 841  1  2  36  23199  17:59:29 23.199
2 841  2  1  15  22994  17:27:41 22.994
3 841  2  2  30  25098  17:51:32 25.098
4 841  3  1  16  23716  17:29:00 23.716
...  ...  ...  ...  ...  ... ...
6065 982  839  6  38  29134  21:29:07 29.134
6066 982  840  1  1  37403  20:06:43 37.403
6067 982  840  2  2  29294  20:10:07 29.294
6068 982  840  3  3  25584  20:13:16 25.584
6069 982  840  4  26  29412  21:05:07 29.412

6070 rows x 7 columns
memory usage: 0.44 MB
name: pitStops
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/pitStops/

In [9]:
races
Out[9]:
name raceId year round circuitId name date time url
role unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string
0 1  2009  1  1  Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/200...
1 2  2009  2  2  Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/200...
2 3  2009  3  17  Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/200...
3 4  2009  4  3  Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/200...
4 5  2009  5  4  Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/200...
...  ...  ...  ...  ... ... ... ...
971 984  2017  16  22  Japanese Grand Prix 2017-10-08 05:00:00 https://en.wikipedia.org/wiki/20...
972 985  2017  17  69  United States Grand Prix 2017-10-22 19:00:00 https://en.wikipedia.org/wiki/20...
973 986  2017  18  32  Mexican Grand Prix 2017-10-29 19:00:00 https://en.wikipedia.org/wiki/20...
974 987  2017  19  18  Brazilian Grand Prix 2017-11-12 16:00:00 https://en.wikipedia.org/wiki/20...
975 988  2017  20  24  Abu Dhabi Grand Prix 2017-11-26 17:00:00 https://en.wikipedia.org/wiki/20...

976 rows x 8 columns
memory usage: 0.15 MB
name: races
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/races/

In [10]:
qualifying
Out[10]:
name qualifyId raceId driverId constructorId number position q1 q2 q3
role unused_float unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string
0 1  18  1  1  22  1  1:26.572 1:25.187 1:26.714
1 2  18  9  2  4  2  1:26.103 1:25.315 1:26.869
2 3  18  5  1  23  3  1:25.664 1:25.452 1:27.079
3 4  18  13  6  2  4  1:25.994 1:25.691 1:27.178
4 5  18  2  2  3  5  1:25.960 1:25.518 1:27.236
...  ...  ...  ...  ...  ...  ... ... ...
7392 7415  982  825  210  20  16  1:43.756 NULL NULL
7393 7416  982  13  3  19  17  1:44.014 NULL NULL
7394 7417  982  840  3  18  18  1:44.728 NULL NULL
7395 7418  982  836  15  94  19  1:45.059 NULL NULL
7396 7419  982  828  15  9  20  1:45.570 NULL NULL

7397 rows x 9 columns
memory usage: 0.70 MB
name: qualifying
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/qualifying/

1.2 Prepare data for getML

In [11]:
racesPd = races.to_pandas()
racesPd
Out[11]:
raceId year round circuitId name date time url
0 1.0 2009.0 1.0 1.0 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G...
1 2.0 2009.0 2.0 2.0 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr...
2 3.0 2009.0 3.0 17.0 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran...
3 4.0 2009.0 4.0 3.0 Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran...
4 5.0 2009.0 5.0 4.0 Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran...
... ... ... ... ... ... ... ... ...
971 984.0 2017.0 16.0 22.0 Japanese Grand Prix 2017-10-08 05:00:00 https://en.wikipedia.org/wiki/2017_Japanese_Gr...
972 985.0 2017.0 17.0 69.0 United States Grand Prix 2017-10-22 19:00:00 https://en.wikipedia.org/wiki/2017_United_Stat...
973 986.0 2017.0 18.0 32.0 Mexican Grand Prix 2017-10-29 19:00:00 https://en.wikipedia.org/wiki/2017_Mexican_Gra...
974 987.0 2017.0 19.0 18.0 Brazilian Grand Prix 2017-11-12 16:00:00 https://en.wikipedia.org/wiki/2017_Brazilian_G...
975 988.0 2017.0 20.0 24.0 Abu Dhabi Grand Prix 2017-11-26 17:00:00 https://en.wikipedia.org/wiki/2017_Abu_Dhabi_G...

976 rows × 8 columns

We actually need some set-up, because the target variable is not readily available. The wins column in driverStandings is actually the accumulated number of wins over a year, but what we want is a boolean variable indicated whether someone has one a particular race or not.

In [12]:
driverStandingsPd = driverStandings.to_pandas()

driverStandingsPd = driverStandingsPd.merge(
    racesPd[["raceId", "year", "date", "round"]],
    on="raceId"
)

previousStanding = driverStandingsPd.merge(
    driverStandingsPd[["driverId", "year", "wins", "round"]],
    on=["driverId", "year"],
)

isPreviousRound = (previousStanding["round_x"] - previousStanding["round_y"] == 1.0)

previousStanding = previousStanding[isPreviousRound]

previousStanding["win"] = previousStanding["wins_x"] - previousStanding["wins_y"]

driverStandingsPd = driverStandingsPd.merge(
    previousStanding[["raceId", "driverId", "win"]],
    on=["raceId", "driverId"],
    how="left",
)

driverStandingsPd["win"] = [win if win == win else wins for win, wins in zip(driverStandingsPd["win"], driverStandingsPd["wins"])]

driver_standings = getml.data.DataFrame.from_pandas(driverStandingsPd, "driver_standings")

driver_standings
Out[12]:
name driverStandingsId raceId driverId points position wins year round win positionText date
role unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_string unused_string
0 1  18  1  10  1  1  2008  1  1  1 2008-03-16
1 2  18  2  8  2  0  2008  1  0  2 2008-03-16
2 3  18  3  6  3  0  2008  1  0  3 2008-03-16
3 4  18  4  5  4  0  2008  1  0  4 2008-03-16
4 5  18  5  4  5  0  2008  1  0  5 2008-03-16
...  ...  ...  ...  ...  ...  ...  ...  ...  ... ...
31573 68456  982  835  8  16  0  2017  14  0  16 2017-09-17
31574 68457  982  154  26  13  0  2017  14  0  13 2017-09-17
31575 68458  982  836  5  18  0  2017  14  0  18 2017-09-17
31576 68459  982  18  0  22  0  2017  14  0  22 2017-09-17
31577 68460  982  814  0  23  0  2017  14  0  23 2017-09-17

31578 rows x 11 columns
memory usage: 3.21 MB
name: driver_standings
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/driver_standings/

We also need to include the date of the race to lapTimes and pitStops, because we cannot use this data for the race we would like to predict. We can only take lap times and pit stops from previous races.

In [13]:
lapTimesPd = lapTimes.to_pandas()

lapTimesPd = lapTimesPd.merge(
    racesPd[["raceId", "date", "year"]],
    on="raceId"
)

lap_times = getml.data.DataFrame.from_pandas(lapTimesPd, "lap_times")

lap_times
Out[13]:
name raceId driverId lap position milliseconds year time date
role unused_float unused_float unused_float unused_float unused_float unused_float unused_string unused_string
0 1  1  1  13  109088  2009  1:49.088 2009-03-29
1 1  1  2  12  93740  2009  1:33.740 2009-03-29
2 1  1  3  11  91600  2009  1:31.600 2009-03-29
3 1  1  4  10  91067  2009  1:31.067 2009-03-29
4 1  1  5  10  92129  2009  1:32.129 2009-03-29
...  ...  ...  ...  ...  ...  ... ...
420364 982  840  54  8  107528  2017  1:47.528 2017-09-17
420365 982  840  55  8  107512  2017  1:47.512 2017-09-17
420366 982  840  56  8  108143  2017  1:48.143 2017-09-17
420367 982  840  57  8  107848  2017  1:47.848 2017-09-17
420368 982  840  58  8  108699  2017  1:48.699 2017-09-17

420369 rows x 8 columns
memory usage: 35.31 MB
name: lap_times
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/lap_times/

In [14]:
pitStopsPd = pitStops.to_pandas()

pitStopsPd = pitStopsPd.merge(
    racesPd[["raceId", "date", "year"]],
    on="raceId"
)

pit_stops = getml.data.DataFrame.from_pandas(pitStopsPd, "pit_stops")

pit_stops
Out[14]:
name raceId driverId stop lap milliseconds year time duration date
role unused_float unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string
0 841  1  1  16  23227  2011  17:28:24 23.227 2011-03-27
1 841  1  2  36  23199  2011  17:59:29 23.199 2011-03-27
2 841  2  1  15  22994  2011  17:27:41 22.994 2011-03-27
3 841  2  2  30  25098  2011  17:51:32 25.098 2011-03-27
4 841  3  1  16  23716  2011  17:29:00 23.716 2011-03-27
...  ...  ...  ...  ...  ...  ... ... ...
6065 982  839  6  38  29134  2017  21:29:07 29.134 2017-09-17
6066 982  840  1  1  37403  2017  20:06:43 37.403 2017-09-17
6067 982  840  2  2  29294  2017  20:10:07 29.294 2017-09-17
6068 982  840  3  3  25584  2017  20:13:16 25.584 2017-09-17
6069 982  840  4  26  29412  2017  21:05:07 29.412 2017-09-17

6070 rows x 9 columns
memory usage: 0.60 MB
name: pit_stops
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/pit_stops/

getML requires that we define roles for each of the columns.

In [15]:
driver_standings.set_role("win", getml.data.roles.target)
driver_standings.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
driver_standings.set_role("position", getml.data.roles.numerical)
driver_standings.set_role("date", getml.data.roles.time_stamp)

driver_standings
Out[15]:
name date raceId driverId year win position driverStandingsId points wins round positionText
role time_stamp join_key join_key join_key target numerical unused_float unused_float unused_float unused_float unused_string
unit time stamp, comparison only
0 2008-03-16 18 1 2008 1  1  1  10  1  1  1
1 2008-03-16 18 2 2008 0  2  2  8  0  1  2
2 2008-03-16 18 3 2008 0  3  3  6  0  1  3
3 2008-03-16 18 4 2008 0  4  4  5  0  1  4
4 2008-03-16 18 5 2008 0  5  5  4  0  1  5
... ... ... ... ...  ...  ...  ...  ...  ...  ...
31573 2017-09-17 982 835 2017 0  16  68456  8  0  14  16
31574 2017-09-17 982 154 2017 0  13  68457  26  0  14  13
31575 2017-09-17 982 836 2017 0  18  68458  5  0  14  18
31576 2017-09-17 982 18 2017 0  22  68459  0  0  14  22
31577 2017-09-17 982 814 2017 0  23  68460  0  0  14  23

31578 rows x 11 columns
memory usage: 2.49 MB
name: driver_standings
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/driver_standings/

In [16]:
drivers.set_role("driverId", getml.data.roles.join_key)
drivers.set_role(["nationality", "driverRef"], getml.data.roles.categorical)

drivers
Out[16]:
name driverId nationality driverRef number code forename surname dob url
role join_key categorical categorical unused_float unused_string unused_string unused_string unused_string unused_string
0 1 British hamilton 44  HAM Lewis Hamilton 1985-01-07 http://en.wikipedia.org/wiki/Lew...
1 2 German heidfeld nan  HEI Nick Heidfeld 1977-05-10 http://en.wikipedia.org/wiki/Nic...
2 3 German rosberg 6  ROS Nico Rosberg 1985-06-27 http://en.wikipedia.org/wiki/Nic...
3 4 Spanish alonso 14  ALO Fernando Alonso 1981-07-29 http://en.wikipedia.org/wiki/Fer...
4 5 Finnish kovalainen nan  KOV Heikki Kovalainen 1981-10-19 http://en.wikipedia.org/wiki/Hei...
... ... ... ...  ... ... ... ... ...
835 837 Indonesian haryanto 88  HAR Rio Haryanto 1993-01-22 http://en.wikipedia.org/wiki/Rio...
836 838 Belgian vandoorne 2  VAN Stoffel Vandoorne 1992-03-26 http://en.wikipedia.org/wiki/Sto...
837 839 French ocon 31  OCO Esteban Ocon 1996-09-17 http://en.wikipedia.org/wiki/Est...
838 840 Canadian stroll 18  STR Lance Stroll 1998-10-29 http://en.wikipedia.org/wiki/Lan...
839 841 Italian giovinazzi 36  GIO Antonio Giovinazzi 1993-12-14 http://en.wikipedia.org/wiki/Ant...

840 rows x 9 columns
memory usage: 0.11 MB
name: drivers
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/drivers/

In [17]:
lap_times.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
lap_times.set_role(["lap", "milliseconds", "position"], getml.data.roles.numerical)
lap_times.set_role("date", getml.data.roles.time_stamp)

lap_times
Out[17]:
name date raceId driverId year lap milliseconds position time
role time_stamp join_key join_key join_key numerical numerical numerical unused_string
unit time stamp, comparison only
0 2009-03-29 1 1 2009 1  109088  13  1:49.088
1 2009-03-29 1 1 2009 2  93740  12  1:33.740
2 2009-03-29 1 1 2009 3  91600  11  1:31.600
3 2009-03-29 1 1 2009 4  91067  10  1:31.067
4 2009-03-29 1 1 2009 5  92129  10  1:32.129
... ... ... ... ...  ...  ...  ...
420364 2017-09-17 982 840 2017 54  107528  8  1:47.528
420365 2017-09-17 982 840 2017 55  107512  8  1:47.512
420366 2017-09-17 982 840 2017 56  108143  8  1:48.143
420367 2017-09-17 982 840 2017 57  107848  8  1:47.848
420368 2017-09-17 982 840 2017 58  108699  8  1:48.699

420369 rows x 8 columns
memory usage: 25.64 MB
name: lap_times
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/lap_times/

In [18]:
pit_stops.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
pit_stops.set_role(["lap", "milliseconds", "stop"], getml.data.roles.numerical)
pit_stops.set_role("date", getml.data.roles.time_stamp)

pit_stops
Out[18]:
name date raceId driverId year lap milliseconds stop time duration
role time_stamp join_key join_key join_key numerical numerical numerical unused_string unused_string
unit time stamp, comparison only
0 2011-03-27 841 1 2011 16  23227  1  17:28:24 23.227
1 2011-03-27 841 1 2011 36  23199  2  17:59:29 23.199
2 2011-03-27 841 2 2011 15  22994  1  17:27:41 22.994
3 2011-03-27 841 2 2011 30  25098  2  17:51:32 25.098
4 2011-03-27 841 3 2011 16  23716  1  17:29:00 23.716
... ... ... ... ...  ...  ...  ... ...
6065 2017-09-17 982 839 2017 38  29134  6  21:29:07 29.134
6066 2017-09-17 982 840 2017 1  37403  1  20:06:43 37.403
6067 2017-09-17 982 840 2017 2  29294  2  20:10:07 29.294
6068 2017-09-17 982 840 2017 3  25584  3  20:13:16 25.584
6069 2017-09-17 982 840 2017 26  29412  4  21:05:07 29.412

6070 rows x 9 columns
memory usage: 0.46 MB
name: pit_stops
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/pit_stops/

In [19]:
qualifying.set_role(["raceId", "driverId", "qualifyId"], getml.data.roles.join_key)
qualifying.set_role(["position", "number"], getml.data.roles.numerical)

qualifying
Out[19]:
name raceId driverId qualifyId position number constructorId q1 q2 q3
role join_key join_key join_key numerical numerical unused_float unused_string unused_string unused_string
0 18 1 1 1  22  1  1:26.572 1:25.187 1:26.714
1 18 9 2 2  4  2  1:26.103 1:25.315 1:26.869
2 18 5 3 3  23  1  1:25.664 1:25.452 1:27.079
3 18 13 4 4  2  6  1:25.994 1:25.691 1:27.178
4 18 2 5 5  3  2  1:25.960 1:25.518 1:27.236
... ... ... ...  ...  ...  ... ... ...
7392 982 825 7415 16  20  210  1:43.756 NULL NULL
7393 982 13 7416 17  19  3  1:44.014 NULL NULL
7394 982 840 7417 18  18  3  1:44.728 NULL NULL
7395 982 836 7418 19  94  15  1:45.059 NULL NULL
7396 982 828 7419 20  9  15  1:45.570 NULL NULL

7397 rows x 9 columns
memory usage: 0.61 MB
name: qualifying
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/qualifying/

2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

In [20]:
split = getml.data.split.random(train=0.8, test=0.2)
split
Out[20]:
0 train
1 train
2 train
3 test
4 train
...

infinite number of rows
type: StringColumnView

2.1 Define relational model

In [21]:
star_schema = getml.data.StarSchema(population=driver_standings.drop(["position"]), alias="population", split=split)

star_schema.join(
    driver_standings,
    on=["driverId"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
    lagged_targets=True,
)

# We cannot use lap times for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
    lap_times,
    on=["driverId"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

# We cannot use pit stops for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
    pit_stops,
    on=["driverId"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

star_schema.join(
    qualifying,
    on=["driverId", "raceId"],
    relationship=getml.data.relationship.many_to_one,
)

star_schema.join(
    drivers,
    on=["driverId"],
    relationship=getml.data.relationship.many_to_one,
)

star_schema
Out[21]:

data model

diagram


driver_standingslap_timespit_stopsqualifyingdriverspopulationdriverId = driverIddate <= dateHorizon: 1.0 daysLagged targets alloweddriverId = driverIddate <= dateHorizon: 1.0 daysdriverId = driverIddate <= dateHorizon: 1.0 daysdriverId = driverIdraceId = raceIdRelationship: many-to-onedriverId = driverIdRelationship: many-to-one


staging

data frames staging table
0 population, qualifying, drivers POPULATION__STAGING_TABLE_1
1 driver_standings DRIVER_STANDINGS__STAGING_TABLE_2
2 lap_times LAP_TIMES__STAGING_TABLE_3
3 pit_stops PIT_STOPS__STAGING_TABLE_4

container

population

subset name rows type
0 test driver_standings 6229 View
1 train driver_standings 25349 View

peripheral

name rows type
0 driver_standings 31578 DataFrame
1 lap_times 420369 DataFrame
2 pit_stops 6070 DataFrame
3 qualifying 7397 DataFrame
4 drivers 840 DataFrame

2.2 getML pipeline

Set-up the feature learner & predictor

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples).

In [22]:
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    num_threads=1,
)

relboost = getml.feature_learning.Relboost(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

relmt = getml.feature_learning.RelMT(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

predictor = getml.predictors.XGBoostClassifier(n_jobs=1)

Build the pipeline

In [23]:
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor],
    include_categorical=True,
)

pipe1
Out[23]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])

2.3 Model training

In [24]:
pipe1.check(star_schema.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId' and 'driverId', there are no corresponding entries for 68.551028% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId' and 'driverId', there are no corresponding entries for 82.527910% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
In [25]:
pipe1.fit(star_schema.train)
Checking data model...


Staging...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId' and 'driverId', there are no corresponding entries for 68.551028% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId' and 'driverId', there are no corresponding entries for 82.527910% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Trying 867 features...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:10m:31.904886

Out[25]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-iDDODO'])

url: http://localhost:1709/#/getpipeline/formula1/sSpaJ1/0/

2.4 Model evaluation

In [26]:
pipe1.score(star_schema.test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


Out[26]:
date time set used target accuracy auc cross entropy
0 2021-08-23 21:25:33 train win 0.9734 0.957 0.07563
1 2021-08-23 21:26:01 test win 0.9727 0.9242 0.08486

2.5 featuretools

featuretools requires some additional data preparation before we can start engineering features.

In [27]:
population_train_pd = star_schema.train.population.drop(["position"]).to_pandas()
population_test_pd = star_schema.test.population.drop(["position"]).to_pandas()
In [28]:
driver_standings_pd = driver_standings.drop(driver_standings.roles.unused).to_pandas()
lap_times_pd = lap_times.drop(lap_times.roles.unused).to_pandas()
pit_stops_pd = pit_stops.drop(pit_stops.roles.unused).to_pandas()
qualifying_pd = qualifying.drop(qualifying.roles.unused).to_pandas()
drivers_pd = drivers.drop(drivers.roles.unused).to_pandas()

Because qualifying and drivers are many-to-one joins, we can directly join them onto our table.

In [29]:
population_train_pd["id"] = population_train_pd.index

population_train_pd = population_train_pd.merge(
    qualifying_pd,
    on=["driverId", "raceId"],
    how="left",
)

population_train_pd = population_train_pd.merge(
    drivers_pd,
    on=["driverId"],
    how="left",
)

population_train_pd
Out[29]:
date raceId driverId year win id qualifyId position number nationality driverRef
0 2008-03-16 18 1 2008 1.0 0 1 1.0 22.0 British hamilton
1 2008-03-16 18 2 2008 0.0 1 5 5.0 3.0 German heidfeld
2 2008-03-16 18 3 2008 0.0 2 7 7.0 7.0 German rosberg
3 2008-03-16 18 5 2008 0.0 3 3 3.0 23.0 Finnish kovalainen
4 2008-03-16 18 6 2008 0.0 4 14 14.0 8.0 Japanese nakajima
... ... ... ... ... ... ... ... ... ... ... ...
25344 2017-09-17 982 835 2017 0.0 25344 7410 11.0 30.0 British jolyon_palmer
25345 2017-09-17 982 154 2017 0.0 25345 7414 15.0 8.0 French grosjean
25346 2017-09-17 982 836 2017 0.0 25346 7418 19.0 94.0 German wehrlein
25347 2017-09-17 982 18 2017 0.0 25347 NaN NaN NaN British button
25348 2017-09-17 982 814 2017 0.0 25348 NaN NaN NaN British resta

25349 rows × 11 columns

Same for the testing set.

In [30]:
population_test_pd["id"] = population_test_pd.index

population_test_pd = population_test_pd.merge(
    qualifying_pd,
    on=["driverId", "raceId"],
    how="left",
)

population_test_pd = population_test_pd.merge(
    drivers_pd,
    on=["driverId"],
    how="left",
)

population_test_pd
Out[30]:
date raceId driverId year win id qualifyId position number nationality driverRef
0 2008-03-16 18 4 2008 0.0 0 12 12.0 5.0 Spanish alonso
1 2008-03-16 18 7 2008 0.0 1 18 18.0 14.0 French bourdais
2 2008-03-23 19 3 2008 0.0 2 38 16.0 7.0 German rosberg
3 2008-03-23 19 11 2008 0.0 3 42 20.0 18.0 Japanese sato
4 2008-04-06 20 7 2008 0.0 4 59 15.0 14.0 French bourdais
... ... ... ... ... ... ... ... ... ... ... ...
6224 2017-09-03 981 835 2017 0.0 6224 7396 17.0 30.0 British jolyon_palmer
6225 2017-09-03 981 814 2017 0.0 6225 NaN NaN NaN British resta
6226 2017-09-17 982 822 2017 0.0 6226 7405 6.0 77.0 Finnish bottas
6227 2017-09-17 982 832 2017 0.0 6227 7409 10.0 55.0 Spanish sainz
6228 2017-09-17 982 826 2017 0.0 6228 7412 13.0 26.0 Russian kvyat

6229 rows × 11 columns

featuretools requires us to expand our peripheral tables so they can be joined using the unique id from the population table. Luckily, we can write a simple helper function that works for all three remaining peripheral tables.

In [31]:
def prepare_peripheral(peripheral, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    peripheral_new = peripheral.merge(
        train_or_test[["id", "driverId", "date"]],
        on=["driverId"],
    )

    peripheral_new = peripheral_new[
        peripheral_new["date_x"] < peripheral_new["date_y"]
    ]
    
    del peripheral_new["date_y"]
    del peripheral_new["driverId"]
    del peripheral_new["raceId"]

    return peripheral_new.rename(columns={"date_x": "date"})
In [32]:
driver_standings_train_pd = prepare_peripheral(driver_standings_pd, population_train_pd)
driver_standings_test_pd = prepare_peripheral(driver_standings_pd, population_test_pd)
driver_standings_train_pd
Out[32]:
date year win position id
1 2008-03-16 2008 1.0 1.0 6
2 2008-03-16 2008 1.0 1.0 22
3 2008-03-16 2008 1.0 1.0 39
4 2008-03-16 2008 1.0 1.0 54
5 2008-03-16 2008 1.0 1.0 72
... ... ... ... ... ...
2656230 2017-07-30 2017 0.0 12.0 25323
2656231 2017-07-30 2017 0.0 12.0 25341
2656242 2017-08-27 2017 0.0 13.0 25323
2656243 2017-08-27 2017 0.0 13.0 25341
2656255 2017-09-03 2017 0.0 12.0 25341

1317941 rows × 5 columns

In [33]:
lap_times_train_pd = prepare_peripheral(lap_times_pd, population_train_pd)
lap_times_test_pd = prepare_peripheral(lap_times_pd, population_test_pd)
lap_times_train_pd
Out[33]:
date year lap milliseconds position id
28 2009-03-29 2009 1.0 109088.0 13.0 5874
29 2009-03-29 2009 1.0 109088.0 13.0 5904
30 2009-03-29 2009 1.0 109088.0 13.0 5918
31 2009-03-29 2009 1.0 109088.0 13.0 5936
32 2009-03-29 2009 1.0 109088.0 13.0 5954
... ... ... ... ... ... ...
54811898 2017-04-09 2017 3.0 156151.0 17.0 25264
54811899 2017-04-09 2017 3.0 156151.0 17.0 25281
54811900 2017-04-09 2017 3.0 156151.0 17.0 25300
54811901 2017-04-09 2017 3.0 156151.0 17.0 25320
54811902 2017-04-09 2017 3.0 156151.0 17.0 25337

25544456 rows × 6 columns

lap_times demonstrates one of the greatest dangers of featuretools. Because it is written in pure Python, featuretools requires you expand your tables so that they can be more easily joined. But this comes at the cost of increased memory consumption: In this case a table that used to have about 420,000 rows has now been expanded to over 25 million rows.

In [34]:
pit_stops_train_pd = prepare_peripheral(pit_stops_pd, population_train_pd)
pit_stops_test_pd = prepare_peripheral(pit_stops_pd, population_test_pd)
pit_stops_train_pd
Out[34]:
date year lap milliseconds stop id
57 2011-03-27 2011 16.0 23227.0 1.0 22459
58 2011-03-27 2011 16.0 23227.0 1.0 22498
59 2011-03-27 2011 16.0 23227.0 1.0 22536
60 2011-03-27 2011 16.0 23227.0 1.0 22555
61 2011-03-27 2011 16.0 23227.0 1.0 22574
... ... ... ... ... ... ...
688120 2017-04-09 2017 2.0 29443.0 1.0 25264
688121 2017-04-09 2017 2.0 29443.0 1.0 25281
688122 2017-04-09 2017 2.0 29443.0 1.0 25300
688123 2017-04-09 2017 2.0 29443.0 1.0 25320
688124 2017-04-09 2017 2.0 29443.0 1.0 25337

213549 rows × 6 columns

In [35]:
del population_train_pd["driverId"]
del population_train_pd["raceId"]
del population_train_pd["year"]
del population_train_pd["qualifyId"]
In [36]:
del population_test_pd["driverId"]
del population_test_pd["raceId"]
del population_test_pd["year"]
del population_test_pd["qualifyId"]
In [37]:
entities_train = {
    "population" : (population_train_pd, "id"),
    "driver_standings": (driver_standings_train_pd, "index"),
    "lap_times" : (lap_times_train_pd, "index"),
    "pit_stops" : (pit_stops_train_pd, "index"),
}
In [38]:
entities_test = {
    "population" : (population_test_pd, "id"),
    "driver_standings": (driver_standings_test_pd, "index"),
    "lap_times" : (lap_times_test_pd, "index"),
    "pit_stops" : (pit_stops_test_pd, "index")
}
In [39]:
relationships = [
    ("population", "id", "driver_standings", "id"),
    ("population", "id", "lap_times", "id"),
    ("population", "id", "pit_stops", "id")
]
In [40]:
featuretools_train_pd = featuretools.dfs(
    entities=entities_train,
    relationships=relationships,
    target_entity="population")[0]
/home/patrick/.local/lib/python3.9/site-packages/featuretools/entityset/entity.py:462: UserWarning: index index not found in dataframe, creating new integer column
  warnings.warn("index {} not found in dataframe, creating new "
In [41]:
featuretools_test_pd = featuretools.dfs(
    entities=entities_test,
    relationships=relationships,
    target_entity="population")[0]
In [42]:
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
In [43]:
featuretools_train.set_role("win", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)

featuretools_train
90.725472% of all entries of column 'MAX(pit_stops.lap)' are NULL values.
90.725472% of all entries of column 'MAX(pit_stops.milliseconds)' are NULL values.
90.725472% of all entries of column 'MAX(pit_stops.stop)' are NULL values.
90.725472% of all entries of column 'MEAN(pit_stops.lap)' are NULL values.
90.725472% of all entries of column 'MEAN(pit_stops.milliseconds)' are NULL values.
90.725472% of all entries of column 'MEAN(pit_stops.stop)' are NULL values.
90.725472% of all entries of column 'MIN(pit_stops.lap)' are NULL values.
90.725472% of all entries of column 'MIN(pit_stops.milliseconds)' are NULL values.
90.725472% of all entries of column 'MIN(pit_stops.stop)' are NULL values.
90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.year)' are NULL values.
90.934554% of all entries of column 'SKEW(pit_stops.lap)' are NULL values.
90.934554% of all entries of column 'SKEW(pit_stops.milliseconds)' are NULL values.
90.934554% of all entries of column 'SKEW(pit_stops.stop)' are NULL values.
90.757032% of all entries of column 'STD(pit_stops.lap)' are NULL values.
90.757032% of all entries of column 'STD(pit_stops.milliseconds)' are NULL values.
90.757032% of all entries of column 'STD(pit_stops.stop)' are NULL values.
90.725472% of all entries of column 'MODE(pit_stops.DAY(date))' are NULL values.
90.725472% of all entries of column 'MODE(pit_stops.MONTH(date))' are NULL values.
90.725472% of all entries of column 'MODE(pit_stops.WEEKDAY(date))' are NULL values.
90.725472% of all entries of column 'MODE(pit_stops.YEAR(date))' are NULL values.
90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.DAY(date))' are NULL values.
90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.MONTH(date))' are NULL values.
90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.WEEKDAY(date))' are NULL values.
90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.YEAR(date))' are NULL values.
90.725472% of all entries of column 'MODE(pit_stops.year)' are NULL values.
Out[43]:
name win nationality driverRef MODE(driver_standings.year) MODE(lap_times.year) MODE(pit_stops.year) position number COUNT(driver_standings) MAX(driver_standings.position) MAX(driver_standings.win) MEAN(driver_standings.position) MEAN(driver_standings.win) MIN(driver_standings.position) MIN(driver_standings.win) NUM_UNIQUE(driver_standings.year) SKEW(driver_standings.position) SKEW(driver_standings.win) STD(driver_standings.position) STD(driver_standings.win) SUM(driver_standings.position) SUM(driver_standings.win) COUNT(lap_times) MAX(lap_times.lap) MAX(lap_times.milliseconds) MAX(lap_times.position) MEAN(lap_times.lap) MEAN(lap_times.milliseconds) MEAN(lap_times.position) MIN(lap_times.lap) MIN(lap_times.milliseconds) MIN(lap_times.position) NUM_UNIQUE(lap_times.year) SKEW(lap_times.lap) SKEW(lap_times.milliseconds) SKEW(lap_times.position) STD(lap_times.lap) STD(lap_times.milliseconds) STD(lap_times.position) SUM(lap_times.lap) SUM(lap_times.milliseconds) SUM(lap_times.position) COUNT(pit_stops) MAX(pit_stops.lap) MAX(pit_stops.milliseconds) MAX(pit_stops.stop) MEAN(pit_stops.lap) MEAN(pit_stops.milliseconds) MEAN(pit_stops.stop) MIN(pit_stops.lap) MIN(pit_stops.milliseconds) MIN(pit_stops.stop) NUM_UNIQUE(pit_stops.year) SKEW(pit_stops.lap) SKEW(pit_stops.milliseconds) SKEW(pit_stops.stop) STD(pit_stops.lap) STD(pit_stops.milliseconds) STD(pit_stops.stop) SUM(pit_stops.lap) SUM(pit_stops.milliseconds) SUM(pit_stops.stop) DAY(date) MONTH(date) WEEKDAY(date) YEAR(date) MODE(driver_standings.DAY(date)) MODE(driver_standings.MONTH(date)) MODE(driver_standings.WEEKDAY(date)) MODE(driver_standings.YEAR(date)) NUM_UNIQUE(driver_standings.DAY(date)) NUM_UNIQUE(driver_standings.MONTH(date)) NUM_UNIQUE(driver_standings.WEEKDAY(date)) NUM_UNIQUE(driver_standings.YEAR(date)) MODE(lap_times.DAY(date)) MODE(lap_times.MONTH(date)) MODE(lap_times.WEEKDAY(date)) MODE(lap_times.YEAR(date)) NUM_UNIQUE(lap_times.DAY(date)) NUM_UNIQUE(lap_times.MONTH(date)) NUM_UNIQUE(lap_times.WEEKDAY(date)) NUM_UNIQUE(lap_times.YEAR(date)) MODE(pit_stops.DAY(date)) MODE(pit_stops.MONTH(date)) MODE(pit_stops.WEEKDAY(date)) MODE(pit_stops.YEAR(date)) NUM_UNIQUE(pit_stops.DAY(date)) NUM_UNIQUE(pit_stops.MONTH(date)) NUM_UNIQUE(pit_stops.WEEKDAY(date)) NUM_UNIQUE(pit_stops.YEAR(date))
role target categorical categorical categorical categorical categorical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical
0 1  British hamilton 2007 2007 NULL 1  22  17  3  1  1.4706 0.2353 1  0  1  1.3538 1.3723 0.7998 0.4372 25  4  1037  78  1453884  19  32.0559 90600.8602 3.2575 1  72506  1  1  0.1882 26.093 2.4493 19.1156 45564.9471 3.4488 33242  93953092  3378  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  8  7  6  2007  16  8  1  1  8  7  6  2007  16  8  1  1  nan  nan  nan  nan  nan  nan  nan  nan 
1 0  German heidfeld 2005 2007 NULL 5  3  134  21  0  10.8582 0  4  0  8  0.4681 0  5.136 0  1455  0  6863  78  1285986  22  30.4682 89072.7102 9.6799 1  70516  1  8  0.2623 30.5344 0.3339 18.7041 20544.1746 4.199 209103  611306010  66433  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  1  7  6  2005  31  8  1  8  1  7  6  2007  31  8  1  8  nan  nan  nan  nan  nan  nan  nan  nan 
2 0  German rosberg 2006 2007 NULL 7  7  35  17  0  11.9714 0  7  0  2  0.1328 0  3.1761 0  419  0  1627  77  191891  22  30.4124 89820.1174 9.9809 1  73159  3  2  0.2225 1.7858 0.5491 18.6109 14212.8848 3.733 49481  146137331  16239  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  8  7  6  2006  25  8  1  2  8  5  6  2007  25  8  1  2  nan  nan  nan  nan  nan  nan  nan  nan 
3 0  Finnish kovalainen 2007 2007 NULL 3  23  17  11  0  8.3529 0  7  0  1  0.7931 0  1.4116 0  142  0  1023  76  1394884  22  31.4526 92458.8299 9.4018 1  73998  1  1  0.1906 26.2493 0.6941 18.6943 43578.4946 4.1724 32176  94585383  9618  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  8  7  6  2007  16  8  1  1  8  7  6  2007  16  8  1  1  nan  nan  nan  nan  nan  nan  nan  nan 
4 0  Japanese nakajima 2007 2007 NULL 14  8  1  22  0  22  0  22  0  1  nan  nan  nan  nan  22  0  70  70  104396  16  35.5 76228.3571 10.9143 1  73116  9  1  0  4.4222 1.2783 20.3511 4785.2529 1.8552 2485  5335985  764  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  21  10  6  2007  1  1  1  1  21  10  6  2007  1  1  1  1  nan  nan  nan  nan  nan  nan  nan  nan 
...  ... ... ... ... ... ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
25344 0  British jolyon_palmer 2016 2016 2016 11  30  34  20  0  18.0588 0  11  0  2  -1.7626 0  1.8413 0  614  0  1517  78  1223356  22  29.4199 98411.6038 14.5175 1  68652  7  2  0.2921 21.0237 0.3751 18.5854 43393.1519 3.0483 44630  149290403  22023  56  50  1088294  4  20.9286 61015.1429 1.7857 1  16851  1  2  0.3035 5.1507 0.8981 13.473 191214.0556 0.9088 1172  3416848  100  17  9  6  2017  3  7  6  2016  25  9  1  2  9  7  6  2016  25  9  1  2  9  7  6  2016  21  9  1  2 
25345 0  French grosjean 2016 2016 2014 15  8  118  24  0  11.5254 0  5  0  7  1.3306 0  4.1709 0  1360  0  5518  78  3670560  24  29.4511 99489.4235 10.7356 1  68590  1  7  0.2565 39.9299 0.06907 18.0521 63108.3058 4.8371 162511  548982639  59239  206  62  1313665  5  22.9369 46260.6214 1.801 1  15073  1  6  0.3075 7.1215 1.2186 13.7483 154388.9496 0.9288 4725  9529688  371  17  9  6  2017  27  7  6  2016  31  9  1  7  9  7  6  2016  31  9  1  7  9  7  6  2014  31  9  1  6 
25346 0  German wehrlein 2016 2016 2016 19  94  32  22  0  17.25 0  13  0  2  0.3213 0  2.2718 0  552  0  1636  76  2118880  22  30.3839 99952.8594 15.8319 1  69241  4  2  0.2027 20.0905 -0.8509 18.2369 77850.4087 3.2136 49708  163522878  25901  62  52  2008464  5  21.6129 119925.5968 1.8065 1  18064  1  2  0.1765 3.9986 1.1414 12.8896 368710.9372 0.9381 1340  7435387  112  17  9  6  2017  3  7  6  2016  24  9  1  2  3  7  6  2016  24  9  1  2  3  7  6  2016  24  8  1  2 
25347 0  British button 2016 2013 2011 nan  nan  309  22  1  9.6828 0.04854 1  0  18  0.3055 4.2218 5.8899 0.2153 2992  15  16272  78  7503775  22  30.2853 95662.1676 8.4123 1  69828  1  18  0.2244 64.7495 0.4521 18.3544 79242.6165 4.859 492803  1556614792  136885  259  64  2010062  6  24.9344 45562.5985 1.8571 1  14501  1  7  0.3229 8.7373 1.148 13.5509 179696.8608 0.9353 6458  11800713  481  17  9  6  2017  24  7  6  2016  31  9  1  18  24  7  6  2013  31  9  1  18  25  7  6  2011  31  9  1  7 
25348 0  British resta 2012 2012 2011 nan  nan  61  23  0  12.6066 0  8  0  4  1.244 0  3.4558 0  769  0  3284  78  7503980  24  30.3097 103049.4318 10.7728 1  76855  1  4  0.1989 44.5439 0.1348 18.0742 144158.4761 3.5107 99537  338414334  35378  127  57  123124  4  25.1496 24084.2598 1.7323 1  14538  1  4  0.3166 8.1337 0.8105 13.8456 9973.774 0.7913 3194  3058701  220  17  9  6  2017  27  7  6  2012  27  9  1  4  27  7  6  2012  27  9  1  4  25  5  6  2011  27  9  1  4 

25349 rows x 90 columns
memory usage: 17.74 MB
name: featuretools_train
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/featuretools_train/

In [44]:
featuretools_test.set_role("win", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)

featuretools_test
90.191042% of all entries of column 'MAX(pit_stops.lap)' are NULL values.
90.191042% of all entries of column 'MAX(pit_stops.milliseconds)' are NULL values.
90.191042% of all entries of column 'MAX(pit_stops.stop)' are NULL values.
90.191042% of all entries of column 'MEAN(pit_stops.lap)' are NULL values.
90.191042% of all entries of column 'MEAN(pit_stops.milliseconds)' are NULL values.
90.191042% of all entries of column 'MEAN(pit_stops.stop)' are NULL values.
90.191042% of all entries of column 'MIN(pit_stops.lap)' are NULL values.
90.191042% of all entries of column 'MIN(pit_stops.milliseconds)' are NULL values.
90.191042% of all entries of column 'MIN(pit_stops.stop)' are NULL values.
90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.year)' are NULL values.
90.351581% of all entries of column 'SKEW(pit_stops.lap)' are NULL values.
90.351581% of all entries of column 'SKEW(pit_stops.milliseconds)' are NULL values.
90.351581% of all entries of column 'SKEW(pit_stops.stop)' are NULL values.
90.239204% of all entries of column 'STD(pit_stops.lap)' are NULL values.
90.239204% of all entries of column 'STD(pit_stops.milliseconds)' are NULL values.
90.239204% of all entries of column 'STD(pit_stops.stop)' are NULL values.
90.191042% of all entries of column 'MODE(pit_stops.DAY(date))' are NULL values.
90.191042% of all entries of column 'MODE(pit_stops.MONTH(date))' are NULL values.
90.191042% of all entries of column 'MODE(pit_stops.WEEKDAY(date))' are NULL values.
90.191042% of all entries of column 'MODE(pit_stops.YEAR(date))' are NULL values.
90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.DAY(date))' are NULL values.
90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.MONTH(date))' are NULL values.
90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.WEEKDAY(date))' are NULL values.
90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.YEAR(date))' are NULL values.
90.191042% of all entries of column 'MODE(pit_stops.year)' are NULL values.
Out[44]:
name win nationality driverRef MODE(driver_standings.year) MODE(lap_times.year) MODE(pit_stops.year) position number COUNT(driver_standings) MAX(driver_standings.position) MAX(driver_standings.win) MEAN(driver_standings.position) MEAN(driver_standings.win) MIN(driver_standings.position) MIN(driver_standings.win) NUM_UNIQUE(driver_standings.year) SKEW(driver_standings.position) SKEW(driver_standings.win) STD(driver_standings.position) STD(driver_standings.win) SUM(driver_standings.position) SUM(driver_standings.win) COUNT(lap_times) MAX(lap_times.lap) MAX(lap_times.milliseconds) MAX(lap_times.position) MEAN(lap_times.lap) MEAN(lap_times.milliseconds) MEAN(lap_times.position) MIN(lap_times.lap) MIN(lap_times.milliseconds) MIN(lap_times.position) NUM_UNIQUE(lap_times.year) SKEW(lap_times.lap) SKEW(lap_times.milliseconds) SKEW(lap_times.position) STD(lap_times.lap) STD(lap_times.milliseconds) STD(lap_times.position) SUM(lap_times.lap) SUM(lap_times.milliseconds) SUM(lap_times.position) COUNT(pit_stops) MAX(pit_stops.lap) MAX(pit_stops.milliseconds) MAX(pit_stops.stop) MEAN(pit_stops.lap) MEAN(pit_stops.milliseconds) MEAN(pit_stops.stop) MIN(pit_stops.lap) MIN(pit_stops.milliseconds) MIN(pit_stops.stop) NUM_UNIQUE(pit_stops.year) SKEW(pit_stops.lap) SKEW(pit_stops.milliseconds) SKEW(pit_stops.stop) STD(pit_stops.lap) STD(pit_stops.milliseconds) STD(pit_stops.stop) SUM(pit_stops.lap) SUM(pit_stops.milliseconds) SUM(pit_stops.stop) DAY(date) MONTH(date) WEEKDAY(date) YEAR(date) MODE(driver_standings.DAY(date)) MODE(driver_standings.MONTH(date)) MODE(driver_standings.WEEKDAY(date)) MODE(driver_standings.YEAR(date)) NUM_UNIQUE(driver_standings.DAY(date)) NUM_UNIQUE(driver_standings.MONTH(date)) NUM_UNIQUE(driver_standings.WEEKDAY(date)) NUM_UNIQUE(driver_standings.YEAR(date)) MODE(lap_times.DAY(date)) MODE(lap_times.MONTH(date)) MODE(lap_times.WEEKDAY(date)) MODE(lap_times.YEAR(date)) NUM_UNIQUE(lap_times.DAY(date)) NUM_UNIQUE(lap_times.MONTH(date)) NUM_UNIQUE(lap_times.WEEKDAY(date)) NUM_UNIQUE(lap_times.YEAR(date)) MODE(pit_stops.DAY(date)) MODE(pit_stops.MONTH(date)) MODE(pit_stops.WEEKDAY(date)) MODE(pit_stops.YEAR(date)) NUM_UNIQUE(pit_stops.DAY(date)) NUM_UNIQUE(pit_stops.MONTH(date)) NUM_UNIQUE(pit_stops.WEEKDAY(date)) NUM_UNIQUE(pit_stops.YEAR(date))
role target categorical categorical categorical categorical categorical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical
0 0  Spanish alonso 2005 2006 NULL 12  5  105  23  1  5.6667 0.181 1  0  6  1.7319 1.6816 7.1598 0.3868 595  19  5769  78  1406432  21  30.436 88439.8289 5.2479 1  70526  1  6  0.2301 34.9402 1.3932 18.3774 22633.3599 4.9145 175585  510209373  30275  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  1  7  6  2005  31  8  1  6  24  7  6  2006  31  8  1  6  nan  nan  nan  nan  nan  nan  nan  nan 
1 0  French bourdais NULL NULL NULL 18  14  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  16  3  6  2008  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan 
2 0  German rosberg 2006 2007 NULL 16  7  36  17  0  11.7222 0  3  0  3  -0.1958 0  3.4691 0  422  0  1685  77  191891  22  30.381 90110.451 9.7822 1  73159  2  3  0.2179 1.8323 0.5148 18.5504 14476.8352 3.8237 51192  151836110  16483  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  23  3  6  2008  8  7  6  2006  25  8  1  3  16  5  6  2007  25  8  1  3  nan  nan  nan  nan  nan  nan  nan  nan 
3 0  Japanese sato 2004 2007 NULL 20  18  84  23  0  16.9405 0  7  0  6  -0.493 0  5.5479 0  1423  0  4309  76  1921214  22  29.1091 91287.2075 12.4342 1  70727  1  7  0.2832 32.4719 -0.2034 18.0839 42719.4804 4.861 125431  393356577  53579  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  23  3  6  2008  12  7  6  2004  31  8  1  6  25  7  6  2007  31  8  1  7  nan  nan  nan  nan  nan  nan  nan  nan 
4 0  French bourdais 2008 2008 NULL 15  14  2  10  0  8.5 0  7  0  1  nan  nan  2.1213 0  17  0  55  55  156156  16  28  98864.3818 9.7818 1  89534  4  1  0  2.2604 -0.06348 16.0208 17491.8075 4.3149 1540  5437541  538  0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  6  4  6  2008  16  3  6  2008  2  1  1  1  16  3  6  2008  1  1  1  1  nan  nan  nan  nan  nan  nan  nan  nan 
...  ... ... ... ... ... ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
6224 0  British jolyon_palmer 2016 2016 2016 17  30  33  20  0  18.0303 0  11  0  2  -1.7159 0  1.8622 0  595  0  1488  78  1223356  22  29.7009 98587.2056 14.5195 1  68652  7  2  0.2691 20.8455 0.3776 18.6186 43789.6099 3.0523 44195  146697762  21605  55  50  1088294  4  20.8545 61560.7273 1.8 1  16851  1  2  0.3181 5.1018 0.8728 13.5857 192932.4401 0.9108 1147  3385840  99  3  9  6  2017  9  7  6  2016  25  9  1  2  9  7  6  2016  25  9  1  2  9  7  6  2016  21  9  1  2 
6225 0  British resta 2012 2012 2011 nan  nan  60  23  0  12.4333 0  8  0  4  1.155 0  3.2068 0  746  0  3284  78  7503980  24  30.3097 103049.4318 10.7728 1  76855  1  4  0.1989 44.5439 0.1348 18.0742 144158.4761 3.5107 99537  338414334  35378  127  57  123124  4  25.1496 24084.2598 1.7323 1  14538  1  4  0.3166 8.1337 0.8105 13.8456 9973.774 0.7913 3194  3058701  220  3  9  6  2017  27  7  6  2012  27  9  1  4  27  7  6  2012  27  9  1  4  25  5  6  2011  27  9  1  4 
6226 0  Finnish bottas 2016 2016 2016 6  77  91  18  1  8.1868 0.02198 3  0  5  1.0124 6.6307 4.9976 0.1474 745  2  5153  78  3689718  22  30.4306 99678.869 8.0584 1  67847  1  5  0.2271 32.1149 0.5698 18.18 74208.5105 4.885 156809  513645212  41525  183  59  2008595  5  24.1093 68069.5082 1.7268 1  14978  1  5  0.3284 5.9016 1.0949 13.4151 242520.4324 0.8527 4412  12456720  316  17  9  6  2017  9  7  6  2016  31  9  1  5  9  7  6  2016  31  9  1  5  9  7  6  2016  31  9  1  5 
6227 0  Spanish sainz 2016 2016 2016 10  55  53  16  0  11.8679 0  7  0  3  0.06274 0  2.481 0  629  0  2690  78  2122068  21  29.7457 99861.4349 10.655 1  69150  3  3  0.3078 24.5072 0.3527 18.3133 61998.6099 3.5636 80016  268627260  28662  97  51  2008828  4  22.6289 84961.6907 1.6701 1  14966  1  3  0.3388 5.1188 1.0447 12.7919 297744.5008 0.8 2195  8241284  162  17  9  6  2017  3  7  6  2016  29  9  1  3  30  5  6  2016  29  9  1  3  9  7  6  2016  29  9  1  3 
6228 0  Russian kvyat 2016 2015 2016 13  26  72  22  0  12.6111 0  7  0  4  -0.3133 0  3.4214 0  908  0  3585  78  3690660  22  29.2739 100722.5623 11.2388 1  68061  2  4  0.2624 33.405 -0.07759 18.0505 79720.0053 4.0792 104947  361090386  40291  145  59  2011266  5  23.6207 64171.0069 1.7724 1  16139  1  4  0.2769 6.5459 0.9965 15.0675 236423.1748 0.8559 3425  9304796  257  17  9  6  2017  9  7  6  2016  31  9  1  4  30  7  6  2015  31  9  1  4  9  7  6  2016  31  9  1  4 

6229 rows x 90 columns
memory usage: 4.36 MB
name: featuretools_test
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/featuretools_test/

We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.

Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.

In [45]:
imputation = getml.preprocessors.Imputation()

predictor = getml.predictors.XGBoostClassifier(n_jobs=1)

pipe2 = getml.pipeline.Pipeline(
    tags=['featuretools'],
    preprocessors=[imputation],
    predictors=[predictor],
    include_categorical=True,
)

pipe2
Out[45]:
Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=['Imputation'],
         share_selected_features=0.5,
         tags=['featuretools'])
In [46]:
pipe2.fit(featuretools_train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


WARNING [COLUMN SHOULD BE UNUSED]: 90.725472% of all entries in column 'mode(pit_stops.year)' in POPULATION__STAGING_TABLE_1 are NULL values. You should consider setting its role to unused_string.
WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( min(lap_times.lap), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( min(pit_stops.stop), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( mode(lap_times.weekday(date)), 6.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( num_unique(lap_times.weekday(date)), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( mode(pit_stops.weekday(date)), 6.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( num_unique(pit_stops.weekday(date)), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:0m:7.319103

Out[46]:
Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=['Imputation'],
         share_selected_features=0.5,
         tags=['featuretools'])

url: http://localhost:1709/#/getpipeline/formula1/DDRwHL/0/
In [47]:
pipe2.score(featuretools_test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%


Out[47]:
date time set used target accuracy auc cross entropy
0 2021-08-23 21:29:30 featuretools_train win 0.9717 0.9391 0.0837
1 2021-08-23 21:29:30 featuretools_test win 0.9724 0.9202 0.08684

2.6 Studying features

We would like to understand why getML outperforms featuretools. In particular, getML's FastProp is based on an approach that is very similar to featuretools.

To investigate this matter, we first take a look at the importance of the features FastProp has learned:

In [48]:
names, importances = pipe1.features.importances(target_num=0)

plt.subplots(figsize=(20, 10))

plt.bar(names