NOTE: Due to featuretools' memory requirement, this notebook will not run on MyBinder.
In this notebook we will benchmark getML against featuretools to predict the winner of a Formula 1 race.
Summary:
Author: Dr. Patrick Urbanke
We would like to develop a prediction model for Formula 1 races, that would allow us to predict the winner of a race before the race has started.
We use dataset of all Formula 1 races from 1950 to 2017. The dataset includes information such as the time taken in each lap, the time taken for pit stops, the performance in the qualifying rounds etc.
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).
We will benchmark getML 's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.
Let's get started with the analysis and set up your session:
import copy
import os
from urllib import request
import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline
import featuretools
import getml
getml.engine.set_project('formula1')
Connected to project 'formula1'
We begin by downloading the data:
conn = getml.database.connect_mariadb(
host="relational.fit.cvut.cz",
dbname="ErgastF1",
port=3306,
user="guest",
password="relational"
)
conn
Connection(conn_id='default', dbname='ErgastF1', dialect='mysql', host='relational.fit.cvut.cz', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
driverStandings = load_if_needed("driverStandings")
drivers = load_if_needed("drivers")
lapTimes = load_if_needed("lapTimes")
pitStops = load_if_needed("pitStops")
races = load_if_needed("races")
qualifying = load_if_needed("qualifying")
driverStandings
name | driverStandingsId | raceId | driverId | points | position | wins | positionText |
---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string |
0 | 1 | 18 | 1 | 10 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8 | 2 | 0 | 2 |
2 | 3 | 18 | 3 | 6 | 3 | 0 | 3 |
3 | 4 | 18 | 4 | 5 | 4 | 0 | 4 |
4 | 5 | 18 | 5 | 4 | 5 | 0 | 5 |
... | ... | ... | ... | ... | ... | ... | |
31573 | 68456 | 982 | 835 | 8 | 16 | 0 | 16 |
31574 | 68457 | 982 | 154 | 26 | 13 | 0 | 13 |
31575 | 68458 | 982 | 836 | 5 | 18 | 0 | 18 |
31576 | 68459 | 982 | 18 | 0 | 22 | 0 | 22 |
31577 | 68460 | 982 | 814 | 0 | 23 | 0 | 23 |
31578 rows x 7 columns
memory usage: 1.85 MB
name: driverStandings
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/driverStandings/
drivers
name | driverId | number | driverRef | code | forename | surname | dob | nationality | url |
---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 44 | hamilton | HAM | Lewis | Hamilton | 1985-01-07 | British | http://en.wikipedia.org/wiki/Lew... |
1 | 2 | nan | heidfeld | HEI | Nick | Heidfeld | 1977-05-10 | German | http://en.wikipedia.org/wiki/Nic... |
2 | 3 | 6 | rosberg | ROS | Nico | Rosberg | 1985-06-27 | German | http://en.wikipedia.org/wiki/Nic... |
3 | 4 | 14 | alonso | ALO | Fernando | Alonso | 1981-07-29 | Spanish | http://en.wikipedia.org/wiki/Fer... |
4 | 5 | nan | kovalainen | KOV | Heikki | Kovalainen | 1981-10-19 | Finnish | http://en.wikipedia.org/wiki/Hei... |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
835 | 837 | 88 | haryanto | HAR | Rio | Haryanto | 1993-01-22 | Indonesian | http://en.wikipedia.org/wiki/Rio... |
836 | 838 | 2 | vandoorne | VAN | Stoffel | Vandoorne | 1992-03-26 | Belgian | http://en.wikipedia.org/wiki/Sto... |
837 | 839 | 31 | ocon | OCO | Esteban | Ocon | 1996-09-17 | French | http://en.wikipedia.org/wiki/Est... |
838 | 840 | 18 | stroll | STR | Lance | Stroll | 1998-10-29 | Canadian | http://en.wikipedia.org/wiki/Lan... |
839 | 841 | 36 | giovinazzi | GIO | Antonio | Giovinazzi | 1993-12-14 | Italian | http://en.wikipedia.org/wiki/Ant... |
840 rows x 9 columns
memory usage: 0.14 MB
name: drivers
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/drivers/
lapTimes
name | raceId | driverId | lap | position | milliseconds | time |
---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string |
0 | 1 | 1 | 1 | 13 | 109088 | 1:49.088 |
1 | 1 | 1 | 2 | 12 | 93740 | 1:33.740 |
2 | 1 | 1 | 3 | 11 | 91600 | 1:31.600 |
3 | 1 | 1 | 4 | 10 | 91067 | 1:31.067 |
4 | 1 | 1 | 5 | 10 | 92129 | 1:32.129 |
... | ... | ... | ... | ... | ... | |
420364 | 982 | 840 | 54 | 8 | 107528 | 1:47.528 |
420365 | 982 | 840 | 55 | 8 | 107512 | 1:47.512 |
420366 | 982 | 840 | 56 | 8 | 108143 | 1:48.143 |
420367 | 982 | 840 | 57 | 8 | 107848 | 1:47.848 |
420368 | 982 | 840 | 58 | 8 | 108699 | 1:48.699 |
420369 rows x 6 columns
memory usage: 23.96 MB
name: lapTimes
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/lapTimes/
pitStops
name | raceId | driverId | stop | lap | milliseconds | time | duration |
---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string |
0 | 841 | 1 | 1 | 16 | 23227 | 17:28:24 | 23.227 |
1 | 841 | 1 | 2 | 36 | 23199 | 17:59:29 | 23.199 |
2 | 841 | 2 | 1 | 15 | 22994 | 17:27:41 | 22.994 |
3 | 841 | 2 | 2 | 30 | 25098 | 17:51:32 | 25.098 |
4 | 841 | 3 | 1 | 16 | 23716 | 17:29:00 | 23.716 |
... | ... | ... | ... | ... | ... | ... | |
6065 | 982 | 839 | 6 | 38 | 29134 | 21:29:07 | 29.134 |
6066 | 982 | 840 | 1 | 1 | 37403 | 20:06:43 | 37.403 |
6067 | 982 | 840 | 2 | 2 | 29294 | 20:10:07 | 29.294 |
6068 | 982 | 840 | 3 | 3 | 25584 | 20:13:16 | 25.584 |
6069 | 982 | 840 | 4 | 26 | 29412 | 21:05:07 | 29.412 |
6070 rows x 7 columns
memory usage: 0.44 MB
name: pitStops
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/pitStops/
races
name | raceId | year | round | circuitId | name | date | time | url |
---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 2009 | 1 | 1 | Australian Grand Prix | 2009-03-29 | 06:00:00 | http://en.wikipedia.org/wiki/200... |
1 | 2 | 2009 | 2 | 2 | Malaysian Grand Prix | 2009-04-05 | 09:00:00 | http://en.wikipedia.org/wiki/200... |
2 | 3 | 2009 | 3 | 17 | Chinese Grand Prix | 2009-04-19 | 07:00:00 | http://en.wikipedia.org/wiki/200... |
3 | 4 | 2009 | 4 | 3 | Bahrain Grand Prix | 2009-04-26 | 12:00:00 | http://en.wikipedia.org/wiki/200... |
4 | 5 | 2009 | 5 | 4 | Spanish Grand Prix | 2009-05-10 | 12:00:00 | http://en.wikipedia.org/wiki/200... |
... | ... | ... | ... | ... | ... | ... | ... | |
971 | 984 | 2017 | 16 | 22 | Japanese Grand Prix | 2017-10-08 | 05:00:00 | https://en.wikipedia.org/wiki/20... |
972 | 985 | 2017 | 17 | 69 | United States Grand Prix | 2017-10-22 | 19:00:00 | https://en.wikipedia.org/wiki/20... |
973 | 986 | 2017 | 18 | 32 | Mexican Grand Prix | 2017-10-29 | 19:00:00 | https://en.wikipedia.org/wiki/20... |
974 | 987 | 2017 | 19 | 18 | Brazilian Grand Prix | 2017-11-12 | 16:00:00 | https://en.wikipedia.org/wiki/20... |
975 | 988 | 2017 | 20 | 24 | Abu Dhabi Grand Prix | 2017-11-26 | 17:00:00 | https://en.wikipedia.org/wiki/20... |
976 rows x 8 columns
memory usage: 0.15 MB
name: races
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/races/
qualifying
name | qualifyId | raceId | driverId | constructorId | number | position | q1 | q2 | q3 |
---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1:26.572 | 1:25.187 | 1:26.714 |
1 | 2 | 18 | 9 | 2 | 4 | 2 | 1:26.103 | 1:25.315 | 1:26.869 |
2 | 3 | 18 | 5 | 1 | 23 | 3 | 1:25.664 | 1:25.452 | 1:27.079 |
3 | 4 | 18 | 13 | 6 | 2 | 4 | 1:25.994 | 1:25.691 | 1:27.178 |
4 | 5 | 18 | 2 | 2 | 3 | 5 | 1:25.960 | 1:25.518 | 1:27.236 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
7392 | 7415 | 982 | 825 | 210 | 20 | 16 | 1:43.756 | NULL | NULL |
7393 | 7416 | 982 | 13 | 3 | 19 | 17 | 1:44.014 | NULL | NULL |
7394 | 7417 | 982 | 840 | 3 | 18 | 18 | 1:44.728 | NULL | NULL |
7395 | 7418 | 982 | 836 | 15 | 94 | 19 | 1:45.059 | NULL | NULL |
7396 | 7419 | 982 | 828 | 15 | 9 | 20 | 1:45.570 | NULL | NULL |
7397 rows x 9 columns
memory usage: 0.70 MB
name: qualifying
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/qualifying/
racesPd = races.to_pandas()
racesPd
raceId | year | round | circuitId | name | date | time | url | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2009.0 | 1.0 | 1.0 | Australian Grand Prix | 2009-03-29 | 06:00:00 | http://en.wikipedia.org/wiki/2009_Australian_G... |
1 | 2.0 | 2009.0 | 2.0 | 2.0 | Malaysian Grand Prix | 2009-04-05 | 09:00:00 | http://en.wikipedia.org/wiki/2009_Malaysian_Gr... |
2 | 3.0 | 2009.0 | 3.0 | 17.0 | Chinese Grand Prix | 2009-04-19 | 07:00:00 | http://en.wikipedia.org/wiki/2009_Chinese_Gran... |
3 | 4.0 | 2009.0 | 4.0 | 3.0 | Bahrain Grand Prix | 2009-04-26 | 12:00:00 | http://en.wikipedia.org/wiki/2009_Bahrain_Gran... |
4 | 5.0 | 2009.0 | 5.0 | 4.0 | Spanish Grand Prix | 2009-05-10 | 12:00:00 | http://en.wikipedia.org/wiki/2009_Spanish_Gran... |
... | ... | ... | ... | ... | ... | ... | ... | ... |
971 | 984.0 | 2017.0 | 16.0 | 22.0 | Japanese Grand Prix | 2017-10-08 | 05:00:00 | https://en.wikipedia.org/wiki/2017_Japanese_Gr... |
972 | 985.0 | 2017.0 | 17.0 | 69.0 | United States Grand Prix | 2017-10-22 | 19:00:00 | https://en.wikipedia.org/wiki/2017_United_Stat... |
973 | 986.0 | 2017.0 | 18.0 | 32.0 | Mexican Grand Prix | 2017-10-29 | 19:00:00 | https://en.wikipedia.org/wiki/2017_Mexican_Gra... |
974 | 987.0 | 2017.0 | 19.0 | 18.0 | Brazilian Grand Prix | 2017-11-12 | 16:00:00 | https://en.wikipedia.org/wiki/2017_Brazilian_G... |
975 | 988.0 | 2017.0 | 20.0 | 24.0 | Abu Dhabi Grand Prix | 2017-11-26 | 17:00:00 | https://en.wikipedia.org/wiki/2017_Abu_Dhabi_G... |
976 rows × 8 columns
We actually need some set-up, because the target variable is not readily available. The wins
column in driverStandings
is actually the accumulated number of wins over a year, but what we want is a boolean variable indicated whether someone has one a particular race or not.
driverStandingsPd = driverStandings.to_pandas()
driverStandingsPd = driverStandingsPd.merge(
racesPd[["raceId", "year", "date", "round"]],
on="raceId"
)
previousStanding = driverStandingsPd.merge(
driverStandingsPd[["driverId", "year", "wins", "round"]],
on=["driverId", "year"],
)
isPreviousRound = (previousStanding["round_x"] - previousStanding["round_y"] == 1.0)
previousStanding = previousStanding[isPreviousRound]
previousStanding["win"] = previousStanding["wins_x"] - previousStanding["wins_y"]
driverStandingsPd = driverStandingsPd.merge(
previousStanding[["raceId", "driverId", "win"]],
on=["raceId", "driverId"],
how="left",
)
driverStandingsPd["win"] = [win if win == win else wins for win, wins in zip(driverStandingsPd["win"], driverStandingsPd["wins"])]
driver_standings = getml.data.DataFrame.from_pandas(driverStandingsPd, "driver_standings")
driver_standings
name | driverStandingsId | raceId | driverId | points | position | wins | year | round | win | positionText | date |
---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string |
0 | 1 | 18 | 1 | 10 | 1 | 1 | 2008 | 1 | 1 | 1 | 2008-03-16 |
1 | 2 | 18 | 2 | 8 | 2 | 0 | 2008 | 1 | 0 | 2 | 2008-03-16 |
2 | 3 | 18 | 3 | 6 | 3 | 0 | 2008 | 1 | 0 | 3 | 2008-03-16 |
3 | 4 | 18 | 4 | 5 | 4 | 0 | 2008 | 1 | 0 | 4 | 2008-03-16 |
4 | 5 | 18 | 5 | 4 | 5 | 0 | 2008 | 1 | 0 | 5 | 2008-03-16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31573 | 68456 | 982 | 835 | 8 | 16 | 0 | 2017 | 14 | 0 | 16 | 2017-09-17 |
31574 | 68457 | 982 | 154 | 26 | 13 | 0 | 2017 | 14 | 0 | 13 | 2017-09-17 |
31575 | 68458 | 982 | 836 | 5 | 18 | 0 | 2017 | 14 | 0 | 18 | 2017-09-17 |
31576 | 68459 | 982 | 18 | 0 | 22 | 0 | 2017 | 14 | 0 | 22 | 2017-09-17 |
31577 | 68460 | 982 | 814 | 0 | 23 | 0 | 2017 | 14 | 0 | 23 | 2017-09-17 |
31578 rows x 11 columns
memory usage: 3.21 MB
name: driver_standings
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/driver_standings/
We also need to include the date of the race to lapTimes
and pitStops
, because we cannot use this data for the race we would like to predict. We can only take lap times and pit stops from previous races.
lapTimesPd = lapTimes.to_pandas()
lapTimesPd = lapTimesPd.merge(
racesPd[["raceId", "date", "year"]],
on="raceId"
)
lap_times = getml.data.DataFrame.from_pandas(lapTimesPd, "lap_times")
lap_times
name | raceId | driverId | lap | position | milliseconds | year | time | date |
---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string |
0 | 1 | 1 | 1 | 13 | 109088 | 2009 | 1:49.088 | 2009-03-29 |
1 | 1 | 1 | 2 | 12 | 93740 | 2009 | 1:33.740 | 2009-03-29 |
2 | 1 | 1 | 3 | 11 | 91600 | 2009 | 1:31.600 | 2009-03-29 |
3 | 1 | 1 | 4 | 10 | 91067 | 2009 | 1:31.067 | 2009-03-29 |
4 | 1 | 1 | 5 | 10 | 92129 | 2009 | 1:32.129 | 2009-03-29 |
... | ... | ... | ... | ... | ... | ... | ... | |
420364 | 982 | 840 | 54 | 8 | 107528 | 2017 | 1:47.528 | 2017-09-17 |
420365 | 982 | 840 | 55 | 8 | 107512 | 2017 | 1:47.512 | 2017-09-17 |
420366 | 982 | 840 | 56 | 8 | 108143 | 2017 | 1:48.143 | 2017-09-17 |
420367 | 982 | 840 | 57 | 8 | 107848 | 2017 | 1:47.848 | 2017-09-17 |
420368 | 982 | 840 | 58 | 8 | 108699 | 2017 | 1:48.699 | 2017-09-17 |
420369 rows x 8 columns
memory usage: 35.31 MB
name: lap_times
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/lap_times/
pitStopsPd = pitStops.to_pandas()
pitStopsPd = pitStopsPd.merge(
racesPd[["raceId", "date", "year"]],
on="raceId"
)
pit_stops = getml.data.DataFrame.from_pandas(pitStopsPd, "pit_stops")
pit_stops
name | raceId | driverId | stop | lap | milliseconds | year | time | duration | date |
---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 841 | 1 | 1 | 16 | 23227 | 2011 | 17:28:24 | 23.227 | 2011-03-27 |
1 | 841 | 1 | 2 | 36 | 23199 | 2011 | 17:59:29 | 23.199 | 2011-03-27 |
2 | 841 | 2 | 1 | 15 | 22994 | 2011 | 17:27:41 | 22.994 | 2011-03-27 |
3 | 841 | 2 | 2 | 30 | 25098 | 2011 | 17:51:32 | 25.098 | 2011-03-27 |
4 | 841 | 3 | 1 | 16 | 23716 | 2011 | 17:29:00 | 23.716 | 2011-03-27 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
6065 | 982 | 839 | 6 | 38 | 29134 | 2017 | 21:29:07 | 29.134 | 2017-09-17 |
6066 | 982 | 840 | 1 | 1 | 37403 | 2017 | 20:06:43 | 37.403 | 2017-09-17 |
6067 | 982 | 840 | 2 | 2 | 29294 | 2017 | 20:10:07 | 29.294 | 2017-09-17 |
6068 | 982 | 840 | 3 | 3 | 25584 | 2017 | 20:13:16 | 25.584 | 2017-09-17 |
6069 | 982 | 840 | 4 | 26 | 29412 | 2017 | 21:05:07 | 29.412 | 2017-09-17 |
6070 rows x 9 columns
memory usage: 0.60 MB
name: pit_stops
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/pit_stops/
getML requires that we define roles for each of the columns.
driver_standings.set_role("win", getml.data.roles.target)
driver_standings.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
driver_standings.set_role("position", getml.data.roles.numerical)
driver_standings.set_role("date", getml.data.roles.time_stamp)
driver_standings
name | date | raceId | driverId | year | win | position | driverStandingsId | points | wins | round | positionText |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | target | numerical | unused_float | unused_float | unused_float | unused_float | unused_string |
unit | time stamp, comparison only | ||||||||||
0 | 2008-03-16 | 18 | 1 | 2008 | 1 | 1 | 1 | 10 | 1 | 1 | 1 |
1 | 2008-03-16 | 18 | 2 | 2008 | 0 | 2 | 2 | 8 | 0 | 1 | 2 |
2 | 2008-03-16 | 18 | 3 | 2008 | 0 | 3 | 3 | 6 | 0 | 1 | 3 |
3 | 2008-03-16 | 18 | 4 | 2008 | 0 | 4 | 4 | 5 | 0 | 1 | 4 |
4 | 2008-03-16 | 18 | 5 | 2008 | 0 | 5 | 5 | 4 | 0 | 1 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31573 | 2017-09-17 | 982 | 835 | 2017 | 0 | 16 | 68456 | 8 | 0 | 14 | 16 |
31574 | 2017-09-17 | 982 | 154 | 2017 | 0 | 13 | 68457 | 26 | 0 | 14 | 13 |
31575 | 2017-09-17 | 982 | 836 | 2017 | 0 | 18 | 68458 | 5 | 0 | 14 | 18 |
31576 | 2017-09-17 | 982 | 18 | 2017 | 0 | 22 | 68459 | 0 | 0 | 14 | 22 |
31577 | 2017-09-17 | 982 | 814 | 2017 | 0 | 23 | 68460 | 0 | 0 | 14 | 23 |
31578 rows x 11 columns
memory usage: 2.49 MB
name: driver_standings
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/driver_standings/
drivers.set_role("driverId", getml.data.roles.join_key)
drivers.set_role(["nationality", "driverRef"], getml.data.roles.categorical)
drivers
name | driverId | nationality | driverRef | number | code | forename | surname | dob | url |
---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | British | hamilton | 44 | HAM | Lewis | Hamilton | 1985-01-07 | http://en.wikipedia.org/wiki/Lew... |
1 | 2 | German | heidfeld | nan | HEI | Nick | Heidfeld | 1977-05-10 | http://en.wikipedia.org/wiki/Nic... |
2 | 3 | German | rosberg | 6 | ROS | Nico | Rosberg | 1985-06-27 | http://en.wikipedia.org/wiki/Nic... |
3 | 4 | Spanish | alonso | 14 | ALO | Fernando | Alonso | 1981-07-29 | http://en.wikipedia.org/wiki/Fer... |
4 | 5 | Finnish | kovalainen | nan | KOV | Heikki | Kovalainen | 1981-10-19 | http://en.wikipedia.org/wiki/Hei... |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
835 | 837 | Indonesian | haryanto | 88 | HAR | Rio | Haryanto | 1993-01-22 | http://en.wikipedia.org/wiki/Rio... |
836 | 838 | Belgian | vandoorne | 2 | VAN | Stoffel | Vandoorne | 1992-03-26 | http://en.wikipedia.org/wiki/Sto... |
837 | 839 | French | ocon | 31 | OCO | Esteban | Ocon | 1996-09-17 | http://en.wikipedia.org/wiki/Est... |
838 | 840 | Canadian | stroll | 18 | STR | Lance | Stroll | 1998-10-29 | http://en.wikipedia.org/wiki/Lan... |
839 | 841 | Italian | giovinazzi | 36 | GIO | Antonio | Giovinazzi | 1993-12-14 | http://en.wikipedia.org/wiki/Ant... |
840 rows x 9 columns
memory usage: 0.11 MB
name: drivers
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/drivers/
lap_times.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
lap_times.set_role(["lap", "milliseconds", "position"], getml.data.roles.numerical)
lap_times.set_role("date", getml.data.roles.time_stamp)
lap_times
name | date | raceId | driverId | year | lap | milliseconds | position | time |
---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | numerical | numerical | numerical | unused_string |
unit | time stamp, comparison only | |||||||
0 | 2009-03-29 | 1 | 1 | 2009 | 1 | 109088 | 13 | 1:49.088 |
1 | 2009-03-29 | 1 | 1 | 2009 | 2 | 93740 | 12 | 1:33.740 |
2 | 2009-03-29 | 1 | 1 | 2009 | 3 | 91600 | 11 | 1:31.600 |
3 | 2009-03-29 | 1 | 1 | 2009 | 4 | 91067 | 10 | 1:31.067 |
4 | 2009-03-29 | 1 | 1 | 2009 | 5 | 92129 | 10 | 1:32.129 |
... | ... | ... | ... | ... | ... | ... | ... | |
420364 | 2017-09-17 | 982 | 840 | 2017 | 54 | 107528 | 8 | 1:47.528 |
420365 | 2017-09-17 | 982 | 840 | 2017 | 55 | 107512 | 8 | 1:47.512 |
420366 | 2017-09-17 | 982 | 840 | 2017 | 56 | 108143 | 8 | 1:48.143 |
420367 | 2017-09-17 | 982 | 840 | 2017 | 57 | 107848 | 8 | 1:47.848 |
420368 | 2017-09-17 | 982 | 840 | 2017 | 58 | 108699 | 8 | 1:48.699 |
420369 rows x 8 columns
memory usage: 25.64 MB
name: lap_times
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/lap_times/
pit_stops.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
pit_stops.set_role(["lap", "milliseconds", "stop"], getml.data.roles.numerical)
pit_stops.set_role("date", getml.data.roles.time_stamp)
pit_stops
name | date | raceId | driverId | year | lap | milliseconds | stop | time | duration |
---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | numerical | numerical | numerical | unused_string | unused_string |
unit | time stamp, comparison only | ||||||||
0 | 2011-03-27 | 841 | 1 | 2011 | 16 | 23227 | 1 | 17:28:24 | 23.227 |
1 | 2011-03-27 | 841 | 1 | 2011 | 36 | 23199 | 2 | 17:59:29 | 23.199 |
2 | 2011-03-27 | 841 | 2 | 2011 | 15 | 22994 | 1 | 17:27:41 | 22.994 |
3 | 2011-03-27 | 841 | 2 | 2011 | 30 | 25098 | 2 | 17:51:32 | 25.098 |
4 | 2011-03-27 | 841 | 3 | 2011 | 16 | 23716 | 1 | 17:29:00 | 23.716 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
6065 | 2017-09-17 | 982 | 839 | 2017 | 38 | 29134 | 6 | 21:29:07 | 29.134 |
6066 | 2017-09-17 | 982 | 840 | 2017 | 1 | 37403 | 1 | 20:06:43 | 37.403 |
6067 | 2017-09-17 | 982 | 840 | 2017 | 2 | 29294 | 2 | 20:10:07 | 29.294 |
6068 | 2017-09-17 | 982 | 840 | 2017 | 3 | 25584 | 3 | 20:13:16 | 25.584 |
6069 | 2017-09-17 | 982 | 840 | 2017 | 26 | 29412 | 4 | 21:05:07 | 29.412 |
6070 rows x 9 columns
memory usage: 0.46 MB
name: pit_stops
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/pit_stops/
qualifying.set_role(["raceId", "driverId", "qualifyId"], getml.data.roles.join_key)
qualifying.set_role(["position", "number"], getml.data.roles.numerical)
qualifying
name | raceId | driverId | qualifyId | position | number | constructorId | q1 | q2 | q3 |
---|---|---|---|---|---|---|---|---|---|
role | join_key | join_key | join_key | numerical | numerical | unused_float | unused_string | unused_string | unused_string |
0 | 18 | 1 | 1 | 1 | 22 | 1 | 1:26.572 | 1:25.187 | 1:26.714 |
1 | 18 | 9 | 2 | 2 | 4 | 2 | 1:26.103 | 1:25.315 | 1:26.869 |
2 | 18 | 5 | 3 | 3 | 23 | 1 | 1:25.664 | 1:25.452 | 1:27.079 |
3 | 18 | 13 | 4 | 4 | 2 | 6 | 1:25.994 | 1:25.691 | 1:27.178 |
4 | 18 | 2 | 5 | 5 | 3 | 2 | 1:25.960 | 1:25.518 | 1:27.236 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
7392 | 982 | 825 | 7415 | 16 | 20 | 210 | 1:43.756 | NULL | NULL |
7393 | 982 | 13 | 7416 | 17 | 19 | 3 | 1:44.014 | NULL | NULL |
7394 | 982 | 840 | 7417 | 18 | 18 | 3 | 1:44.728 | NULL | NULL |
7395 | 982 | 836 | 7418 | 19 | 94 | 15 | 1:45.059 | NULL | NULL |
7396 | 982 | 828 | 7419 | 20 | 9 | 15 | 1:45.570 | NULL | NULL |
7397 rows x 9 columns
memory usage: 0.61 MB
name: qualifying
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/qualifying/
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
split = getml.data.split.random(train=0.8, test=0.2)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | train |
... |
infinite number of rows
type: StringColumnView
star_schema = getml.data.StarSchema(population=driver_standings.drop(["position"]), alias="population", split=split)
star_schema.join(
driver_standings,
on=["driverId"],
time_stamps="date",
horizon=getml.data.time.days(1),
lagged_targets=True,
)
# We cannot use lap times for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
lap_times,
on=["driverId"],
time_stamps="date",
horizon=getml.data.time.days(1),
)
# We cannot use pit stops for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
pit_stops,
on=["driverId"],
time_stamps="date",
horizon=getml.data.time.days(1),
)
star_schema.join(
qualifying,
on=["driverId", "raceId"],
relationship=getml.data.relationship.many_to_one,
)
star_schema.join(
drivers,
on=["driverId"],
relationship=getml.data.relationship.many_to_one,
)
star_schema
data frames | staging table | |
---|---|---|
0 | population, qualifying, drivers | POPULATION__STAGING_TABLE_1 |
1 | driver_standings | DRIVER_STANDINGS__STAGING_TABLE_2 |
2 | lap_times | LAP_TIMES__STAGING_TABLE_3 |
3 | pit_stops | PIT_STOPS__STAGING_TABLE_4 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | driver_standings | 6229 | View |
1 | train | driver_standings | 25349 | View |
name | rows | type | |
---|---|---|---|
0 | driver_standings | 31578 | DataFrame |
1 | lap_times | 420369 | DataFrame |
2 | pit_stops | 6070 | DataFrame |
3 | qualifying | 7397 | DataFrame |
4 | drivers | 840 | DataFrame |
Set-up the feature learner & predictor
We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples
).
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
aggregation=getml.feature_learning.FastProp.agg_sets.All,
num_threads=1,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
relmt = getml.feature_learning.RelMT(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
Build the pipeline
pipe1 = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function=None, peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
pipe1.check(star_schema.train)
Checking data model... Staging... [========================================] 100% Preprocessing... [========================================] 100% Checking... [========================================] 100% INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId' and 'driverId', there are no corresponding entries for 68.551028% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId' and 'driverId', there are no corresponding entries for 82.527910% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
pipe1.fit(star_schema.train)
Checking data model... Staging... [========================================] 100% INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId' and 'driverId', there are no corresponding entries for 68.551028% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId' and 'driverId', there are no corresponding entries for 82.527910% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. Staging... [========================================] 100% Preprocessing... [========================================] 100% FastProp: Trying 867 features... [========================================] 100% FastProp: Building features... [========================================] 100% XGBoost: Training as predictor... [========================================] 100% Trained pipeline. Time taken: 0h:10m:31.904886
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function=None, peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-iDDODO'])
url: http://localhost:1709/#/getpipeline/formula1/sSpaJ1/0/
pipe1.score(star_schema.test)
Staging... [========================================] 100% Preprocessing... [========================================] 100% FastProp: Building features... [========================================] 100%
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2021-08-23 21:25:33 | train | win | 0.9734 | 0.957 | 0.07563 |
1 | 2021-08-23 21:26:01 | test | win | 0.9727 | 0.9242 | 0.08486 |
featuretools requires some additional data preparation before we can start engineering features.
population_train_pd = star_schema.train.population.drop(["position"]).to_pandas()
population_test_pd = star_schema.test.population.drop(["position"]).to_pandas()
driver_standings_pd = driver_standings.drop(driver_standings.roles.unused).to_pandas()
lap_times_pd = lap_times.drop(lap_times.roles.unused).to_pandas()
pit_stops_pd = pit_stops.drop(pit_stops.roles.unused).to_pandas()
qualifying_pd = qualifying.drop(qualifying.roles.unused).to_pandas()
drivers_pd = drivers.drop(drivers.roles.unused).to_pandas()
Because qualifying
and drivers
are many-to-one joins, we can directly join them onto our table.
population_train_pd["id"] = population_train_pd.index
population_train_pd = population_train_pd.merge(
qualifying_pd,
on=["driverId", "raceId"],
how="left",
)
population_train_pd = population_train_pd.merge(
drivers_pd,
on=["driverId"],
how="left",
)
population_train_pd
date | raceId | driverId | year | win | id | qualifyId | position | number | nationality | driverRef | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-03-16 | 18 | 1 | 2008 | 1.0 | 0 | 1 | 1.0 | 22.0 | British | hamilton |
1 | 2008-03-16 | 18 | 2 | 2008 | 0.0 | 1 | 5 | 5.0 | 3.0 | German | heidfeld |
2 | 2008-03-16 | 18 | 3 | 2008 | 0.0 | 2 | 7 | 7.0 | 7.0 | German | rosberg |
3 | 2008-03-16 | 18 | 5 | 2008 | 0.0 | 3 | 3 | 3.0 | 23.0 | Finnish | kovalainen |
4 | 2008-03-16 | 18 | 6 | 2008 | 0.0 | 4 | 14 | 14.0 | 8.0 | Japanese | nakajima |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
25344 | 2017-09-17 | 982 | 835 | 2017 | 0.0 | 25344 | 7410 | 11.0 | 30.0 | British | jolyon_palmer |
25345 | 2017-09-17 | 982 | 154 | 2017 | 0.0 | 25345 | 7414 | 15.0 | 8.0 | French | grosjean |
25346 | 2017-09-17 | 982 | 836 | 2017 | 0.0 | 25346 | 7418 | 19.0 | 94.0 | German | wehrlein |
25347 | 2017-09-17 | 982 | 18 | 2017 | 0.0 | 25347 | NaN | NaN | NaN | British | button |
25348 | 2017-09-17 | 982 | 814 | 2017 | 0.0 | 25348 | NaN | NaN | NaN | British | resta |
25349 rows × 11 columns
Same for the testing set.
population_test_pd["id"] = population_test_pd.index
population_test_pd = population_test_pd.merge(
qualifying_pd,
on=["driverId", "raceId"],
how="left",
)
population_test_pd = population_test_pd.merge(
drivers_pd,
on=["driverId"],
how="left",
)
population_test_pd
date | raceId | driverId | year | win | id | qualifyId | position | number | nationality | driverRef | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-03-16 | 18 | 4 | 2008 | 0.0 | 0 | 12 | 12.0 | 5.0 | Spanish | alonso |
1 | 2008-03-16 | 18 | 7 | 2008 | 0.0 | 1 | 18 | 18.0 | 14.0 | French | bourdais |
2 | 2008-03-23 | 19 | 3 | 2008 | 0.0 | 2 | 38 | 16.0 | 7.0 | German | rosberg |
3 | 2008-03-23 | 19 | 11 | 2008 | 0.0 | 3 | 42 | 20.0 | 18.0 | Japanese | sato |
4 | 2008-04-06 | 20 | 7 | 2008 | 0.0 | 4 | 59 | 15.0 | 14.0 | French | bourdais |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6224 | 2017-09-03 | 981 | 835 | 2017 | 0.0 | 6224 | 7396 | 17.0 | 30.0 | British | jolyon_palmer |
6225 | 2017-09-03 | 981 | 814 | 2017 | 0.0 | 6225 | NaN | NaN | NaN | British | resta |
6226 | 2017-09-17 | 982 | 822 | 2017 | 0.0 | 6226 | 7405 | 6.0 | 77.0 | Finnish | bottas |
6227 | 2017-09-17 | 982 | 832 | 2017 | 0.0 | 6227 | 7409 | 10.0 | 55.0 | Spanish | sainz |
6228 | 2017-09-17 | 982 | 826 | 2017 | 0.0 | 6228 | 7412 | 13.0 | 26.0 | Russian | kvyat |
6229 rows × 11 columns
featuretools requires us to expand our peripheral tables so they can be joined using the unique id from the population table. Luckily, we can write a simple helper function that works for all three remaining peripheral tables.
def prepare_peripheral(peripheral, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
peripheral_new = peripheral.merge(
train_or_test[["id", "driverId", "date"]],
on=["driverId"],
)
peripheral_new = peripheral_new[
peripheral_new["date_x"] < peripheral_new["date_y"]
]
del peripheral_new["date_y"]
del peripheral_new["driverId"]
del peripheral_new["raceId"]
return peripheral_new.rename(columns={"date_x": "date"})
driver_standings_train_pd = prepare_peripheral(driver_standings_pd, population_train_pd)
driver_standings_test_pd = prepare_peripheral(driver_standings_pd, population_test_pd)
driver_standings_train_pd
date | year | win | position | id | |
---|---|---|---|---|---|
1 | 2008-03-16 | 2008 | 1.0 | 1.0 | 6 |
2 | 2008-03-16 | 2008 | 1.0 | 1.0 | 22 |
3 | 2008-03-16 | 2008 | 1.0 | 1.0 | 39 |
4 | 2008-03-16 | 2008 | 1.0 | 1.0 | 54 |
5 | 2008-03-16 | 2008 | 1.0 | 1.0 | 72 |
... | ... | ... | ... | ... | ... |
2656230 | 2017-07-30 | 2017 | 0.0 | 12.0 | 25323 |
2656231 | 2017-07-30 | 2017 | 0.0 | 12.0 | 25341 |
2656242 | 2017-08-27 | 2017 | 0.0 | 13.0 | 25323 |
2656243 | 2017-08-27 | 2017 | 0.0 | 13.0 | 25341 |
2656255 | 2017-09-03 | 2017 | 0.0 | 12.0 | 25341 |
1317941 rows × 5 columns
lap_times_train_pd = prepare_peripheral(lap_times_pd, population_train_pd)
lap_times_test_pd = prepare_peripheral(lap_times_pd, population_test_pd)
lap_times_train_pd
date | year | lap | milliseconds | position | id | |
---|---|---|---|---|---|---|
28 | 2009-03-29 | 2009 | 1.0 | 109088.0 | 13.0 | 5874 |
29 | 2009-03-29 | 2009 | 1.0 | 109088.0 | 13.0 | 5904 |
30 | 2009-03-29 | 2009 | 1.0 | 109088.0 | 13.0 | 5918 |
31 | 2009-03-29 | 2009 | 1.0 | 109088.0 | 13.0 | 5936 |
32 | 2009-03-29 | 2009 | 1.0 | 109088.0 | 13.0 | 5954 |
... | ... | ... | ... | ... | ... | ... |
54811898 | 2017-04-09 | 2017 | 3.0 | 156151.0 | 17.0 | 25264 |
54811899 | 2017-04-09 | 2017 | 3.0 | 156151.0 | 17.0 | 25281 |
54811900 | 2017-04-09 | 2017 | 3.0 | 156151.0 | 17.0 | 25300 |
54811901 | 2017-04-09 | 2017 | 3.0 | 156151.0 | 17.0 | 25320 |
54811902 | 2017-04-09 | 2017 | 3.0 | 156151.0 | 17.0 | 25337 |
25544456 rows × 6 columns
lap_times
demonstrates one of the greatest dangers of featuretools. Because it is written in pure Python, featuretools requires you expand your tables so that they can be more easily joined. But this comes at the cost of increased memory consumption: In this case a table that used to have about 420,000 rows has now been expanded to over 25 million rows.
pit_stops_train_pd = prepare_peripheral(pit_stops_pd, population_train_pd)
pit_stops_test_pd = prepare_peripheral(pit_stops_pd, population_test_pd)
pit_stops_train_pd
date | year | lap | milliseconds | stop | id | |
---|---|---|---|---|---|---|
57 | 2011-03-27 | 2011 | 16.0 | 23227.0 | 1.0 | 22459 |
58 | 2011-03-27 | 2011 | 16.0 | 23227.0 | 1.0 | 22498 |
59 | 2011-03-27 | 2011 | 16.0 | 23227.0 | 1.0 | 22536 |
60 | 2011-03-27 | 2011 | 16.0 | 23227.0 | 1.0 | 22555 |
61 | 2011-03-27 | 2011 | 16.0 | 23227.0 | 1.0 | 22574 |
... | ... | ... | ... | ... | ... | ... |
688120 | 2017-04-09 | 2017 | 2.0 | 29443.0 | 1.0 | 25264 |
688121 | 2017-04-09 | 2017 | 2.0 | 29443.0 | 1.0 | 25281 |
688122 | 2017-04-09 | 2017 | 2.0 | 29443.0 | 1.0 | 25300 |
688123 | 2017-04-09 | 2017 | 2.0 | 29443.0 | 1.0 | 25320 |
688124 | 2017-04-09 | 2017 | 2.0 | 29443.0 | 1.0 | 25337 |
213549 rows × 6 columns
del population_train_pd["driverId"]
del population_train_pd["raceId"]
del population_train_pd["year"]
del population_train_pd["qualifyId"]
del population_test_pd["driverId"]
del population_test_pd["raceId"]
del population_test_pd["year"]
del population_test_pd["qualifyId"]
entities_train = {
"population" : (population_train_pd, "id"),
"driver_standings": (driver_standings_train_pd, "index"),
"lap_times" : (lap_times_train_pd, "index"),
"pit_stops" : (pit_stops_train_pd, "index"),
}
entities_test = {
"population" : (population_test_pd, "id"),
"driver_standings": (driver_standings_test_pd, "index"),
"lap_times" : (lap_times_test_pd, "index"),
"pit_stops" : (pit_stops_test_pd, "index")
}
relationships = [
("population", "id", "driver_standings", "id"),
("population", "id", "lap_times", "id"),
("population", "id", "pit_stops", "id")
]
featuretools_train_pd = featuretools.dfs(
entities=entities_train,
relationships=relationships,
target_entity="population")[0]
/home/patrick/.local/lib/python3.9/site-packages/featuretools/entityset/entity.py:462: UserWarning: index index not found in dataframe, creating new integer column warnings.warn("index {} not found in dataframe, creating new "
featuretools_test_pd = featuretools.dfs(
entities=entities_test,
relationships=relationships,
target_entity="population")[0]
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train.set_role("win", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)
featuretools_train
90.725472% of all entries of column 'MAX(pit_stops.lap)' are NULL values. 90.725472% of all entries of column 'MAX(pit_stops.milliseconds)' are NULL values. 90.725472% of all entries of column 'MAX(pit_stops.stop)' are NULL values. 90.725472% of all entries of column 'MEAN(pit_stops.lap)' are NULL values. 90.725472% of all entries of column 'MEAN(pit_stops.milliseconds)' are NULL values. 90.725472% of all entries of column 'MEAN(pit_stops.stop)' are NULL values. 90.725472% of all entries of column 'MIN(pit_stops.lap)' are NULL values. 90.725472% of all entries of column 'MIN(pit_stops.milliseconds)' are NULL values. 90.725472% of all entries of column 'MIN(pit_stops.stop)' are NULL values. 90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.year)' are NULL values. 90.934554% of all entries of column 'SKEW(pit_stops.lap)' are NULL values. 90.934554% of all entries of column 'SKEW(pit_stops.milliseconds)' are NULL values. 90.934554% of all entries of column 'SKEW(pit_stops.stop)' are NULL values. 90.757032% of all entries of column 'STD(pit_stops.lap)' are NULL values. 90.757032% of all entries of column 'STD(pit_stops.milliseconds)' are NULL values. 90.757032% of all entries of column 'STD(pit_stops.stop)' are NULL values. 90.725472% of all entries of column 'MODE(pit_stops.DAY(date))' are NULL values. 90.725472% of all entries of column 'MODE(pit_stops.MONTH(date))' are NULL values. 90.725472% of all entries of column 'MODE(pit_stops.WEEKDAY(date))' are NULL values. 90.725472% of all entries of column 'MODE(pit_stops.YEAR(date))' are NULL values. 90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.DAY(date))' are NULL values. 90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.MONTH(date))' are NULL values. 90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.WEEKDAY(date))' are NULL values. 90.725472% of all entries of column 'NUM_UNIQUE(pit_stops.YEAR(date))' are NULL values. 90.725472% of all entries of column 'MODE(pit_stops.year)' are NULL values.
name | win | nationality | driverRef | MODE(driver_standings.year) | MODE(lap_times.year) | MODE(pit_stops.year) | position | number | COUNT(driver_standings) | MAX(driver_standings.position) | MAX(driver_standings.win) | MEAN(driver_standings.position) | MEAN(driver_standings.win) | MIN(driver_standings.position) | MIN(driver_standings.win) | NUM_UNIQUE(driver_standings.year) | SKEW(driver_standings.position) | SKEW(driver_standings.win) | STD(driver_standings.position) | STD(driver_standings.win) | SUM(driver_standings.position) | SUM(driver_standings.win) | COUNT(lap_times) | MAX(lap_times.lap) | MAX(lap_times.milliseconds) | MAX(lap_times.position) | MEAN(lap_times.lap) | MEAN(lap_times.milliseconds) | MEAN(lap_times.position) | MIN(lap_times.lap) | MIN(lap_times.milliseconds) | MIN(lap_times.position) | NUM_UNIQUE(lap_times.year) | SKEW(lap_times.lap) | SKEW(lap_times.milliseconds) | SKEW(lap_times.position) | STD(lap_times.lap) | STD(lap_times.milliseconds) | STD(lap_times.position) | SUM(lap_times.lap) | SUM(lap_times.milliseconds) | SUM(lap_times.position) | COUNT(pit_stops) | MAX(pit_stops.lap) | MAX(pit_stops.milliseconds) | MAX(pit_stops.stop) | MEAN(pit_stops.lap) | MEAN(pit_stops.milliseconds) | MEAN(pit_stops.stop) | MIN(pit_stops.lap) | MIN(pit_stops.milliseconds) | MIN(pit_stops.stop) | NUM_UNIQUE(pit_stops.year) | SKEW(pit_stops.lap) | SKEW(pit_stops.milliseconds) | SKEW(pit_stops.stop) | STD(pit_stops.lap) | STD(pit_stops.milliseconds) | STD(pit_stops.stop) | SUM(pit_stops.lap) | SUM(pit_stops.milliseconds) | SUM(pit_stops.stop) | DAY(date) | MONTH(date) | WEEKDAY(date) | YEAR(date) | MODE(driver_standings.DAY(date)) | MODE(driver_standings.MONTH(date)) | MODE(driver_standings.WEEKDAY(date)) | MODE(driver_standings.YEAR(date)) | NUM_UNIQUE(driver_standings.DAY(date)) | NUM_UNIQUE(driver_standings.MONTH(date)) | NUM_UNIQUE(driver_standings.WEEKDAY(date)) | NUM_UNIQUE(driver_standings.YEAR(date)) | MODE(lap_times.DAY(date)) | MODE(lap_times.MONTH(date)) | MODE(lap_times.WEEKDAY(date)) | MODE(lap_times.YEAR(date)) | NUM_UNIQUE(lap_times.DAY(date)) | NUM_UNIQUE(lap_times.MONTH(date)) | NUM_UNIQUE(lap_times.WEEKDAY(date)) | NUM_UNIQUE(lap_times.YEAR(date)) | MODE(pit_stops.DAY(date)) | MODE(pit_stops.MONTH(date)) | MODE(pit_stops.WEEKDAY(date)) | MODE(pit_stops.YEAR(date)) | NUM_UNIQUE(pit_stops.DAY(date)) | NUM_UNIQUE(pit_stops.MONTH(date)) | NUM_UNIQUE(pit_stops.WEEKDAY(date)) | NUM_UNIQUE(pit_stops.YEAR(date)) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 1 | British | hamilton | 2007 | 2007 | NULL | 1 | 22 | 17 | 3 | 1 | 1.4706 | 0.2353 | 1 | 0 | 1 | 1.3538 | 1.3723 | 0.7998 | 0.4372 | 25 | 4 | 1037 | 78 | 1453884 | 19 | 32.0559 | 90600.8602 | 3.2575 | 1 | 72506 | 1 | 1 | 0.1882 | 26.093 | 2.4493 | 19.1156 | 45564.9471 | 3.4488 | 33242 | 93953092 | 3378 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | nan | nan | nan | nan | nan | nan | nan | nan |
1 | 0 | German | heidfeld | 2005 | 2007 | NULL | 5 | 3 | 134 | 21 | 0 | 10.8582 | 0 | 4 | 0 | 8 | 0.4681 | 0 | 5.136 | 0 | 1455 | 0 | 6863 | 78 | 1285986 | 22 | 30.4682 | 89072.7102 | 9.6799 | 1 | 70516 | 1 | 8 | 0.2623 | 30.5344 | 0.3339 | 18.7041 | 20544.1746 | 4.199 | 209103 | 611306010 | 66433 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | 1 | 7 | 6 | 2005 | 31 | 8 | 1 | 8 | 1 | 7 | 6 | 2007 | 31 | 8 | 1 | 8 | nan | nan | nan | nan | nan | nan | nan | nan |
2 | 0 | German | rosberg | 2006 | 2007 | NULL | 7 | 7 | 35 | 17 | 0 | 11.9714 | 0 | 7 | 0 | 2 | 0.1328 | 0 | 3.1761 | 0 | 419 | 0 | 1627 | 77 | 191891 | 22 | 30.4124 | 89820.1174 | 9.9809 | 1 | 73159 | 3 | 2 | 0.2225 | 1.7858 | 0.5491 | 18.6109 | 14212.8848 | 3.733 | 49481 | 146137331 | 16239 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | 8 | 7 | 6 | 2006 | 25 | 8 | 1 | 2 | 8 | 5 | 6 | 2007 | 25 | 8 | 1 | 2 | nan | nan | nan | nan | nan | nan | nan | nan |
3 | 0 | Finnish | kovalainen | 2007 | 2007 | NULL | 3 | 23 | 17 | 11 | 0 | 8.3529 | 0 | 7 | 0 | 1 | 0.7931 | 0 | 1.4116 | 0 | 142 | 0 | 1023 | 76 | 1394884 | 22 | 31.4526 | 92458.8299 | 9.4018 | 1 | 73998 | 1 | 1 | 0.1906 | 26.2493 | 0.6941 | 18.6943 | 43578.4946 | 4.1724 | 32176 | 94585383 | 9618 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | nan | nan | nan | nan | nan | nan | nan | nan |
4 | 0 | Japanese | nakajima | 2007 | 2007 | NULL | 14 | 8 | 1 | 22 | 0 | 22 | 0 | 22 | 0 | 1 | nan | nan | nan | nan | 22 | 0 | 70 | 70 | 104396 | 16 | 35.5 | 76228.3571 | 10.9143 | 1 | 73116 | 9 | 1 | 0 | 4.4222 | 1.2783 | 20.3511 | 4785.2529 | 1.8552 | 2485 | 5335985 | 764 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | 21 | 10 | 6 | 2007 | 1 | 1 | 1 | 1 | 21 | 10 | 6 | 2007 | 1 | 1 | 1 | 1 | nan | nan | nan | nan | nan | nan | nan | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
25344 | 0 | British | jolyon_palmer | 2016 | 2016 | 2016 | 11 | 30 | 34 | 20 | 0 | 18.0588 | 0 | 11 | 0 | 2 | -1.7626 | 0 | 1.8413 | 0 | 614 | 0 | 1517 | 78 | 1223356 | 22 | 29.4199 | 98411.6038 | 14.5175 | 1 | 68652 | 7 | 2 | 0.2921 | 21.0237 | 0.3751 | 18.5854 | 43393.1519 | 3.0483 | 44630 | 149290403 | 22023 | 56 | 50 | 1088294 | 4 | 20.9286 | 61015.1429 | 1.7857 | 1 | 16851 | 1 | 2 | 0.3035 | 5.1507 | 0.8981 | 13.473 | 191214.0556 | 0.9088 | 1172 | 3416848 | 100 | 17 | 9 | 6 | 2017 | 3 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 21 | 9 | 1 | 2 |
25345 | 0 | French | grosjean | 2016 | 2016 | 2014 | 15 | 8 | 118 | 24 | 0 | 11.5254 | 0 | 5 | 0 | 7 | 1.3306 | 0 | 4.1709 | 0 | 1360 | 0 | 5518 | 78 | 3670560 | 24 | 29.4511 | 99489.4235 | 10.7356 | 1 | 68590 | 1 | 7 | 0.2565 | 39.9299 | 0.06907 | 18.0521 | 63108.3058 | 4.8371 | 162511 | 548982639 | 59239 | 206 | 62 | 1313665 | 5 | 22.9369 | 46260.6214 | 1.801 | 1 | 15073 | 1 | 6 | 0.3075 | 7.1215 | 1.2186 | 13.7483 | 154388.9496 | 0.9288 | 4725 | 9529688 | 371 | 17 | 9 | 6 | 2017 | 27 | 7 | 6 | 2016 | 31 | 9 | 1 | 7 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 7 | 9 | 7 | 6 | 2014 | 31 | 9 | 1 | 6 |
25346 | 0 | German | wehrlein | 2016 | 2016 | 2016 | 19 | 94 | 32 | 22 | 0 | 17.25 | 0 | 13 | 0 | 2 | 0.3213 | 0 | 2.2718 | 0 | 552 | 0 | 1636 | 76 | 2118880 | 22 | 30.3839 | 99952.8594 | 15.8319 | 1 | 69241 | 4 | 2 | 0.2027 | 20.0905 | -0.8509 | 18.2369 | 77850.4087 | 3.2136 | 49708 | 163522878 | 25901 | 62 | 52 | 2008464 | 5 | 21.6129 | 119925.5968 | 1.8065 | 1 | 18064 | 1 | 2 | 0.1765 | 3.9986 | 1.1414 | 12.8896 | 368710.9372 | 0.9381 | 1340 | 7435387 | 112 | 17 | 9 | 6 | 2017 | 3 | 7 | 6 | 2016 | 24 | 9 | 1 | 2 | 3 | 7 | 6 | 2016 | 24 | 9 | 1 | 2 | 3 | 7 | 6 | 2016 | 24 | 8 | 1 | 2 |
25347 | 0 | British | button | 2016 | 2013 | 2011 | nan | nan | 309 | 22 | 1 | 9.6828 | 0.04854 | 1 | 0 | 18 | 0.3055 | 4.2218 | 5.8899 | 0.2153 | 2992 | 15 | 16272 | 78 | 7503775 | 22 | 30.2853 | 95662.1676 | 8.4123 | 1 | 69828 | 1 | 18 | 0.2244 | 64.7495 | 0.4521 | 18.3544 | 79242.6165 | 4.859 | 492803 | 1556614792 | 136885 | 259 | 64 | 2010062 | 6 | 24.9344 | 45562.5985 | 1.8571 | 1 | 14501 | 1 | 7 | 0.3229 | 8.7373 | 1.148 | 13.5509 | 179696.8608 | 0.9353 | 6458 | 11800713 | 481 | 17 | 9 | 6 | 2017 | 24 | 7 | 6 | 2016 | 31 | 9 | 1 | 18 | 24 | 7 | 6 | 2013 | 31 | 9 | 1 | 18 | 25 | 7 | 6 | 2011 | 31 | 9 | 1 | 7 |
25348 | 0 | British | resta | 2012 | 2012 | 2011 | nan | nan | 61 | 23 | 0 | 12.6066 | 0 | 8 | 0 | 4 | 1.244 | 0 | 3.4558 | 0 | 769 | 0 | 3284 | 78 | 7503980 | 24 | 30.3097 | 103049.4318 | 10.7728 | 1 | 76855 | 1 | 4 | 0.1989 | 44.5439 | 0.1348 | 18.0742 | 144158.4761 | 3.5107 | 99537 | 338414334 | 35378 | 127 | 57 | 123124 | 4 | 25.1496 | 24084.2598 | 1.7323 | 1 | 14538 | 1 | 4 | 0.3166 | 8.1337 | 0.8105 | 13.8456 | 9973.774 | 0.7913 | 3194 | 3058701 | 220 | 17 | 9 | 6 | 2017 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 25 | 5 | 6 | 2011 | 27 | 9 | 1 | 4 |
25349 rows x 90 columns
memory usage: 17.74 MB
name: featuretools_train
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/featuretools_train/
featuretools_test.set_role("win", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)
featuretools_test
90.191042% of all entries of column 'MAX(pit_stops.lap)' are NULL values. 90.191042% of all entries of column 'MAX(pit_stops.milliseconds)' are NULL values. 90.191042% of all entries of column 'MAX(pit_stops.stop)' are NULL values. 90.191042% of all entries of column 'MEAN(pit_stops.lap)' are NULL values. 90.191042% of all entries of column 'MEAN(pit_stops.milliseconds)' are NULL values. 90.191042% of all entries of column 'MEAN(pit_stops.stop)' are NULL values. 90.191042% of all entries of column 'MIN(pit_stops.lap)' are NULL values. 90.191042% of all entries of column 'MIN(pit_stops.milliseconds)' are NULL values. 90.191042% of all entries of column 'MIN(pit_stops.stop)' are NULL values. 90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.year)' are NULL values. 90.351581% of all entries of column 'SKEW(pit_stops.lap)' are NULL values. 90.351581% of all entries of column 'SKEW(pit_stops.milliseconds)' are NULL values. 90.351581% of all entries of column 'SKEW(pit_stops.stop)' are NULL values. 90.239204% of all entries of column 'STD(pit_stops.lap)' are NULL values. 90.239204% of all entries of column 'STD(pit_stops.milliseconds)' are NULL values. 90.239204% of all entries of column 'STD(pit_stops.stop)' are NULL values. 90.191042% of all entries of column 'MODE(pit_stops.DAY(date))' are NULL values. 90.191042% of all entries of column 'MODE(pit_stops.MONTH(date))' are NULL values. 90.191042% of all entries of column 'MODE(pit_stops.WEEKDAY(date))' are NULL values. 90.191042% of all entries of column 'MODE(pit_stops.YEAR(date))' are NULL values. 90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.DAY(date))' are NULL values. 90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.MONTH(date))' are NULL values. 90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.WEEKDAY(date))' are NULL values. 90.191042% of all entries of column 'NUM_UNIQUE(pit_stops.YEAR(date))' are NULL values. 90.191042% of all entries of column 'MODE(pit_stops.year)' are NULL values.
name | win | nationality | driverRef | MODE(driver_standings.year) | MODE(lap_times.year) | MODE(pit_stops.year) | position | number | COUNT(driver_standings) | MAX(driver_standings.position) | MAX(driver_standings.win) | MEAN(driver_standings.position) | MEAN(driver_standings.win) | MIN(driver_standings.position) | MIN(driver_standings.win) | NUM_UNIQUE(driver_standings.year) | SKEW(driver_standings.position) | SKEW(driver_standings.win) | STD(driver_standings.position) | STD(driver_standings.win) | SUM(driver_standings.position) | SUM(driver_standings.win) | COUNT(lap_times) | MAX(lap_times.lap) | MAX(lap_times.milliseconds) | MAX(lap_times.position) | MEAN(lap_times.lap) | MEAN(lap_times.milliseconds) | MEAN(lap_times.position) | MIN(lap_times.lap) | MIN(lap_times.milliseconds) | MIN(lap_times.position) | NUM_UNIQUE(lap_times.year) | SKEW(lap_times.lap) | SKEW(lap_times.milliseconds) | SKEW(lap_times.position) | STD(lap_times.lap) | STD(lap_times.milliseconds) | STD(lap_times.position) | SUM(lap_times.lap) | SUM(lap_times.milliseconds) | SUM(lap_times.position) | COUNT(pit_stops) | MAX(pit_stops.lap) | MAX(pit_stops.milliseconds) | MAX(pit_stops.stop) | MEAN(pit_stops.lap) | MEAN(pit_stops.milliseconds) | MEAN(pit_stops.stop) | MIN(pit_stops.lap) | MIN(pit_stops.milliseconds) | MIN(pit_stops.stop) | NUM_UNIQUE(pit_stops.year) | SKEW(pit_stops.lap) | SKEW(pit_stops.milliseconds) | SKEW(pit_stops.stop) | STD(pit_stops.lap) | STD(pit_stops.milliseconds) | STD(pit_stops.stop) | SUM(pit_stops.lap) | SUM(pit_stops.milliseconds) | SUM(pit_stops.stop) | DAY(date) | MONTH(date) | WEEKDAY(date) | YEAR(date) | MODE(driver_standings.DAY(date)) | MODE(driver_standings.MONTH(date)) | MODE(driver_standings.WEEKDAY(date)) | MODE(driver_standings.YEAR(date)) | NUM_UNIQUE(driver_standings.DAY(date)) | NUM_UNIQUE(driver_standings.MONTH(date)) | NUM_UNIQUE(driver_standings.WEEKDAY(date)) | NUM_UNIQUE(driver_standings.YEAR(date)) | MODE(lap_times.DAY(date)) | MODE(lap_times.MONTH(date)) | MODE(lap_times.WEEKDAY(date)) | MODE(lap_times.YEAR(date)) | NUM_UNIQUE(lap_times.DAY(date)) | NUM_UNIQUE(lap_times.MONTH(date)) | NUM_UNIQUE(lap_times.WEEKDAY(date)) | NUM_UNIQUE(lap_times.YEAR(date)) | MODE(pit_stops.DAY(date)) | MODE(pit_stops.MONTH(date)) | MODE(pit_stops.WEEKDAY(date)) | MODE(pit_stops.YEAR(date)) | NUM_UNIQUE(pit_stops.DAY(date)) | NUM_UNIQUE(pit_stops.MONTH(date)) | NUM_UNIQUE(pit_stops.WEEKDAY(date)) | NUM_UNIQUE(pit_stops.YEAR(date)) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 0 | Spanish | alonso | 2005 | 2006 | NULL | 12 | 5 | 105 | 23 | 1 | 5.6667 | 0.181 | 1 | 0 | 6 | 1.7319 | 1.6816 | 7.1598 | 0.3868 | 595 | 19 | 5769 | 78 | 1406432 | 21 | 30.436 | 88439.8289 | 5.2479 | 1 | 70526 | 1 | 6 | 0.2301 | 34.9402 | 1.3932 | 18.3774 | 22633.3599 | 4.9145 | 175585 | 510209373 | 30275 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | 1 | 7 | 6 | 2005 | 31 | 8 | 1 | 6 | 24 | 7 | 6 | 2006 | 31 | 8 | 1 | 6 | nan | nan | nan | nan | nan | nan | nan | nan |
1 | 0 | French | bourdais | NULL | NULL | NULL | 18 | 14 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 16 | 3 | 6 | 2008 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
2 | 0 | German | rosberg | 2006 | 2007 | NULL | 16 | 7 | 36 | 17 | 0 | 11.7222 | 0 | 3 | 0 | 3 | -0.1958 | 0 | 3.4691 | 0 | 422 | 0 | 1685 | 77 | 191891 | 22 | 30.381 | 90110.451 | 9.7822 | 1 | 73159 | 2 | 3 | 0.2179 | 1.8323 | 0.5148 | 18.5504 | 14476.8352 | 3.8237 | 51192 | 151836110 | 16483 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 23 | 3 | 6 | 2008 | 8 | 7 | 6 | 2006 | 25 | 8 | 1 | 3 | 16 | 5 | 6 | 2007 | 25 | 8 | 1 | 3 | nan | nan | nan | nan | nan | nan | nan | nan |
3 | 0 | Japanese | sato | 2004 | 2007 | NULL | 20 | 18 | 84 | 23 | 0 | 16.9405 | 0 | 7 | 0 | 6 | -0.493 | 0 | 5.5479 | 0 | 1423 | 0 | 4309 | 76 | 1921214 | 22 | 29.1091 | 91287.2075 | 12.4342 | 1 | 70727 | 1 | 7 | 0.2832 | 32.4719 | -0.2034 | 18.0839 | 42719.4804 | 4.861 | 125431 | 393356577 | 53579 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 23 | 3 | 6 | 2008 | 12 | 7 | 6 | 2004 | 31 | 8 | 1 | 6 | 25 | 7 | 6 | 2007 | 31 | 8 | 1 | 7 | nan | nan | nan | nan | nan | nan | nan | nan |
4 | 0 | French | bourdais | 2008 | 2008 | NULL | 15 | 14 | 2 | 10 | 0 | 8.5 | 0 | 7 | 0 | 1 | nan | nan | 2.1213 | 0 | 17 | 0 | 55 | 55 | 156156 | 16 | 28 | 98864.3818 | 9.7818 | 1 | 89534 | 4 | 1 | 0 | 2.2604 | -0.06348 | 16.0208 | 17491.8075 | 4.3149 | 1540 | 5437541 | 538 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 6 | 4 | 6 | 2008 | 16 | 3 | 6 | 2008 | 2 | 1 | 1 | 1 | 16 | 3 | 6 | 2008 | 1 | 1 | 1 | 1 | nan | nan | nan | nan | nan | nan | nan | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
6224 | 0 | British | jolyon_palmer | 2016 | 2016 | 2016 | 17 | 30 | 33 | 20 | 0 | 18.0303 | 0 | 11 | 0 | 2 | -1.7159 | 0 | 1.8622 | 0 | 595 | 0 | 1488 | 78 | 1223356 | 22 | 29.7009 | 98587.2056 | 14.5195 | 1 | 68652 | 7 | 2 | 0.2691 | 20.8455 | 0.3776 | 18.6186 | 43789.6099 | 3.0523 | 44195 | 146697762 | 21605 | 55 | 50 | 1088294 | 4 | 20.8545 | 61560.7273 | 1.8 | 1 | 16851 | 1 | 2 | 0.3181 | 5.1018 | 0.8728 | 13.5857 | 192932.4401 | 0.9108 | 1147 | 3385840 | 99 | 3 | 9 | 6 | 2017 | 9 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 21 | 9 | 1 | 2 |
6225 | 0 | British | resta | 2012 | 2012 | 2011 | nan | nan | 60 | 23 | 0 | 12.4333 | 0 | 8 | 0 | 4 | 1.155 | 0 | 3.2068 | 0 | 746 | 0 | 3284 | 78 | 7503980 | 24 | 30.3097 | 103049.4318 | 10.7728 | 1 | 76855 | 1 | 4 | 0.1989 | 44.5439 | 0.1348 | 18.0742 | 144158.4761 | 3.5107 | 99537 | 338414334 | 35378 | 127 | 57 | 123124 | 4 | 25.1496 | 24084.2598 | 1.7323 | 1 | 14538 | 1 | 4 | 0.3166 | 8.1337 | 0.8105 | 13.8456 | 9973.774 | 0.7913 | 3194 | 3058701 | 220 | 3 | 9 | 6 | 2017 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 25 | 5 | 6 | 2011 | 27 | 9 | 1 | 4 |
6226 | 0 | Finnish | bottas | 2016 | 2016 | 2016 | 6 | 77 | 91 | 18 | 1 | 8.1868 | 0.02198 | 3 | 0 | 5 | 1.0124 | 6.6307 | 4.9976 | 0.1474 | 745 | 2 | 5153 | 78 | 3689718 | 22 | 30.4306 | 99678.869 | 8.0584 | 1 | 67847 | 1 | 5 | 0.2271 | 32.1149 | 0.5698 | 18.18 | 74208.5105 | 4.885 | 156809 | 513645212 | 41525 | 183 | 59 | 2008595 | 5 | 24.1093 | 68069.5082 | 1.7268 | 1 | 14978 | 1 | 5 | 0.3284 | 5.9016 | 1.0949 | 13.4151 | 242520.4324 | 0.8527 | 4412 | 12456720 | 316 | 17 | 9 | 6 | 2017 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 5 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 5 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 5 |
6227 | 0 | Spanish | sainz | 2016 | 2016 | 2016 | 10 | 55 | 53 | 16 | 0 | 11.8679 | 0 | 7 | 0 | 3 | 0.06274 | 0 | 2.481 | 0 | 629 | 0 | 2690 | 78 | 2122068 | 21 | 29.7457 | 99861.4349 | 10.655 | 1 | 69150 | 3 | 3 | 0.3078 | 24.5072 | 0.3527 | 18.3133 | 61998.6099 | 3.5636 | 80016 | 268627260 | 28662 | 97 | 51 | 2008828 | 4 | 22.6289 | 84961.6907 | 1.6701 | 1 | 14966 | 1 | 3 | 0.3388 | 5.1188 | 1.0447 | 12.7919 | 297744.5008 | 0.8 | 2195 | 8241284 | 162 | 17 | 9 | 6 | 2017 | 3 | 7 | 6 | 2016 | 29 | 9 | 1 | 3 | 30 | 5 | 6 | 2016 | 29 | 9 | 1 | 3 | 9 | 7 | 6 | 2016 | 29 | 9 | 1 | 3 |
6228 | 0 | Russian | kvyat | 2016 | 2015 | 2016 | 13 | 26 | 72 | 22 | 0 | 12.6111 | 0 | 7 | 0 | 4 | -0.3133 | 0 | 3.4214 | 0 | 908 | 0 | 3585 | 78 | 3690660 | 22 | 29.2739 | 100722.5623 | 11.2388 | 1 | 68061 | 2 | 4 | 0.2624 | 33.405 | -0.07759 | 18.0505 | 79720.0053 | 4.0792 | 104947 | 361090386 | 40291 | 145 | 59 | 2011266 | 5 | 23.6207 | 64171.0069 | 1.7724 | 1 | 16139 | 1 | 4 | 0.2769 | 6.5459 | 0.9965 | 15.0675 | 236423.1748 | 0.8559 | 3425 | 9304796 | 257 | 17 | 9 | 6 | 2017 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 4 | 30 | 7 | 6 | 2015 | 31 | 9 | 1 | 4 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 4 |
6229 rows x 90 columns
memory usage: 4.36 MB
name: featuretools_test
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/formula1/featuretools_test/
We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.
Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.
imputation = getml.preprocessors.Imputation()
predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
pipe2 = getml.pipeline.Pipeline(
tags=['featuretools'],
preprocessors=[imputation],
predictors=[predictor],
include_categorical=True,
)
pipe2
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function=None, peripheral=[], predictors=['XGBoostClassifier'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
pipe2.fit(featuretools_train)
Checking data model... Staging... [========================================] 100% Preprocessing... [========================================] 100% Checking... [========================================] 100% WARNING [COLUMN SHOULD BE UNUSED]: 90.725472% of all entries in column 'mode(pit_stops.year)' in POPULATION__STAGING_TABLE_1 are NULL values. You should consider setting its role to unused_string. WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( min(lap_times.lap), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( min(pit_stops.stop), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( mode(lap_times.weekday(date)), 6.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( num_unique(lap_times.weekday(date)), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( mode(pit_stops.weekday(date)), 6.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). WARNING [COLUMN SHOULD BE UNUSED]: All non-NULL entries in column 'COALESCE( num_unique(pit_stops.weekday(date)), 1.000000 )' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). Staging... [========================================] 100% Preprocessing... [========================================] 100% XGBoost: Training as predictor... [========================================] 100% Trained pipeline. Time taken: 0h:0m:7.319103
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function=None, peripheral=[], predictors=['XGBoostClassifier'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
url: http://localhost:1709/#/getpipeline/formula1/DDRwHL/0/
pipe2.score(featuretools_test)
Staging... [========================================] 100% Preprocessing... [========================================] 100%
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2021-08-23 21:29:30 | featuretools_train | win | 0.9717 | 0.9391 | 0.0837 |
1 | 2021-08-23 21:29:30 | featuretools_test | win | 0.9724 | 0.9202 | 0.08684 |
We would like to understand why getML outperforms featuretools. In particular, getML's FastProp is based on an approach that is very similar to featuretools.
To investigate this matter, we first take a look at the importance of the features FastProp has learned:
names, importances = pipe1.features.importances(target_num=0)
plt.subplots(figsize=(20, 10))
plt.bar(names