AdventureWorks - Predicting customer churn

NOTE: Due to featuretools' memory requirement, this notebook will not run on MyBinder.

In this notebook, we will demonstrate how getML can be used for a customer churn project using a synthetic dataset of a fictional company. We will also benchmark getML against featuretools.

Summary:

  • Prediction type: Classification model
  • Domain: Customer loyalty
  • Prediction target: churn
  • Population size: 19704

Author: Dr. Patrick Urbanke

Background

AdventureWorks is a fictional company that sells bicycles. It is used by Microsoft to showcase how its MS SQL Server can be used to manage business data. Since this dataset resembles a real-world customer database and it is open-source, we will use it to showcase how getML can be used for a classic customer churn project (real customer databases are not easily available for the purposes of showcasing and benchmarking, for reasons of data privacy).

The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

We will benchmark getML 's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Where is this running?

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import featuretools
import getml

getml.set_project('adventure_works')

Connected to project 'adventure_works'

1. Loading data

1.1 Download from source

We begin by downloading the data:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="AdventureWorks2014",
    port=3306,
    user="guest",
    password="relational"
)

conn
Out[2]:
Connection(conn_id='default',
           dbname='AdventureWorks2014',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)
In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
In [4]:
product = load_if_needed("Product")
sales_order_detail = load_if_needed("SalesOrderDetail")
sales_order_header = load_if_needed("SalesOrderHeader")
sales_order_reason = load_if_needed("SalesOrderHeaderSalesReason")
special_offer = load_if_needed("SpecialOffer")
store = load_if_needed("Store")
In [5]:
product
Out[5]:
name ProductID MakeFlag FinishedGoodsFlag SafetyStockLevel ReorderPoint DaysToManufacture ProductSubcategoryID ProductModelID Name ProductNumber Color StandardCost ListPrice Size SizeUnitMeasureCode WeightUnitMeasureCode Weight ProductLine Class Style SellStartDate SellEndDate DiscontinuedDate rowguid ModifiedDate
role unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string
0 1  0  0  1000  750  0  nan  nan  Adjustable Race AR-5381 NULL 0.0000 0.0000 NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL 694215B7-08F7-4C0D-ACB1-D734BA44... 2014-02-08 10:01:36
1 2  0  0  1000  750  0  nan  nan  Bearing Ball BA-8327 NULL 0.0000 0.0000 NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL 58AE3C20-4F3A-4749-A7D4-D568806C... 2014-02-08 10:01:36
2 3  1  0  800  600  1  nan  nan  BB Ball Bearing BE-2349 NULL 0.0000 0.0000 NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL 9C21AED2-5BFA-4F18-BCB8-F11638DC... 2014-02-08 10:01:36
3 4  0  0  800  600  0  nan  nan  Headset Ball Bearings BE-2908 NULL 0.0000 0.0000 NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL ECFED6CB-51FF-49B5-B06C-7D8AC834... 2014-02-08 10:01:36
4 316  1  0  800  600  1  nan  nan  Blade BL-2036 NULL 0.0000 0.0000 NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL E73E9750-603B-4131-89F5-3DD15ED5... 2014-02-08 10:01:36
...  ...  ...  ...  ...  ...  ...  ...  ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
499 995  1  1  500  375  1  5  96  ML Bottom Bracket BB-8107 NULL 44.9506 101.2400 NULL NULL G 168.00 NULL M NULL 2013-05-30 00:00:00 NULL NULL 71AB847F-D091-42D6-B735-7B0C2D82... 2014-02-08 10:01:36
500 996  1  1  500  375  1  5  97  HL Bottom Bracket BB-9108 NULL 53.9416 121.4900 NULL NULL G 170.00 NULL H NULL 2013-05-30 00:00:00 NULL NULL 230C47C5-08B2-4CE3-B706-69C0BDD6... 2014-02-08 10:01:36
501 997  1  1  100  75  4  2  31  Road-750 Black, 44 BK-R19B-44 Black 343.6496 539.9900 44 CM LB 19.77 R L U 2013-05-30 00:00:00 NULL NULL 44CE4802-409F-43AB-9B27-CA534218... 2014-02-08 10:01:36
502 998  1  1  100  75  4  2  31  Road-750 Black, 48 BK-R19B-48 Black 343.6496 539.9900 48 CM LB 20.13 R L U 2013-05-30 00:00:00 NULL NULL 3DE9A212-1D49-40B6-B10A-F564D981... 2014-02-08 10:01:36
503 999  1  1  100  75  4  2  31  Road-750 Black, 52 BK-R19B-52 Black 343.6496 539.9900 52 CM LB 20.42 R L U 2013-05-30 00:00:00 NULL NULL AE638923-2B67-4679-B90E-ABBAB17D... 2014-02-08 10:01:36

504 rows x 25 columns
memory usage: 0.19 MB
name: Product
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/Product/

In [6]:
sales_order_detail
Out[6]:
name SalesOrderID SalesOrderDetailID OrderQty ProductID SpecialOfferID CarrierTrackingNumber UnitPrice UnitPriceDiscount LineTotal rowguid ModifiedDate
role unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string
0 43659  1  1  776  1  4911-403C-98 2024.9940 0.0000 2024.994000 B207C96D-D9E6-402B-8470-2CC176C4... 2011-05-31 00:00:00
1 43659  2  3  777  1  4911-403C-98 2024.9940 0.0000 6074.982000 7ABB600D-1E77-41BE-9FE5-B9142CFC... 2011-05-31 00:00:00
2 43659  3  1  778  1  4911-403C-98 2024.9940 0.0000 2024.994000 475CF8C6-49F6-486E-B0AD-AFC6A50C... 2011-05-31 00:00:00
3 43659  4  1  771  1  4911-403C-98 2039.9940 0.0000 2039.994000 04C4DE91-5815-45D6-8670-F462719F... 2011-05-31 00:00:00
4 43659  5  1  772  1  4911-403C-98 2039.9940 0.0000 2039.994000 5A74C7D2-E641-438E-A7AC-37BF2328... 2011-05-31 00:00:00
...  ...  ...  ...  ...  ... ... ... ... ... ...
121312 75122  121313  1  878  1  NULL 21.9800 0.0000 21.980000 8CAD6675-18CC-4F47-8287-97B41A8E... 2014-06-30 00:00:00
121313 75122  121314  1  712  1  NULL 8.9900 0.0000 8.990000 84F1C363-1C50-4442-BE16-541C59B6... 2014-06-30 00:00:00
121314 75123  121315  1  878  1  NULL 21.9800 0.0000 21.980000 C18B6476-429F-4BB1-828E-2BE5F82A... 2014-06-30 00:00:00
121315 75123  121316  1  879  1  NULL 159.0000 0.0000 159.000000 75A89C6A-C60A-47EA-8A52-B52A9C43... 2014-06-30 00:00:00
121316 75123  121317  1  712  1  NULL 8.9900 0.0000 8.990000 73646D26-0461-450D-8019-2C6C8586... 2014-06-30 00:00:00

121317 rows x 11 columns
memory usage: 21.84 MB
name: SalesOrderDetail
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderDetail/

In [7]:
sales_order_header
Out[7]:
name SalesOrderID RevisionNumber Status OnlineOrderFlag CustomerID SalesPersonID TerritoryID BillToAddressID ShipToAddressID ShipMethodID CreditCardID CurrencyRateID OrderDate DueDate ShipDate SalesOrderNumber PurchaseOrderNumber AccountNumber CreditCardApprovalCode SubTotal TaxAmt Freight TotalDue Comment rowguid ModifiedDate
role unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string
0 43659  8  5  0  29825  279  5  985  985  5  16281  nan  2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00 SO43659 PO522145787 10-4020-000676 105041Vi84182 20565.6206 1971.5149 616.0984 23153.2339 NULL 79B65321-39CA-4115-9CBA-8FE0903E... 2011-06-07 00:00:00
1 43660  8  5  0  29672  279  5  921  921  5  5618  nan  2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00 SO43660 PO18850127500 10-4020-000117 115213Vi29411 1294.2529 124.2483 38.8276 1457.3288 NULL 738DC42D-D03B-48A1-9822-F95A67EA... 2011-06-07 00:00:00
2 43661  8  5  0  29734  282  6  517  517  5  1346  4  2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00 SO43661 PO18473189620 10-4020-000442 85274Vi6854 32726.4786 3153.7696 985.5530 36865.8012 NULL D91B9131-18A4-4A11-BC3A-90B6F53E... 2011-06-07 00:00:00
3 43662  8  5  0  29994  282  6  482  482  5  10456  4  2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00 SO43662 PO18444174044 10-4020-000227 125295Vi53935 28832.5289 2775.1646 867.2389 32474.9324 NULL 4A1ECFC0-CC3A-4740-B028-1C50BB48... 2011-06-07 00:00:00
4 43663  8  5  0  29565  276  4  1073  1073  5  4322  nan  2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00 SO43663 PO18009186470 10-4020-000510 45303Vi22691 419.4589 40.2681 12.5838 472.3108 NULL 9B1E7A40-6AE0-4AD3-811C-A6495185... 2011-06-07 00:00:00
...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31460 75119  8  5  1  11981  nan  1  17649  17649  1  6761  nan  2014-06-30 00:00:00 2014-07-12 00:00:00 2014-07-07 00:00:00 SO75119 NULL 10-4030-011981 429826Vi35166 42.2800 3.3824 1.0570 46.7194 NULL 9382F1C9-383A-435F-9449-0EECEA21... 2014-07-07 00:00:00
31461 75120  8  5  1  18749  nan  6  28374  28374  1  8925  nan  2014-06-30 00:00:00 2014-07-12 00:00:00 2014-07-07 00:00:00 SO75120 NULL 10-4030-018749 929849Vi46003 84.9600 6.7968 2.1240 93.8808 NULL AE6A4FCF-FF73-4CD4-AF2C-5993D00D... 2014-07-07 00:00:00
31462 75121  8  5  1  15251  nan  6  26553  26553  1  14220  nan  2014-06-30 00:00:00 2014-07-12 00:00:00 2014-07-07 00:00:00 SO75121 NULL 10-4030-015251 529864Vi73738 74.9800 5.9984 1.8745 82.8529 NULL D7395C0E-00CB-4BFA-A238-0D6A9F49... 2014-07-07 00:00:00
31463 75122  8  5  1  15868  nan  6  14616  14616  1  18719  nan  2014-06-30 00:00:00 2014-07-12 00:00:00 2014-07-07 00:00:00 SO75122 NULL 10-4030-015868 330022Vi97312 30.9700 2.4776 0.7743 34.2219 NULL 4221035A-4159-492F-AF40-4363A64F... 2014-07-07 00:00:00
31464 75123  8  5  1  18759  nan  6  14024  14024  1  10084  nan  2014-06-30 00:00:00 2014-07-12 00:00:00 2014-07-07 00:00:00 SO75123 NULL 10-4030-018759 230370Vi51970 189.9700 15.1976 4.7493 209.9169 NULL D54752FF-2B54-4BE5-95EA-3B72289C... 2014-07-07 00:00:00

31465 rows x 26 columns
memory usage: 12.80 MB
name: SalesOrderHeader
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderHeader/

In [8]:
sales_order_reason
Out[8]:
name SalesOrderID SalesReasonID ModifiedDate
role unused_float unused_float unused_string
0 43697  5  2011-05-31 00:00:00
1 43697  9  2011-05-31 00:00:00
2 43702  5  2011-06-01 00:00:00
3 43702  9  2011-06-01 00:00:00
4 43703  5  2011-06-01 00:00:00
...  ...  ...
27642 75119  1  2014-06-30 00:00:00
27643 75120  1  2014-06-30 00:00:00
27644 75121  1  2014-06-30 00:00:00
27645 75122  1  2014-06-30 00:00:00
27646 75123  1  2014-06-30 00:00:00

27647 rows x 3 columns
memory usage: 1.22 MB
name: SalesOrderHeaderSalesReason
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderHeaderSalesReason/

In [9]:
special_offer
Out[9]:
name SpecialOfferID MinQty MaxQty Description DiscountPct Type Category StartDate EndDate rowguid ModifiedDate
role unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string
0 1  0  nan  No Discount 0.0000 No Discount No Discount 2011-05-01 00:00:00 2014-11-30 00:00:00 0290C4F5-191F-4337-AB6B-0A2DDE03... 2011-04-01 00:00:00
1 2  11  14  Volume Discount 11 to 14 0.0200 Volume Discount Reseller 2011-05-31 00:00:00 2014-05-30 00:00:00 D7542EE7-15DB-4541-985C-5CC27AEF... 2011-05-01 00:00:00
2 3  15  24  Volume Discount 15 to 24 0.0500 Volume Discount Reseller 2011-05-31 00:00:00 2014-05-30 00:00:00 4BDBCC01-8CF7-40A9-B643-40EC5B71... 2011-05-01 00:00:00
3 4  25  40  Volume Discount 25 to 40 0.1000 Volume Discount Reseller 2011-05-31 00:00:00 2014-05-30 00:00:00 504B5E85-8F3F-4EBC-9E1D-C1BC5DEA... 2011-05-01 00:00:00
4 5  41  60  Volume Discount 41 to 60 0.1500 Volume Discount Reseller 2011-05-31 00:00:00 2014-05-30 00:00:00 677E1D9D-944F-4E81-90E8-47EB0A82... 2011-05-01 00:00:00
...  ...  ...  ... ... ... ... ... ... ... ...
11 12  0  nan  LL Road Frame Sale 0.3500 Excess Inventory Reseller 2013-05-30 00:00:00 2013-07-14 00:00:00 C0AF1C89-9722-4235-9248-3FBA4D9E... 2013-04-30 00:00:00
12 13  0  nan  Touring-3000 Promotion 0.1500 New Product Reseller 2013-05-30 00:00:00 2013-08-29 00:00:00 5061CCE4-E021-45A8-9A75-DFB36CBB... 2013-04-30 00:00:00
13 14  0  nan  Touring-1000 Promotion 0.2000 New Product Reseller 2013-05-30 00:00:00 2013-08-29 00:00:00 1AF84A9E-A98C-4BD9-B48F-DC2B8B6B... 2013-04-30 00:00:00
14 15  0  nan  Half-Price Pedal Sale 0.5000 Seasonal Discount Customer 2013-07-14 00:00:00 2013-08-14 00:00:00 03E3594D-6EBB-46A6-B8EE-A9289C0C... 2013-06-14 00:00:00
15 16  0  nan  Mountain-500 Silver Clearance Sa... 0.4000 Discontinued Product Reseller 2014-03-31 00:00:00 2014-05-30 00:00:00 EB7CB484-BCCF-4D2D-BF73-521B2001... 2014-03-01 00:00:00

16 rows x 11 columns
memory usage: 0.00 MB
name: SpecialOffer
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SpecialOffer/

In [10]:
store
Out[10]:
name BusinessEntityID SalesPersonID Name Demographics rowguid ModifiedDate
role unused_float unused_float unused_string unused_string unused_string unused_string
0 292  279  Next-Door Bike Store <StoreSurvey xmlns="http://schem... A22517E3-848D-4EBE-B9D9-7437F343... 2014-09-12 11:15:07
1 294  276  Professional Sales and Service <StoreSurvey xmlns="http://schem... B50CA50B-C601-4A13-B07E-2C63862D... 2014-09-12 11:15:07
2 296  277  Riders Company <StoreSurvey xmlns="http://schem... 337C3688-1339-4E1A-A08A-B54B2356... 2014-09-12 11:15:07
3 298  275  The Bike Mechanics <StoreSurvey xmlns="http://schem... 7894F278-F0C8-4D16-BD75-213FDBF1... 2014-09-12 11:15:07
4 300  286  Nationwide Supply <StoreSurvey xmlns="http://schem... C3FC9705-A8C4-4F3A-9550-EB2FA4B7... 2014-09-12 11:15:07
...  ...  ... ... ... ...
696 1988  282  Retreat Inn <StoreSurvey xmlns="http://schem... EA21EC81-1BFA-4A07-9B4D-73D9852A... 2014-09-12 11:15:07
697 1990  281  Technical Parts Manufacturing <StoreSurvey xmlns="http://schem... C8E3C4ED-8F58-4DB2-B600-E0CD11D9... 2014-09-12 11:15:07
698 1992  277  Totes & Baskets Company <StoreSurvey xmlns="http://schem... CE860B58-643C-4567-BFD8-06E97969... 2014-09-12 11:15:07
699 1994  277  World of Bikes <StoreSurvey xmlns="http://schem... 0C10F2B6-A13A-440C-9C25-5B28D482... 2014-09-12 11:15:07
700 2051  275  A Bicycle Association <StoreSurvey xmlns="http://schem... 82237172-D3FE-4A95-82EF-636F6552... 2014-09-12 11:15:07

701 rows x 6 columns
memory usage: 0.38 MB
name: Store
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/Store/

1.2 Prepare data for getML

getML requires that we define roles for each of the columns.

In [11]:
product.set_role("ProductID", getml.data.roles.join_key)
product.set_role(["MakeFlag", "ProductSubcategoryID", "ProductModelID"], getml.data.roles.categorical)
product.set_role(["SafetyStockLevel", "ReorderPoint", "StandardCost", "ListPrice"], getml.data.roles.numerical)

product
Out[11]:
name ProductID MakeFlag ProductSubcategoryID ProductModelID SafetyStockLevel ReorderPoint StandardCost ListPrice FinishedGoodsFlag DaysToManufacture Name ProductNumber Color Size SizeUnitMeasureCode WeightUnitMeasureCode Weight ProductLine Class Style SellStartDate SellEndDate DiscontinuedDate rowguid ModifiedDate
role join_key categorical categorical categorical numerical numerical numerical numerical unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string unused_string
0 1 0 NULL NULL 1000  750  0  0  0  0  Adjustable Race AR-5381 NULL NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL 694215B7-08F7-4C0D-ACB1-D734BA44... 2014-02-08 10:01:36
1 2 0 NULL NULL 1000  750  0  0  0  0  Bearing Ball BA-8327 NULL NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL 58AE3C20-4F3A-4749-A7D4-D568806C... 2014-02-08 10:01:36
2 3 1 NULL NULL 800  600  0  0  0  1  BB Ball Bearing BE-2349 NULL NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL 9C21AED2-5BFA-4F18-BCB8-F11638DC... 2014-02-08 10:01:36
3 4 0 NULL NULL 800  600  0  0  0  0  Headset Ball Bearings BE-2908 NULL NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL ECFED6CB-51FF-49B5-B06C-7D8AC834... 2014-02-08 10:01:36
4 316 1 NULL NULL 800  600  0  0  0  1  Blade BL-2036 NULL NULL NULL NULL NULL NULL NULL NULL 2008-04-30 00:00:00 NULL NULL E73E9750-603B-4131-89F5-3DD15ED5... 2014-02-08 10:01:36
... ... ... ... ...  ...  ...  ...  ...  ...  ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
499 995 1 5 96 500  375  44.9506 101.24 1  1  ML Bottom Bracket BB-8107 NULL NULL NULL G 168.00 NULL M NULL 2013-05-30 00:00:00 NULL NULL 71AB847F-D091-42D6-B735-7B0C2D82... 2014-02-08 10:01:36
500 996 1 5 97 500  375  53.9416 121.49 1  1  HL Bottom Bracket BB-9108 NULL NULL NULL G 170.00 NULL H NULL 2013-05-30 00:00:00 NULL NULL 230C47C5-08B2-4CE3-B706-69C0BDD6... 2014-02-08 10:01:36
501 997 1 2 31 100  75  343.6496 539.99 1  4  Road-750 Black, 44 BK-R19B-44 Black 44 CM LB 19.77 R L U 2013-05-30 00:00:00 NULL NULL 44CE4802-409F-43AB-9B27-CA534218... 2014-02-08 10:01:36
502 998 1 2 31 100  75  343.6496 539.99 1  4  Road-750 Black, 48 BK-R19B-48 Black 48 CM LB 20.13 R L U 2013-05-30 00:00:00 NULL NULL 3DE9A212-1D49-40B6-B10A-F564D981... 2014-02-08 10:01:36
503 999 1 2 31 100  75  343.6496 539.99 1  4  Road-750 Black, 52 BK-R19B-52 Black 52 CM LB 20.42 R L U 2013-05-30 00:00:00 NULL NULL AE638923-2B67-4679-B90E-ABBAB17D... 2014-02-08 10:01:36

504 rows x 25 columns
memory usage: 0.17 MB
name: Product
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/Product/

In [12]:
sales_order_detail.set_role(["SalesOrderID", "SalesOrderDetailID", "ProductID", "SpecialOfferID"], getml.data.roles.join_key)
sales_order_detail.set_role(["OrderQty", "UnitPrice", "UnitPriceDiscount", "LineTotal"], getml.data.roles.numerical)
sales_order_detail.set_role("ModifiedDate", getml.data.roles.time_stamp)

sales_order_detail
Out[12]:
name ModifiedDate SalesOrderID SalesOrderDetailID ProductID SpecialOfferID OrderQty UnitPrice UnitPriceDiscount LineTotal CarrierTrackingNumber rowguid
role time_stamp join_key join_key join_key join_key numerical numerical numerical numerical unused_string unused_string
unit time stamp, comparison only
0 2011-05-31 43659 1 776 1 1  2024.994 0  2024.994 4911-403C-98 B207C96D-D9E6-402B-8470-2CC176C4...
1 2011-05-31 43659 2 777 1 3  2024.994 0  6074.982 4911-403C-98 7ABB600D-1E77-41BE-9FE5-B9142CFC...
2 2011-05-31 43659 3 778 1 1  2024.994 0  2024.994 4911-403C-98 475CF8C6-49F6-486E-B0AD-AFC6A50C...
3 2011-05-31 43659 4 771 1 1  2039.994 0  2039.994 4911-403C-98 04C4DE91-5815-45D6-8670-F462719F...
4 2011-05-31 43659 5 772 1 1  2039.994 0  2039.994 4911-403C-98 5A74C7D2-E641-438E-A7AC-37BF2328...
... ... ... ... ... ...  ...  ...  ...  ... ...
121312 2014-06-30 75122 121313 878 1 1  21.98 0  21.98 NULL 8CAD6675-18CC-4F47-8287-97B41A8E...
121313 2014-06-30 75122 121314 712 1 1  8.99 0  8.99 NULL 84F1C363-1C50-4442-BE16-541C59B6...
121314 2014-06-30 75123 121315 878 1 1  21.98 0  21.98 NULL C18B6476-429F-4BB1-828E-2BE5F82A...
121315 2014-06-30 75123 121316 879 1 1  159  0  159  NULL 75A89C6A-C60A-47EA-8A52-B52A9C43...
121316 2014-06-30 75123 121317 712 1 1  8.99 0  8.99 NULL 73646D26-0461-450D-8019-2C6C8586...

121317 rows x 11 columns
memory usage: 14.32 MB
name: SalesOrderDetail
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderDetail/

In [13]:
sales_order_reason.set_role("SalesOrderID", getml.data.roles.join_key)
sales_order_reason.set_role("SalesReasonID", getml.data.roles.categorical)

sales_order_reason
Out[13]:
name SalesOrderID SalesReasonID ModifiedDate
role join_key categorical unused_string
0 43697 5 2011-05-31 00:00:00
1 43697 9 2011-05-31 00:00:00
2 43702 5 2011-06-01 00:00:00
3 43702 9 2011-06-01 00:00:00
4 43703 5 2011-06-01 00:00:00
... ... ...
27642 75119 1 2014-06-30 00:00:00
27643 75120 1 2014-06-30 00:00:00
27644 75121 1 2014-06-30 00:00:00
27645 75122 1 2014-06-30 00:00:00
27646 75123 1 2014-06-30 00:00:00

27647 rows x 3 columns
memory usage: 1.00 MB
name: SalesOrderHeaderSalesReason
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderHeaderSalesReason/

In [14]:
special_offer.set_role(["SpecialOfferID"], getml.data.roles.join_key)
special_offer.set_role(["MinQty", "DiscountPct"], getml.data.roles.numerical)
special_offer.set_role(["Category", "Description", "Type"], getml.data.roles.categorical)
special_offer.set_role(["StartDate", "EndDate"], getml.data.roles.time_stamp)

special_offer
Out[14]:
name StartDate EndDate SpecialOfferID Category Description Type MinQty DiscountPct MaxQty rowguid ModifiedDate
role time_stamp time_stamp join_key categorical categorical categorical numerical numerical unused_float unused_string unused_string
unit time stamp, comparison only time stamp, comparison only
0 2011-05-01 2014-11-30 1 No Discount No Discount No Discount 0  0  nan  0290C4F5-191F-4337-AB6B-0A2DDE03... 2011-04-01 00:00:00
1 2011-05-31 2014-05-30 2 Reseller Volume Discount 11 to 14 Volume Discount 11  0.02 14  D7542EE7-15DB-4541-985C-5CC27AEF... 2011-05-01 00:00:00
2 2011-05-31 2014-05-30 3 Reseller Volume Discount 15 to 24 Volume Discount 15  0.05 24  4BDBCC01-8CF7-40A9-B643-40EC5B71... 2011-05-01 00:00:00
3 2011-05-31 2014-05-30 4 Reseller Volume Discount 25 to 40 Volume Discount 25  0.1 40  504B5E85-8F3F-4EBC-9E1D-C1BC5DEA... 2011-05-01 00:00:00
4 2011-05-31 2014-05-30 5 Reseller Volume Discount 41 to 60 Volume Discount 41  0.15 60  677E1D9D-944F-4E81-90E8-47EB0A82... 2011-05-01 00:00:00
... ... ... ... ... ... ...  ...  ...  ... ...
11 2013-05-30 2013-07-14 12 Reseller LL Road Frame Sale Excess Inventory 0  0.35 nan  C0AF1C89-9722-4235-9248-3FBA4D9E... 2013-04-30 00:00:00
12 2013-05-30 2013-08-29 13 Reseller Touring-3000 Promotion New Product 0  0.15 nan  5061CCE4-E021-45A8-9A75-DFB36CBB... 2013-04-30 00:00:00
13 2013-05-30 2013-08-29 14 Reseller Touring-1000 Promotion New Product 0  0.2 nan  1AF84A9E-A98C-4BD9-B48F-DC2B8B6B... 2013-04-30 00:00:00
14 2013-07-14 2013-08-14 15 Customer Half-Price Pedal Sale Seasonal Discount 0  0.5 nan  03E3594D-6EBB-46A6-B8EE-A9289C0C... 2013-06-14 00:00:00
15 2014-03-31 2014-05-30 16 Reseller Mountain-500 Silver Clearance Sa... Discontinued Product 0  0.4 nan  EB7CB484-BCCF-4D2D-BF73-521B2001... 2014-03-01 00:00:00

16 rows x 11 columns
memory usage: 0.00 MB
name: SpecialOffer
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SpecialOffer/

In [15]:
store.set_role(["SalesPersonID"], getml.data.roles.join_key)

store["test"] = store["ModifiedDate"].update(getml.data.random() > 0.5, "NULL")

store.set_role(["SalesPersonID"], getml.data.roles.join_key)
store.set_role(["test"], getml.data.roles.time_stamp)

store
Out[15]:
name test SalesPersonID BusinessEntityID Name Demographics rowguid ModifiedDate
role time_stamp join_key unused_float unused_string unused_string unused_string unused_string
unit time stamp, comparison only
0 2014-09-12 11:15:07 279 292  Next-Door Bike Store <StoreSurvey xmlns="http://schem... A22517E3-848D-4EBE-B9D9-7437F343... 2014-09-12 11:15:07
1 NULL 276 294  Professional Sales and Service <StoreSurvey xmlns="http://schem... B50CA50B-C601-4A13-B07E-2C63862D... 2014-09-12 11:15:07
2 2014-09-12 11:15:07 277 296  Riders Company <StoreSurvey xmlns="http://schem... 337C3688-1339-4E1A-A08A-B54B2356... 2014-09-12 11:15:07
3 NULL 275 298  The Bike Mechanics <StoreSurvey xmlns="http://schem... 7894F278-F0C8-4D16-BD75-213FDBF1... 2014-09-12 11:15:07
4 NULL 286 300  Nationwide Supply <StoreSurvey xmlns="http://schem... C3FC9705-A8C4-4F3A-9550-EB2FA4B7... 2014-09-12 11:15:07
... ... ...  ... ... ... ...
696 NULL 282 1988  Retreat Inn <StoreSurvey xmlns="http://schem... EA21EC81-1BFA-4A07-9B4D-73D9852A... 2014-09-12 11:15:07
697 2014-09-12 11:15:07 281 1990  Technical Parts Manufacturing <StoreSurvey xmlns="http://schem... C8E3C4ED-8F58-4DB2-B600-E0CD11D9... 2014-09-12 11:15:07
698 NULL 277 1992  Totes & Baskets Company <StoreSurvey xmlns="http://schem... CE860B58-643C-4567-BFD8-06E97969... 2014-09-12 11:15:07
699 NULL 277 1994  World of Bikes <StoreSurvey xmlns="http://schem... 0C10F2B6-A13A-440C-9C25-5B28D482... 2014-09-12 11:15:07
700 2014-09-12 11:15:07 275 2051  A Bicycle Association <StoreSurvey xmlns="http://schem... 82237172-D3FE-4A95-82EF-636F6552... 2014-09-12 11:15:07

701 rows x 7 columns
memory usage: 0.38 MB
name: Store
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/Store/

In [16]:
store.to_pandas()
Out[16]:
test SalesPersonID BusinessEntityID Name Demographics rowguid ModifiedDate
0 2014-09-12 11:15:07 279 292.0 Next-Door Bike Store <StoreSurvey xmlns="http://schemas.microsoft.c... A22517E3-848D-4EBE-B9D9-7437F3432304 2014-09-12 11:15:07
1 NaT 276 294.0 Professional Sales and Service <StoreSurvey xmlns="http://schemas.microsoft.c... B50CA50B-C601-4A13-B07E-2C63862D71B4 2014-09-12 11:15:07
2 2014-09-12 11:15:07 277 296.0 Riders Company <StoreSurvey xmlns="http://schemas.microsoft.c... 337C3688-1339-4E1A-A08A-B54B23566E49 2014-09-12 11:15:07
3 NaT 275 298.0 The Bike Mechanics <StoreSurvey xmlns="http://schemas.microsoft.c... 7894F278-F0C8-4D16-BD75-213FDBF13023 2014-09-12 11:15:07
4 NaT 286 300.0 Nationwide Supply <StoreSurvey xmlns="http://schemas.microsoft.c... C3FC9705-A8C4-4F3A-9550-EB2FA4B7B64D 2014-09-12 11:15:07
... ... ... ... ... ... ... ...
696 NaT 282 1988.0 Retreat Inn <StoreSurvey xmlns="http://schemas.microsoft.c... EA21EC81-1BFA-4A07-9B4D-73D9852AFCBF 2014-09-12 11:15:07
697 2014-09-12 11:15:07 281 1990.0 Technical Parts Manufacturing <StoreSurvey xmlns="http://schemas.microsoft.c... C8E3C4ED-8F58-4DB2-B600-E0CD11D9CFAD 2014-09-12 11:15:07
698 NaT 277 1992.0 Totes & Baskets Company <StoreSurvey xmlns="http://schemas.microsoft.c... CE860B58-643C-4567-BFD8-06E97969CC67 2014-09-12 11:15:07
699 NaT 277 1994.0 World of Bikes <StoreSurvey xmlns="http://schemas.microsoft.c... 0C10F2B6-A13A-440C-9C25-5B28D482872A 2014-09-12 11:15:07
700 2014-09-12 11:15:07 275 2051.0 A Bicycle Association <StoreSurvey xmlns="http://schemas.microsoft.c... 82237172-D3FE-4A95-82EF-636F655240AD 2014-09-12 11:15:07

701 rows × 7 columns

In [17]:
sales_order_header["SalesPersonIDCat"] = sales_order_header["SalesPersonID"]
sales_order_header["TerritoryIDCat"] = sales_order_header["TerritoryID"]

sales_order_header.set_role(["CustomerID", "SalesOrderID", "SalesPersonID", "TerritoryID"], getml.data.roles.join_key)
sales_order_header.set_role(
    ["RevisionNumber", "OnlineOrderFlag", "SalesPersonIDCat", "TerritoryIDCat", "ShipMethodID"], 
    getml.data.roles.categorical)
sales_order_header.set_role(["SubTotal", "TaxAmt", "Freight", "TotalDue"], getml.data.roles.numerical)
sales_order_header.set_role(["OrderDate", "DueDate", "ShipDate", "ModifiedDate"], getml.data.roles.time_stamp)

sales_order_header
Out[17]:
name OrderDate DueDate ShipDate ModifiedDate CustomerID SalesOrderID SalesPersonID TerritoryID RevisionNumber OnlineOrderFlag SalesPersonIDCat TerritoryIDCat ShipMethodID SubTotal TaxAmt Freight TotalDue Status BillToAddressID ShipToAddressID CreditCardID CurrencyRateID SalesOrderNumber PurchaseOrderNumber AccountNumber CreditCardApprovalCode Comment rowguid
role time_stamp time_stamp time_stamp time_stamp join_key join_key join_key join_key categorical categorical categorical categorical categorical numerical numerical numerical numerical unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string
unit time stamp, comparison only time stamp, comparison only time stamp, comparison only time stamp, comparison only
0 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29825 43659 279 5 8 0 279 5 5 20565.6206 1971.5149 616.0984 23153.2339 5  985  985  16281  nan  SO43659 PO522145787 10-4020-000676 105041Vi84182 NULL 79B65321-39CA-4115-9CBA-8FE0903E...
1 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29672 43660 279 5 8 0 279 5 5 1294.2529 124.2483 38.8276 1457.3288 5  921  921  5618  nan  SO43660 PO18850127500 10-4020-000117 115213Vi29411 NULL 738DC42D-D03B-48A1-9822-F95A67EA...
2 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29734 43661 282 6 8 0 282 6 5 32726.4786 3153.7696 985.553 36865.8012 5  517  517  1346  4  SO43661 PO18473189620 10-4020-000442 85274Vi6854 NULL D91B9131-18A4-4A11-BC3A-90B6F53E...
3 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29994 43662 282 6 8 0 282 6 5 28832.5289 2775.1646 867.2389 32474.9324 5  482  482  10456  4  SO43662 PO18444174044 10-4020-000227 125295Vi53935 NULL 4A1ECFC0-CC3A-4740-B028-1C50BB48...
4 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29565 43663 276 4 8 0 276 4 5 419.4589 40.2681 12.5838 472.3108 5  1073  1073  4322  nan  SO43663 PO18009186470 10-4020-000510 45303Vi22691 NULL 9B1E7A40-6AE0-4AD3-811C-A6495185...
... ... ... ... ... ... ... ... ... ... ... ... ... ...  ...  ...  ...  ...  ...  ...  ...  ...  ... ... ... ... ... ...
31460 2014-06-30 2014-07-12 2014-07-07 2014-07-07 11981 75119 NULL 1 8 1 NULL 1 1 42.28 3.3824 1.057 46.7194 5  17649  17649  6761  nan  SO75119 NULL 10-4030-011981 429826Vi35166 NULL 9382F1C9-383A-435F-9449-0EECEA21...
31461 2014-06-30 2014-07-12 2014-07-07 2014-07-07 18749 75120 NULL 6 8 1 NULL 6 1 84.96 6.7968 2.124 93.8808 5  28374  28374  8925  nan  SO75120 NULL 10-4030-018749 929849Vi46003 NULL AE6A4FCF-FF73-4CD4-AF2C-5993D00D...
31462 2014-06-30 2014-07-12 2014-07-07 2014-07-07 15251 75121 NULL 6 8 1 NULL 6 1 74.98 5.9984 1.8745 82.8529 5  26553  26553  14220  nan  SO75121 NULL 10-4030-015251 529864Vi73738 NULL D7395C0E-00CB-4BFA-A238-0D6A9F49...
31463 2014-06-30 2014-07-12 2014-07-07 2014-07-07 15868 75122 NULL 6 8 1 NULL 6 1 30.97 2.4776 0.7743 34.2219 5  14616  14616  18719  nan  SO75122 NULL 10-4030-015868 330022Vi97312 NULL 4221035A-4159-492F-AF40-4363A64F...
31464 2014-06-30 2014-07-12 2014-07-07 2014-07-07 18759 75123 NULL 6 8 1 NULL 6 1 189.97 15.1976 4.7493 209.9169 5  14024  14024  10084  nan  SO75123 NULL 10-4030-018759 230370Vi51970 NULL D54752FF-2B54-4BE5-95EA-3B72289C...

31465 rows x 28 columns
memory usage: 8.58 MB
name: SalesOrderHeader
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderHeader/

We must also define customer churn. In this case, we define customer churn as a customer not making another purchase within 180 days of his or her last purchase.

Thus, the churn variable is defined as follows:

  • 0, if another purchase by the same customer has been made within 180 days after OrderDate
  • 1, if no purchase by the same customer has been made within 180 days after OrderDate
  • NULL, if max(OrderDate) - OrderDate <= 180 days

NULL targets can not be used in our analysis.

In [18]:
sales_order_header_pd = sales_order_header[["OrderDate", "CustomerID", "SalesOrderID"]].to_pandas()

repeat_purchases = sales_order_header_pd.merge(
    sales_order_header_pd[["OrderDate", "CustomerID"]],
    on="CustomerID",
    how="left",
)

repeat_purchases = repeat_purchases[
    repeat_purchases["OrderDate_y"] > repeat_purchases["OrderDate_x"]
]

repeat_purchases = repeat_purchases[
    repeat_purchases["OrderDate_y"] - repeat_purchases["OrderDate_x"] > pd.Timedelta('180 days')
]

repeat_purchases.groupby("SalesOrderID", as_index=False).aggregate({"CustomerID": "max"})

repeat_purchase_ids = {sid: True for sid in repeat_purchases["SalesOrderID"]}

cut_off_date = max(sales_order_header_pd["OrderDate"]) - pd.Timedelta('180 days')

churn = np.asarray([
    np.nan if order_date >= cut_off_date else 0 if order_id in repeat_purchase_ids else 1 
    for (order_date, order_id) in zip(sales_order_header_pd["OrderDate"], sales_order_header_pd["SalesOrderID"])
])

sales_order_header["churn"] = churn

sales_order_header = sales_order_header[~sales_order_header.churn.is_nan()].to_df("SalesOrderHeaderRefined")

sales_order_header.set_role("churn", getml.data.roles.target)

sales_order_header
Out[18]:
name OrderDate DueDate ShipDate ModifiedDate CustomerID SalesOrderID SalesPersonID TerritoryID churn RevisionNumber OnlineOrderFlag SalesPersonIDCat TerritoryIDCat ShipMethodID SubTotal TaxAmt Freight TotalDue Status BillToAddressID ShipToAddressID CreditCardID CurrencyRateID SalesOrderNumber PurchaseOrderNumber AccountNumber CreditCardApprovalCode Comment rowguid
role time_stamp time_stamp time_stamp time_stamp join_key join_key join_key join_key target categorical categorical categorical categorical categorical numerical numerical numerical numerical unused_float unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string unused_string
unit time stamp, comparison only time stamp, comparison only time stamp, comparison only time stamp, comparison only
0 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29825 43659 279 5 0  8 0 279 5 5 20565.6206 1971.5149 616.0984 23153.2339 5  985  985  16281  nan  SO43659 PO522145787 10-4020-000676 105041Vi84182 NULL 79B65321-39CA-4115-9CBA-8FE0903E...
1 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29672 43660 279 5 0  8 0 279 5 5 1294.2529 124.2483 38.8276 1457.3288 5  921  921  5618  nan  SO43660 PO18850127500 10-4020-000117 115213Vi29411 NULL 738DC42D-D03B-48A1-9822-F95A67EA...
2 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29734 43661 282 6 0  8 0 282 6 5 32726.4786 3153.7696 985.553 36865.8012 5  517  517  1346  4  SO43661 PO18473189620 10-4020-000442 85274Vi6854 NULL D91B9131-18A4-4A11-BC3A-90B6F53E...
3 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29994 43662 282 6 0  8 0 282 6 5 28832.5289 2775.1646 867.2389 32474.9324 5  482  482  10456  4  SO43662 PO18444174044 10-4020-000227 125295Vi53935 NULL 4A1ECFC0-CC3A-4740-B028-1C50BB48...
4 2011-05-31 2011-06-12 2011-06-07 2011-06-07 29565 43663 276 4 0  8 0 276 4 5 419.4589 40.2681 12.5838 472.3108 5  1073  1073  4322  nan  SO43663 PO18009186470 10-4020-000510 45303Vi22691 NULL 9B1E7A40-6AE0-4AD3-811C-A6495185...
... ... ... ... ... ... ... ... ...  ... ... ... ... ... ...  ...  ...  ...  ...  ...  ...  ...  ...  ... ... ... ... ... ...
19699 2013-12-31 2014-01-12 2014-01-07 2014-01-07 20826 63358 NULL 7 1  8 1 NULL 7 1 1173.96 93.9168 29.349 1297.2258 5  24387  24387  3239  nan  SO63358 NULL 10-4030-020826 1142084Vi17039 NULL 41278FBB-3DD8-488B-AEA5-8BF4A6F1...
19700 2013-12-31 2014-01-12 2014-01-07 2014-01-07 24114 63359 NULL 10 1  8 1 NULL 10 1 1179.47 94.3576 29.4868 1303.3144 5  29682  29682  nan  10770  SO63359 NULL 10-4030-024114 NULL NULL 749532D4-BFF7-4FEC-9F77-396C4A96...
19701 2013-12-31 2014-01-12 2014-01-07 2014-01-07 11417 63360 NULL 7 1  8 1 NULL 7 1 548.98 43.9184 13.7245 606.6229 5  21465  21465  6582  nan  SO63360 NULL 10-4030-011417 242387Vi34223 NULL F92FA2A3-73E1-4DD0-987C-99D1C87E...
19702 2013-12-31 2014-01-12 2014-01-07 2014-01-07 18125 63361 NULL 8 1  8 1 NULL 8 1 2384.07 190.7256 59.6018 2634.3974 5  26562  26562  11994  nan  SO63361 NULL 10-4030-018125 1242859Vi61993 NULL 3A3758BF-CDFA-4740-9104-87BD5A08...
19703 2013-12-31 2014-01-12 2014-01-07 2014-01-07 15692 63362 NULL 10 1  8 1 NULL 10 1 2419.06 193.5248 60.4765 2673.0613 5  27090  27090  16083  10770  SO63362 NULL 10-4030-015692 242864Vi83167 NULL 3151D31F-4020-41B2-87F3-B6732869...

19704 rows x 29 columns
memory usage: 5.54 MB
name: SalesOrderHeaderRefined
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/adventure_works/SalesOrderHeaderRefined/

In [19]:
split = getml.data.split.random(train=0.8, test=0.2)
split
Out[19]:
0 train
1 train
2 train
3 test
4 train
...

infinite number of rows
type: StringColumnView

In [20]:
container = getml.data.Container(population=sales_order_header, split=split)

container.add(
    product=product,
    sales_order_detail=sales_order_detail, 
    sales_order_header=sales_order_header, 
    sales_order_reason=sales_order_reason,
    special_offer=special_offer,
    store=store,
)

container
Out[20]:

population

subset name rows type
0 test SalesOrderHeaderRefined 3879 View
1 train SalesOrderHeaderRefined 15825 View

peripheral

alias name rows type
0 product Product 504 DataFrame
1 sales_order_detail SalesOrderDetail 121317 DataFrame
2 sales_order_header SalesOrderHeaderRefined 19704 DataFrame
3 sales_order_reason SalesOrderHeaderSalesReason 27647 DataFrame
4 special_offer SpecialOffer 16 DataFrame
5 store Store 701 DataFrame

2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model

In [21]:
dm = getml.data.DataModel(sales_order_header.to_placeholder("population"))

dm.add(getml.data.to_placeholder(
    product=product,
    sales_order_detail=sales_order_detail, 
    sales_order_header=sales_order_header, 
    sales_order_reason=sales_order_reason,
    special_offer=special_offer,
    store=store,
))


dm.population.join(
    dm.sales_order_header,
    on="CustomerID",
    time_stamps="OrderDate",
    lagged_targets=True,
    horizon=getml.data.time.days(1),
)

dm.population.join(
    dm.sales_order_detail,
    on="SalesOrderID",
)

dm.population.join(
    dm.sales_order_reason,
    on="SalesOrderID", 
)

dm.population.join(
    dm.store,
    on="SalesPersonID", 
)

dm.sales_order_detail.join(
    dm.product,
    on="ProductID",
    relationship=getml.data.relationship.many_to_one,
)

dm.sales_order_detail.join(
    dm.special_offer,
    on="SpecialOfferID",
    relationship=getml.data.relationship.many_to_one,
)

dm
Out[21]:

diagram


sales_order_headerproductspecial_offersales_order_detailsales_order_reasonstorepopulationProductID = ProductIDRelationship: many-to-oneSpecialOfferID = SpecialOfferIDRelationship: many-to-oneCustomerID = CustomerIDOrderDate <= OrderDateHorizon: 1.0 daysLagged targets allowedSalesOrderID = SalesOrderIDSalesOrderID = SalesOrderIDSalesPersonID = SalesPersonID


staging

data frames staging table
0 population POPULATION__STAGING_TABLE_1
1 sales_order_detail, product, special_offer SALES_ORDER_DETAIL__STAGING_TABLE_2
2 sales_order_header SALES_ORDER_HEADER__STAGING_TABLE_3
3 sales_order_reason SALES_ORDER_REASON__STAGING_TABLE_4
4 store STORE__STAGING_TABLE_5

2.2 getML pipeline

Set-up the feature learner & predictor

In [22]:
seasonal = getml.preprocessors.Seasonal()

mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,    
    num_features=400,
)

relboost = getml.feature_learning.Relboost(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,    
)

predictor = getml.predictors.XGBoostClassifier(n_jobs=1)

Build the pipeline

In [23]:
pipe1 = getml.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor],
    include_categorical=True,
)

pipe1
Out[23]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=['product', 'sales_order_detail', 'sales_order_header',
                     'sales_order_reason', 'special_offer', 'store'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])
In [24]:
pipe2 = getml.Pipeline(
    tags=['relboost'],
    data_model=dm,
    preprocessors=[seasonal, mapping],
    feature_learners=[relboost],
    predictors=[predictor],
    include_categorical=True,
)

pipe2
Out[24]:
Pipeline(data_model='population',
         feature_learners=['Relboost'],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=['product', 'sales_order_detail', 'sales_order_header',
                     'sales_order_reason', 'special_offer', 'store'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Seasonal', 'Mapping'],
         share_selected_features=0.5,
         tags=['relboost'])

2.3 Model training

In [25]:
pipe1.check(container.train)
Checking data model...

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%
[========================================] 100%

Checking...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
In [26]:
pipe1.fit(container.train)
Checking data model...

Staging...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%
[========================================] 100%

FastProp: Trying 710 features...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:1m:7.794911

Out[26]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=['product', 'sales_order_detail', 'sales_order_header',
                     'sales_order_reason', 'special_offer', 'store'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-qlp0vY'])

url: http://localhost:1709/#/getpipeline/adventure_works/M49AEo/0/
In [27]:
pipe2.check(container.train)
Checking data model...

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%
[========================================] 100%

Checking...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
In [28]:
pipe2.fit(container.train)
Checking data model...

Staging...
[========================================] 100%


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:1m:53.175772

Out[28]:
Pipeline(data_model='population',
         feature_learners=['Relboost'],
         feature_selectors=[],
         include_categorical=True,
         loss_function=None,
         peripheral=['product', 'sales_order_detail', 'sales_order_header',
                     'sales_order_reason', 'special_offer', 'store'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Seasonal', 'Mapping'],
         share_selected_features=0.5,
         tags=['relboost', 'container-qlp0vY'])

url: http://localhost:1709/#/getpipeline/adventure_works/xkgbq9/0/

2.4 Model evaluation

In [29]:
pipe1.score(container.test)
Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%
[========================================] 100%

FastProp: Building features...
[========================================] 100%


Out[29]:
date time set used target accuracy auc cross entropy
0 2021-08-23 16:07:18 train churn 0.9151 0.9742 0.2148
1 2021-08-23 16:09:19 test churn 0.9129 0.9712 0.2236
In [30]:
pipe2.score(container.test)
Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%
[========================================] 100%

Relboost: Building features...
[========================================] 100%


Out[30]:
date time set used target accuracy auc cross entropy
0 2021-08-23 16:09:16 train churn 0.9312 0.9832 0.1689
1 2021-08-23 16:09:23 test churn 0.9273 0.9776 0.1929

2.5 featuretools

In [31]:
population_train_pd = container.train.population.drop(container.train.population.roles.unused).to_pandas()
population_test_pd = container.test.population.drop(container.train.population.roles.unused).to_pandas()
In [32]:
product_pd = product.drop(product.roles.unused).to_pandas()
sales_order_detail_pd = sales_order_detail.drop(sales_order_detail.roles.unused).to_pandas()
sales_order_header_pd = sales_order_header.drop(sales_order_header.roles.unused).to_pandas()
sales_order_reason_pd = sales_order_reason.drop(sales_order_reason.roles.unused).to_pandas()
special_offer_pd = special_offer.drop(special_offer.roles.unused).to_pandas()
store_pd = store.drop(store.roles.unused).to_pandas()

featuretools does not support many-to-one joins. Therefore, we must manually merge sales_order_detail_pd, product_pd and special_offer_pd.

In [33]:
sales_order_detail_pd = sales_order_detail_pd.merge(
    product_pd,
    on="ProductID",
    how="left",
)

sales_order_detail_pd = sales_order_detail_pd.merge(
    special_offer_pd,
    on="SpecialOfferID",
    how="left",
)

del sales_order_detail_pd["SalesOrderDetailID"]
del sales_order_detail_pd["ProductID"]
del sales_order_detail_pd["SpecialOfferID"]

sales_order_detail_pd
Out[33]:
ModifiedDate SalesOrderID OrderQty UnitPrice UnitPriceDiscount LineTotal MakeFlag ProductSubcategoryID ProductModelID SafetyStockLevel ReorderPoint StandardCost ListPrice StartDate EndDate Category Description Type MinQty DiscountPct
0 2011-05-31 43659 1.0 2024.994 0.0 2024.994 1 1 19 100.0 75.0 1898.0944 3374.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
1 2011-05-31 43659 3.0 2024.994 0.0 6074.982 1 1 19 100.0 75.0 1898.0944 3374.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
2 2011-05-31 43659 1.0 2024.994 0.0 2024.994 1 1 19 100.0 75.0 1898.0944 3374.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
3 2011-05-31 43659 1.0 2039.994 0.0 2039.994 1 1 19 100.0 75.0 1912.1544 3399.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
4 2011-05-31 43659 1.0 2039.994 0.0 2039.994 1 1 19 100.0 75.0 1912.1544 3399.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
121312 2014-06-30 75122 1.0 21.980 0.0 21.980 0 30 121 4.0 3.0 8.2205 21.98 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
121313 2014-06-30 75122 1.0 8.990 0.0 8.990 0 19 2 4.0 3.0 6.9223 8.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
121314 2014-06-30 75123 1.0 21.980 0.0 21.980 0 30 121 4.0 3.0 8.2205 21.98 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
121315 2014-06-30 75123 1.0 159.000 0.0 159.000 0 27 122 4.0 3.0 59.4660 159.00 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0
121316 2014-06-30 75123 1.0 8.990 0.0 8.990 0 19 2 4.0 3.0 6.9223 8.99 2011-05-01 2014-11-30 No Discount No Discount No Discount 0.0 0.0

121317 rows × 20 columns

In [34]:
def prepare_sales_order_header(peripheral_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    peripheral_new = peripheral_pd.merge(
        train_or_test[["CustomerID", "OrderDate", "SalesOrderID"]],
        on="CustomerID"
    )

    peripheral_new = peripheral_new[
        peripheral_new["OrderDate_x"] < peripheral_new["OrderDate_y"]
    ]
    
    del peripheral_new["SalesOrderID_x"]
    del peripheral_new["OrderDate_y"]
    del peripheral_new["CustomerID"]
    del peripheral_new["SalesPersonIDCat"]
    del peripheral_new["TerritoryIDCat"]

    return peripheral_new.rename(columns={"OrderDate_x": "OrderDate", "SalesOrderID_y": "SalesOrderID"})
In [35]:
def prepare_store(peripheral_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    peripheral_new = peripheral_pd.merge(
        train_or_test[["SalesPersonID", "SalesOrderID"]],
        on="SalesPersonID"
    )
    
    return peripheral_new
In [36]:
store_train_pd = prepare_store(store_pd, population_train_pd)
store_test_pd = prepare_store(store_pd, population_test_pd)
store_train_pd
Out[36]:
test SalesPersonID SalesOrderID
0 2014-09-12 11:15:07 279 43659
1 2014-09-12 11:15:07 279 43660
2 2014-09-12 11:15:07 279 43681
3 2014-09-12 11:15:07 279 43685
4 2014-09-12 11:15:07 279 43695
... ... ... ...
142427 NaT 290 63216
142428 NaT 290 63217
142429 NaT 290 63223
142430 NaT 290 63282
142431 NaT 290 63284

142432 rows × 3 columns

In [37]:
sales_order_header_train_pd = prepare_sales_order_header(sales_order_header_pd, population_train_pd)
sales_order_header_test_pd = prepare_sales_order_header(sales_order_header_pd, population_test_pd)
sales_order_header_train_pd
Out[37]:
OrderDate DueDate ShipDate ModifiedDate SalesPersonID TerritoryID churn RevisionNumber OnlineOrderFlag ShipMethodID SubTotal TaxAmt Freight TotalDue SalesOrderID
1 2011-05-31 2011-06-12 2011-06-07 2011-06-07 279 5 0.0 8 0 5 20565.6206 1971.5149 616.0984 23153.2339 44305
2 2011-05-31 2011-06-12 2011-06-07 2011-06-07 279 5 0.0 8 0 5 20565.6206 1971.5149 616.0984 23153.2339 45061
3 2011-05-31 2011-06-12 2011-06-07 2011-06-07 279 5 0.0 8 0 5 20565.6206 1971.5149 616.0984 23153.2339 45779
4 2011-05-31 2011-06-12 2011-06-07 2011-06-07 279 5 0.0 8 0 5 20565.6206 1971.5149 616.0984 23153.2339 46604
5 2011-05-31 2011-06-12 2011-06-07 2011-06-07 279 5 0.0 8 0 5 20565.6206 1971.5149 616.0984 23153.2339 47693
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
39894 2013-12-15 2013-12-27 2013-12-22 2013-12-22 NULL 6 1.0 8 1 1 32.6000 2.6080 0.8150 36.0230 62799
39908 2013-12-13 2013-12-25 2013-12-20 2013-12-20 NULL 4 1.0 8 1 1 42.2800 3.3824 1.0570 46.7194 62770
39952 2013-12-15 2013-12-27 2013-12-22 2013-12-22 NULL 1 1.0 8 1 1 53.9900 4.3192 1.3498 59.6590 62796
40080 2013-12-19 2013-12-31 2013-12-26 2013-12-26 NULL 4 1.0 8 1 1 35.0000 2.8000 0.8750 38.6750 62643
40101 2013-12-20 2014-01-01 2013-12-27 2013-12-27 NULL 10 1.0 8 1 1 12.9400 1.0352 0.3235 14.2987 62914

12271 rows × 15 columns

In [38]:
del population_train_pd["CustomerID"]
del population_train_pd["SalesPersonIDCat"]

del population_test_pd["CustomerID"]
del population_test_pd["SalesPersonIDCat"]

featuretools's entities are similar for getML's container class.

In [39]:
entities_train = {
    "population" : (population_train_pd, "SalesOrderID"),
    "sales_order_header": (sales_order_header_train_pd, "index"),
    "sales_order_detail": (sales_order_detail_pd, "index"),
    "sales_order_reason": (sales_order_reason_pd, "index"),
    "store": (store_train_pd, "index"),
}
In [40]:
entities_test = {
    "population" : (population_test_pd, "SalesOrderID"),
    "sales_order_header": (sales_order_header_test_pd, "index"),
    "sales_order_detail": (sales_order_detail_pd, "index"),
    "sales_order_reason": (sales_order_reason_pd, "index"),
    "store": (store_test_pd, "index"),
}

featuretools's relationships are similar for getML's data model.

In [41]:
relationships = [
    ("population", "SalesOrderID", "sales_order_header", "SalesOrderID"),
    ("population", "SalesOrderID", "sales_order_detail", "SalesOrderID"),
    ("population", "SalesOrderID", "sales_order_reason", "SalesOrderID"),
    ("population", "SalesOrderID", "store", "SalesOrderID"),
]
In [42]:
featuretools_train_pd = featuretools.dfs(
    entities=entities_train,
    relationships=relationships,
    target_entity="population")[0]
/home/patrick/.local/lib/python3.9/site-packages/featuretools/entityset/entity.py:462: UserWarning: index index not found in dataframe, creating new integer column
  warnings.warn("index {} not found in dataframe, creating new "
In [43]:
featuretools_test_pd = featuretools.dfs(
    entities=entities_test,
    relationships=relationships,
    target_entity="population")[0]
In [44]:
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
In [45]:
featuretools_train.set_role("churn", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)

featuretools_train
91.014218% of all entries of column 'SKEW(sales_order_header.Freight)' are NULL values.
91.014218% of all entries of column 'SKEW(sales_order_header.SubTotal)' are NULL values.
91.014218% of all entries of column 'SKEW(sales_order_header.TaxAmt)' are NULL values.
91.014218% of all entries of column 'SKEW(sales_order_header.TotalDue)' are NULL values.
91.014218% of all entries of column 'SKEW(sales_order_header.churn)' are NULL values.
Out[45]:
name churn SalesPersonID TerritoryID RevisionNumber OnlineOrderFlag TerritoryIDCat ShipMethodID MODE(sales_order_header.OnlineOrderFlag) MODE(sales_order_header.RevisionNumber) MODE(sales_order_header.SalesPersonID) MODE(sales_order_header.ShipMethodID) MODE(sales_order_header.TerritoryID) MODE(sales_order_detail.Category) MODE(sales_order_detail.Description) MODE(sales_order_detail.MakeFlag) MODE(sales_order_detail.ProductModelID) MODE(sales_order_detail.ProductSubcategoryID) MODE(sales_order_detail.Type) MODE(sales_order_reason.SalesReasonID) MODE(store.SalesPersonID) SubTotal TaxAmt Freight TotalDue COUNT(sales_order_header) MAX(sales_order_header.Freight) MAX(sales_order_header.SubTotal) MAX(sales_order_header.TaxAmt) MAX(sales_order_header.TotalDue) MAX(sales_order_header.churn) MEAN(sales_order_header.Freight) MEAN(sales_order_header.SubTotal) MEAN(sales_order_header.TaxAmt) MEAN(sales_order_header.TotalDue) MEAN(sales_order_header.churn) MIN(sales_order_header.Freight) MIN(sales_order_header.SubTotal) MIN(sales_order_header.TaxAmt) MIN(sales_order_header.TotalDue) MIN(sales_order_header.churn) NUM_UNIQUE(sales_order_header.OnlineOrderFlag) NUM_UNIQUE(sales_order_header.RevisionNumber) NUM_UNIQUE(sales_order_header.SalesPersonID) NUM_UNIQUE(sales_order_header.ShipMethodID) NUM_UNIQUE(sales_order_header.TerritoryID) SKEW(sales_order_header.Freight) SKEW(sales_order_header.SubTotal) SKEW(sales_order_header.TaxAmt) SKEW(sales_order_header.TotalDue) SKEW(sales_order_header.churn) STD(sales_order_header.Freight) STD(sales_order_header.SubTotal) STD(sales_order_header.TaxAmt) STD(sales_order_header.TotalDue) STD(sales_order_header.churn) SUM(sales_order_header.Freight) SUM(sales_order_header.SubTotal) SUM(sales_order_header.TaxAmt) SUM(sales_order_header.TotalDue) SUM(sales_order_header.churn) COUNT(sales_order_detail) MAX(sales_order_detail.DiscountPct) MAX(sales_order_detail.LineTotal) MAX(sales_order_detail.ListPrice) MAX(sales_order_detail.MinQty) MAX(sales_order_detail.OrderQty) MAX(sales_order_detail.ReorderPoint) MAX(sales_order_detail.SafetyStockLevel) MAX(sales_order_detail.StandardCost) MAX(sales_order_detail.UnitPrice) MAX(sales_order_detail.UnitPriceDiscount) MEAN(sales_order_detail.DiscountPct) MEAN(sales_order_detail.LineTotal) MEAN(sales_order_detail.ListPrice) MEAN(sales_order_detail.MinQty) MEAN(sales_order_detail.OrderQty) MEAN(sales_order_detail.ReorderPoint) MEAN(sales_order_detail.SafetyStockLevel) MEAN(sales_order_detail.StandardCost) MEAN(sales_order_detail.UnitPrice) MEAN(sales_order_detail.UnitPriceDiscount) MIN(sales_order_detail.DiscountPct) MIN(sales_order_detail.LineTotal) MIN(sales_order_detail.ListPrice) MIN(sales_order_detail.MinQty) MIN(sales_order_detail.OrderQty) MIN(sales_order_detail.ReorderPoint) MIN(sales_order_detail.SafetyStockLevel) MIN(sales_order_detail.StandardCost) MIN(sales_order_detail.UnitPrice) MIN(sales_order_detail.UnitPriceDiscount) NUM_UNIQUE(sales_order_detail.Category) NUM_UNIQUE(sales_order_detail.Description) NUM_UNIQUE(sales_order_detail.MakeFlag) NUM_UNIQUE(sales_order_detail.ProductModelID) NUM_UNIQUE(sales_order_detail.ProductSubcategoryID) NUM_UNIQUE(sales_order_detail.Type) SKEW(sales_order_detail.DiscountPct) SKEW(sales_order_detail.LineTotal) SKEW(sales_order_detail.ListPrice) SKEW(sales_order_detail.MinQty) SKEW(sales_order_detail.OrderQty) SKEW(sales_order_detail.ReorderPoint) SKEW(sales_order_detail.SafetyStockLevel) SKEW(sales_order_detail.StandardCost) SKEW(sales_order_detail.UnitPrice) SKEW(sales_order_detail.UnitPriceDiscount) STD(sales_order_detail.DiscountPct) STD(sales_order_detail.LineTotal) STD(sales_order_detail.ListPrice) STD(sales_order_detail.MinQty) STD(sales_order_detail.OrderQty) STD(sales_order_detail.ReorderPoint) STD(sales_order_detail.SafetyStockLevel) STD(sales_order_detail.StandardCost) STD(sales_order_detail.UnitPrice) STD(sales_order_detail.UnitPriceDiscount) SUM(sales_order_detail.DiscountPct) SUM(sales_order_detail.LineTotal) SUM(sales_order_detail.ListPrice) SUM(sales_order_detail.MinQty) SUM(sales_order_detail.OrderQty) SUM(sales_order_detail.ReorderPoint) SUM(sales_order_detail.SafetyStockLevel) SUM(sales_order_detail.StandardCost) SUM(sales_order_detail.UnitPrice) SUM(sales_order_detail.UnitPriceDiscount) COUNT(sales_order_reason) NUM_UNIQUE(sales_order_reason.SalesReasonID) COUNT(store) NUM_UNIQUE(store.SalesPersonID) DAY(DueDate) DAY(ModifiedDate) DAY(OrderDate) DAY(ShipDate) MONTH(DueDate) MONTH(ModifiedDate) MONTH(OrderDate) MONTH(ShipDate) WEEKDAY(DueDate) WEEKDAY(ModifiedDate) WEEKDAY(OrderDate) WEEKDAY(ShipDate) YEAR(DueDate) YEAR(ModifiedDate) YEAR(OrderDate) YEAR(ShipDate) MODE(sales_order_header.DAY(DueDate)) MODE(sales_order_header.DAY(ModifiedDate)) MODE(sales_order_header.DAY(OrderDate)) MODE(sales_order_header.DAY(ShipDate)) MODE(sales_order_header.MONTH(DueDate)) MODE(sales_order_header.MONTH(ModifiedDate)) MODE(sales_order_header.MONTH(OrderDate)) MODE(sales_order_header.MONTH(ShipDate)) MODE(sales_order_header.WEEKDAY(DueDate)) MODE(sales_order_header.WEEKDAY(ModifiedDate)) MODE(sales_order_header.WEEKDAY(OrderDate)) MODE(sales_order_header.WEEKDAY(ShipDate)) MODE(sales_order_header.YEAR(DueDate)) MODE(sales_order_header.YEAR(ModifiedDate)) MODE(sales_order_header.YEAR(OrderDate)) MODE(sales_order_header.YEAR(ShipDate)) NUM_UNIQUE(sales_order_header.DAY(DueDate)) NUM_UNIQUE(sales_order_header.DAY(ModifiedDate)) NUM_UNIQUE(sales_order_header.DAY(OrderDate)) NUM_UNIQUE(sales_order_header.DAY(ShipDate)) NUM_UNIQUE(sales_order_header.MONTH(DueDate)) NUM_UNIQUE(sales_order_header.MONTH(ModifiedDate)) NUM_UNIQUE(sales_order_header.MONTH(OrderDate)) NUM_UNIQUE(sales_order_header.MONTH(ShipDate)) NUM_UNIQUE(sales_order_header.WEEKDAY(DueDate)) NUM_UNIQUE(sales_order_header.WEEKDAY(ModifiedDate)) NUM_UNIQUE(sales_order_header.WEEKDAY(OrderDate)) NUM_UNIQUE(sales_order_header.WEEKDAY(ShipDate)) NUM_UNIQUE(sales_order_header.YEAR(DueDate)) NUM_UNIQUE(sales_order_header.YEAR(ModifiedDate)) NUM_UNIQUE(sales_order_header.YEAR(OrderDate)) NUM_UNIQUE(sales_order_header.YEAR(ShipDate)) MODE(sales_order_detail.DAY(EndDate)) MODE(sales_order_detail.DAY(ModifiedDate)) MODE(sales_order_detail.DAY(StartDate)) MODE(sales_order_detail.MONTH(EndDate)) MODE(sales_order_detail.MONTH(ModifiedDate)) MODE(sales_order_detail.MONTH(StartDate)) MODE(sales_order_detail.WEEKDAY(EndDate)) MODE(sales_order_detail.WEEKDAY(ModifiedDate)) MODE(sales_order_detail.WEEKDAY(StartDate)) MODE(sales_order_detail.YEAR(EndDate)) MODE(sales_order_detail.YEAR(ModifiedDate)) MODE(sales_order_detail.YEAR(StartDate)) NUM_UNIQUE(sales_order_detail.DAY(EndDate)) NUM_UNIQUE(sales_order_detail.DAY(ModifiedDate)) NUM_UNIQUE(sales_order_detail.DAY(StartDate)) NUM_UNIQUE(sales_order_detail.MONTH(EndDate)) NUM_UNIQUE(sales_order_detail.MONTH(ModifiedDate)) NUM_UNIQUE(sales_order_detail.MONTH(StartDate)) NUM_UNIQUE(sales_order_detail.WEEKDAY(EndDate)) NUM_UNIQUE(sales_order_detail.WEEKDAY(ModifiedDate)) NUM_UNIQUE(sales_order_detail.WEEKDAY(StartDate)) NUM_UNIQUE(sales_order_detail.YEAR(EndDate)) NUM_UNIQUE(sales_order_detail.YEAR(ModifiedDate)) NUM_UNIQUE(sales_order_detail.YEAR(StartDate)) MODE(store.DAY(test)) MODE(store.MONTH(test)) MODE(store.WEEKDAY(test)) MODE(store.YEAR(test)) NUM_UNIQUE(store.DAY(test)) NUM_UNIQUE(store.MONTH(test)) NUM_UNIQUE(store.WEEKDAY(test)) NUM_UNIQUE(store.YEAR(test))
role target categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical categorical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical numerical
0 0  279 5 8 0 5 5 NULL NULL NULL NULL NULL No Discount No Discount 1 19 1 No Discount NULL 279 20565.6206 1971.5149 616.0984 23153.2339 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  12  0  6074.982 3399.99 0  6  75  100  1912.1544 2039.994 0  0  1713.8017 1989.8658 0  2.1667 45  60  1120.2742 1193.6427 0  0  10.373 8.99 0  1  3  4  3.3963 5.1865 0  1  1  2  5  5  1  0  1.1963 -0.3885 0  1.4799 -0.3884 -0.3884 -0.3887 -0.3885 0  0  1883.1394 1729.5034 0  1.5859 37.0749 49.4332 971.2526 1038.0419 0  0  20565.6206 23878.39 0  26  540  720  13443.2903 14323.7118 0  0  nan  80  1  12  7  31  7  6  6  5  6  6  1  1  1  2011  2011  2011  2011  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  5  5  6  1  6  2014  2011  2011  1  1  1  1  1  1  1  1  1  1  1  1  12  9  4  2014  1  1  1  1 
1 0  279 5 8 0 5 5 NULL NULL NULL NULL NULL No Discount No Discount 1 28 2 No Discount NULL 279 1294.2529 124.2483 38.8276 1457.3288 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  2  0  874.794 1457.99 0  1  75  100  884.7083 874.794 0  0  647.1264 1120.49 0  1  75  100  685.7074 647.1264 0  0  419.4589 782.99 0  1  75  100  486.7066 419.4589 0  1  1  1  2  1  1  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  321.9705 477.2971 0  0  0  0  281.4297 321.9705 0  0  1294.2529 2240.98 0  2  150  200  1371.4149 1294.2529 0  0  nan  80  1  12  7  31  7  6  6  5  6  6  1  1  1  2011  2011  2011  2011  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  5  5  6  1  6  2014  2011  2011  1  1  1  1  1  1  1  1  1  1  1  1  12  9  4  2014  1  1  1  1 
2 0  282 6 8 0 6 5 NULL NULL NULL NULL NULL No Discount No Discount 1 19 1 No Discount NULL 282 32726.4786 3153.7696 985.553 36865.8012 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  15  0  8099.976 3399.99 0  5  375  500  1912.1544 2039.994 0  0  2181.7652 1590.4467 0  2.5333 151  201.3333 883.0532 934.9116 0  0  20.746 8.99 0  1  3  4  6.9223 5.1865 0  1  1  2  5  5  1  0  1.2022 0.2799 0  0.7955 0.6718 0.6718 0.3285 0.348 0  0  2515.2803 1423.6488 0  1.1872 166.7505 222.3339 802.3616 858.7914 0  0  32726.4786 23856.7 0  38  2265  3020  13245.7975 14023.6738 0  0  nan  74  1  12  7  31  7  6  6  5  6  6  1  1  1  2011  2011  2011  2011  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  5  5  6  1  6  2014  2011  2011  1  1  1  1  1  1  1  1  1  1  1  1  12  9  4  2014  1  1  1  1 
3 0  276 4 8 0 4 5 NULL NULL NULL NULL NULL No Discount No Discount 1 30 2 No Discount NULL 276 419.4589 40.2681 12.5838 472.3108 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  1  0  419.4589 782.99 0  1  75  100  486.7066 419.4589 0  0  419.4589 782.99 0  1  75  100  486.7066 419.4589 0  0  419.4589 782.99 0  1  75  100  486.7066 419.4589 0  1  1  1  1  1  1  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  419.4589 782.99 0  1  75  100  486.7066 419.4589 0  0  nan  39  1  12  7  31  7  6  6  5  6  6  1  1  1  2011  2011  2011  2011  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  5  5  6  1  6  2014  2011  2011  1  1  1  1  1  1  1  1  1  1  1  1  12  9  4  2014  1  1  1  1 
4 0  280 1 8 0 1 5 NULL NULL NULL NULL NULL No Discount No Discount 1 19 1 No Discount NULL 280 24432.6088 2344.9921 732.81 27510.4109 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  8  0  8099.976 3399.99 0  4  75  100  1912.1544 2039.994 0  0  3054.0761 2553.115 0  1.75 57  76  1438.4664 1531.5806 0  0  28.8404 49.99 0  1  3  4  38.4923 28.8404 0  1  1  2  2  2  1  0  0.8117 -1.4399 0  1.3554 -1.4402 -1.4402 -1.4399 -1.4399 0  0  2860.3836 1545.0056 0  1.165 33.3295 44.4394 864.1073 927.5374 0  0  24432.6088 20424.92 0  14  456  608  11507.731 12252.6448 0  0  nan  38  1  12  7  31  7  6  6  5  6  6  1  1  1  2011  2011  2011  2011  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  5  5  6  1  6  2014  2011  2011  1  1  1  1  1  1  1  1  1  1  1  1  12  9  4  2014  1  1  1  1 
...  ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ... 
15820 1  NULL 7 8 1 7 1 NULL NULL NULL NULL NULL No Discount No Discount 0 29 37 No Discount 1 NULL 1173.96 93.9168 29.349 1297.2258 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  4  0  1120.49 1120.49 0  1  375  500  713.0798 1120.49 0  0  293.49 293.49 0  1  207  276  183.2694 293.49 0  0  3.99 3.99 0  1  3  4  1.4923 3.99 0  1  1  2  4  3  1  0  1.9981 1.9981 0  0  -0.1153 -0.1153 1.9994 1.9981 0  0  551.4201 551.4201 0  0  196.204 261.6053 353.2259 551.4201 0  0  1173.96 1173.96 0  4  828  1104  733.0777 1173.96 0  1  1  0  nan  12  7  31  7  1  1  12  1  6  1  1  1  2014  2014  2013  2014  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  12  5  6  1  6  2014  2013  2011  1  1  1  1  1  1  1  1  1  1  1  1  nan  nan  nan  nan  nan  nan  nan  nan 
15821 1  NULL 10 8 1 10 1 NULL NULL NULL NULL NULL No Discount No Discount 0 11 19 No Discount 1 NULL 1179.47 94.3576 29.4868 1303.3144 0  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  0  0  0  0  0  3  0  1120.49 1120.49 0  1  75  100  713.0798 1120.49 0  0  393.1567 393.1567 0  1  27  36  252.8315 393.1567 0  0  8.99 8.99 0  1  3  4  6.9223 8.99 0  1  1  2  3  3  1  0  1.7238 1.7238 0  0  1.7321 1.7321 1.7199 1.7238 0  0  630.2226 630.2226 0  0  41.5692 55.4256 398.8992 630.2226 0  0  1179.47 1179.47 0  3  81  108  758.4944 1179.47 0  1  1  0  nan  12  7  31  7  1  1  12  1  6  1  1  1  2014  2014  2013  2014  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  30  31  1  11  12  5  6  1  6  2014  2013  2011  1  1  1  1  1  1  1  1  1  1  1  1  nan  nan  nan  nan  nan  nan  nan