Predicting order cancellations using getML

In this tutorial, we demonstrate how getML can be applied in an e-commerce context. Using a dataset of about 400,000 orders, our goal is to predict whether an order will be cancelled.

We also show that we can significantly improve our results by using getML's built-in hyperparameter tuning routines.

Summary:

  • Prediction type: Classification model
  • Domain: E-commerce
  • Prediction target: Whether an order will be cancelled
  • Population size: 397925

Author: Dr. Patrick Urbanke

Background

The data set contains about 400,000 orders from a British online retailer. Each order consists of a product that has been ordered and a corresponding quantity. Several orders can be summarized onto a single invoice. The goal is to predict whether an order will be cancelled.

Because the company mainly sells to other businesses, the cancellation rate is relatively low, namely 1.83%.

The data set has been originally collected for this study:

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

It has been downloaded from the UCI Machine Learning Repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Where is this running?

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

from pyspark.sql import SparkSession
import getml

getml.engine.launch()
getml.engine.set_project('online_retail')
Launched the getML engine. The log output will be stored in /home/patrick/.getML/logs/20220323121536.log.


Loading pipelines...
[========================================] 100%


Connected to project 'online_retail'
In [2]:
RUN_SPARK = False

1. Loading data

1.1 Download from source

We begin by downloading the data from the source file:

In [3]:
fname = "online_retail.csv"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://static.getml.com/datasets/online_retail/" + fname, 
        fname
    )
    
full_data_pandas = pd.read_csv(fname, sep="|")

1.2 Data preparation

The invoice dates are in a somewhat unusual format, fo we need to rectify that.

In [4]:
def add_zero(string):
    if len(string) == 1:
        return "0" + string
    return string
In [5]:
def format_date(string):
    datetime = string.split(" ")
    assert len(datetime) == 2, "Expected date and time"
    
    date_components = datetime[0].split("/")
    assert len(date_components) == 3, "Expected three date components"
    
    date_components = [add_zero(x) for x in date_components]
    
    return "-".join(date_components) + " " + datetime[1] 
In [6]:
full_data_pandas["InvoiceDate"] = [
    format_date(string) for string in np.asarray(full_data_pandas["InvoiceDate"])
]
In [7]:
full_data_pandas
Out[7]:
Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011-12-09 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011-12-09 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011-12-09 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011-12-09 12:50 4.95 12680.0 France
541909 581587 POST POSTAGE 1 2011-12-09 12:50 18.00 12680.0 France

541910 rows × 8 columns

In this data set, the targets aren't as clearly defined as we would like to, so we have do define them ourselves.

In [8]:
def add_target(df):
    df = df.sort_values(by=["Customer ID", "InvoiceDate"])
    
    cancelled = np.zeros(df.shape[0])

    invoice = np.asarray(df["Invoice"])
    stock_code = np.asarray(df["StockCode"])
    customer_id = np.asarray(df["Customer ID"])

    for i in range(len(invoice)):
        if (invoice[i][0] == 'C') or (i == len(invoice) - 1):
            continue

        j = i + 1

        while customer_id[j] == customer_id[i]:
            if (invoice[j][0] == 'C') and (stock_code[i] == stock_code[j]):
                cancelled[i] = 1.0
                break

            if stock_code[i] == stock_code[j]:
                break

            j += 1
    
    df["cancelled"] = cancelled
    
    return df

Also, we want to remove any orders in the data set that are actually cancellations.

In [9]:
def remove_cancellations(df):
    invoice = np.asarray(df["Invoice"])

    is_order = [inv[0] != 'C' for inv in invoice]
    
    df = df[is_order]
    
    return df
In [10]:
full_data_pandas = add_target(full_data_pandas)
full_data_pandas = remove_cancellations(full_data_pandas)

Finally, there are some order for which we do not have a customer ID. We want to remove those.

In [11]:
full_data_pandas = full_data_pandas[~np.isnan(full_data_pandas["Customer ID"])]

Now we can upload the data to getML.

In [12]:
full_data = getml.data.DataFrame.from_pandas(full_data_pandas, "full_data")

full_data
Out[12]:
name Quantity Price Customer ID cancelled Invoice StockCode Description InvoiceDate Country
role unused_float unused_float unused_float unused_float unused_string unused_string unused_string unused_string unused_string
0 74215  1.04 12346  1  541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 2011-01-18 10:01 United Kingdom
1 12  2.1 12347  0  537626 85116 BLACK CANDELABRA T-LIGHT HOLDER 2010-12-07 14:57 Iceland
2 4  4.25 12347  0  537626 22375 AIRLINE BAG VINTAGE JET SET BROW... 2010-12-07 14:57 Iceland
3 12  3.25 12347  0  537626 71477 COLOUR GLASS. STAR T-LIGHT HOLDE... 2010-12-07 14:57 Iceland
4 36  0.65 12347  0  537626 22492 MINI PAINT SET VINTAGE 2010-12-07 14:57 Iceland
...  ...  ...  ...  ... ... ... ... ...
397920 12  0.42 18287  0  570715 22419 LIPSTICK PEN RED 2011-10-12 10:23 United Kingdom
397921 12  2.1 18287  0  570715 22866 HAND WARMER SCOTTY DOG DESIGN 2011-10-12 10:23 United Kingdom
397922 36  1.25 18287  0  573167 23264 SET OF 3 WOODEN SLEIGH DECORATIO... 2011-10-28 09:29 United Kingdom
397923 48  0.39 18287  0  573167 21824 PAINTED METAL STAR WITH HOLLY BE... 2011-10-28 09:29 United Kingdom
397924 24  0.29 18287  0  573167 21014 SWISS CHALET TREE DECORATION 2011-10-28 09:29 United Kingdom

397925 rows x 9 columns
memory usage: 57.28 MB
name: full_data
type: getml.DataFrame
url: http://localhost:1709/#/getdataframe/online_retail/full_data/

1.3 Prepare data for getML

getML requires that we define roles for each of the columns.

In [13]:
full_data.set_role("InvoiceDate", getml.data.roles.time_stamp, time_formats=['%Y-%m-%d %H:%M'])
full_data.set_role(["Customer ID", "Invoice"], getml.data.roles.join_key)
full_data.set_role(["cancelled"], getml.data.roles.target)
full_data.set_role(["Quantity", "Price"], getml.data.roles.numerical)
full_data.set_role("Country", getml.data.roles.categorical)
full_data.set_role("Description", getml.data.roles.text)

The StockCode is a 5-digit code that uniquely defines a product. It is hierarchical, meaning that every digit has a meaning. We want to make use of that, so we assign a unit to the stock code, which we can reference in our preprocessors.

In [14]:
full_data.set_unit("StockCode", "code")
In [15]:
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split
Out[15]:
0 train
1 validation
2 train
3 validation
4 validation
...

infinite number of rows
type: StringColumnView

2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model

To get started with relational learning, we need to specify the data model.

In our case, there are two joins we are interested in:

1) We want to take a look at all of the other orders on the same invoice.

2) We want to check out how often a certain customer has cancelled orders in the past. Here, we limit ourselves to the last 90 days. To avoid data leaks, we set a horizon of one day.

In [16]:
star_schema = getml.data.StarSchema(
    population=full_data, 
    alias="population",
    split=split,
)

star_schema.join(
    full_data.drop("Description"),
    alias="full_data",
    on='Invoice',
)

star_schema.join(
    full_data.drop("Description"),
    alias="full_data",
    on='Customer ID',
    time_stamps='InvoiceDate',
    horizon=getml.data.time.days(1),
    memory=getml.data.time.days(90),
    lagged_targets=True,
)

star_schema
Out[16]:

data model

diagram


full_datafull_datapopulationInvoice = InvoiceCustomer ID = Customer IDInvoiceDate <= InvoiceDateMemory: 90.0 daysHorizon: 1.0 daysLagged targets allowed

staging

data frames staging table
0 population POPULATION__STAGING_TABLE_1
1 full_data FULL_DATA__STAGING_TABLE_2

container

population

subset name rows type
0 test full_data 60013 View
1 train full_data 278171 View
2 validation full_data 59741 View

peripheral

name rows type
0 full_data 397925 View

2.2 getML pipeline

Set-up the feature learner & predictor

We have mentioned that the StockCode is a hierarchical code. To make use of that fact, we use getML's substring preprocessor, extracting the first digit, the first two digits etc. Since we have assigned the unit code to the StockCode, the preprocessors know which column they should be applied to.

In [17]:
substr1 = getml.preprocessors.Substring(0, 1, "code")
substr2 = getml.preprocessors.Substring(0, 2, "code")
substr3 = getml.preprocessors.Substring(0, 3, "code")

mapping = getml.preprocessors.Mapping()

text_field_splitter = getml.preprocessors.TextFieldSplitter()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
    sampling_factor=0.1,
)

feature_selector = getml.predictors.XGBoostClassifier()

predictor = getml.predictors.XGBoostClassifier()

Build the pipeline

In [18]:
pipe = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[substr1, substr2, substr3, mapping, text_field_splitter],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    share_selected_features=0.2,
)

pipe
Out[18]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function=None,
         peripheral=['full_data'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],
         share_selected_features=0.2,
         tags=['fast_prop'])

2.3 Model training

In [19]:
pipe.check(star_schema.train)
Checking data model...


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Checking...
[========================================] 100%


OK.
In [20]:
pipe.fit(star_schema.train)
Checking data model...


Staging...
[========================================] 100%


OK.


Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

Indexing text fields...
[========================================] 100%

FastProp: Trying 206 features...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as feature selector...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%


Trained pipeline.
Time taken: 0h:6m:50.075449

Out[20]:
Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function=None,
         peripheral=['full_data'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],
         share_selected_features=0.2,
         tags=['fast_prop', 'container-DNrDmX'])

url: http://localhost:1709/#/getpipeline/online_retail/F0kxVy/0/

2.4 Model evaluation

In [21]:
pipe.score(star_schema.test)

Staging...
[========================================] 100%

Preprocessing...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


Out[21]:
date time set used target accuracy auc cross entropy
0 2022-03-23 12:23:29 train cancelled 0.9825 0.8446 0.0736
1 2022-03-23 12:23:37 test cancelled 0.9825 0.8119 0.07529

2.5 Features

The most important feature looks as follows:

In [22]:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
Out[22]:
DROP TABLE IF EXISTS "FEATURE_1_190";

CREATE TABLE "FEATURE_1_190" AS
SELECT AVG( t2."description__mapping_3_target_1_avg" ) AS "feature_1_190",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "POPULATION__STAGING_TABLE_1__DESCRIPTION" t2
ON t1."rowid" = t2."rownum"
GROUP BY t1.rowid;

2.6 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code.

In [23]:
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("online_retail_spark")
In [24]:
if RUN_SPARK:
    spark = SparkSession.builder.appName(
        "online_retail"
    ).config(
        "spark.driver.maxResultSize","5g"
    ).config(
        "spark.driver.memory", "5g"
    ).config(
        "spark.executor.memory", "5g"
    ).config(
        "spark.sql.execution.arrow.pyspark.enabled", "true"
    ).config(
        "spark.sql.session.timeZone", "UTC"
    ).enableHiveSupport().getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")
In [25]:
if RUN_SPARK:
    population_spark = star_schema.train.population.to_pyspark(spark, name="population")
    peripheral_spark = star_schema.full_data.to_pyspark(spark, name="full_data")
In [26]:
if RUN_SPARK:
    getml.spark.execute(spark, "online_retail_spark")

The resulting features are in a table called features. Here is how you can retrieve them:

In [27]:
if RUN_SPARK:
    spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()

3. Conclusion

In this notebook we have demonstrated how getML can be applied to an e-commerce setting. In particular, we have seen how results can be improved using the built-in hyperparamater tuning routines.

Next Steps

This tutorial presented an application of getML in the e-commerce context.

If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.

Here is some additional material from our documentation if you want to learn more about getML:

Get in contact

If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.