Using Compose with Featuretools

In this guide, we will generate labels and features on a mock dataset of transactions using Compose and Featuretools. Then create a machine learning model for predicting one hour in advance whether customers will spend over $1200 within the next hour of transactions.

[1]:
%matplotlib inline
import composeml as cp
import featuretools as ft
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

Load Data

To get an idea on how the transactions looks, we preview the data frame.

[2]:
transactions = ft.demo.load_mock_customer(
    return_single_table=True,
    random_seed=0,
)

transactions[transactions.columns[:7]].head()
[2]:
transaction_id session_id transaction_time product_id amount customer_id device
0 298 1 2014-01-01 00:00:00 5 127.64 2 desktop
1 10 1 2014-01-01 00:09:45 5 57.39 2 desktop
2 495 1 2014-01-01 00:14:05 5 69.45 2 desktop
3 460 10 2014-01-01 02:33:50 5 123.19 2 tablet
4 302 10 2014-01-01 02:37:05 5 64.47 2 tablet

Generate Labels

Now with the transactions loaded, we are ready to generate labels for our prediction problem.

Create Labeling Function

First, we define the function that will return the total purchase amount given a hour of transactions.

[3]:
def total_spent(df):
    total = df["amount"].sum()
    return total

Construct Label Maker

With our labeling function, we create the LabelMaker for the transactions. The target_entity is set to customer_id so that the labels are generated for each customer. The window_size is set to one hour to process one hour of transactions for a given customer.

[4]:
label_maker = cp.LabelMaker(
    target_entity='customer_id',
    time_index='transaction_time',
    labeling_function=total_spent,
    window_size='1h',
)

Create Labels

Next, we automatically search and extract the labels by using LabelMaker.search().

See also

For more details on how the label maker works, see Main Concepts.

[5]:
labels = label_maker.search(
    transactions.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1,
)

labels.head()
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 5/5
[5]:
customer_id time total_spent
id
0 1 2014-01-01 00:44:25 2880.53
1 1 2014-01-01 00:45:30 2859.18
2 1 2014-01-01 00:46:35 2751.07
3 1 2014-01-01 00:47:40 2638.54
4 1 2014-01-01 00:48:45 2632.25

Transform Labels

With the generated LabelTimes, we will apply specific transforms for our prediction problem.

Apply Threshold on Labels

We apply LabelTimes.threshold() to make the labels binary for total amounts exceeding $1200.

[6]:
labels = labels.threshold(1200)

labels.head()
[6]:
customer_id time total_spent
id
0 1 2014-01-01 00:44:25 True
1 1 2014-01-01 00:45:30 True
2 1 2014-01-01 00:46:35 True
3 1 2014-01-01 00:47:40 True
4 1 2014-01-01 00:48:45 True

Lead Label Times

We also use LabelTimes.apply_lead() to shift the label times 1 hour earlier for predicting in advance.

[7]:
labels = labels.apply_lead('1h')

labels.head()
[7]:
customer_id time total_spent
id
0 1 2013-12-31 23:44:25 True
1 1 2013-12-31 23:45:30 True
2 1 2013-12-31 23:46:35 True
3 1 2013-12-31 23:47:40 True
4 1 2013-12-31 23:48:45 True

Describe Labels

After transforming the labels, we could use LabelTimes.describe() to print out the distribution with the settings and transforms that were used to make the labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.

[8]:
labels.describe()
Label Distribution
------------------
False     248
True      252
Total:    500


Settings
--------
gap                                    1
label_type                      discrete
labeling_function            total_spent
minimum_data                        None
num_examples_per_instance             -1
target_entity                customer_id
window_size                       <Hour>


Transforms
----------
1. threshold
  - value:    1200

2. apply_lead
  - value:    1h

Generate Features

Now with the generated labels, we are ready to generate features for our prediction problem.

Create Entity Set

Let’s construct an EntitySet and load the transactions as an entity by using EntitySet.entity_from_dataframe(). Then extract additional entities by using EntitySet.normalize_entity().

See also

For more details on working with entity sets, see loading_data/using_entitysets .

[9]:
es = ft.EntitySet('transactions')

es.entity_from_dataframe(
    'transactions',
    transactions,
    index='transaction_id',
    time_index='transaction_time',
)

es.normalize_entity(
    base_entity_id='transactions',
    new_entity_id='sessions',
    index='session_id',
    make_time_index='session_start',
    additional_variables=[
        'device',
        'customer_id',
        'zip_code',
        'session_start',
        'join_date',
        'date_of_birth',
    ],
)

es.normalize_entity(
    base_entity_id='sessions',
    new_entity_id='customers',
    index='customer_id',
    make_time_index='join_date',
    additional_variables=[
        'zip_code',
        'join_date',
        'date_of_birth',
    ],
)

es.normalize_entity(
    base_entity_id='transactions',
    new_entity_id='products',
    index='product_id',
    additional_variables=['brand'],
    make_time_index=False,
)

es.add_last_time_indexes()

Describe Entity Set

To get information on how the entity set is structured, we could print the entity set and use EntitySet.plot() to create a diagram.

[10]:
print(es, end='\n\n')

es.plot()
Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 5]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 4]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id
    transactions.product_id -> products.product_id

[10]:
../_images/guides_using_compose_with_featuretools_19_1.svg

Create Feature Matrix

Next, we generate features that correspond to the labels created previously by using dfs(). The target_entity is set to customers so that features are only calculated for customers. The cutoff_time is set to the labels so that features are calculated only using data up to and including the label cutoff times. Notice that the output of Compose integrates easily with Featuretools.

See also

For more details on calculating features using cutoff times, see automated_feature_engineering/handling_time.

[11]:
feature_matrix, features_defs = ft.dfs(
    entityset=es,
    target_entity='customers',
    cutoff_time=labels,
    cutoff_time_in_index=True,
    verbose=True,
)
Built 77 features
Elapsed: 00:51 | Progress: 100%|██████████

Describe Features

To get an idea on how the generated features look, we preview the feature definitions.

[12]:
features_defs[:20]
[12]:
[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: NUM_UNIQUE(sessions.device)>,
 <Feature: MODE(sessions.device)>,
 <Feature: SUM(transactions.amount)>,
 <Feature: STD(transactions.amount)>,
 <Feature: MAX(transactions.amount)>,
 <Feature: SKEW(transactions.amount)>,
 <Feature: MIN(transactions.amount)>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: COUNT(transactions)>,
 <Feature: NUM_UNIQUE(transactions.product_id)>,
 <Feature: MODE(transactions.product_id)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: DAY(join_date)>,
 <Feature: YEAR(date_of_birth)>,
 <Feature: YEAR(join_date)>,
 <Feature: MONTH(date_of_birth)>,
 <Feature: MONTH(join_date)>,
 <Feature: WEEKDAY(date_of_birth)>]

Machine Learning

Now with the generated labels and features, we are ready to create a machine learning model for our prediction problem.

Preprocess Features

In the feature matrix, let’s extract the labels and fill any missing values with zeros. Then, one-hot encode all categorical features by using encode_features().

[13]:
y = feature_matrix.pop('total_spent')
x = feature_matrix.fillna(0)
x, features_enc = ft.encode_features(x, features_defs)

Split Labels and Features

After preprocessing, we split the features and corresponding labels each into training and testing sets.

[14]:
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    train_size=.8,
    test_size=.2,
    random_state=0,
)

Train Model

Next, we train a random forest classifer on the training set.

[15]:
clf = RandomForestClassifier(n_estimators=10, random_state=0)

clf.fit(x_train, y_train)
[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Test Model

Lastly, we test the model performance by evaluating predictions on the testing set.

[16]:
y_hat = clf.predict(x_test)

print(classification_report(y_test, y_hat))
              precision    recall  f1-score   support

       False       0.79      0.90      0.84        49
        True       0.89      0.76      0.82        51

    accuracy                           0.83       100
   macro avg       0.84      0.83      0.83       100
weighted avg       0.84      0.83      0.83       100

Feature Importances

This plot is based on scores obtained by the model to illustrate which features are considered important for predictions.

[17]:
feature_importances = zip(x_train.columns, clf.feature_importances_)
feature_importances = pd.Series(dict(feature_importances))
feature_importances = feature_importances.rename_axis('Features')
feature_importances = feature_importances.sort_values()

top_features = feature_importances.tail(40)
plot = top_features.plot(kind='barh', figsize=(5, 12), color='#054571')
plot.set_title('Feature Importances')
plot.set_xlabel('Scores');
../_images/guides_using_compose_with_featuretools_34_0.png