Start

In this example, you generate labels on a mock dataset of transactions. For each customer, you want to label whether the total purchase amount over the next hour of transactions will exceed $300. Additionally, you want to make your predictions one hour in advance.

[1]:
import composeml as cp

Load Data

With the package installed, load the data. To get an idea on how the transactions looks, preview the data frame.

[2]:
df = cp.demos.load_transactions()

df[df.columns[:7]].head()
[2]:
transaction_id session_id transaction_time product_id amount customer_id device
0 298 1 2014-01-01 00:00:00 5 127.64 2 desktop
1 10 1 2014-01-01 00:09:45 5 57.39 2 desktop
2 495 1 2014-01-01 00:14:05 5 69.45 2 desktop
3 460 10 2014-01-01 02:33:50 5 123.19 2 tablet
4 302 10 2014-01-01 02:37:05 5 64.47 2 tablet

Create Labeling Function

Define the labeling function that returns the total purchase amount given a hour of transactions.

[3]:
def total_spent(df):
    total = df['amount'].sum()
    return total

Construct Label Maker

With the labeling function, create the LabelMaker for this prediction problem. To process one hour of transactions for each customer, set the target_dataframe_name to the customer ID and the window_size to one hour.

[4]:
label_maker = cp.LabelMaker(
    target_dataframe_name="customer_id",
    time_index="transaction_time",
    labeling_function=total_spent,
    window_size="1h",
)

Generate Labels

Automatically search and extract the labels using LabelMaker.search().

[5]:
labels = label_maker.search(
    df.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1,
    verbose=True,
)

labels.head()
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 5/5
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-compose/envs/stable/lib/python3.8/site-packages/composeml/label_times/object.py:55: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  self.target_types = pd.Series(self.target_types)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-compose/envs/stable/lib/python3.8/site-packages/composeml/label_times/object.py:55: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  self.target_types = pd.Series(self.target_types)
[5]:
customer_id time total_spent
0 1 2014-01-01 00:45:30 914.73
1 1 2014-01-01 00:46:35 806.62
2 1 2014-01-01 00:47:40 694.09
3 1 2014-01-01 00:52:00 687.80
4 1 2014-01-01 00:53:05 656.43
[6]:
%matplotlib inline
plot = labels.plot.dist()
_images/start_10_0.png

Transform Labels

With the generated LabelTimes, apply specific transforms for our prediction problem.

Apply Threshold on Labels

To make the labels binary, LabelTimes.threshold() is applied for amounts exceeding $300.

[7]:
labels = labels.threshold(300)

labels.head()
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-compose/envs/stable/lib/python3.8/site-packages/composeml/label_times/object.py:55: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  self.target_types = pd.Series(self.target_types)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-compose/envs/stable/lib/python3.8/site-packages/composeml/label_times/object.py:55: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  self.target_types = pd.Series(self.target_types)
[7]:
customer_id time total_spent
0 1 2014-01-01 00:45:30 True
1 1 2014-01-01 00:46:35 True
2 1 2014-01-01 00:47:40 True
3 1 2014-01-01 00:52:00 True
4 1 2014-01-01 00:53:05 True

Lead Label Times

The label times are shifted one hour earlier for predicting in advance by using LabelTimes.apply_lead().

[8]:
labels = labels.apply_lead('1h')

labels.head()
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-compose/envs/stable/lib/python3.8/site-packages/composeml/label_times/object.py:55: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  self.target_types = pd.Series(self.target_types)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-compose/envs/stable/lib/python3.8/site-packages/composeml/label_times/object.py:55: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  self.target_types = pd.Series(self.target_types)
[8]:
customer_id time total_spent
0 1 2013-12-31 23:45:30 True
1 1 2013-12-31 23:46:35 True
2 1 2013-12-31 23:47:40 True
3 1 2013-12-31 23:52:00 True
4 1 2013-12-31 23:53:05 True

Describe Labels

After transforming the labels, use LabelTimes.describe() to print out the distribution with the settings and transforms that were used to make these labels. This is useful as a reference for understanding how the labels are generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.

[9]:
labels.describe()
Label Distribution
------------------
False      56
True       44
Total:    100


Settings
--------
gap                                    1
maximum_data                        None
minimum_data                        None
num_examples_per_instance             -1
target_column                total_spent
target_dataframe_name        customer_id
target_type                     discrete
window_size                           1h


Transforms
----------
1. threshold
  - value:    300

2. apply_lead
  - value:    1h

Plot Labels

You can use plots to inspect the labels.

Distribution

This plot shows the label distribution.

[10]:
plot = labels.plot.distribution()
_images/start_18_0.png

Count by Time

This plot shows the label distribution across cutoff times.

[11]:
plot = labels.plot.count_by_time()
_images/start_20_0.png