Start¶
In this example, you generate labels on a mock dataset of transactions. For each customer, you want to label whether the total purchase amount over the next hour of transactions will exceed $300. Additionally, you want to make your predictions one hour in advance.
[1]:
import composeml as cp
Load Data¶
With the package installed, load the data. To get an idea on how the transactions looks, preview the data frame.
[2]:
df = cp.demos.load_transactions()
df[df.columns[:7]].head()
[2]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|---|
0 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2 | desktop |
1 | 10 | 1 | 2014-01-01 00:09:45 | 5 | 57.39 | 2 | desktop |
2 | 495 | 1 | 2014-01-01 00:14:05 | 5 | 69.45 | 2 | desktop |
3 | 460 | 10 | 2014-01-01 02:33:50 | 5 | 123.19 | 2 | tablet |
4 | 302 | 10 | 2014-01-01 02:37:05 | 5 | 64.47 | 2 | tablet |
Create Labeling Function¶
Define the labeling function that returns the total purchase amount given a hour of transactions.
[3]:
def total_spent(df):
total = df['amount'].sum()
return total
Construct Label Maker¶
With the labeling function, create the LabelMaker
for this prediction problem. To process one hour of transactions for each customer, set the target_dataframe_index
to the customer ID and the window_size
to one hour.
[4]:
label_maker = cp.LabelMaker(
target_dataframe_index="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="1h",
)
Generate Labels¶
Automatically search and extract the labels using LabelMaker.search()
.
[5]:
labels = label_maker.search(
df.sort_values('transaction_time'),
num_examples_per_instance=-1,
gap=1,
verbose=True,
)
labels.head()
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 5/5
[5]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 00:45:30 | 914.73 |
1 | 1 | 2014-01-01 00:46:35 | 806.62 |
2 | 1 | 2014-01-01 00:47:40 | 694.09 |
3 | 1 | 2014-01-01 00:52:00 | 687.80 |
4 | 1 | 2014-01-01 00:53:05 | 656.43 |
[6]:
%matplotlib inline
plot = labels.plot.dist()
Transform Labels¶
With the generated LabelTimes
, apply specific transforms for our prediction problem.
Apply Threshold on Labels¶
To make the labels binary, LabelTimes.threshold()
is applied for amounts exceeding $300.
[7]:
labels = labels.threshold(300)
labels.head()
[7]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 00:45:30 | True |
1 | 1 | 2014-01-01 00:46:35 | True |
2 | 1 | 2014-01-01 00:47:40 | True |
3 | 1 | 2014-01-01 00:52:00 | True |
4 | 1 | 2014-01-01 00:53:05 | True |
Lead Label Times¶
The label times are shifted one hour earlier for predicting in advance by using LabelTimes.apply_lead()
.
[8]:
labels = labels.apply_lead('1h')
labels.head()
[8]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2013-12-31 23:45:30 | True |
1 | 1 | 2013-12-31 23:46:35 | True |
2 | 1 | 2013-12-31 23:47:40 | True |
3 | 1 | 2013-12-31 23:52:00 | True |
4 | 1 | 2013-12-31 23:53:05 | True |
Describe Labels¶
After transforming the labels, use LabelTimes.describe()
to print out the distribution with the settings and transforms that were used to make these labels. This is useful as a reference for understanding how the labels are generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.
[9]:
labels.describe()
Label Distribution
------------------
False 56
True 44
Total: 100
Settings
--------
gap 1
maximum_data None
minimum_data None
num_examples_per_instance -1
target_column total_spent
target_dataframe_index customer_id
target_type discrete
window_size 1h
Transforms
----------
1. threshold
- value: 300
2. apply_lead
- value: 1h
Plot Labels¶
You can use plots to inspect the labels.
Distribution¶
This plot shows the label distribution.
[10]:
plot = labels.plot.distribution()
Count by Time¶
This plot shows the label distribution across cutoff times.
[11]:
plot = labels.plot.count_by_time()