In this example, we will generate labels on a mock dataset of transactions. For each customer, we want to label whether the total purchase amount over the next hour of transactions will exceed $300. Additionally, we want to predict one hour in advance.
[1]:
import composeml as cp
With the package installed, we load in the data. To get an idea on how the transactions looks, we preview the data frame.
[2]:
df = cp.demos.load_transactions() df[df.columns[:7]].head()
To get started, we define the labeling function that will return the total purchase amount given a hour of transactions.
[3]:
def total_spent(df): total = df['amount'].sum() return total
With the labeling function, we create the LabelMaker for our prediction problem. To process one hour of transactions for each customer, we set the target_entity to the customer ID and the window_size to one hour.
LabelMaker
target_entity
window_size
[4]:
label_maker = cp.LabelMaker( target_entity="customer_id", time_index="transaction_time", labeling_function=total_spent, window_size="1h", )
Next, we automatically search and extract the labels by using LabelMaker.search().
LabelMaker.search()
[5]:
labels = label_maker.search( df.sort_values('transaction_time'), num_examples_per_instance=-1, gap=1, verbose=True, ) labels.head()
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 5/5
[6]:
%matplotlib inline plot = labels.plot.dist()
With the generated LabelTimes, we will apply specific transforms for our prediction problem.
LabelTimes
To make the labels binary, LabelTimes.threshold() is applied for amounts exceeding $300.
LabelTimes.threshold()
[7]:
labels = labels.threshold(300) labels.head()
Additionally, the label times are shifted one hour earlier for predicting in advance by using LabelTimes.apply_lead().
LabelTimes.apply_lead()
[8]:
labels = labels.apply_lead('1h') labels.head()
After transforming the labels, we can use LabelTimes.describe() to print out the distribution with the settings and transforms that were used to make these labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.
LabelTimes.describe()
[9]:
labels.describe()
Label Distribution ------------------ False 56 True 44 Total: 100 Settings -------- gap 1 minimum_data None num_examples_per_instance -1 target_column total_spent target_entity customer_id target_type discrete window_size 1h Transforms ---------- 1. threshold - value: 300 2. apply_lead - value: 1h
Also, there are plots available for insight to the labels.
This plot shows the label distribution.
[10]:
plot = labels.plot.distribution()
This plot shows the label distribution across cutoff times.
[11]:
plot = labels.plot.count_by_time()