Controlling cutoff times in a label search¶
The start time of the labeling process is known as the first cutoff time. You need data that exists before the first cutoff time to build features. You can use minimum_data
in a label search to directly define the first cutoff time or the amount of data needed before the first cutoff time. Similarly, you can use maximum_data
to directly define the last cutoff time. These parameters let you control when the labeling process starts and finishes.
Labeling customer transactions¶
For example, suppose you have customer transactions from the first quarter of 2021.
[2]:
import composeml as cp
transactions.head()
[2]:
customer_id | transaction_time | amount | |
---|---|---|---|
0 | 3 | 2021-03-31 18:51:27 | 52.29 |
1 | 5 | 2021-03-22 06:56:05 | 33.81 |
2 | 5 | 2021-03-20 23:45:21 | 76.30 |
3 | 2 | 2021-03-30 10:06:59 | 32.72 |
4 | 1 | 2021-02-17 11:01:22 | 59.16 |
You want to calculate the total amount that customers spent over two weeks only for February. Start by defining a labeling function that sums up the transaction amount. Then, create a label maker that will label data over two weeks using the transaction time.
[3]:
def total_amount(ds):
return ds.amount.sum()
lm = cp.LabelMaker(
labeling_function=total_amount,
time_index="transaction_time",
target_dataframe_index="customer_id",
window_size="14d",
)
Defining the first and last cutoff time¶
Now, you can use minimum_data
in the label search to directly set the 1st of February as the first cutoff time. Since you are labeling data over two weeks, you can define the last cutoff time as the 15th.
[4]:
lt = lm.search(
df=transactions.sort_values("transaction_time"),
num_examples_per_instance=-1,
minimum_data="2021-02-01",
maximum_data="2021-02-15",
drop_empty=False,
verbose=False,
)
lt
[4]:
customer_id | time | total_amount | |
---|---|---|---|
0 | 1 | 2021-02-01 | 31.29 |
1 | 1 | 2021-02-15 | 59.16 |
2 | 2 | 2021-02-01 | 49.70 |
3 | 2 | 2021-02-15 | 86.15 |
4 | 3 | 2021-02-01 | 0.00 |
5 | 3 | 2021-02-15 | 28.96 |
6 | 4 | 2021-02-01 | 128.70 |
7 | 4 | 2021-02-15 | 59.67 |
8 | 5 | 2021-02-01 | 0.00 |
9 | 5 | 2021-02-15 | 0.00 |
Changing the first cutoff time for each customer¶
Suppose you have a lookup table that contains the dates when customers signed up and created their accounts. Now, you are interested in calculating the total amount that customers spent over two weeks only after creating an account.
[5]:
created_account
[5]:
customer_id
1 2021-01-10
2 2021-02-12
3 2021-01-23
4 2021-02-13
5 2021-01-24
Name: created_account, dtype: datetime64[ns]
You can use the column of sign up dates directly as the first cutoff times in the labeling process. Each customer should only have one cutoff time.
[6]:
lt = lm.search(
df=transactions.sort_values("transaction_time"),
num_examples_per_instance=-1,
minimum_data=created_account,
drop_empty=False,
verbose=False,
)
lt.head(10)
[6]:
customer_id | time | total_amount | |
---|---|---|---|
0 | 1 | 2021-01-10 | 0.00 |
1 | 1 | 2021-01-24 | 26.15 |
2 | 1 | 2021-02-07 | 90.45 |
3 | 1 | 2021-02-21 | 0.00 |
4 | 1 | 2021-03-07 | 49.64 |
5 | 2 | 2021-02-12 | 86.15 |
6 | 2 | 2021-02-26 | 0.00 |
7 | 2 | 2021-03-12 | 41.08 |
8 | 2 | 2021-03-26 | 32.72 |
9 | 3 | 2021-01-23 | 0.00 |
For more details on labeling data over specific periods, you can look at the guide on generating data slices.