Using Label Transforms¶
In this guide, you learn how to use the transforms that are available on LabelTimes
. Each transform returns a copy of the label times. This is useful for trying out multiple transforms in different settings without having to recalculate the labels. As a result, you can see which labels give a better performance in less time.
Generate Labels¶
Start by generating labels on a mock dataset of transactions. Each label is defined as the total spent by a customer given one hour of transactions.
[1]:
import composeml as cp
import pandas as pd
def total_spent(df):
return df["amount"].sum()
label_maker = cp.LabelMaker(
labeling_function=total_spent,
target_dataframe_index="customer_id",
time_index="transaction_time",
window_size="1h",
)
labels = label_maker.search(
cp.demos.load_transactions(),
num_examples_per_instance=10,
minimum_data=pd.Timedelta("2h"),
gap="2min",
verbose=True,
)
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 50/50
To get an idea on how the labels looks, preview the data frame.
[2]:
labels.head()
[2]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | 217.94 |
1 | 1 | 2014-01-01 02:47:30 | 217.94 |
2 | 1 | 2014-01-01 02:49:30 | 217.94 |
3 | 1 | 2014-01-01 02:51:30 | 217.94 |
4 | 1 | 2014-01-01 02:53:30 | 217.94 |
Threshold on Labels¶
LabelTimes.threshold()
creates binary labels by testing if label values are above a threshold. In this example, a threshold is applied to determine which customers spent over 100.
[3]:
labels.threshold(100).head()
[3]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | True |
1 | 1 | 2014-01-01 02:47:30 | True |
2 | 1 | 2014-01-01 02:49:30 | True |
3 | 1 | 2014-01-01 02:51:30 | True |
4 | 1 | 2014-01-01 02:53:30 | True |
Lead Labels Times¶
LabelTimes.apply_lead()
shifts the label time to an earlier moment. This is useful for training a model to predict in advance. In this example, a one hour lead is applied to the label times.
[4]:
labels.apply_lead("1h").head()
[4]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 01:45:30 | 217.94 |
1 | 1 | 2014-01-01 01:47:30 | 217.94 |
2 | 1 | 2014-01-01 01:49:30 | 217.94 |
3 | 1 | 2014-01-01 01:51:30 | 217.94 |
4 | 1 | 2014-01-01 01:53:30 | 217.94 |
Bin Labels¶
LabelTimes.bin()
bins the labels into discrete intervals. There are two types of bins. Bins could either be based on values or quantiles. Additionally, the widths of the bins could either be defined by the user or divided equally.
Value Based¶
To use bins based on values, quantiles
should be set to False
, the default value.
Equal Width¶
To group values into bins of equal width, set bins as a scalar value. In this example, total_spent is grouped into bins of equal width.
[5]:
labels.bin(4, quantiles=False).head()
[5]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | (198.455, 271.072] |
1 | 1 | 2014-01-01 02:47:30 | (198.455, 271.072] |
2 | 1 | 2014-01-01 02:49:30 | (198.455, 271.072] |
3 | 1 | 2014-01-01 02:51:30 | (198.455, 271.072] |
4 | 1 | 2014-01-01 02:53:30 | (198.455, 271.072] |
Custom Widths¶
To group values into bins of custom widths, set bins as an array of values to define edges. In this example, total_spent is grouped into bins of custom widths.
[6]:
inf = float("inf")
edges = [-inf, 34, 50, 67, inf]
labels.bin(
edges,
quantiles=False,
).head()
[6]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | (67.0, inf] |
1 | 1 | 2014-01-01 02:47:30 | (67.0, inf] |
2 | 1 | 2014-01-01 02:49:30 | (67.0, inf] |
3 | 1 | 2014-01-01 02:51:30 | (67.0, inf] |
4 | 1 | 2014-01-01 02:53:30 | (67.0, inf] |
Quantile Based¶
To use bins based on quantiles, quantiles
should be set to True
.
Equal Width¶
To group values into quantile bins of equal width, set bins to the number of quantiles as a scalar value (for example, 4 for quartiles, 10 for deciles, etc.). In this example, the total spent is grouped into bins based on the quartiles.
[7]:
labels.bin(4, quantiles=True).head()
[7]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | (196.25, 217.94] |
1 | 1 | 2014-01-01 02:47:30 | (196.25, 217.94] |
2 | 1 | 2014-01-01 02:49:30 | (196.25, 217.94] |
3 | 1 | 2014-01-01 02:51:30 | (196.25, 217.94] |
4 | 1 | 2014-01-01 02:53:30 | (196.25, 217.94] |
To verify quartile values, check the descriptive statistics.
[8]:
stats = labels.total_spent.describe()
stats = stats.round(3).to_string()
print(stats)
count 50.000
mean 215.182
std 90.518
min 53.220
25% 196.250
50% 217.940
75% 290.390
max 343.690
Custom Widths¶
To group values into quantile bins of custom widths, set bins as an array of quantiles. In this example, the total spent is grouped into quantile bins of custom widths.
[9]:
quantiles = [0, 0.34, 0.5, 0.67, 1]
labels.bin(quantiles, quantiles=True).head()
[9]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | (196.25, 217.94] |
1 | 1 | 2014-01-01 02:47:30 | (196.25, 217.94] |
2 | 1 | 2014-01-01 02:49:30 | (196.25, 217.94] |
3 | 1 | 2014-01-01 02:51:30 | (196.25, 217.94] |
4 | 1 | 2014-01-01 02:53:30 | (196.25, 217.94] |
Label Bins¶
To assign bins with custom labels, set labels
to the array of values. The number of labels need to match the number of bins. In this example, the total spent is grouped into bins with custom labels.
[10]:
values = ["low", "medium", "high"]
labels.bin(3, labels=values).head()
[10]:
customer_id | time | total_spent | |
---|---|---|---|
0 | 1 | 2014-01-01 02:45:30 | medium |
1 | 1 | 2014-01-01 02:47:30 | medium |
2 | 1 | 2014-01-01 02:49:30 | medium |
3 | 1 | 2014-01-01 02:51:30 | medium |
4 | 1 | 2014-01-01 02:53:30 | medium |
Describe Labels¶
LabelTimes.describe()
prints out the distribution with the settings and transforms that you’ve used to make the labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels. In this example, a description of the labels is printed after transforming the labels into discrete values.
[11]:
labels.threshold(100).describe()
Label Distribution
------------------
total_spent
False 8
True 42
Total: 50
Settings
--------
gap 2min
maximum_data None
minimum_data 0 days 02:00:00
num_examples_per_instance 10
target_column total_spent
target_dataframe_index customer_id
target_type discrete
window_size 1h
Transforms
----------
1. threshold
- value: 100
Sample Labels¶
LabelTimes.sample()
samples the labels based on a number or fraction. Samples can be reproduced by fixing random_state
to an integer.
To sample 10 labels, n
is set to 10.
[12]:
labels.sample(n=10, random_state=0)
[12]:
customer_id | time | total_spent | |
---|---|---|---|
2 | 1 | 2014-01-01 02:49:30 | 217.94 |
4 | 1 | 2014-01-01 02:53:30 | 217.94 |
10 | 2 | 2014-01-01 02:00:00 | 290.39 |
11 | 2 | 2014-01-01 02:02:00 | 290.39 |
22 | 3 | 2014-01-01 03:49:05 | 196.25 |
27 | 3 | 2014-01-01 03:59:05 | 196.25 |
28 | 3 | 2014-01-01 04:01:05 | 196.25 |
31 | 4 | 2014-01-01 02:41:00 | 343.69 |
38 | 4 | 2014-01-01 02:55:00 | 225.18 |
41 | 5 | 2014-01-01 03:48:25 | 53.22 |
Similarly, to sample 10% of labels, frac
is set to 10%.
[13]:
labels.sample(frac=0.1, random_state=0)
[13]:
customer_id | time | total_spent | |
---|---|---|---|
2 | 1 | 2014-01-01 02:49:30 | 217.94 |
10 | 2 | 2014-01-01 02:00:00 | 290.39 |
11 | 2 | 2014-01-01 02:02:00 | 290.39 |
28 | 3 | 2014-01-01 04:01:05 | 196.25 |
41 | 5 | 2014-01-01 03:48:25 | 53.22 |
Categorical Labels¶
When working with categorical labels, the number or fraction of labels for each category can be sampled by using a dictionary. Bin the labels into 4 bins to make categorical.
[14]:
categorical = labels.bin(4, labels=["A", "B", "C", "D"])
To sample 2 labels per category, map each category to the number 2.
[15]:
n = {"A": 2, "B": 2, "C": 2, "D": 2}
categorical.sample(n=n, random_state=0)
[15]:
customer_id | time | total_spent | |
---|---|---|---|
6 | 1 | 2014-01-01 02:57:30 | C |
11 | 2 | 2014-01-01 02:02:00 | D |
16 | 2 | 2014-01-01 02:12:00 | D |
26 | 3 | 2014-01-01 03:57:05 | B |
38 | 4 | 2014-01-01 02:55:00 | C |
42 | 5 | 2014-01-01 03:50:25 | A |
46 | 5 | 2014-01-01 03:58:25 | A |
48 | 5 | 2014-01-01 04:02:25 | B |
Similarly, to sample 10% of labels per category, map each category to 10%.
[16]:
frac = {"A": 0.1, "B": 0.1, "C": 0.1, "D": 0.1}
categorical.sample(frac=frac, random_state=0)
[16]:
customer_id | time | total_spent | |
---|---|---|---|
6 | 1 | 2014-01-01 02:57:30 | C |
11 | 2 | 2014-01-01 02:02:00 | D |
16 | 2 | 2014-01-01 02:12:00 | D |
26 | 3 | 2014-01-01 03:57:05 | B |
46 | 5 | 2014-01-01 03:58:25 | A |