Data Slice Generator¶
The data slice generator is the underlying function used to generate data slices for the labeling function. If the label maker raises an error during the search or the output labels doesn’t seem right, then you will need to check the logic in the labeling function or inspect the data for any inherent errors. This is where the data slice generator can help us do both. Ideally, you also want to use the generator during the development of your labeling function for best practice. However, it is an optional step and not required to generate labels.
In this guide, we will use the data slice generator to inspect data slices and apply our labeling function. To get started, let’s load a mock dataset of transactions and sample the data to see how the transactions look.
[1]:
import composeml as cp
[2]:
df = cp.demos.load_transactions()
df = df[df.columns[:7]]
df.sample(n=5, random_state=0)
[2]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|---|
26 | 94 | 24 | 2014-01-01 05:55:20 | 5 | 100.42 | 5 | tablet |
86 | 274 | 7 | 2014-01-01 01:46:10 | 5 | 14.45 | 3 | tablet |
2 | 495 | 1 | 2014-01-01 00:14:05 | 5 | 69.45 | 2 | desktop |
55 | 275 | 4 | 2014-01-01 00:45:30 | 5 | 108.11 | 1 | mobile |
75 | 368 | 27 | 2014-01-01 06:36:30 | 5 | 139.43 | 1 | mobile |
Labeling Function¶
Let’s define a labeling function that will return how much a customer spent given a slice of transactions.
[3]:
def total_spent(df):
total = df['amount'].sum()
return total
Data Slices¶
The LabelMaker.slice()
method will create the data slice generator. The parameters of this method can be passed directly to LabelMaker.search()
to generate the labels. In the following sections, we will see how to use the data slice generator to make data slices consecutive, overlap, or spread out.
See also
For a conceptual explanation of the process, see Main Concepts.
Consecutive¶
When the the gap size is equal to the window size, the data slices are consecutive. In other words, the data slices do not overlap and are not spread out (e.g. don’t skip any data). This is the default value for the gap size. To demonstrate this example, let’s generate data slices using these parameters.
We create a label maker with the 2-hour window size.
[4]:
lm = cp.LabelMaker(
target_entity="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="2h",
)
Then, we create a data slice generator with the 2-hour gap size. The default value for the gap size is the window size.
Tip
You can directly set minimum_data
as the first cutoff time.
[5]:
slices = lm.slice(
df.sort_values('transaction_time'),
num_examples_per_instance=-1,
minimum_data='2014-01-01',
)
Consecutive - Data Slice #1¶
By printing this data slice, we can see that it’s the first slice of transactions (denoted by the slice_number
) for customer 1. This data slice contains all of the customer’s transactions that occurred within the 2-hour window between 2014-01-01 00:00:00
and 2014-01-01 02:00:00
. We can also see that the 2-hour gap aligns the cutoff times to the window. So, the next data slice will start at the end of this data slice.
[6]:
ds = next(slices)
print(ds)
ds
slice_number 1
customer_id 1
window [2014-01-01 00:00:00, 2014-01-01 02:00:00)
gap [2014-01-01 00:00:00, 2014-01-01 02:00:00)
[6]:
transaction_id | session_id | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|
transaction_time | ||||||
2014-01-01 00:45:30 | 275 | 4 | 5 | 108.11 | 1 | mobile |
2014-01-01 00:46:35 | 101 | 4 | 5 | 112.53 | 1 | mobile |
2014-01-01 00:47:40 | 80 | 4 | 5 | 6.29 | 1 | mobile |
2014-01-01 00:52:00 | 163 | 4 | 5 | 31.37 | 1 | mobile |
2014-01-01 00:53:05 | 293 | 4 | 5 | 82.88 | 1 | mobile |
2014-01-01 00:57:25 | 103 | 4 | 5 | 20.79 | 1 | mobile |
2014-01-01 01:03:55 | 488 | 4 | 5 | 129.00 | 1 | mobile |
2014-01-01 01:05:00 | 413 | 4 | 5 | 119.98 | 1 | mobile |
2014-01-01 01:31:00 | 191 | 6 | 5 | 139.23 | 1 | tablet |
2014-01-01 01:37:30 | 372 | 6 | 5 | 114.84 | 1 | tablet |
2014-01-01 01:38:35 | 387 | 6 | 5 | 49.71 | 1 | tablet |
Let’s apply our labeling function for the total spent on this data slice.
[7]:
total_spent(ds)
[7]:
914.7300000000001
Consecutive - Data Slice #2¶
In the second data slice, we can see the next 2 consecutive hours of transactions between 2014-01-01 02:00:00
and 2014-01-01 04:00:00
. This is useful for generating labels that will consecutively process the data only once.
[8]:
ds = next(slices)
print(ds)
ds
slice_number 2
customer_id 1
window [2014-01-01 02:00:00, 2014-01-01 04:00:00)
gap [2014-01-01 02:00:00, 2014-01-01 04:00:00)
[8]:
transaction_id | session_id | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|
transaction_time | ||||||
2014-01-01 02:28:25 | 287 | 9 | 5 | 50.94 | 1 | desktop |
2014-01-01 03:29:05 | 190 | 14 | 5 | 110.52 | 1 | tablet |
2014-01-01 03:39:55 | 7 | 14 | 5 | 107.42 | 1 | tablet |
Let’s apply our labeling function for the total spent on this data slice.
[9]:
total_spent(ds)
[9]:
268.88
Overlap¶
When the the gap size is less than the window size, the data slices will overlap. We can use this for rolling window based labeling processes. The amount of overlap is the difference between the window and gap size. For example, if the window size is 3 hours and the gap size is 1 hour, then 2 hours will overlap on each data slice. To demonstrate this example, let’s generate data slices using these parameters.
We create a label maker with the 3-hour window size.
[10]:
lm = cp.LabelMaker(
target_entity="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="3h",
)
Then, we create a data slice generator with the 1-hour gap size.
[11]:
slices = lm.slice(
df.sort_values('transaction_time'),
num_examples_per_instance=-1,
minimum_data='2014-01-01',
gap="1h",
)
Overlap - Data Slice #1¶
The first data slice contains all of the customer’s transactions that occurred within the 3-hour window between 2014-01-01 00:00:00
and 2014-01-01 03:00:00
. We can also see that the 1-hour gap spaces apart the cutoff time of this data slice at 2014-01-01 00:00:00
from the cutoff time of the next data slice at 2014-01-01 01:00:00
.
[12]:
ds = next(slices)
print(ds)
ds
slice_number 1
customer_id 1
window [2014-01-01 00:00:00, 2014-01-01 03:00:00)
gap [2014-01-01 00:00:00, 2014-01-01 01:00:00)
[12]:
transaction_id | session_id | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|
transaction_time | ||||||
2014-01-01 00:45:30 | 275 | 4 | 5 | 108.11 | 1 | mobile |
2014-01-01 00:46:35 | 101 | 4 | 5 | 112.53 | 1 | mobile |
2014-01-01 00:47:40 | 80 | 4 | 5 | 6.29 | 1 | mobile |
2014-01-01 00:52:00 | 163 | 4 | 5 | 31.37 | 1 | mobile |
2014-01-01 00:53:05 | 293 | 4 | 5 | 82.88 | 1 | mobile |
2014-01-01 00:57:25 | 103 | 4 | 5 | 20.79 | 1 | mobile |
2014-01-01 01:03:55 | 488 | 4 | 5 | 129.00 | 1 | mobile |
2014-01-01 01:05:00 | 413 | 4 | 5 | 119.98 | 1 | mobile |
2014-01-01 01:31:00 | 191 | 6 | 5 | 139.23 | 1 | tablet |
2014-01-01 01:37:30 | 372 | 6 | 5 | 114.84 | 1 | tablet |
2014-01-01 01:38:35 | 387 | 6 | 5 | 49.71 | 1 | tablet |
2014-01-01 02:28:25 | 287 | 9 | 5 | 50.94 | 1 | desktop |
Let’s apply our labeling function for the total spent on this data slice.
[13]:
total_spent(ds)
[13]:
965.6700000000001
Overlap - Data Slice #2¶
In the second data slice, we can see that there is a 2-hour overlap on the transactions that occurred between 2014-01-01 01:00:00
and 2014-01-01 03:00:00
. By adjusting the gap size, we can set the precise amount of overlap in the data slices. This is useful for generating labels with specific overlap.
[14]:
ds = next(slices)
print(ds)
ds
slice_number 2
customer_id 1
window [2014-01-01 01:00:00, 2014-01-01 04:00:00)
gap [2014-01-01 01:00:00, 2014-01-01 02:00:00)
[14]:
transaction_id | session_id | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|
transaction_time | ||||||
2014-01-01 01:03:55 | 488 | 4 | 5 | 129.00 | 1 | mobile |
2014-01-01 01:05:00 | 413 | 4 | 5 | 119.98 | 1 | mobile |
2014-01-01 01:31:00 | 191 | 6 | 5 | 139.23 | 1 | tablet |
2014-01-01 01:37:30 | 372 | 6 | 5 | 114.84 | 1 | tablet |
2014-01-01 01:38:35 | 387 | 6 | 5 | 49.71 | 1 | tablet |
2014-01-01 02:28:25 | 287 | 9 | 5 | 50.94 | 1 | desktop |
2014-01-01 03:29:05 | 190 | 14 | 5 | 110.52 | 1 | tablet |
2014-01-01 03:39:55 | 7 | 14 | 5 | 107.42 | 1 | tablet |
Let’s apply our labeling function for the total spent on this data slice.
[15]:
total_spent(ds)
[15]:
821.6400000000001
Spread Out¶
When the the gap size is greater than the window size, then there is data in-between data slices that will be skipped. We can use this for labeling data at specific intervals of time. The amount of data skipped is the difference between the gap and window size. For example, if the gap size is 3 hours and the window size is 1 hour, then 2 hours of data will be skipped in-between data slices. To demonstrate this example, let’s generate data slices using these parameters.
We create a label maker with the 1-hour window size.
[16]:
lm = cp.LabelMaker(
target_entity="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="1h",
)
Then, we create a data slice generator with the 3-hour gap size.
[17]:
slices = lm.slice(
df.sort_values('transaction_time'),
num_examples_per_instance=-1,
minimum_data='2014-01-01',
gap="3h",
)
Spread Out - Data Slice #1¶
The first data slice contains all of the customer’s transactions that occurred within the 1-hour window between 2014-01-01 00:00:00
and 2014-01-01 01:00:00
. We can also see that the 3-hour gap spaces apart the cutoff time of this data slice at 2014-01-01 00:00:00
from the cutoff time of the next data slice at 2014-01-01 03:00:00
.
[18]:
ds = next(slices)
print(ds)
ds
slice_number 1
customer_id 1
window [2014-01-01 00:00:00, 2014-01-01 01:00:00)
gap [2014-01-01 00:00:00, 2014-01-01 03:00:00)
[18]:
transaction_id | session_id | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|
transaction_time | ||||||
2014-01-01 00:45:30 | 275 | 4 | 5 | 108.11 | 1 | mobile |
2014-01-01 00:46:35 | 101 | 4 | 5 | 112.53 | 1 | mobile |
2014-01-01 00:47:40 | 80 | 4 | 5 | 6.29 | 1 | mobile |
2014-01-01 00:52:00 | 163 | 4 | 5 | 31.37 | 1 | mobile |
2014-01-01 00:53:05 | 293 | 4 | 5 | 82.88 | 1 | mobile |
2014-01-01 00:57:25 | 103 | 4 | 5 | 20.79 | 1 | mobile |
Let’s apply our labeling function for the total spent on this data slice.
[19]:
total_spent(ds)
[19]:
361.96999999999997
Spread Out - Data Slice #2¶
In the second data slice, we can see that 2 hours of transactions were skipped between 2014-01-01 01:00:00
and 2014-01-01 03:00:00
. By adjusting the gap size, we can set the precise amount of data to skip in-between data slices. This is useful for generating labels that target specific portions of a dataset.
[20]:
ds = next(slices)
print(ds)
ds
slice_number 2
customer_id 1
window [2014-01-01 03:00:00, 2014-01-01 04:00:00)
gap [2014-01-01 03:00:00, 2014-01-01 06:00:00)
[20]:
transaction_id | session_id | product_id | amount | customer_id | device | |
---|---|---|---|---|---|---|
transaction_time | ||||||
2014-01-01 03:29:05 | 190 | 14 | 5 | 110.52 | 1 | tablet |
2014-01-01 03:39:55 | 7 | 14 | 5 | 107.42 | 1 | tablet |
Let’s apply our labeling function for the total spent on this data slice.
[21]:
total_spent(ds)
[21]:
217.94
Data Slice Context¶
Each data slice has a context
attribute to access its metadata. This is useful for integrating the context with the logic in the labeling function.
[22]:
vars(ds.context)
[22]:
{'gap': (Timestamp('2014-01-01 03:00:00'), Timestamp('2014-01-01 06:00:00')),
'window': (Timestamp('2014-01-01 03:00:00'),
Timestamp('2014-01-01 04:00:00')),
'slice_number': 2,
'target_entity': 'customer_id',
'target_instance': 1}
From this guide, hopefully you have a better understanding on how to use the data slice generator to develop your labeling function.