Predict Next Purchase¶

In this example, we will generate labels on online grocery orders provided by Instacart using Compose. The labels can be used to train a machine learning model to predict whether a customer will buy a specific product within the next month.

If you plan to run this notebook, you can use the following command at the root directory of the repository.

jupyter notebook docs/source/examples/predict-next-purchase/example.ipynb

Load Data¶

[1]:

%matplotlib inline
import composeml as cp
import data

The data hosted here will be downloaded automatically into the data module of this notebook unless it already exist. Once the data is in place, we can preview the grocery orders to see how they look.

[2]:

df = data.load_orders(nrows=1000000)

df.head()

[2]:

	order_id	product_id	add_to_cart_order	product_name	aisle_id	department_id	department	user_id	order_time
0	120	33120	13	Organic Egg Whites	86	16	dairy eggs	23750	2015-01-11 08:00:00
1	120	31323	7	Light Wisconsin String Cheese	21	16	dairy eggs	23750	2015-01-11 08:00:00
2	120	1503	8	Low Fat Cottage Cheese	108	16	dairy eggs	23750	2015-01-11 08:00:00
3	120	28156	11	Total 0% Nonfat Plain Greek Yogurt	120	16	dairy eggs	23750	2015-01-11 08:00:00
4	120	41273	4	Broccoli Florets	123	4	produce	23750	2015-01-11 08:00:00

Generate Labels¶

Now with the grocery orders loaded, we are ready to generate labels for our prediction problem.

Create Labeling Function¶

To get started, we define the labeling function that will return whether a customer purchased the product in a given month.

[3]:

def bought_product(df, product_name):
    purchased = df.product_name.str.contains(product_name).any()
    return purchased

Construct Label Maker¶

With the labeling function, we create the label maker for our prediction problem. To process one month of orders for each customer, we set the target_entity to the customer ID and the window_size to one month. When window size is set to 1MS, the window size will end on the first day of the next month. Alias definitions are listed here.

[4]:

lm = cp.LabelMaker(
    target_entity='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='1MS',
)

Search Labels¶

Next, the label maker will search through the data continously to label whether a customer bought bananas in a given month. This happens when we use LabelMaker.search and set the product_name to bananas. If you are running this code yourself, feel free to expirement with other products (e.g. limes, avocados, etc.) and different time frames!

[5]:

lt = lm.search(
    df.sort_values('order_time'),
    minimum_data='2015-01-01',
    num_examples_per_instance=-1,
    product_name='Banana',
    verbose=True,
)

lt.head()

Elapsed: 00:29 | Remaining: 00:00 | Progress: 100%|██████████| user_id: 19477/19477

[5]:

	user_id	time	bought_product
id
0	4	2015-01-01	False
1	7	2015-01-01	False
2	10	2015-01-01	False
3	10	2015-02-01	False
4	13	2015-01-01	False

Describe Labels¶

With the generate label times, we can use LabelTimes.describe to print out the distribution with the settings and transforms that were used to make these labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.

[6]:

lt.describe()

Label Distribution
------------------
False     13752
True       7044
Total:    20796


Settings
--------
gap                            <MonthBegin>
label_type                         discrete
labeling_function            bought_product
minimum_data                     2015-01-01
num_examples_per_instance                -1
target_entity                       user_id
window_size                    <MonthBegin>


Transforms
----------
No transforms applied

Plot Labels¶

Additionally, there are plots available for insight to the labels.

Distribution¶

This plot shows the label distribution.

[7]:

lt.plot.distribution();

../../_images/examples_predict-next-purchase_example_14_0.png

Count by Time¶

This plot shows the label distribution across cutoff times.

[8]:

lt.plot.count_by_time();

../../_images/examples_predict-next-purchase_example_16_0.png