Predict Bike Trips¶

In this example, we build a machine learning application to predict the number of bike trips from a station in the next biking period. This application is structured into three important steps:

Prediction Engineering
Feature Engineering
Machine Learning

In the first step, we generate new labels from the data by using Compose. In the second step, we generate features for the labels by using Featuretools. In the third step, we search for the best machine learning pipeline by using EvalML. After working through these steps, you will learn how to build machine learning applications for real-world problems like forecasting demand. Let’s get started.

[1]:

from demo.chicago_bike import load_sample
from matplotlib.pyplot import subplots
import composeml as cp
import featuretools as ft
import evalml

We will use data provided by Divvy which is a bicycle sharing system in Chicago. In this dataset, we have a record of each bike trip.

[2]:

df = load_sample()
df.head()

[2]:

	gender	starttime	stoptime	tripduration	temperature	events	from_station_id	dpcapacity_start	to_station_id	dpcapacity_end
trip_id
2331610	Female	2014-06-29 13:35:00	2014-06-29 13:56:00	20.750000	82.9	cloudy	178	15.0	76	39.0
2347603	Female	2014-06-30 12:07:00	2014-06-30 12:37:00	30.150000	82.0	cloudy	211	19.0	177	15.0
2345120	Male	2014-06-30 08:36:00	2014-06-30 08:43:00	6.516667	75.0	cloudy	340	15.0	67	15.0
2347527	Male	2014-06-30 12:00:00	2014-06-30 12:08:00	7.250000	82.0	cloudy	56	19.0	56	19.0
2344421	Male	2014-06-30 08:04:00	2014-06-30 08:11:00	7.316667	75.0	cloudy	77	23.0	37	19.0

Prediction Engineering¶

How many trips will occur from a station in the next biking period?

We can change the length of the biking period to create different prediction problems. For example, how many bike trips will occur in the next 13 hours or in the next week? These variations can be done by simply tweaking a parameter. This helps us explore different scenarios which is crucial for making better decisions.

Defining the Labeling Function¶

Let’s start by defining a labeling function to calculate the number of trips. Given that each observation is an individual trip, the number of trips is just the number of observations. Our labeling function will be used by a label maker to extract the training examples.

[3]:

def trip_count(ds):
    return len(ds)

Representing the Prediction Problem¶

Then, let’s represent the prediction problem by creating a label maker with the following parameters:

target_entity as the column for station ID where each trip starts from, since we want to process trips from each station.
labeling_function as the function to calculate the number of trips.
time_index as the column for the starting time of a trip. The biking periods are based on this time index.
window_size as the length of a biking period. We can easily change this parameter to create variations of the prediction problem.

[4]:

lm = cp.LabelMaker(
    target_entity='from_station_id',
    labeling_function=trip_count,
    time_index='starttime',
    window_size='13h',
)

Finding the Training Examples¶

Now, let’s run a search to get the training examples by using the following parameters:

The trips sorted by the start time, since the search will expect the trips to be sorted chronologically, otherwise an error will be raised.
num_examples_per_instance to find the number of training examples per station. In this case, the search will return all existing examples.
minimum_data as the start time of the first biking period. This is also the first cutoff time for building features.

[5]:

lt = lm.search(
    df.sort_values('starttime'),
    num_examples_per_instance=-1,
    minimum_data='2014-06-30 08:00',
    verbose=False,
)

lt.head()

[5]:

	from_station_id	time	trip_count
0	5	2014-06-30 08:00:00	3
1	15	2014-06-30 08:00:00	1
2	16	2014-06-30 08:00:00	1
3	17	2014-06-30 08:00:00	2
4	19	2014-06-30 08:00:00	2

The output from the search is a label times table with three columns:

The station ID associated to the trips. There can be many training examples generated from each station.
The start time of the biking period. This is also the cutoff time for building features. Only data that existed beforehand is valid to use for predictions.
The number of trips during the biking period window. This is calculated by our labeling function.

As a helpul reference, we can print out the search settings that were used to generate these labels.

[6]:

lt.describe()

Settings
--------
gap                                      None
minimum_data                 2014-06-30 08:00
num_examples_per_instance                  -1
target_column                      trip_count
target_entity                 from_station_id
target_type                        continuous
window_size                               13h


Transforms
----------
No transforms applied

We can also get a better look at the labels by plotting the distribution and cumulative count across time.

[7]:

%matplotlib inline
fig, ax = subplots(nrows=2, ncols=1, figsize=(6, 8))
lt.plot.distribution(ax=ax[0])
lt.plot.count_by_time(ax=ax[1])
fig.tight_layout(pad=2)

../_images/examples_predict_bike_trips_13_0.png

Feature Engineering¶

In the previous step, we generated the labels. The next step is to generate the features.

Representing the Data¶

Let’s start by representing the data with an entity set. This way, we can generate features based on the relational structure of the dataset. We currently have a single table of trips where one station can have many trips. This one-to-many relationship can be represented by normalizing a station entity. The same can be done with other one-to-many relationships like weather-to-trips. We want to make predictions based on the station where the trips started from, so we will use this station entity as the target entity for generating features. Also, we will use the stop times of the trips as the time index for generating features, since data about a trip would likely be unavailable until the trip is complete.

[8]:

es = ft.EntitySet('chicago_bike')

es.entity_from_dataframe(
    dataframe=df.reset_index(),
    entity_id='trips',
    time_index='stoptime',
    index='trip_id',
)

es.normalize_entity(
    base_entity_id='trips',
    new_entity_id='from_station_id',
    index='from_station_id',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='trips',
    new_entity_id='weather',
    index='events',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='trips',
    new_entity_id='gender',
    index='gender',
    make_time_index=False,
)

es["trips"]["gender"].interesting_values = ['Male', 'Female']
es["trips"]["events"].interesting_values = ['tstorms']
es.plot()

[8]:

../_images/examples_predict_bike_trips_15_0.svg

Calculating the Features¶

Now, we can generate features by using a method called Deep Feature Synthesis (DFS). This will automatically build features by stacking and applying mathematical operations called primitives across relationships in an entity set. The more structured an entity set is, the better DFS can leverage the relationships to generate better features. Let’s run DFS using the following parameters:

entity_set as the entity set we structured previously.
target_entity as the station entity where the trips started from.
cutoff_time as the label times that we generated previously. The label values are appended to the feature matrix.

[9]:

fm, fd = ft.dfs(
    entityset=es,
    target_entity='from_station_id',
    trans_primitives=['hour', 'week', 'is_weekend'],
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

fm.head()

[9]:

		COUNT(trips)	MAX(trips.dpcapacity_end)	MAX(trips.dpcapacity_start)	MAX(trips.temperature)	MAX(trips.to_station_id)	MAX(trips.tripduration)	MEAN(trips.dpcapacity_end)	MEAN(trips.dpcapacity_start)	MEAN(trips.temperature)	MEAN(trips.to_station_id)	...	MODE(trips.HOUR(stoptime))	MODE(trips.WEEK(starttime))	MODE(trips.WEEK(stoptime))	NUM_UNIQUE(trips.HOUR(starttime))	NUM_UNIQUE(trips.HOUR(stoptime))	NUM_UNIQUE(trips.WEEK(starttime))	NUM_UNIQUE(trips.WEEK(stoptime))	PERCENT_TRUE(trips.IS_WEEKEND(starttime))	PERCENT_TRUE(trips.IS_WEEKEND(stoptime))	trip_count
from_station_id	time
5	2014-06-30 08:00:00	1.0	39.0	19.0	84.0	76.0	9.233333	39.000000	19.0	84.000000	76.000000	...	14.0	26.0	26.0	1.0	1.0	1.0	1.0	1.000000	1.000000	3
15	2014-06-30 08:00:00	3.0	19.0	15.0	84.2	280.0	14.166667	16.333333	15.0	76.733333	137.333333	...	7.0	27.0	27.0	3.0	2.0	2.0	2.0	0.333333	0.333333	1
16	2014-06-30 08:00:00	2.0	15.0	11.0	84.9	340.0	25.083333	13.000000	11.0	84.900000	324.500000	...	17.0	26.0	26.0	1.0	2.0	1.0	1.0	1.000000	1.000000	1
17	2014-06-30 08:00:00	1.0	15.0	15.0	84.9	183.0	4.650000	15.000000	15.0	84.900000	183.000000	...	17.0	26.0	26.0	1.0	1.0	1.0	1.0	1.000000	1.000000	2
19	2014-06-30 08:00:00	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	2

5 rows × 49 columns

There are two outputs from DFS: a feature matrix and feature definitions. The feature matrix is a table that contains the feature values with the corresponding labels based on the cutoff times. Feature definitions are features in a list that can be stored and reused later to calculate the same set of features on future data.

Machine Learning¶

In the previous steps, we generated the labels and features. The final step is to build the machine learning pipeline.

Splitting the Data¶

Let’s start by extracting the labels from the feature matrix and splitting the data into a training set and holdout set.

[10]:

y = fm.pop('trip_count')

splits = evalml.preprocessing.split_data(
    X=fm,
    y=y,
    test_size=0.1,
    random_state=0,
    regression=True,
)

X_train, X_holdout, y_train, y_holdout = splits

Finding the Best Model¶

Then, let’s run a search on the training set for the best machine learning model. During the search process, predictions from several different pipelines are evaluated to find the best pipeline.

[11]:

automl = evalml.AutoMLSearch(
    problem_type='regression',
    objective='r2',
    random_state=3,
    allowed_model_families=['extra_trees', 'random_forest'],
    max_pipelines=3,
)

automl.search(
    X=X_train,
    y=y_train,
    data_checks='disabled',
    show_iteration_plot=False,
)

Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************

Optimizing for R2.
Greater score is better.

Searching up to 3 pipelines.
Allowed model families: extra_trees, random_forest

(1/3) Mean Baseline Regression Pipeline        Elapsed:00:00
        Starting cross validation
        Finished cross validation - mean R2: -0.044
(2/3) Extra Trees Regressor w/ Imputer + On... Elapsed:00:00
        Starting cross validation
        Finished cross validation - mean R2: 0.058
(3/3) Random Forest Regressor w/ Imputer + ... Elapsed:00:01
        Starting cross validation
        Finished cross validation - mean R2: -0.008

Search finished after 00:02
Best pipeline: Extra Trees Regressor w/ Imputer + One Hot Encoder
Best pipeline R2: 0.057850

Once the search is complete, we can print out information about the best pipeline found, such as the parameters in each component.

[12]:

automl.best_pipeline.describe()
automl.best_pipeline.graph()

******************************************************
* Extra Trees Regressor w/ Imputer + One Hot Encoder *
******************************************************

Problem Type: Regression
Model Family: Extra Trees

Pipeline Steps
==============
1. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * categorical_fill_value : None
         * numeric_fill_value : None
2. One Hot Encoder
         * top_n : 10
         * categories : None
         * drop : None
         * handle_unknown : ignore
         * handle_missing : error
3. Extra Trees Regressor
         * n_estimators : 100
         * max_features : auto
         * max_depth : 6
         * min_samples_split : 2
         * min_weight_fraction_leaf : 0.0
         * n_jobs : -1

[12]:

../_images/examples_predict_bike_trips_23_1.svg

Let’s score the model performance by evaluating predictions on the holdout set.

[13]:

best_pipeline = automl.best_pipeline.fit(X_train, y_train)

score = best_pipeline.score(
    X=X_holdout,
    y=y_holdout,
    objectives=['r2'],
)

dict(score)

[13]:

{'R2': 0.5404753013618724}

From the pipeline, we can see which features are most important for predictions.

[14]:

feature_importance = best_pipeline.feature_importance
feature_importance = feature_importance.set_index('feature')['importance']
top_k = feature_importance.abs().sort_values().tail(20).index
feature_importance[top_k].plot.barh(figsize=(8, 8), fontsize=14, width=.7);

[14]:

<AxesSubplot:ylabel='feature'>

../_images/examples_predict_bike_trips_27_1.png

Making Predictions¶

We are ready to make predictions with our trained model. First, let’s calculate the same set of features by using the feature definitions. We will use a cutoff time based on the latest information available in the dataset.

[15]:

fm = ft.calculate_feature_matrix(
    features=fd,
    entityset=es,
    cutoff_time=ft.pd.Timestamp('2014-07-02 08:00:00'),
    cutoff_time_in_index=True,
    verbose=False,
)

fm.head()

[15]:

		COUNT(trips)	MAX(trips.dpcapacity_end)	MAX(trips.dpcapacity_start)	MAX(trips.temperature)	MAX(trips.to_station_id)	MAX(trips.tripduration)	MEAN(trips.dpcapacity_end)	MEAN(trips.dpcapacity_start)	MEAN(trips.temperature)	MEAN(trips.to_station_id)	...	MODE(trips.HOUR(starttime))	MODE(trips.HOUR(stoptime))	MODE(trips.WEEK(starttime))	MODE(trips.WEEK(stoptime))	NUM_UNIQUE(trips.HOUR(starttime))	NUM_UNIQUE(trips.HOUR(stoptime))	NUM_UNIQUE(trips.WEEK(starttime))	NUM_UNIQUE(trips.WEEK(stoptime))	PERCENT_TRUE(trips.IS_WEEKEND(starttime))	PERCENT_TRUE(trips.IS_WEEKEND(stoptime))
from_station_id	time
186	2014-07-02 08:00:00	1	19.0	15.0	82.9	56	6.483333	19.000000	15.0	82.900000	56.000000	...	13	13	26	26	1	1	1	1	1.00000	1.00000
181	2014-07-02 08:00:00	8	27.0	31.0	84.9	233	18.350000	20.500000	31.0	83.212500	140.750000	...	16	16	27	27	5	5	2	2	0.37500	0.37500
177	2014-07-02 08:00:00	23	31.0	15.0	84.9	334	50.233333	17.434783	15.0	83.334783	219.565217	...	16	18	26	26	8	8	2	2	0.73913	0.73913
13	2014-07-02 08:00:00	3	19.0	19.0	82.9	165	13.916667	17.666667	19.0	82.600000	112.666667	...	13	13	26	26	2	2	1	1	1.00000	1.00000
153	2014-07-02 08:00:00	2	23.0	19.0	82.9	324	13.850000	19.000000	19.0	77.950000	219.500000	...	6	6	26	26	2	2	2	2	0.50000	0.50000

5 rows × 48 columns

Now, let’s predict the number of trips that will occur from a station in the next 13 hours.

[16]:

values = best_pipeline.predict(fm).values.round()

prediction = fm[[]]
prediction['trip_count (estimate)'] = values
prediction.head()

[16]:

		trip_count (estimate)
from_station_id	time
186	2014-07-02 08:00:00	2.0
181	2014-07-02 08:00:00	6.0
177	2014-07-02 08:00:00	5.0
13	2014-07-02 08:00:00	3.0
153	2014-07-02 08:00:00	3.0

Next Steps¶

At this point, we have completed the machine learning application. We can revisit each step to explore and fine-tune with different parameters until the model is ready for deployment. For more information on how to work with the features produced by Featuretools, take a look at the Featuretools documentation. For more information on how to work with the models produced by EvalML, take a look at the EvalML documentation.

Tutorials Predict Next Purchase