Innovation

How We Built: An Early-Stage Machine Learning Model for Recommendations

Written by Shayak Banerjee and Akshay Kashyap

June 13, 2022

In an earlier post, we described how the Personalization team at Peloton built out a recommender system to power the Daily Picks section on connected fitness devices (Bike, Bike+, and Tread). Our focus there was the system that productionized a Machine Learning model. This article focuses on the vital question of which model we chose and the reasoning for that choice. Similar to the earlier article, this work is placed in the context of Peloton and the Personalization team being in the early stages of developing our Data and ML ecosystem. We hope the choices made and lessons learned will be valuable when making your own choices in designing and evolving your recommender systems.

Why did we need a model?

Figure 1: The Daily Picks row on the Peloton Bike & Tread home screens

For those not familiar with the product, Your Daily Picks is a section on Peloton connected fitness device home screens that surfaces a few recommendations to the user and a reason for suggesting it (Fig. 1). The three most prominent cards offer For Your Usual, For Something New, and For A Quick Workout. Around March of 2020, we were powering For Your Usual using AWS Personalize, – powered by a hierarchical recurrent neural network-based model. We used this service as a quick way to prove the power of model-based recommendations in the user experience. At the time, we had two shortcomings with using this service:

It only offered a ranked list of predictions, without any associated scores (from model prediction). Without this information, we could not tell how differently suitable the top-10 might be from one another and had to resort to always picking the top-ranked prediction. This led to staleness in the recommendations, where a user would have ended up seeing the same For Your Usual recommendation several days in a row if they did not take it.
Around this time, we were narrowing in on the concept of using rankers and filters to power multiple recommendation placements. As an example, we could combine a ranked list of predictions, with a filter for selecting shorter classes to power For a Quick Workout. This needed larger lists of predictions than were available from AWS Personalize.

It was critical that we owned the modeling to personalize content more precisely and with a greater degree of interpretability. Additionally, as a Machine Learning team, we felt the need to own the modeling aspect of the problem so as to develop and improve it over time. Outsourcing the model denies the team that opportunity for growth and learning. The time felt right to pursue designing our own model to solve these problems.

Considerations for Designing a Model

Making it Uniquely Peloton

Peloton classes are diverse along many dimensions. Each class has – amongst other parameters – a vibe, a unique script, an instructor, a carefully selected music playlist, a duration, and a certain level of difficulty. Our in-house model, first and foremost, had to traverse this diversity and be able to surface classes based on the preferences and abilities of each user. For someone who loves rock music, we had to be able to surface classes with rock-heavy playlists. For a cyclist who prefers to stay in the saddle, we had to be able to recommend low-impact rides. For someone who usually can only devote 20 minutes to working out, we had to be able to display shorter-duration classes.

Batch Compute, Cache, and Retrieve

Our system generated recommendations offline via batch processing and cached them for retrieval when a user started their session. This meant we could avoid addressing some tough online inference problems such as prediction latency and candidate selection for scoring. Our only constraint on processing times was to ensure that members would have their recommendations refreshed daily.

Figure 2: Batch processing system for generating and serving recommendations

Platform Maturity

A fundamental component of training a model is the ready availability of behavioral data e.g., what classes were taken by the member, in what context, and with what browsing behaviors preceding or succeeding it. When designing this in-house model, Peloton’s data platform was in the nascent stages of construction. These various sources of explicit data were in disparate locations and not readily queryable by our batch jobs. This limited us to using primarily user-class consumption data to learn about these interactions implicitly instead. We were also missing critical ML Platform components such as a feature store, which made it harder to take up feature-driven content-based filtering models.

Speed of Development vs. Model Performance

While we wanted to carefully think through various implications of an in-house model, we needed to develop a first iteration of this model within a reasonable time frame, so as to keep making progress on product features. On the other hand, we were going to measure this in-house model against a hierarchical recurrent neural network or HRNN (AWS Personalize). Its performance had to be comparable, in order for us to justify switching.

Communicating with Product

Personalization was the first Machine Learning team at Peloton, and the awareness of ML as a practice or its value to the product was in its infancy at that point. We were challenged to educate the organization. Things that are obvious to ML engineers, such as the value of representing users/classes in high-dimensional space, were not immediately clear or easy to explain to an external audience. At the same time, we were delighted with how quickly our product managers picked up information and helped pave the way with our stakeholders.

Introducing LSTMNet

Based on the above considerations, we chose to frame recommending Peloton classes as a sequential modeling problem that we approach with a long-short term memory (LSTM)-based deep neural network. Why sequential modeling? There were two reasons: (a) Classes released by Peloton are serial in nature e.g. The Jess King Experience is a 12-class cycling series released over 2 years, while Power Yoga is a series covering hundreds of classes, (b) User classes follow progressions over time. Over long periods of time, we may see users move from shorter to longer classes, or toward more difficult classes. Over shorter periods of time, sequences may involve pairing classes for a more complete workout e.g. a 5 min. warm-up before a 30 min. class, or an upper body strength class after a cycling class. We derived inspiration from the prior use of sequential modeling in next-item prediction, machine translation, and natural language generation tasks. A fundamental difference in our problem is that a user’s sequence of classes is taken over a longer period of time, sometimes covering weeks and months. Our neural network learns high-dimensional embeddings of users as a function of their class consumption history that can be used downstream for ranking class preference for each user. Here, we show the architecture of the model.

The architecture consists of two components:

An Embedding layer that maps sparse Class IDs (represented by Ct in the figure) to dense vector representations, that are our “Class Embeddings” (represented by Et).
An LSTM layer that generates a representation of a fixed-length sequence of class embeddings. The main insight here is that we treat this sequence representation of a user’s history of classes as a dense representation of that user, i.e., the “user embedding”.

Why the name LSTMNet? We do not claim credit for it. We derived the inspiration and code from Maciej Kula’s excellent Spotlight library. LSTMNet’s objective is to minimize the cosine distance between the user embedding up to each timestep (in the sequence we feed into the network) and the embedding of the class at the immediate next timestep. To make the network more robust, we also present classes that don’t appear in a user’s history as negative examples during training. We expect the distance to be significantly higher than it would be against positive examples. Negative samples are selected at random from the class library amongst classes that the user did not take.

Aside from the benefits of being now able to use this in-house model for multiple placements, we were deriving a few other benefits from having user and class embeddings readily available for this model:

We could now quantify users’ preferences for classes with a bounded distance/similarity metric. This allowed us to better recommend classes for not just the For Your Usual module but also the For Something New module, since we could now gauge what classes they may not have a strong affinity to (but not have a strong preference against either).
The class embeddings enable better clustering and nearest-neighbor searching of classes for application in other modules and features. For example, we were now able to look up similar classes based on a nearest neighbor search in this vector space.

Learnings from LSTMNet

For much of our team, this was the first foray into productionizing a sequential recommendation model. Below are some learnings from this effort, which we hope will be useful for other teams pursuing similar model architectures.

The Quirks of Generating Test Data

A model needs to be evaluated offline prior to being released, so as to have confidence in its predictive power. We were primarily interested in user-level evaluation (as opposed to item-level validation) because the efficacy of the recommendations is also measured by user-level conversion. As it turned out, evaluating sequential recommenders can be tricky.

As shown in Figure 4, the standard 80/20 random split of data into a training and test set does not work. By removing data from the middle of sequences of classes taken, we effectively break the sequence, degrading the predictive power of the model. We can alternatively hold out a handful of the most recent classes taken for each user, but that has a problem as well. Newly aired classes are highly likely to end up amongst the most recent classes. If these classes are not in the training data, we cannot learn their embeddings. This causes evaluation to fail.

Our current approach selects the most recent classes taken for a subset of users, as our test data. This overcomes the problem of missing embeddings, while still generating a sizable enough dataset for evaluation. We have two controls over the size of this dataset:

The number of most recently taken classes held out. While increasing this will lead to a larger test set, it reduces the predictive power for these test users, since their training sequences are shorter.
The fraction of users for whom we hold out most recent classes. Increasing this fraction increases the size of the test set, but also increases the chances of running into classes with missing embeddings – especially if some newly aired classes were taken only by the test users.

The Challenges of Ongoing Evaluation

LSTMNet is trained every day. This allows it to pick up newly released classes in the inventory, as well as new user-class interactions. When we first released the model, we were generating evaluation metrics daily and comparing these metrics against what the previous model in production had generated. If the staged model metrics did not exhibit any large deviations from the previous production model, we would promote the staged model into production, replacing the old one (Figure 5).

This method of comparison has one fundamental flaw – we were evaluating the two models on two different datasets! Even though the test datasets were similarly generated,they were not exactly the same, so a fair comparison is not guaranteed. This approach could only help with detecting anomalies in newly-trained models and guaranteeing some baseline performance. It could not discern any performance differences between the current trained model and the production model. We had two options for evaluating on the same dataset:

Compare both models against the generated test data for the staged model. With this approach, the curse of missing embeddings strikes us again. The staged model contains some recently released classes that the production model has not seen during training, and hence does not have embeddings for.
Compare both models against the generated test data for the production model. This also has a flaw, where the staged model has actually been trained on some of the data that was hidden from the production model. This data leakage will mean that the staged model is very likely to outperform the production model on this dataset, and it will not necessarily be indicative of a better model in practice.

We formulated an alternative approach to ongoing model evaluation. We periodically (once a week), select a set of users and a cutoff time for each user. The cutoff time represents the last class for this user that we are allowed to “see” during model training. For each training run, we read in this set of users and their cutoff times and prune the training set to remove any of their classes taken after the cutoff. Then, for comparing the current staged model and the production model, we evaluate against the current production test data (as shown in Approach B above), but only try to predict the next N classes after the cutoff date for these test users. N is a tunable parameter.

LSTMNet – Now for the Bad News

LSTMNet has served us for a few years and helped us expand recommendations to multiple platforms, shaping the user experience on Peloton. In various A/B tests, it gave us parity with the existing AWS Personalize service, as well as ~150% increase in workout conversion over existing heuristics. It enabled us to do so without having to invest heavily in complex feature engineering, online datastores, or explicit signal collection. However, as our user base grows in size and diversity, we are running into limitations of this model architecture, some of which are described here.

Low Contextual Relevance

LSTMNet learns representations of users and classes, but these representations are not context-dependent. For example, it does not learn that a particular user prefers outdoor running when the weather is sunny outside, or that the same user would likely take a meditation when using the Peloton app at night. Exercising is such a contextual activity – related to time, weather conditions, mood, body state – that learning a static representation is a limiting approach.

Batch Predictions Don’t Scale With Context

Tied into the above, even if we were to model context in our architectures, we would still be limited on the recommendation serving side. In batch predictions, we can generate recommendations for different contextual scenarios, but this approach quickly stops being scalable. Imagine having to generate three recommendations for a user – one each for morning, afternoon and evening – and having to store all of that offline, to retrieve based on the time the user logs in. Instead, we are better off generating recommendations online, with real-time context as an input.

Where are the New Classes?

A batch-trained model that usually operates on data from a day before does not see many recently released classes during training. This lack of embeddings for recent classes means that these do not get recommended at all. Further, we found a curious pattern with LSTMNet that older classes have a slightly higher chance of being surfaced. It’s not entirely surprising, given that classes that have had a chance to be consumed by more users, are part of more sequences, and hence more likely to be recommended. This is a different kind of “popularity bias”, where the popularity is proportional to the age of the class.

Are You Sure it Isn’t Overfitting?

To be honest, no we are not. Given how we construct our test data, the train / test data ratio is heavily skewed towards the former. As mentioned earlier, getting a higher proportion of test data for this sequential model is extremely difficult – due to the curse of missing embeddings, and data leakage.

What about New Users?

LSTMNet requires a certain sequence length of classes taken for a user, for their recommendation to be meaningful. Like any collaborative-filtering model, cold start – and even “cool” start – is a problem. Our new users are shown recommendations via a different mechanism, in order to compensate for this shortcoming.

Looking to The Future

At the time of writing this article, we are already looking to the future for a new model architecture that can help us overcome the above limitations. We are looking to make our recommendations more contextually relevant, and for this purpose, we are looking to take our models and systems online to generate them. We are also looking for new model architectures that help us solve cold-start for users and classes, while still preserving recommendation quality. We have already identified an exciting model architecture, and a corresponding eco-system of tools to achieve this dream. Stay tuned for more!

SITE NOT OPTIMIZED FOR INTERNET EXPLORER