At Lyft, understanding how riders go through our user experience is fundamental to operating a healthy marketplace. Specifically, it is important to have a robust model determining if a rider will actually request a ride after entering a destination and viewing a price and ETA. Accurately predicting this decision, that we call conversion, informs countless decisions across our platform. Whether it is to better balance supply and demand, improve user experiences, optimize recommendations and advertisement, understand long-term engagement, decide how to distribute coupons… rider conversion prediction is a central challenge for the Lyft business.
However, predicting human behavior at scale is incredibly complex, and the exact same person might well open the app just to check current availability or actually to request a ride after viewing our prices. The contexts under which riders make their conversion decisions are extremely diverse and almost unique to each session. A user’s intent changes based on where they are and where they want to go, what time it is, their previous interactions with the platform, current supply-demand market conditions, to cite a few.
When we try to model this using standard machine learning approaches, we run into a significant challenge: data sparsity.
The Challenge of High Cardinality and Sparsity
To accurately predict conversion, we need to slice our data very thinly across many categorical features. Imagine trying to predict the conversion probability for a business traveler leaving the suburbs of Detroit at 4:00 AM on a Tuesday to catch their flight at the airport 30 minutes away. While Lyft has vast amounts of data overall, the amount of data available for that specific intersection of contexts often reveals to be very tiny. Maybe we only have ten examples in history.
If we use standard techniques like Gradient Boosted Trees (e.g., LGBM, XGBoost), we encounter severe overfitting. A standard model looking at 10 examples in the training data where 8 converted might confidently predict an 80% conversion probability. But that’s likely statistical noise. The next 10 examples might show only 2 conversions.
One could argue that much more complex models, such as Deep Neural Networks or even Large Language Models in a few-shot fashion, could be able to handle this sparsity if properly tuned and calibrated; but this comes at the cost of greatly increased inference time. Indeed, predicting session conversion is very often needed live, while the rider interacts with our systems, and we want to be able to confidently predict their conversion as soon as we know their destination. And of course, no one wants to see a loading wheel rotating for hours, so when you think about the user experience, we need a model that can handle an extremely high-cardinality set of different situations, provide robust and accurate predictions even when data is scarce, and serve those predictions with ultra-low latency.
Overcome Sparsity with Bayesian Trees
To solve this, our team developed a modeling framework to predict sessions’ conversions in real-time. At its core, it is a highly optimized, hierarchical lookup structure designed to handle sparse categorical data by leveraging Bayesian theory. It incorporates priors (knowledge gained from broader contexts) to smooth predictions in specific, data-rare contexts.
The Hierarchical Tree Structure
We organize the training data into a tree structure based on a hierarchy of partition keys. These keys define the context. While the exact keys we use depend on the specific application, imagine a hierarchy like this:
- Root: All the sessions
- Level 1 Split (e.g., Spatial Context): City Region
- Level 2 Split (e.g., Temporal Context): Time of Day / Day of Week
- Level 3 Split (e.g., Congestion Context): Current local supply-demand balance
- …
As we move down the tree, the context becomes more and more specific, and the data available at each node becomes sparser and sparser.
The Secret: Bayesian Smoothing with Gaussian Priors
How do we deal with a leaf node that only has five data points? We definitely cannot trust only the 5 data points and train an independent model, but also we cannot have a global model that trains on the whole data, as it would miss the specifics of this leaf node that guide rider conversion decision-making.
Get Zammit Alban’s stories in your inbox
Join Medium for free to get updates from this writer.
Our conversion model uses the statistical properties of the parent node to inform the prediction at the child node. This is achieved through Bayesian smoothing, specifically utilizing gaussian priors on model parameters. It allows data-sparse segments (e.g. a leaf node in the tree) to default to the robust average of their parent group.
Here is the intuition:
- Model architecture: Each node in the tree is a copy of the same parametric model. It can be a logistic regression, a support vector machine, or even a shallow neural network. To make a prediction y from an input x, you rely on a parametric model f that depends on trainable parameters Θ such that y = f(x, Θ). Each node has its own version of Θ that we want to be as specific as possible to the segment the node represents.
- Training: The tree is trained top-down, starting from the root of the tree, level by level, until the leaves, with less and less data as we dive deeper and deeper. For each individual node, its parent node has been trained on more data. Hence Θ_parent provides a strong belief about what Θ_child should look like when training the child node. As such, when fitting the child’s model on the child’s specific data, we add in the training a L2 penalization for diverging too far from the parent’s, that is ||Θ_parent — Θ_child||². This is known as a Gaussian Bayesian prior, centered at the parent’s mean.
The choice of the regularization strength λ is critical and should depend on the data size available between the parent’s node and its children, such that the model automatically balances parent’s trust with specification to child’s data. If a leaf node has very little data, its prediction will be heavily pulled towards the parent’s stable mean. As the leaf node acquires more data, the model gradually trusts the local data more, moving the prediction away from the parent’s prior and towards the local raw average.
Ensuring Behavioral Consistency
Beyond handling sparsity, we often have domain knowledge about how conversion should behave relative to certain continuous variables. For instance, consider the rider’s historical conversion rate: intuitively, as this increases, the predicted conversion probability for the current session should also increase. All else being equal, we should logically predict the conversion of a rider with a 90% historical rate higher than the one for another rider with a 40% historical rate. The model should enforce a monotonic relationship between historical conversion and current session’s conversion.
However, standard ML models can sometimes learn erratic shapes that violate this intuitive logic due to noise in the training data, ending up with non monotonic relationships between some input and the output that shouldn’t be. With Bayesian trees, we can keep very simple parametric models at each node because the use case is already very specific (that is the whole point of Bayesian trees, right?). Using simpler models— e.g. a logistic regression cvr = σ(Θ . cvr_hist) — can give much more control on the model monotonicity (e.g. enforce Θ>0) and ensure appropriate behavior of the model, and as such better explainability. This guarantees that the model’s outputs are not only statistically robust but also directionally intuitive, reliable and interpretable.
Conclusion
Bayesian Conversion models represent a significant step forward in our ability to model rider conversion decision-making in highly dynamic, sparse environments. By combining a highly structured hierarchical approach with the robust statistical grounding of Bayesian smoothing, we can generate accurate, stable predictions where traditional models fail. This architecture allows us to be hyper-local and specific when the data supports it, while gracefully falling back to broader, stable trends when faced with the unknown. It’s a key piece of infrastructure that helps Lyft make smarter decisions in real-time.
Lyft is hiring! If you’re passionate about developing state of the art machine learning/optimization models or building the infrastructure that powers them, read more about them on our blog and join our team.
Source: eng.lyft.com
