Press enter or click to view image in full size

At Lyft, understanding how riders go through our user experience is fundamental to operating a healthy marketplace. Specifically, it is important to have a robust model determining if a rider will actually request a ride after entering a destination and viewing a price and ETA. Accurately predicting this decision, that we call conversion, informs countless decisions across our platform. Whether it is to better balance supply and demand, improve user experiences, optimize recommendations and advertisement, understand long-term engagement, decide how to distribute coupons… rider conversion prediction is a central challenge for the Lyft business.

However, predicting human behavior at scale is incredibly complex, and the exact same person might well open the app just to check current availability or actually to request a ride after viewing our prices. The contexts under which riders make their conversion decisions are extremely diverse and almost unique to each session. A user’s intent changes based on where they are and where they want to go, what time it is, their previous interactions with the platform, current supply-demand market conditions, to cite a few.

When we try to model this using standard machine learning approaches, we run into a significant challenge: data sparsity.

The Challenge of High Cardinality and Sparsity

To accurately predict conversion, we need to slice our data very thinly across many categorical features. Imagine trying to predict the conversion probability for a business traveler leaving the suburbs of Detroit at 4:00 AM on a Tuesday to catch their flight at the airport 30 minutes away. While Lyft has vast amounts of data overall, the amount of data available for that specific intersection of contexts often reveals to be very tiny. Maybe we only have ten examples in history.

If we use standard techniques like Gradient Boosted Trees (e.g., LGBM, XGBoost), we encounter severe overfitting. A standard model looking at 10 examples in the training data where 8 converted might confidently predict an 80% conversion probability. But that’s likely statistical noise. The next 10 examples might show only 2 conversions.

One could argue that much more complex models, such as Deep Neural Networks or even Large Language Models in a few-shot fashion, could be able to handle this sparsity if properly tuned and calibrated; but this comes at the cost of greatly increased inference time. Indeed, predicting session conversion is very often needed live, while the rider interacts with our systems, and we want to be able to confidently predict their conversion as soon as we know their destination. And of course, no one wants to see a loading wheel rotating for hours, so when you think about the user experience, we need a model that can handle an extremely high-cardinality set of different situations, provide robust and accurate predictions even when data is scarce, and serve those predictions with ultra-low latency.

Overcome Sparsity with Bayesian Trees

To solve this, our team developed a modeling framework to predict sessions’ conversions in real-time. At its core, it is a highly optimized, hierarchical lookup structure designed to handle sparse categorical data by leveraging Bayesian theory. It incorporates priors (knowledge gained from broader contexts) to smooth predictions in specific, data-rare contexts.

The Hierarchical Tree Structure

We organize the training data into a tree structure based on a hierarchy of partition keys. These keys define the context. While the exact keys we use depend on the specific application, imagine a hierarchy like this:

Root: All the sessions
Level 1 Split (e.g., Spatial Context): City Region
Level 2 Split (e.g., Temporal Context): Time of Day / Day of Week
Level 3 Split (e.g., Congestion Context): Current local supply-demand balance
…

As we move down the tree, the context becomes more and more specific, and the data available at each node becomes sparser and sparser.

Figure 1 — Hierarchical decomposition of rider sessions. The tree structure illustrates how the model splits global data into progressively granular segments, from City Region down to specific Temporal and Congestion contexts, to address data sparsity. (Image generated with Gemini’s Nano Banana model)

The Secret: Bayesian Smoothing with Gaussian Priors

How do we deal with a leaf node that only has five data points? We definitely cannot trust only the 5 data points and train an independent model, but also we cannot have a global model that trains on the whole data, as it would miss the specifics of this leaf node that guide rider conversion decision-making.

Get Zammit Alban’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Our conversion model uses the statistical properties of the parent node to inform the prediction at the child node. This is achieved through Bayesian smoothing, specifically utilizing gaussian priors on model parameters. It allows data-sparse segments (e.g. a leaf node in the tree) to default to the robust average of their parent group.

Here is the intuition:

Model architecture: Each node in the tree is a copy of the same parametric model. It can be a logistic regression, a support vector machine, or even a shallow neural network. To make a prediction y from an input x, you rely on a parametric model f that depends on trainable parameters Θ such that y = f(x, Θ). Each node has its own version of Θ that we want to be as specific as possible to the segment the node represents.
Training: The tree is trained top-down, starting from the root of the tree, level by level, until the leaves, with less and less data as we dive deeper and deeper. For each individual node, its parent node has been trained on more data. Hence Θ_parent provides a strong belief about what Θ_child should look like when training the child node. As such, when fitting the child’s model on the child’s specific data, we add in the training a L2 penalization for diverging too far from the parent’s, that is ||Θ_parent — Θ_child||². This is known as a Gaussian Bayesian prior, centered at the parent’s mean.

Equation 1 — Example of the training loss used to train a child node’s parameters Θ_child using its parent parameter Θ_parent via a L2 regularization of strength λ — *Equation 1 — Example of the training loss used to train a child node’s parameters Θ_child using its parent parameter Θ_parent via a L2 regularization of strength* λ

The choice of the regularization strength λ is critical and should depend on the data size available between the parent’s node and its children, such that the model automatically balances parent’s trust with specification to child’s data. If a leaf node has very little data, its prediction will be heavily pulled towards the parent’s stable mean. As the leaf node acquires more data, the model gradually trusts the local data more, moving the prediction away from the parent’s prior and towards the local raw average.

Figure 2 — How Bayesian smoothing prevents overfitting. If a parent model shows conversions peaking at 5-mile trips, but a sparse child node indicates a 2–3-mile peak, relying solely on the child’s data causes irrational overfitting. Bayesian smoothing creates a compromise by blending the local signal with the parent’s prior, ensuring robust predictions even when local data is scarce. — Figure 2 — Bayesian Smoothing in Action. Imagine we are at a specific node in the tree where the parent model (black line) tells us that, for the parent’s segment of sessions, rider conversion peaks around 5-mile trips. Then, for a sparser segment one level below in the tree, i.e. for one of the parent’s child nodes, we observe in the child’s specific training data that shorter 2–3-mile trips actually convert more. Standard Overfitting (purple dashed line): If we relied solely on this child’s data, the child model would overfit, irrationally predicting near-zero conversion for 5-mile trips. Bayesian Smoothing (pink line): The smoothed model finds a compromise. It respects the local signal by shifting the peak a bit to the right, but also relies on the parent’s prior to maintain a good probability mass around 5 miles, ensuring robust predictions even where local data is missing.

Ensuring Behavioral Consistency

Beyond handling sparsity, we often have domain knowledge about how conversion should behave relative to certain continuous variables. For instance, consider the rider’s historical conversion rate: intuitively, as this increases, the predicted conversion probability for the current session should also increase. All else being equal, we should logically predict the conversion of a rider with a 90% historical rate higher than the one for another rider with a 40% historical rate. The model should enforce a monotonic relationship between historical conversion and current session’s conversion.

However, standard ML models can sometimes learn erratic shapes that violate this intuitive logic due to noise in the training data, ending up with non monotonic relationships between some input and the output that shouldn’t be. With Bayesian trees, we can keep very simple parametric models at each node because the use case is already very specific (that is the whole point of Bayesian trees, right?). Using simpler models— e.g. a logistic regression cvr = σ(Θ . cvr_hist) — can give much more control on the model monotonicity (e.g. enforce Θ>0) and ensure appropriate behavior of the model, and as such better explainability. This guarantees that the model’s outputs are not only statistically robust but also directionally intuitive, reliable and interpretable.

Figure 3 — Bayesian priors to enforce logical constraints, like monotonicity. For instance, a rider with a higher historical conversion rate should logically have a higher current conversion probability. While standard models overfit to noisy data, creating jagged, illogical curves; the Bayesian model enforces a monotonic constraint. This ignores noise and produces a smooth, logically consistent prediction that aligns with business logic. — Figure 3 — Enforcing Logical Constraints with Bayesian Priors. Imagine we are analyzing a specific feature used to predict a session’s conversion, such as the “Rider’s Historical Conversion Rate.” Business Logic dictates that this relationship should be monotonic: a rider who has converted frequently in the past should generally be more likely to convert now, not less. Standard Overfitting (purple dashed line): A standard model blindly chases the sparse noisy data (blue dots), creating a jagged curve that irrationally becomes non-monotonic with the Rider’s Historical Conversion Rate. Monotonically-constrained Model (pink line): By enforcing a monotonic constraint on the relationship between input and output, the Bayesian model produces a smooth, logically consistent prediction. It ignores the noise that suggests a dip, ensuring that a higher historical rate always translates to a higher predicted probability, aligning with our domain knowledge.

Conclusion

Bayesian Conversion models represent a significant step forward in our ability to model rider conversion decision-making in highly dynamic, sparse environments. By combining a highly structured hierarchical approach with the robust statistical grounding of Bayesian smoothing, we can generate accurate, stable predictions where traditional models fail. This architecture allows us to be hyper-local and specific when the data supports it, while gracefully falling back to broader, stable trends when faced with the unknown. It’s a key piece of infrastructure that helps Lyft make smarter decisions in real-time.

Lyft is hiring! If you’re passionate about developing state of the art machine learning/optimization models or building the infrastructure that powers them, read more about them on our blog and join our team.

Source: eng.lyft.com

Name	Price	7d %	Volume
Bitcoin btc	$63,861	1.86%	$21,711,299,338
Ethereum eth	$1,799	2.21%	$6,670,460,466
XRP xrp	$1.09	-3.05%	$745,901,276
Solana sol	$77	-3.89%	$1,464,919,498
Dogecoin doge	$0.073	-3.77%	$401,270,935

The Challenge of High Cardinality and Sparsity

Overcome Sparsity with Bayesian Trees

The Hierarchical Tree Structure

The Secret: Bayesian Smoothing with Gaussian Priors

Get Zammit Alban’s stories in your inbox

Ensuring Behavioral Consistency

Conclusion

Peace Education in Action

Latest Articles

Editors' Picks

Predicting Rider Conversion in Sparse Data Environments with Bayesian Trees | by Zammit Alban | Mar, 2026

The Challenge of High Cardinality and Sparsity

Overcome Sparsity with Bayesian Trees

The Hierarchical Tree Structure

The Secret: Bayesian Smoothing with Gaussian Priors

Get Zammit Alban’s stories in your inbox

Ensuring Behavioral Consistency

Conclusion

The Big Lesson Netflix’s Reed Hastings Learned at His First Job

Universal Latin Ups Alfredo Delgadillo, Angel Kaminsky, Daniel Luna

You may also like

Leave a Comment Cancel Reply

Latest Articles

Login/Register

Editors' Picks