The Feature Store Architecture: A Platform of Platforms
Our system is best described as a platform of platforms. While the full architecture diagram is complex, we can break it down into three digestible components: Batch, Online, and Streaming features.
Batch Feature Ingestion and Serving
Batch features are the most widely used family of features within our platform. These features are defined from existing Hive data tables and represent a set of standardized data points that are calculated and refreshed on a set cadence, typically on a daily basis.
The ingestion process begins when customers define features using a Spark SQL query and a simple JSON file representing the dedicated configuration metadata.
A Python cron service reads these configurations and automatically generates an Astronomer-hosted Airflow Directed Acyclic Graph (DAG). Crucially, these generated DAGs are production-ready out-of-the-box. They handle:
- Executing the Spark SQL query to compute the feature data
- Storing the feature data to both the offline and online data paths
- Running integrated data quality checks
- Compatibility for feature discovery
The executed DAG generates a dataframe and delivers the results to two distinct paths:
- Offline Data Path: The feature data is stored in Hive tables for historical data analysis and machine learning model training.
- Online Data Path: The processed features are translated and sent to our low-latency online serving layer for real-time inference.
The Online Serving Layer
Our online serving layer, referred to as dsfeatures (short for “data science features”), is central to our feature serving capability. It is an optimized wrapper over various AWS data stores, providing a reliable and ultra-low-latency retrieval mechanism for real-time serving.
The core structure of dsfeatures is:
- Backing Store: DynamoDB is utilized as the primary, persistent source for features. It uses various metadata fields as the primary key with a GSI for GDPR deletion efficiency.
- Performance Cache: A ValKey write-through LRU cache is deployed on top of DynamoDB to facilitate ultra-low-latency retrievals by storing the most frequently-accessed (meta)data with a generous TTL.
- Embeddings: An OpenSearch integration is utilized specifically for serving embedding features, which require specialized indexing and retrieval capabilities.
Customer Interaction and Data Retrieval
The dsfeatures service centralizes how both internal DAGs and external customers interact with the feature data. From a customer’s perspective, data retrieval and management are straightforward, facilitated by our dedicated Software Development Kits (SDKs): go-lyft-features (Golang) and lyft-dsp-features (Python).
Services utilize these SDKs to make API calls directly to the dsfeatures service. The most common retrieval methods are Get or BatchGet calls, which the service handles and returns the requested data in a developer-friendly format.
Crucially, the SDK libraries expose full CRUD (Create, Read, Update, Delete) operations. This capability allows system components, such as our internal Airflow DAGs, to read and write features, and even lets customers manage real-time features ad-hoc by directly invoking these API calls against our data stores.
The Streaming Pipeline
While batch features are essential, we also rely on streaming features to ensure data recency for low-latency applications and customer demands.
Our streaming pipeline follows a robust, multi-stage architecture to process features in real time.
- Ingestion: Streaming applications, developed primarily using Apache Flink, read analytic events from Kafka topics (or sometimes Kinesis streams).
- Transformation: The Flink applications perform necessary initial transformations on the data. This includes manual metadata creation and proper value formatting.
- Ingest Service: The feature payloads from customer applications are sunk to
spfeaturesingest— our “Streaming Platform feature ingest” Flink application. It handles the (de)serialization of the payloads and subsequent interaction withdsfeaturesvia WRITE API call(s) to ensure the features are processed in the right format, guaranteeing availability for online retrieval by other services.
Regardless of the ingestion method (batch, streaming, or on-demand), the Feature Store maintains uniform metadata and strongly consistent reads. This is crucial for ensuring feature accuracy and availability across all consuming applications and services.
Prioritizing User Experience and Feature Governance
Understanding our architecture is only half the picture; the user experience is central to maximizing productivity. Our Feature Store primarily serves two frequent personas: Software Engineers (who drive service activity) and ML Modelers (who design features and models). Since developers can often embody both roles or work in mixed teams, we’ve designed our system to simplify interaction for everyone.
Ease of Use and Quick Iteration
We learned early on that our core personas are particularly proficient in SQL and place a high value on quick iteration. To facilitate this, our design centers on:
- Performant SparkSQL as the preferred processing engine and language for batch feature queries.
- Simple JSON configuration files to define feature behavior/metadata.
This approach ensures that developers can focus on their primary responsibilities without technical intricacies getting in their way. The Feature Store presents this user-friendly interface and APIs that simplify interaction, minimizing the learning curve and facilitating rapid adoption. Engineers can readily register, update, and retrieve features using well-documented APIs and well-supported examples.
Feature Governance and Metadata
Our configuration files include essential metadata such as ownership details, urgency tiering, run-to-run carryover/rollup logic, and explicit feature naming & data-typing. This metadata is crucial for more than just customer clarity; it is vital for our monitoring and observability systems, aiding in debugging and providing posterity of feature history (both metadata and values).
Get Rohan Varshney’s stories in your inbox
Join Medium for free to get updates from this writer.
To support robust feature management, the Feature Store incorporates versioning and lineage tracking capabilities, encapsulated in our metadata:
- Versioning allows developers to monitor changes to features over time, ensuring the use of correct versions for their models/services. If the SQL or expected feature behavior undergoes business logic changes, a version bump is expected.
- Lineage tracking offers crucial insights into the origin and transformation of features, enhancing both transparency and accountability across the platform.
Accelerating the Feature Engineering Workflow
To complement our simple SQL/JSON foundation, we’ve integrated with Kyte to accelerate the development lifecycle. This homegrown solution is central to Airflow local development at Lyft — more about Kyte can be learned here.
We provide a custom Command Line Interface (CLI) within the Kyte environment that significantly improves the feature prototyping experience, allowing users to:
- Perform feature validation against their configurations.
- Test SQL runs for immediate feedback and investigable results.
- Execute DAG runs in a local environment.
- Confidently backfill previous dates against their DAGs.
Feature Discoverability
Once features are generating data, discoverability is the next crucial step. Our generated DAGs automatically tag feature metadata within Amundsen, Lyft’s central data discovery platform. This integration allows users to easily search for existing features, a critical step in preventing the duplication of efforts and reducing wasted engineering work.
By simplifying data discovery and feature engineering, we solidify the Feature Store’s crucial role in the ML Model Development lifecycle, ensuring a strong partnership with our Machine Learning Platform (MLP) team, which owns the remaining model-building steps.
Source: eng.lyft.com
