Raw ML Streaming Feature
Streaming Feature
Streaming Features enable real-time data processing by reading live data from Kafka topics. These features perform predefined or custom logic on the live data and store the results in specified data sinks.
Introduction
Streaming features operate on live data, making them crucial for real-time machine learning applications. It provides:
They allow for immediate processing and analysis of data as it arrives, enabling real-time predictions and insights.
Streaming features are especially important in scenarios where timely data processing is critical, such as fraud detection, recommendation systems, and dynamic pricing models.
By continuously updating features based on the latest data, streaming features ensure that machine learning models are always working with the most current information.
In this context, sliding window operations and custom user-defined functions (UDFs) are supported to provide flexible and powerful data processing capabilities.
Overview of Streaming Aggregations and Transformations
Streaming features enable dynamic real-time data processing by applying a variety of aggregations and transformations on live data streams. These operations are critical for creating powerful, real-time machine learning features.
Supported Operations
Window Aggregations
Compute metrics over defined time intervals using sliding, tumbling, or session windows
Time-series analysis, Event monitoring
Calculate average transaction value per 5-minute window
Filter
Selectively process events based on specified conditions
Data cleaning, Event filtering
Filter out transactions below threshold value
Map
One-to-one transformation of individual events
Data normalization, Format conversion
Convert temperature from Celsius to Fahrenheit
FlatMap
One-to-many transformation breaking down events into multiple outputs
Event decomposition, Data expansion
Split compound events into individual components
Count
Calculate event occurrence frequency
Event frequency analysis, Usage metrics
Count user interactions per session
Repartition
Redistribute data across specified number of partitions
Performance optimization, Load balancing
Rebalance data across processing nodes
SelectExpr
SQL-style column transformations and filtering
Column manipulation, Data projection
Extract specific fields using SQL expressions
Getting Started
Choose the appropriate transformation type based on your use case
Configure the transformation parameters
Define your StreamingFeature with the chosen transformation
Register and deploy your feature using the YugenClient
Raw Feature Types
Feature with Predefined Logic
Raw Features with predefined logic utilize built-in transformations and aggregations for ease of use and consistency. Features created using common aggregations like window or sliding window and transformations such as SUM, MIN, MAX, etc.
Feature with Custom UDF
Features created using user-defined functions for more complex and specific transformations. Raw Features with custom UDFs allow for more flexibility and can handle complex transformations not covered by predefined logic.
Streaming Feature Attributes
name
Unique name of the Kafka feature
real_time_click_rate
description
Description of what the streaming feature contains
Real-time click rate of users
data_type
Data type of the Kafka feature
FLOAT
data_sources
List of Kafka topics from which data is read
['clicks_topic']
staging_sink
Data sink for staging the processed data
real_time_clicks_processed_data_s3_bucket
owners
List of team members or teams responsible for the feature
['data_team@company.com']
feature_logic
Transformation logic applied to the live data
SlidingWindowAggregation
processing_engine
Engine used for processing the feature logic
pyspark
processing_engine_configs
Configuration options for the processing engine
{'parallelism': 4}
online
Boolean flag indicating if the feature should be available for online retrieval
True
offline
Boolean flag indicating if the feature should be available for offline analysis
True
Special notes on attributes
processing_engine & their configs
: These are the default set of PySpark streaming configurations used to run the Derived Feature.checkpoint_path
: Users can specify acheckpoint_path
as part of the processing engine configuration to enable checkpointing for stateful recovery in Spark Streaming jobs. If nocheckpoint_path
is provided, the streaming job will run without checkpointing. Users are encouraged to provide a checkpoint path to ensure fault tolerance and state recovery.For more details, refer to Spark's official documentation on Checkpointing.
online
: If the online flag is enabled, the data will be ingested into the online sink (i.e., Redis cache).offline
: If the offline flag is enabled, the Raw Feature will ingest the data into the offline sink (i.e., S3 sink).Read option configs
: Users can provide some configurations at the time of feature registration. Currently, this option is not enabled, but we will be adding support for them soon.staging_sink_write_option_configs
andonline_sink_write_option_configs
: These are optional configurations that provide flexibility in controlling how data is written to sinks. Currently, this option is not enabled, but we will be adding support for them soon.
Session Windows
Session windows group data based on periods of activity separated by gaps of inactivity. Unlike fixed-time windows, session windows have dynamic lengths that expand as new events arrive and close when no events occur within a specified timeout period.
When to Use Session Windows
Session windows are perfect for analyzing continuous sequences of events that should be grouped together until there's a significant gap in activity. Consider these metrics:
Time spent per active session
Events processed per session
Session-based conversion rates
These metrics are valuable in scenarios like user website interactions, where each user action extends the session timeout. If no new actions occur within the timeout period (e.g., 30 minutes), the session closes and a new one begins with the next action.
For more details on different types of time windows in Spark Structured Streaming, see the official documentation.
Example Object of Streaming Feature
Working with Kafka Features
Kafka Features are defined and registered similarly to batch features but are continuously updated as new data arrives.
This real-time processing allows for immediate updates to features, which can then be used for live model predictions or analytics.
Tool Tips
Go to Top β¬οΈ
Go back to README.md β¬ οΈ
Go back to data-sources.md β¬ οΈ
Go back to data-sinks.md β¬ οΈ
Move forward to see register-feature.md β‘οΈ
Last updated
Was this helpful?