Derived ML Batch Feature
Derived Features are created by applying transformations on data generated by raw features. They serve as advanced features that build upon the foundational raw features, enabling more complex data processing and feature engineering.
Introduction
Derived Feature is an additional layer on top of raw features. It provides:
Derived Features can utilize multiple raw features as inputs, combining their values to create more meaningful and polished results.
These features are crucial in machine learning workflows as they allow for more sophisticated data transformations and enrichment.
It performs built-in operations and advanced operations on the raw tabular data.
Derived Feature Types
Feature with built-in operations
Derived Features support a range of built-in operations such as add, subtract, multiply, and safe_divide. These operations combine the raw data perform the transformation operation and adds new column to the dataframe having the new transformed values.
Derived Feature Attributes
name
Unique name of the derived feature
user_click_rate
description
Human-readable description for easier understanding
Click rate of users over time
staging_sink
Data sink for staging the processed data
recommendation-data-sink-S3
online_sink
Data sink for storing the processed data for online retrieval
redis://192.168.1.100:6379
data_type
Data type of the derived feature
FLOAT
owners
List of team members or teams responsible for the feature
['data_team@company.com']
schedule
Schedule for computing the derived feature
daily
entity
Entity to which the feature belongs
user_id
processing_engine
Engine used for processing the feature logic
Spark
processing_engine_configs
Configuration options for the processing engine
{'num_partitions': 10}
online
Boolean flag indicating if the feature should be available for online retrieval
True
offline
Boolean flag indicating if the feature should be available for offline analysis
True
transform
Transformation logic applied to the raw feature values
add(raw_feature1, raw_feature2)
start_time
Time since when the feature computation should begin
2024-01-01 00:00:00
Special notes on attributes
feature_logic
: The Derived Feature supports operations like add, subtract, multiply, safe_divide.processing_engine & their configs
: These are the default set of PySpark configurations used to run the Derived Feature.online flag
: If the online flag is enabled, the data will be ingested into the online sink (i.e., Redis cache).offline flag
: If the offline flag is enabled, the Derived Feature will ingest the data into the offline sink (i.e., S3 sink).Read & Write option configs
: Users can provide some configurations at the time of feature registration. Currently, this option is not enabled, but we will be adding support for them soon.
Example Object of Derived feature
Working with Derived Features
Once the derived feature is defined:
It can be used as a standalone feature that combines and transforms raw features or other data sources.
The output of a Derived Feature is used directly for machine learning model training or inference.
It cannot be reused or referenced again in other features.
Tool Tips
Go to Top ⬆️
Go back to README.md ⬅️
Go back to data-sources.md ⬅️
Go back to data-sinks.md ⬅️
Move forward to see register-feature.md ➡️
Last updated
Was this helpful?