Derived ML Batch Feature

Derived Features are created by applying transformations on data generated by raw features. They serve as advanced features that build upon the foundational raw features, enabling more complex data processing and feature engineering.

Introduction

Derived Feature is an additional layer on top of raw features. It provides:

Derived Features can utilize multiple raw features as inputs, combining their values to create more meaningful and polished results.
These features are crucial in machine learning workflows as they allow for more sophisticated data transformations and enrichment.
It performs built-in operations and advanced operations on the raw tabular data.

Derived Feature Types

Feature with built-in operations

Derived Features support a range of built-in operations such as add, subtract, multiply, and safe_divide. These operations combine the raw data perform the transformation operation and adds new column to the dataframe having the new transformed values.

Derived Feature Attributes

Attribute

Description

Example

name

Unique name of the derived feature

user_click_rate

description

Human-readable description for easier understanding

Click rate of users over time

staging_sink

Data sink for staging the processed data

recommendation-data-sink-S3

online_sink

Data sink for storing the processed data for online retrieval

redis://192.168.1.100:6379

data_type

Data type of the derived feature

FLOAT

owners

List of team members or teams responsible for the feature

['[email protected]']

schedule

Schedule for computing the derived feature

daily

entity

Entity to which the feature belongs

user_id

processing_engine

Engine used for processing the feature logic

Spark

processing_engine_configs

Configuration options for the processing engine

{'num_partitions': 10}

online

Boolean flag indicating if the feature should be available for online retrieval

True

offline

Boolean flag indicating if the feature should be available for offline analysis

True

transform

Transformation logic applied to the raw feature values

add(raw_feature1, raw_feature2)

start_time

Time since when the feature computation should begin

2024-01-01 00:00:00

Special notes on attributes

feature_logic: The Derived Feature supports operations like add, subtract, multiply, safe_divide.
processing_engine & their configs: These are the default set of PySpark configurations used to run the Derived Feature.
online flag: If the online flag is enabled, the data will be ingested into the online sink (i.e., Redis cache).
offline flag: If the offline flag is enabled, the Derived Feature will ingest the data into the offline sink (i.e., S3 sink).
Read & Write option configs: Users can provide some configurations at the time of feature registration. Currently, this option is not enabled, but we will be adding support for them soon.

Example Object of Derived feature

Feature with built-in logic

Working with Derived Features

Once the derived feature is defined:

It can be used as a standalone feature that combines and transforms raw features or other data sources.
The output of a Derived Feature is used directly for machine learning model training or inference.
It cannot be reused or referenced again in other features.

Tool Tips

Go to Top ⬆️
Go back to README.md ⬅️
Go back to data-sources.md ⬅️
Go back to data-sinks.md ⬅️
Move forward to see register-feature.md ➡️

PreviousRaw ML Batch Feature NextRaw ML Streaming Feature

Last updated 11 months ago

Was this helpful?