Raw ML Batch Feature

A Raw Feature in the Canso platform is a fundamental component that executes ML pipelines. It processes data from registered data sources, applies feature logic, and stores the results in data sinks.

Introduction

A Raw Features are the essential building blocks for machine learning models. It provides:

Ensures seamless preparation of data for both training and inference stages.
Performs standardized predefined logics or user-defined functions (UDFs).

Raw Feature Types

Feature with Predefined Logic

Raw Features with predefined logic utilize built-in transformations and aggregations for ease of use and consistency. Features created using common aggregations like window or sliding window and transformations such as SUM, MIN, MAX, etc.

Feature with Custom UDF

Features created using user-defined functions for more complex and specific transformations. Raw Features with custom UDFs allow for more flexibility and can handle complex transformations not covered by predefined logic.

Raw Feature Attributes

Attribute

Description

Example

name

Unique Name of the raw feature

user_clicks_7d

description

Description of what the raw feature contains

Sum of user clicks in the last 7 days

owners

Team that owns the raw feature

['[email protected]']

entity

The entity the feature is based on

user_id

data_type

Data type of the raw feature

FLOAT

data_sources

List of data sources used to create the feature

[survey_telemetry_data]

staging_sink

S3 sink where the intermediate data is stored

[operational_telemetry_data]

online_sink

Redis sink for storing the feature online

["online_telemetry_data"]

online_sink_write_option_configs

Configurations for writing to the online sink

{"online_telemetry_data":{ "file_type_properties": { "type": "PARQUET", "mergeSchema": False}}}

feature_logic

Transformation logic to compute the feature

SlidingWindowAggregation

processing_engine

Processing engine used for feature computation

spark

processing_engine_configs

Configurations for the processing engine

{"memory": "4g", "cores": 2}

online

Flag to indicate if the feature should be available online

True

offline

Flag to indicate if the feature should be available offline

True

schedule

Schedule for feature computation

1D

active

Flag to indicate if the feature is active

True

start_time

Start time for feature computation

datetime.now()

Special notes on attributes

feature_logic: The Raw Feature supports sliding windows and window aggregations.
processing_engine & their configs: These are the default set of PySpark configurations used to run the Raw Feature.
online flag: If the online flag is enabled, the data will be ingested into the online sink (i.e., Redis cache).
offline flag: If the offline flag is enabled, the Raw Feature will ingest the data into the offline sink (i.e., S3 sink).
Read & Write option configs: Users can provide some configurations at the time of feature registration. Currently, this option is not enabled, but we will be adding support for them soon.

Example Object of Raw Features

Working with Raw Features

Once the raw feature is defined:

It can be registered and reused by Derived Features. The output of a Raw Feature becomes the input for a Derived Feature, enabling more complex feature engineering and reducing redundancy.
It can be deployed to execute the defined operation.

Tool Tips

Go to Top ⬆️
Go back to README.md ⬅️
Go back to data-sources.md ⬅️
Go back to data-sinks.md ⬅️
Move forward to see register-feature.md ➡️

PreviousML Features NextDerived ML Batch Feature

Last updated 11 months ago

Was this helpful?