Raw ML Batch Feature
A Raw Feature in the Canso platform is a fundamental component that executes ML pipelines. It processes data from registered data sources, applies feature logic, and stores the results in data sinks.
Introduction
A Raw Features are the essential building blocks for machine learning models. It provides:
Ensures seamless preparation of data for both training and inference stages.
Performs standardized predefined logics or user-defined functions (UDFs).
Raw Feature Types
Feature with Predefined Logic
Raw Features with predefined logic utilize built-in transformations and aggregations for ease of use and consistency. Features created using common aggregations like window or sliding window and transformations such as SUM, MIN, MAX, etc.
Feature with Custom UDF
Features created using user-defined functions for more complex and specific transformations. Raw Features with custom UDFs allow for more flexibility and can handle complex transformations not covered by predefined logic.
Raw Feature Attributes
name
Unique Name of the raw feature
user_clicks_7d
description
Description of what the raw feature contains
Sum of user clicks in the last 7 days
owners
Team that owns the raw feature
['data_team@company.com']
entity
The entity the feature is based on
user_id
data_type
Data type of the raw feature
FLOAT
data_sources
List of data sources used to create the feature
[survey_telemetry_data]
staging_sink
S3 sink where the intermediate data is stored
[operational_telemetry_data]
online_sink
Redis sink for storing the feature online
["online_telemetry_data"]
online_sink_write_option_configs
Configurations for writing to the online sink
{"online_telemetry_data":{ "file_type_properties": { "type": "PARQUET", "mergeSchema": False}}}
feature_logic
Transformation logic to compute the feature
SlidingWindowAggregation
processing_engine
Processing engine used for feature computation
spark
processing_engine_configs
Configurations for the processing engine
{"memory": "4g", "cores": 2}
online
Flag to indicate if the feature should be available online
True
offline
Flag to indicate if the feature should be available offline
True
schedule
Schedule for feature computation
1D
active
Flag to indicate if the feature is active
True
start_time
Start time for feature computation
datetime.now()
Special notes on attributes
feature_logic
: The Raw Feature supports sliding windows and window aggregations.processing_engine & their configs
: These are the default set of PySpark configurations used to run the Raw Feature.online flag
: If the online flag is enabled, the data will be ingested into the online sink (i.e., Redis cache).offline flag
: If the offline flag is enabled, the Raw Feature will ingest the data into the offline sink (i.e., S3 sink).Read & Write option configs
: Users can provide some configurations at the time of feature registration. Currently, this option is not enabled, but we will be adding support for them soon.
Example Object of Raw Features
Working with Raw Features
Once the raw feature is defined:
It can be registered and reused by Derived Features. The output of a Raw Feature becomes the input for a Derived Feature, enabling more complex feature engineering and reducing redundancy.
It can be deployed to execute the defined operation.
Tool Tips
Go to Top ⬆️
Go back to README.md ⬅️
Go back to data-sources.md ⬅️
Go back to data-sinks.md ⬅️
Move forward to see register-feature.md ➡️
Last updated
Was this helpful?