ML Features
Features
A Feature in Machine Learning is an individual measurable property or characteristic of a phenomenon. Features are used for
Creating Training Data
Inference i.e. making predictions
Introduction
Operationalising Machine Learning is complex. Even before getting to the stage of Training Data, ML teams need to work on multiple aspects.
1. Scheduling Features
For most Machine Learning problems, features vary with time. For e.g. a user's spend in the last 7 days will keep changing with time. Therefore, Data Scientists often need to calculate their features as per certain schedules. This requires scheduling features as DAGs using workflow management platforms such as Airflow, Luigi etc.
More often than not, ML teams have additional requirements that accompany scheduling.
1a. Monitoring
Users benefit greatly from being able to
visualize historical feature job runs and lineage
go through logs
see job configuration details
1b. Alerting
Scheduled feature jobs can fail due to multiple reasons, lack of upstream data, availability of computational resources etc. Teams want to be alerted in such cases and have the ability to re-run feature jobs with ease.
1c. Pipeline Best Practises
Additional efforts have to be invested to make sure feature jobs are idempotent. Idempotence ensures self-correction since it prevents duplication of data when pipelines fail.
1d. Backfills
Once a feature has been scheduled to run i.e has been deployed, a Data Scientist may want these feature jobs to run for a set of days in the past. This is referred to as backfills, a common concept in Data Engineering pipelines. Backfills help Data Scientists re-play the existing feature jobs for past dates with minimal configuration changes and without having to define the feature logic from scratch.
2. Need for online retrieval and Mitigate train-serve skew
While features are crucial for model training, they are critical for model predictions as well. Assume a model for a recommendation system was trained using multiple features, one of which was clicks on grocery items in an E-commerce website in the last 7 days user_grocery_clicks_7d
. This feature will also be used to rank items when users show up to the website the next time. This gives rise to 2 considerations for ML Teams
Train-Serve Skew Training-Serving skew happens when the feature data distribution while predictions/inference differs from the distribution present in the training data. Lack of consistency in training and serving results in degraded model performance.
Low-latency retrieval Model predictions impact user experience. In the scenario above, we would like to recommend products to users in ~100 ms. This mandates that feature values for
user_grocery_clicks_7d
can be queried/fetched in sub-second latencies.
3. Re-usability & Feature Sharing
Different teams/Data Scientists can end up building the same feature multiple times. Over time, this leads to duplicated pipelines and unnecessary costs incurred in compute, storage (offline and online). For an organization, this reduces productivity and increases time to production since Data Scientists start building features from scratch instead of being able to re-use existing features.
Working with Features
Canso allows users to define features in a Declarative Manner. At a high level, defining a feature involves 3 considerations
1. Feature Metadata
Feature metadata includes
the name of the feature
a human-readable description for easier understanding
the data-source on top of which this feature needs to be calculated
the datatype
owners of the feature
2. Feature Logic
Feature Logic is the transformation that is used to compute the feature. Transformations include commonly used aggregations such as SUM, MIN, MAX etc or row-level transformations.
3. Feature Scheduling Details
Scheduling details include a feature's
computation schedule - common schedules are once a day, once an hour etc.
feature compute start time i.e. the time since when the feature computation should begin
Whether or not the feature's computed values should be ingested to an online store
Register a Feature
A Feature, once defined needs to be registered. Canso persists the feature's metadata, logic and scheduling details for future reference and re-use.
Deploy a Feature
When users deploy a feature, Canso creates a DAG for the feature job. This DAG is automatically scheduled and starts running based on the user's defined schedule. These DAGs compute a feature (also referred to as materialization) and these materialized values are persisted. If the user specifies that ingestion is needed, the materialized values are ingested to an online store as well. To enable online ingestion, set
while defining a Feature.
Types of Features
Canso divides Features into 2 categories, Raw and Derived. Raw Features are transformations or aggregations on data sources. Derived Features are defined on top of existing Raw Features.
Tool Tips
Go to Top ⬆️
Go back to README.md ⬅️
Move forward to see raw-feature.md ➡️
Move forward to see derived-feature.md ➡️
Move forward to see streaming-feature.md ➡️
Last updated
Was this helpful?