ML Features

Features

A Feature in Machine Learning is an individual measurable property or characteristic of a phenomenon. Features are used for

  • Creating Training Data

  • Inference i.e. making predictions

Introduction

Operationalising Machine Learning is complex. Even before getting to the stage of Training Data, ML teams need to work on multiple aspects.

1. Scheduling Features

For most Machine Learning problems, features vary with time. For e.g. a user's spend in the last 7 days will keep changing with time. Therefore, Data Scientists often need to calculate their features as per certain schedules. This requires scheduling features as DAGs using workflow management platforms such as Airflow, Luigi etc.

More often than not, ML teams have additional requirements that accompany scheduling.

1a. Monitoring

Users benefit greatly from being able to

  • visualize historical feature job runs and lineage

  • go through logs

  • see job configuration details

1b. Alerting

Scheduled feature jobs can fail due to multiple reasons, lack of upstream data, availability of computational resources etc. Teams want to be alerted in such cases and have the ability to re-run feature jobs with ease.

1c. Pipeline Best Practises

Additional efforts have to be invested to make sure feature jobs are idempotent. Idempotence ensures self-correction since it prevents duplication of data when pipelines fail.

1d. Backfills

Once a feature has been scheduled to run i.e has been deployed, a Data Scientist may want these feature jobs to run for a set of days in the past. This is referred to as backfills, a common concept in Data Engineering pipelines. Backfills help Data Scientists re-play the existing feature jobs for past dates with minimal configuration changes and without having to define the feature logic from scratch.

2. Need for online retrieval and Mitigate train-serve skew

While features are crucial for model training, they are critical for model predictions as well. Assume a model for a recommendation system was trained using multiple features, one of which was clicks on grocery items in an E-commerce website in the last 7 days user_grocery_clicks_7d. This feature will also be used to rank items when users show up to the website the next time. This gives rise to 2 considerations for ML Teams

  1. Train-Serve Skew Training-Serving skew happens when the feature data distribution while predictions/inference differs from the distribution present in the training data. Lack of consistency in training and serving results in degraded model performance.

  2. Low-latency retrieval Model predictions impact user experience. In the scenario above, we would like to recommend products to users in ~100 ms. This mandates that feature values for user_grocery_clicks_7d can be queried/fetched in sub-second latencies.

3. Re-usability & Feature Sharing

Different teams/Data Scientists can end up building the same feature multiple times. Over time, this leads to duplicated pipelines and unnecessary costs incurred in compute, storage (offline and online). For an organization, this reduces productivity and increases time to production since Data Scientists start building features from scratch instead of being able to re-use existing features.

Working with Features

Canso allows users to define features in a Declarative Manner. At a high level, defining a feature involves 3 considerations

1. Feature Metadata

Feature metadata includes

  • the name of the feature

  • a human-readable description for easier understanding

  • the data-source on top of which this feature needs to be calculated

  • the datatype

  • owners of the feature

2. Feature Logic

Feature Logic is the transformation that is used to compute the feature. Transformations include commonly used aggregations such as SUM, MIN, MAX etc or row-level transformations.

3. Feature Scheduling Details

Scheduling details include a feature's

  • computation schedule - common schedules are once a day, once an hour etc.

  • feature compute start time i.e. the time since when the feature computation should begin

  • Whether or not the feature's computed values should be ingested to an online store

Register a Feature

A Feature, once defined needs to be registered. Canso persists the feature's metadata, logic and scheduling details for future reference and re-use.

Deploy a Feature

When users deploy a feature, Canso creates a DAG for the feature job. This DAG is automatically scheduled and starts running based on the user's defined schedule. These DAGs compute a feature (also referred to as materialization) and these materialized values are persisted. If the user specifies that ingestion is needed, the materialized values are ingested to an online store as well. To enable online ingestion, set

online=True

while defining a Feature.

Types of Features

Canso divides Features into 2 categories, Raw and Derived. Raw Features are transformations or aggregations on data sources. Derived Features are defined on top of existing Raw Features.

Tool Tips

Last updated

Was this helpful?