Canso - ML Platform
  • 👋Introduction
  • 🏛️Canso Architecture
  • 💻Getting Started
    • 🏁Overview
    • 🌌Provison K8s Clusters
    • 🚢Install Canso Helm Charts
    • 🐍🔗 Canso Python Client & Web App
    • 📊Health Metrics for Features in the Data Plane
  • 💡Feature Store
    • Data Sources
      • Data Spans
    • Data Sinks
    • ML Features
      • Raw ML Batch Feature
      • Derived ML Batch Feature
      • Raw ML Streaming Feature
      • Custom User Defined Function
  • 💡AI Agents
    • Introduction
    • Getting Started
    • Quickstart
    • Use Cases
      • Fraud Analyst Agent
      • Agent with Memory
      • Memory command examples
    • Concepts
      • Task Server
      • Broker
      • Checkpoint DB
      • Conversation History
      • Memory
    • How Tos
      • Update the AI Agent
      • Delete the AI Agent
    • Toolkit
      • SQL Runner
      • Kubernetes Job
      • Text-to-SQL
    • API Documentation
      • Agent
      • Memory
  • 💡Risk
    • Overview
    • Workflows and Rules
    • Real Time Transaction Monitoring
    • API Documentation
  • 💡Fraud Investigation
    • API Documentation
  • 📝Guides
    • Registry
    • Dry Runs for Batch ML Features
    • Deployment
Powered by GitBook
On this page
  • Features
  • Introduction
  • Working with Features
  • Register a Feature
  • Deploy a Feature
  • Types of Features

Was this helpful?

  1. 💡Feature Store

ML Features

PreviousData SinksNextRaw ML Batch Feature

Last updated 11 months ago

Was this helpful?

Features

A is an individual measurable property or characteristic of a phenomenon. Features are used for

  • Creating Training Data

  • Inference i.e. making predictions

Introduction

Operationalising Machine Learning is complex. Even before getting to the stage of Training Data, ML teams need to work on multiple aspects.

1. Scheduling Features

For most Machine Learning problems, features vary with time. For e.g. a user's spend in the last 7 days will keep changing with time. Therefore, Data Scientists often need to calculate their features as per certain schedules. This requires scheduling features as DAGs using workflow management platforms such as Airflow, Luigi etc.

More often than not, ML teams have additional requirements that accompany scheduling.

1a. Monitoring

Users benefit greatly from being able to

  • visualize historical feature job runs and lineage

  • go through logs

  • see job configuration details

1b. Alerting

Scheduled feature jobs can fail due to multiple reasons, lack of upstream data, availability of computational resources etc. Teams want to be alerted in such cases and have the ability to re-run feature jobs with ease.

1c. Pipeline Best Practises

1d. Backfills

Once a feature has been scheduled to run i.e has been deployed, a Data Scientist may want these feature jobs to run for a set of days in the past. This is referred to as backfills, a common concept in Data Engineering pipelines. Backfills help Data Scientists re-play the existing feature jobs for past dates with minimal configuration changes and without having to define the feature logic from scratch.

2. Need for online retrieval and Mitigate train-serve skew

While features are crucial for model training, they are critical for model predictions as well. Assume a model for a recommendation system was trained using multiple features, one of which was clicks on grocery items in an E-commerce website in the last 7 days user_grocery_clicks_7d. This feature will also be used to rank items when users show up to the website the next time. This gives rise to 2 considerations for ML Teams

  1. Train-Serve Skew Training-Serving skew happens when the feature data distribution while predictions/inference differs from the distribution present in the training data. Lack of consistency in training and serving results in degraded model performance.

  2. Low-latency retrieval Model predictions impact user experience. In the scenario above, we would like to recommend products to users in ~100 ms. This mandates that feature values for user_grocery_clicks_7d can be queried/fetched in sub-second latencies.

3. Re-usability & Feature Sharing

Different teams/Data Scientists can end up building the same feature multiple times. Over time, this leads to duplicated pipelines and unnecessary costs incurred in compute, storage (offline and online). For an organization, this reduces productivity and increases time to production since Data Scientists start building features from scratch instead of being able to re-use existing features.

Working with Features

Canso allows users to define features in a Declarative Manner. At a high level, defining a feature involves 3 considerations

1. Feature Metadata

Feature metadata includes

  • the name of the feature

  • a human-readable description for easier understanding

  • the data-source on top of which this feature needs to be calculated

  • the datatype

  • owners of the feature

2. Feature Logic

Feature Logic is the transformation that is used to compute the feature. Transformations include commonly used aggregations such as SUM, MIN, MAX etc or row-level transformations.

3. Feature Scheduling Details

Scheduling details include a feature's

  • computation schedule - common schedules are once a day, once an hour etc.

  • feature compute start time i.e. the time since when the feature computation should begin

  • Whether or not the feature's computed values should be ingested to an online store

Register a Feature

A Feature, once defined needs to be registered. Canso persists the feature's metadata, logic and scheduling details for future reference and re-use.

Deploy a Feature

When users deploy a feature, Canso creates a DAG for the feature job. This DAG is automatically scheduled and starts running based on the user's defined schedule. These DAGs compute a feature (also referred to as materialization) and these materialized values are persisted. If the user specifies that ingestion is needed, the materialized values are ingested to an online store as well. To enable online ingestion, set

online=True

while defining a Feature.

Types of Features

Canso divides Features into 2 categories, Raw and Derived. Raw Features are transformations or aggregations on data sources. Derived Features are defined on top of existing Raw Features.

Tool Tips

Additional efforts have to be invested to make sure feature jobs are . Idempotence ensures self-correction since it prevents duplication of data when pipelines fail.

Go to ⬆️

Go back to ⬅️

Move forward to see ➡️

Move forward to see ➡️

Move forward to see ➡️

Feature in Machine Learning
idempotent
Top
README.md
raw-feature.md
derived-feature.md
streaming-feature.md