Canso - ML Platform
  • πŸ‘‹Introduction
  • πŸ›οΈCanso Architecture
  • πŸ’»Getting Started
    • 🏁Overview
    • 🌌Provison K8s Clusters
    • 🚒Install Canso Helm Charts
    • πŸπŸ”— Canso Python Client & Web App
    • πŸ“ŠHealth Metrics for Features in the Data Plane
  • πŸ’‘Feature Store
    • Data Sources
      • Data Spans
    • Data Sinks
    • ML Features
      • Raw ML Batch Feature
      • Derived ML Batch Feature
      • Raw ML Streaming Feature
      • Custom User Defined Function
  • πŸ’‘AI Agents
    • Introduction
    • Getting Started
    • Quickstart
    • Use Cases
      • Fraud Analyst Agent
      • Agent with Memory
      • Memory command examples
    • Concepts
      • Task Server
      • Broker
      • Checkpoint DB
      • Conversation History
      • Memory
    • How Tos
      • Update the AI Agent
      • Delete the AI Agent
    • Toolkit
      • SQL Runner
      • Kubernetes Job
      • Text-to-SQL
    • API Documentation
      • Agent
      • Memory
  • πŸ’‘Risk
    • Overview
    • Workflows and Rules
    • Real Time Transaction Monitoring
    • API Documentation
  • πŸ’‘Fraud Investigation
    • API Documentation
  • πŸ“Guides
    • Registry
    • Dry Runs for Batch ML Features
    • Deployment
Powered by GitBook
On this page
  • Table of Contents
  • 1. Introduction
  • 2. System Architecture Overview
  • 3. New Features and Improvements
  • 3.1 Canso Agent Proxy
  • 3.2 Airflow Job Health Metrics
  • 3.3 Spark Streaming Health Metrics
  • 4. Technical Details
  • 4.1 Control Plane
  • 4.2 Data Plane
  • 4.3 Communication and Data Flow
  • 5. Benefits and Impact
  • 6. Future Roadmap

Was this helpful?

  1. Getting Started

Health Metrics for Features in the Data Plane

PreviousπŸπŸ”— Canso Python Client & Web AppNextData Sources

Last updated 7 months ago

Was this helpful?

Table of Contents

  1. 3.1.

    3.2.

    3.3.

  2. 4.1.

    4.2.

    4.3.

1. Introduction

Welcome to the latest release of our Health Metrics Collection for features. This release marks a significant milestone in our efforts to provide robust, scalable, and efficient monitoring capabilities for complex distributed systems. Our focus has been on enhancing the observability of data plane architectures, with particular emphasis on Airflow jobs and Spark streaming applications.

2. System Architecture Overview

Our system is built on a Control Plane and Data Plane architecture, designed to provide comprehensive monitoring while maintaining a clear separation of concerns.

  • Control Plane: Centralized management and monitoring hub

    • Houses the RabbitMQ message broker

    • Will include a metrics database in future releases

    • Responsible for processing and analyzing collected metrics

  • Data Plane: Client-side infrastructure

    • Hosts Airflow jobs and Spark streaming applications

    • New Canso Agent Proxy for efficient metrics collection

3. New Features and Improvements

3.1 Canso Agent Proxy

We've introduced a new component called the Canso Agent Proxy, significantly enhancing our metrics collection capabilities without impacting core functionalities.

Key Features:

  • Deployed as a separate pod within the Canso Agent Helm chart

  • Runs a Flask background scheduler for automated metric collection

  • Operates independently from the main Canso Agent pod

Technical Details:

  • Implemented using Python Flask

  • Utilizes the APScheduler library for task scheduling

  • Communicates with Airflow and Prometheus for metric collection

Benefits:

  • Separation of concerns: Metric collection doesn't interfere with deployment tasks

  • Improved reliability and scalability of the monitoring system

  • Flexible configuration options for collection intervals

3.2 Airflow Job Health Metrics

We've implemented a robust system for collecting and reporting Airflow job health metrics.

Key Features:

  • Collects metrics every 5 minutes

  • Utilizes Airflow's REST API for data retrieval

  • Captures comprehensive information about DAG runs and task states

Technical Details:

  • Interacts with Airflow API endpoints such as /api/v1/dags and /api/v1/dags/{dag_id}/dagRuns

  • Processes API responses to extract relevant health information

  • Structures data into a standardized metric format before publishing

Metrics Collected:

  • Number of active DAGs

  • Success/failure rates of DAG runs

  • Average duration of DAG runs

  • Task-level statistics (success rates, durations, etc.)

3.3 Spark Streaming Health Metrics

Our new release includes advanced monitoring capabilities for Spark streaming jobs.

Key Features:

  • Collects metrics every 1 minute

  • Leverages Prometheus for efficient metric gathering

  • Focuses on critical Spark driver health indicators

Technical Details:

  • Uses HTTPS calls to the Prometheus query API

  • Employs specific PromQL queries to extract relevant Spark metrics

  • Processes and transforms Prometheus data into our standardized metric format

Metrics Collected:

  • Streaming query progress (input rate, process rate, etc.)

  • Streaming state information (active queries, waiting batches, etc.)

4. Technical Details

4.1 Control Plane

The Control Plane serves as the centralized hub for metric aggregation and analysis.

Components:

  • RabbitMQ message broker

    • Configured for high availability and durability

    • Uses topic exchanges for flexible routing of metrics

  • Future: Metrics database (e.g., TimescaleDB or InfluxDB)

    • Will provide long-term storage and querying capabilities

Data Flow:

  1. Receives metrics from Data Plane components via RabbitMQ

  2. Processes incoming messages for immediate alerting or visualization

  3. Stores processed metrics in the database (future feature)

4.2 Data Plane

The Data Plane represents the client-side infrastructure where actual workloads run.

Components:

  • Airflow cluster

    • Runs batch processing jobs

    • Exposes REST API for metric collection

  • Spark cluster

    • Executes streaming jobs

    • Monitored via Prometheus

  • Canso Agent

    • Main pod: Handles deployment of features and AI agents

    • Proxy pod: Responsible for metric collection and reporting

Interaction:

  • Canso Agent Proxy interacts with Airflow API and Prometheus

  • Collected metrics are securely transmitted to the Control Plane

4.3 Communication and Data Flow

  1. Canso Agent Proxy initiates metric collection at specified intervals

  2. Metrics are collected from Airflow API and Prometheus

  3. Collected data is transformed into a standardized format

  4. Metrics are published to RabbitMQ in the Control Plane

  5. Control Plane services consume metrics for processing and storage

5. Benefits and Impact

  • Improved Visibility: Gain deep insights into the health and performance of both Airflow and Spark jobs

  • Proactive Management: Early detection of issues enables faster response times

  • Scalability: Architecture supports monitoring across multiple client infrastructures

  • Minimal Overhead: Separate proxy ensures core functionalities remain unaffected

6. Future Roadmap

  • Implementation of a metrics database in the Control Plane

  • Advanced analytics and machine learning for predictive maintenance

  • Expansion of metric collection to cover additional components

For any questions, concerns, or support needs, please don't hesitate to Reach out to us

πŸ’»
πŸ“Š
Introduction
System Architecture Overview
New Features and Improvements
Canso Agent Proxy
Airflow Job Health Metrics
Spark Streaming Health Metrics
Technical Details
Control Plane
Data Plane
Communication and Data Flow
Benefits and Impact
Future Roadmap
Canso Community
High-Level Design Diagram