Health Metrics for Features in the Data Plane
Table of Contents
1. Introduction
Welcome to the latest release of our Health Metrics Collection for features. This release marks a significant milestone in our efforts to provide robust, scalable, and efficient monitoring capabilities for complex distributed systems. Our focus has been on enhancing the observability of data plane architectures, with particular emphasis on Airflow jobs and Spark streaming applications.
2. System Architecture Overview
Our system is built on a Control Plane and Data Plane architecture, designed to provide comprehensive monitoring while maintaining a clear separation of concerns.
Control Plane: Centralized management and monitoring hub
Houses the RabbitMQ message broker
Will include a metrics database in future releases
Responsible for processing and analyzing collected metrics
Data Plane: Client-side infrastructure
Hosts Airflow jobs and Spark streaming applications
New Canso Agent Proxy for efficient metrics collection
3. New Features and Improvements
3.1 Canso Agent Proxy
We've introduced a new component called the Canso Agent Proxy, significantly enhancing our metrics collection capabilities without impacting core functionalities.
Key Features:
Deployed as a separate pod within the Canso Agent Helm chart
Runs a Flask background scheduler for automated metric collection
Operates independently from the main Canso Agent pod
Technical Details:
Implemented using Python Flask
Utilizes the APScheduler library for task scheduling
Communicates with Airflow and Prometheus for metric collection
Benefits:
Separation of concerns: Metric collection doesn't interfere with deployment tasks
Improved reliability and scalability of the monitoring system
Flexible configuration options for collection intervals
3.2 Airflow Job Health Metrics
We've implemented a robust system for collecting and reporting Airflow job health metrics.
Key Features:
Collects metrics every 5 minutes
Utilizes Airflow's REST API for data retrieval
Captures comprehensive information about DAG runs and task states
Technical Details:
Interacts with Airflow API endpoints such as
/api/v1/dags
and/api/v1/dags/{dag_id}/dagRuns
Processes API responses to extract relevant health information
Structures data into a standardized metric format before publishing
Metrics Collected:
Number of active DAGs
Success/failure rates of DAG runs
Average duration of DAG runs
Task-level statistics (success rates, durations, etc.)
3.3 Spark Streaming Health Metrics
Our new release includes advanced monitoring capabilities for Spark streaming jobs.
Key Features:
Collects metrics every 1 minute
Leverages Prometheus for efficient metric gathering
Focuses on critical Spark driver health indicators
Technical Details:
Uses HTTPS calls to the Prometheus query API
Employs specific PromQL queries to extract relevant Spark metrics
Processes and transforms Prometheus data into our standardized metric format
Metrics Collected:
Streaming query progress (input rate, process rate, etc.)
Streaming state information (active queries, waiting batches, etc.)
4. Technical Details
4.1 Control Plane
The Control Plane serves as the centralized hub for metric aggregation and analysis.
Components:
RabbitMQ message broker
Configured for high availability and durability
Uses topic exchanges for flexible routing of metrics
Future: Metrics database (e.g., TimescaleDB or InfluxDB)
Will provide long-term storage and querying capabilities
Data Flow:
Receives metrics from Data Plane components via RabbitMQ
Processes incoming messages for immediate alerting or visualization
Stores processed metrics in the database (future feature)
4.2 Data Plane
The Data Plane represents the client-side infrastructure where actual workloads run.
Components:
Airflow cluster
Runs batch processing jobs
Exposes REST API for metric collection
Spark cluster
Executes streaming jobs
Monitored via Prometheus
Canso Agent
Main pod: Handles deployment of features and AI agents
Proxy pod: Responsible for metric collection and reporting
Interaction:
Canso Agent Proxy interacts with Airflow API and Prometheus
Collected metrics are securely transmitted to the Control Plane
4.3 Communication and Data Flow
Canso Agent Proxy initiates metric collection at specified intervals
Metrics are collected from Airflow API and Prometheus
Collected data is transformed into a standardized format
Metrics are published to RabbitMQ in the Control Plane
Control Plane services consume metrics for processing and storage
5. Benefits and Impact
Improved Visibility: Gain deep insights into the health and performance of both Airflow and Spark jobs
Proactive Management: Early detection of issues enables faster response times
Scalability: Architecture supports monitoring across multiple client infrastructures
Minimal Overhead: Separate proxy ensures core functionalities remain unaffected
6. Future Roadmap
Implementation of a metrics database in the Control Plane
Advanced analytics and machine learning for predictive maintenance
Expansion of metric collection to cover additional components
For any questions, concerns, or support needs, please don't hesitate to Reach out to us Canso Community
Last updated
Was this helpful?