Data Sinks

A Data Sink in Canso is a reference to where processed data is stored. Data Sinks can be used by multiple Batch and Streaming features to save their outputs. Currently, Canso supports two types of data sinks: S3 for offline storage and Redis for online storage.

Introduction

Data Sinks provide a standardized way to store processed data, enabling:

Efficient storage and retrieval of processed data.
Reusability across different batch and streaming features.
Simplified handling of data storage configurations for Data Scientists.
Consistent methods for saving feature and pre-processing table outputs.

Data Sink Types

Offline Data Sink Attributes (S3)

These attributes define how processed data is stored and accessed in Data Sinks for offline storage.

Attribute

Description

Example

name

Unique Name of the data sink

processed_sales_orders

description

Description of what the data sink contains

Processed sales orders stored for analysis

owner

Team that owns the data sink

['[email protected]', '[email protected]']

bucket

Bucket Name

mycompany_processed_data

leading_key

Fixed Key Component

processed_txns/users

file_type

File Type

CSV, PARQUET

metadata

Additional metadata about the offline sink

{"output_mode": "append", "processing_time": "120 seconds", "output_partitions": 20}

Online Data Sink Attributes (Redis)

These attributes define the configuration and usage of sink for low-latency retrieval.

Attribute

Description

Example

name

Unique Name of the data sink

user_session_data

description

Description of what the data source contains

Real-time user session data for quick access

owner

Team that owns the data source

['[email protected]', '[email protected]']

host

Redis Host

redis://192.168.1.123:6379

metadata

Additional metadata about the online sink

{"output_mode": "append", "processing_time": "120 seconds", "output_partitions": 20}

Example object of Sinks

Working with Data Sinks

Once a Data Sink is defined, it can be:

Registered for reusability across different teams.
Referenced by Batch and Streaming features to save their outputs.

Tool Tips

Go to Top ⬆️
Go back to README.md ⬅️
Go back to data-sources.md ⬅️
Move forward to see raw-feature.md ➡️
Move forward to see derived-feature.md ➡️
Move forward to see streaming-feature.md ➡️

PreviousData Spans NextML Features

Last updated 1 year ago

Was this helpful?