Data Sources
A Data Source in Canso is a reference to raw data that is either generated and owned by users or defined by users and owned by Canso's Pre-processing pipelines. Currently, Canso supports tabular data.
Introduction
Data Sources provide an abstraction over datasets owned by users, enabling:
A standardized way of declaring raw data for feature and pre-processing table calculations.
A uniform user experience when defining features, allowing Data Scientists to focus on feature logic without worrying about underlying data, DB connections, or access.
Reusability in defining and materializing features and pre-processing jobs.
Improved understanding of raw data.
Data Sources can be of 2 types at the very least
Batch Data Sources
Streaming Data Sources
Data Source Types
Batch Data Sources
Batch Data Sources are typically Data Warehouses (BigQuery, Redshift, etc.) or Object Storages (S3, GCS, etc.). In Canso currently we are supporting S3 Data Source only.
Batch Data Source Attributes (S3 & GCS)
S3 and GCS Batch Data Sources, can be described by the following attributes. These attributes define how data is stored and accessed in object storage, allowing for standardized and reusable data processing.
data_source_name
Unique Name of the datasource
raw_us_orders
bucket
Bucket Name
mycompany_data
base_key
Fixed Key Component
raw_txns/orders/us
varying_key_suffix_format
Varying Time-based Key Component suffixed to Fixed Key Component
%Y-%m-%d/%H
%Y-%m-%d/%H-%M
varying_key_suffix_freq
Frequency of varying_key_suffix_format
30min
3H
12H
1D
time_offset
Optional time offset in seconds
3600
file_type
File Type
CSV
, PARQUET
description
Description of what the data source contains
Daily Raw orders placed by users in the US
owner
Team that owns the data source
['data_engg@yugen.ai', 'sales@yugen.ai']
schema_path
Path where PySpark and BQ Schemas of the data are persisted
-
event_timestamp_field
Column indicating the event time
ordered_at
event_timestamp_format
Format of the event_timestamp_field
PySpark supported data/time formats
created_at
Time when the Data Source was created
datetime(2021, 1, 1, 0, 0, 0)
The path/key to data files for a data source are obtained using
bucket
base_key
varying_key_suffix_format
varying_key_suffix_freq
time_offset
Canso uses the concept of Data Spans, which supports various ways in which a user's data in stored in Object Storage. To understand more about how Data Spans are used, see Data Spans
Streaming Data Sources
Canso supports Kafka as a data source. This can be used for real-time data processing and feature generation. Kafka data sources enable:
Real-time data ingestion.
Stream processing for immediate feature extraction.
Continuous updates to machine learning models based on live data streams.
Streaming Data Source Attributes (Kafka)
name
Unique Name of the streaming datasource
user_activity_stream
description
Description of what the streaming data source contains
Real-time user activity data from the app
owners
Tenants that owns the data source
['data_engg@yugen.ai', 'app_analytics@yugen.ai']]
topic
Kafka topic from which the data is read
user_activity_topic
schema
Schema definition of the streaming data
{"user_id": "STRING", "activity_type": "STRING", "timestamp": "TIMESTAMP"}
timestamp_field
Field indicating the event time
timestamp
timestamp_format
Format of the timestamp_field
yyyy-MM-dd HH:mm:ssXXX
bootstrap_server
Kafka bootstrap server
12.345.678.910:9092
read_configs
Configuration settings for reading from Kafka
{"datasource": "streaming_data_source", "watermark_delay_threshold": "10 seconds", "starting_timestamp": None, "starting_offsets_by_timestamp": {}, "starting_offsets": "earliest", "fail_on_data_loss": False, "include_headers": True}
cloud_provider
Cloud provider where Kafka is hosted
AWS
Special notes on attributes
read_configs
: The standard set of attributes we support when registering a data source. Users can provide some configurations at the time of feature registration. Currently, this option is not enabled, but we will be adding support for them soon.
Working with Data Sources
Once a DataSource has been defined, it can be
Registered for re-usability across different teams
Referenced to create ML features
Example objects of Data Sources
Tool Tips
Go to Top β¬οΈ
Go back to README.md β¬ οΈ
Move forward to see data-sinks.md β‘οΈ
Last updated
Was this helpful?