๐Ÿ‘‹Introduction

Overview

Canso is a Managed Data and Feature Platform for operationalizing Machine Learning initiatives. The goal of Canso is to enable ML Teams (Data Engineers, Data Scientists, ML Engineers) to define their requirements in a declarative and standardized manner via a concise DSL without having to focus on writing custom code for Features, DAGs etc and managing infrastructure. This enables ML teams to

  • Iterate fast i.e. move from development to production in hours/days as opposed to weeks

  • Promote Reliability i.e build standardized ML pipelines

Canso's core focus is on user experience and speed of iteration, without compromising on reliability -

  • Define data sources where features can be created and computed.

  • Specify data sinks where processed data is stored after a successful ML pipeline run.

  • Define Machine Learning features in a standardized manner on top of existing Datasources and deploy them. These features can be used while Model training as well as for Model inference. Canso supports Raw, Derived and Streaming features currently.

  • Register and deploy features to execute the ML pipeline.

User Experience

Getting Started

1. Install Gru Package

For installing gru package will need to username and PAT as password.

  • A Personal Access Token (PAT) is a kind of key that authenticates a user across all applications they have access to.

2. Create Yugen client

3. Define a s3 Data Source

4. Register Data Source

5. Define a Raw Feature

6. Register Raw Feature

7. Dry run Raw Feature

8. Deploy Raw Feature

9. Define a Derived Feature

10. Register Derived Feature

11. Dry run Derived Feature

12. Deploy Derived Feature

13. Define Pre-Processing Transform

14. Register Pre-Processing Transform

15. Dry run Pre-Processing Transform

16. Deploy Pre-Processing Transform

17. Define Training Data

18. Register Training Data

19. Deploy Training Data

20. Define Infrastructure Data

21. Register Infrastructure Data

22. Deploy Infrastructure Data

Roadmap

DataSources

Batch

Streaming

DataSinks

Online DataSinks

Online data sinks offers real-time data storage for fast write operations. It ensures low-latency access to data, making it suitable for applications requiring immediate data retrieval and updates, such as retrieval for ML predictions. Currently, Canso supports Redis cache for storing data online.

Offline DataSinks

Offline data sinks provides durable and scalable storage for batch-processed and historical data. It supports large volumes of data with high reliability, making it ideal for data warehousing and archival storage. Currently, Canso supports S3 storing data offline.

Batch

Streaming

Online Feature Store

Last updated

Was this helpful?