Data Spans

For Object-based Data Sources (S3, GCS), Canso uses the concept of Data Spans for loading data from specific keys. Underlying raw data for a data source can have different directory stuctures.

Introduction

For e.g. consider the underlying data to be present in the following directory tree

mycompany_bucket/raw_events/orders/
|---2023-01-01
|   |   |---abc1.parquet
|   |   |---abc2.parquet
|---2023-01-02
|   |   |---abc3.parquet
|   |   |---abc4.parquet
|---2023-01-03
|   |   |---abc5.parquet
...
...
|---2023-03-31
|   |   |---abcm.parquet
|   |   |---abcn.parquet

So, the complete path to abc1.parquet is

s3://mycompany_bucket/raw_events/orders/2023-01-01/abc1.parquet

Internally, Canso inteprets this Data Source as below -

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         |
|    |                     |    |                      |       |              |         |
|    |  mycompany_bucket/  |    |  raw_events/orders/  |       |  2023-01-01  |         | 
|    |                     |    |                      |       |              |         | 
|    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         | 
|    <-------bucket------->     <-------base_key------->   <---varying_key_suffix--->   |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The varying_key_suffix has 2 components -

  1. Format - e.g.

    • %Y-%m-%d for 2023-01-01,

    • d=%Y-%m-%d/t=%H-%M for d=2023-01-01/t=00-00, d=2023-01-01/t=00-15, d=2023-01-01/t=06-30

  2. Frequency - e.g.

    • 30min

    • 1H

    • 1D

Therefore, in this particular case, varying_key_suffix_format and varying_key_suffix_freq will be as follows -

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|   varying_key_suffix_format = %Y-%m-%d   |
|   varying_key_suffix_freq = 1D           |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How Data Spans are used

Based on the values of varying_key_suffix_format and varying_key_suffix_freq provided, Canso internally generates paths that will then be read while materializing features.

So, for the above example, the following is generated -

s3://mycompany_bucket/raw_events/orders/2023-01-01/
s3://mycompany_bucket/raw_events/orders/2023-01-02/
s3://mycompany_bucket/raw_events/orders/2023-01-03/
s3://mycompany_bucket/raw_events/orders/2023-01-04/
s3://mycompany_bucket/raw_events/orders/2023-01-05/
s3://mycompany_bucket/raw_events/orders/2023-01-06/
s3://mycompany_bucket/raw_events/orders/2023-01-07/
s3://mycompany_bucket/raw_events/orders/2023-01-08/
...
s3://mycompany_bucket/raw_events/orders/2023-03-31/
s3://mycompany_bucket/raw_events/orders/2023-04-01/
...

Now, say a feature avg_user_spend_l7d is registered and deployed with a daily schedule frequency i.e. it get's calculated at the beginning of a day and the AVG is based on the last 7 days of data, Canso will automatically calculate the paths based on the feature's execution time. It will load data from those paths and perform the feature computation on the loaded data. To see more examples of how Features are computed, see Feature Materialization

Here's another example of Data Spans

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|   varying_key_suffix_format = d=%Y-%m-%d/t=%H-%M   |
|   varying_key_suffix_freq = 30min                  |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Canso will generate the following keys in this scenario -

s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=02-00/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-00/
...

Canso also supports an optional time_offset argument, which can be used to displace the paths formed above.

For e.g. a time_offset = 10*60 (10 mins) on the example above will generate the following keys

...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-50/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-50/
...

Last updated

Was this helpful?