Data Spans
For Object-based Data Sources (S3, GCS), Canso uses the concept of Data Spans for loading data from specific keys. Underlying raw data for a data source can have different directory stuctures.
Introduction
For e.g. consider the underlying data to be present in the following directory tree
mycompany_bucket/raw_events/orders/
|---2023-01-01
| | |---abc1.parquet
| | |---abc2.parquet
|---2023-01-02
| | |---abc3.parquet
| | |---abc4.parquet
|---2023-01-03
| | |---abc5.parquet
...
...
|---2023-03-31
| | |---abcm.parquet
| | |---abcn.parquetSo, the complete path to abc1.parquet is
s3://mycompany_bucket/raw_events/orders/2023-01-01/abc1.parquetInternally, Canso inteprets this Data Source as below -
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| βββββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| | | | | | | |
| | mycompany_bucket/ | | raw_events/orders/ | | 2023-01-01 | |
| | | | | | | |
| βββββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| <-------bucket-------> <-------base_key-------> <---varying_key_suffix---> |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββThe varying_key_suffix has 2 components -
Format - e.g.
%Y-%m-%dfor2023-01-01,d=%Y-%m-%d/t=%H-%Mford=2023-01-01/t=00-00,d=2023-01-01/t=00-15,d=2023-01-01/t=06-30
Frequency - e.g.
30min1H1D
Therefore, in this particular case, varying_key_suffix_format and varying_key_suffix_freq will be as follows -
ββββββββββββββββββββββββββββββββββββββββββββ
| varying_key_suffix_format = %Y-%m-%d |
| varying_key_suffix_freq = 1D |
ββββββββββββββββββββββββββββββββββββββββββββHow Data Spans are used
Based on the values of varying_key_suffix_format and varying_key_suffix_freq provided, Canso internally generates paths that will then be read while materializing features.
So, for the above example, the following is generated -
s3://mycompany_bucket/raw_events/orders/2023-01-01/
s3://mycompany_bucket/raw_events/orders/2023-01-02/
s3://mycompany_bucket/raw_events/orders/2023-01-03/
s3://mycompany_bucket/raw_events/orders/2023-01-04/
s3://mycompany_bucket/raw_events/orders/2023-01-05/
s3://mycompany_bucket/raw_events/orders/2023-01-06/
s3://mycompany_bucket/raw_events/orders/2023-01-07/
s3://mycompany_bucket/raw_events/orders/2023-01-08/
...
s3://mycompany_bucket/raw_events/orders/2023-03-31/
s3://mycompany_bucket/raw_events/orders/2023-04-01/
...Now, say a feature avg_user_spend_l7d is registered and deployed with a daily schedule frequency i.e. it get's calculated at the beginning of a day and the AVG is based on the last 7 days of data, Canso will automatically calculate the paths based on the feature's execution time. It will load data from those paths and perform the feature computation on the loaded data. To see more examples of how Features are computed, see Feature Materialization
Here's another example of Data Spans
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| varying_key_suffix_format = d=%Y-%m-%d/t=%H-%M |
| varying_key_suffix_freq = 30min |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββCanso will generate the following keys in this scenario -
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=02-00/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-00/
...Canso also supports an optional time_offset argument, which can be used to displace the paths formed above.
For e.g. a time_offset = 10*60 (10 mins) on the example above will generate the following keys
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-50/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-50/
...Last updated
Was this helpful?