Data Spans
For Object-based Data Sources (S3, GCS), Canso uses the concept of Data Spans for loading data from specific keys. Underlying raw data for a data source can have different directory stuctures.
Introduction
For e.g. consider the underlying data to be present in the following directory tree
mycompany_bucket/raw_events/orders/
|---2023-01-01
| | |---abc1.parquet
| | |---abc2.parquet
|---2023-01-02
| | |---abc3.parquet
| | |---abc4.parquet
|---2023-01-03
| | |---abc5.parquet
...
...
|---2023-03-31
| | |---abcm.parquet
| | |---abcn.parquet
So, the complete path to abc1.parquet
is
s3://mycompany_bucket/raw_events/orders/2023-01-01/abc1.parquet
Internally, Canso inteprets this Data Source as below -
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| βββββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| | | | | | | |
| | mycompany_bucket/ | | raw_events/orders/ | | 2023-01-01 | |
| | | | | | | |
| βββββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| <-------bucket-------> <-------base_key-------> <---varying_key_suffix---> |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The varying_key_suffix
has 2 components -
Format - e.g.
%Y-%m-%d
for2023-01-01
,d=%Y-%m-%d/t=%H-%M
ford=2023-01-01/t=00-00
,d=2023-01-01/t=00-15
,d=2023-01-01/t=06-30
Frequency - e.g.
30min
1H
1D
Therefore, in this particular case, varying_key_suffix_format
and varying_key_suffix_freq
will be as follows -
ββββββββββββββββββββββββββββββββββββββββββββ
| varying_key_suffix_format = %Y-%m-%d |
| varying_key_suffix_freq = 1D |
ββββββββββββββββββββββββββββββββββββββββββββ
How Data Spans are used
Based on the values of varying_key_suffix_format
and varying_key_suffix_freq
provided, Canso internally generates paths that will then be read while materializing features.
So, for the above example, the following is generated -
s3://mycompany_bucket/raw_events/orders/2023-01-01/
s3://mycompany_bucket/raw_events/orders/2023-01-02/
s3://mycompany_bucket/raw_events/orders/2023-01-03/
s3://mycompany_bucket/raw_events/orders/2023-01-04/
s3://mycompany_bucket/raw_events/orders/2023-01-05/
s3://mycompany_bucket/raw_events/orders/2023-01-06/
s3://mycompany_bucket/raw_events/orders/2023-01-07/
s3://mycompany_bucket/raw_events/orders/2023-01-08/
...
s3://mycompany_bucket/raw_events/orders/2023-03-31/
s3://mycompany_bucket/raw_events/orders/2023-04-01/
...
Now, say a feature avg_user_spend_l7d
is registered and deployed with a daily schedule frequency i.e. it get's calculated at the beginning of a day and the AVG is based on the last 7 days of data, Canso will automatically calculate the paths based on the feature's execution time. It will load data from those paths and perform the feature computation on the loaded data. To see more examples of how Features are computed, see Feature Materialization
Here's another example of Data Spans
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| varying_key_suffix_format = d=%Y-%m-%d/t=%H-%M |
| varying_key_suffix_freq = 30min |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Canso will generate the following keys in this scenario -
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=02-00/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-00/
...
Canso also supports an optional time_offset
argument, which can be used to displace the paths formed above.
For e.g. a time_offset = 10*60
(10 mins) on the example above will generate the following keys
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-50/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-50/
...
Last updated
Was this helpful?