For Object-based Data Sources (S3, GCS), Canso uses the concept of Data Spans for loading data from specific keys. Underlying raw data for a data source can have different directory stuctures.
Introduction
For e.g. consider the underlying data to be present in the following directory tree
Copy mycompany_bucket/raw_events/orders/
|---2023-01-01
| | |---abc1.parquet
| | |---abc2.parquet
|---2023-01-02
| | |---abc3.parquet
| | |---abc4.parquet
|---2023-01-03
| | |---abc5.parquet
...
...
|---2023-03-31
| | |---abcm.parquet
| | |---abcn.parquet
So, the complete path to abc1.parquet
is
Copy s3://mycompany_bucket/raw_events/orders/2023-01-01/abc1.parquet
Internally, Canso inteprets this Data Source as below -
Copy βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| βββββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| | | | | | | |
| | mycompany_bucket/ | | raw_events/orders/ | | 2023-01-01 | |
| | | | | | | |
| βββββββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| <-------bucket-------> <-------base_key-------> <---varying_key_suffix---> |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The varying_key_suffix
has 2 components -
Format - e.g.
d=%Y-%m-%d/t=%H-%M
for d=2023-01-01/t=00-00
, d=2023-01-01/t=00-15
, d=2023-01-01/t=06-30
Therefore, in this particular case, varying_key_suffix_format
and varying_key_suffix_freq
will be as follows -
Copy ββββββββββββββββββββββββββββββββββββββββββββ
| varying_key_suffix_format = %Y-%m-%d |
| varying_key_suffix_freq = 1D |
ββββββββββββββββββββββββββββββββββββββββββββ
How Data Spans are used
Based on the values of varying_key_suffix_format
and varying_key_suffix_freq
provided, Canso internally generates paths that will then be read while materializing features.
So, for the above example, the following is generated -
Copy s3://mycompany_bucket/raw_events/orders/2023-01-01/
s3://mycompany_bucket/raw_events/orders/2023-01-02/
s3://mycompany_bucket/raw_events/orders/2023-01-03/
s3://mycompany_bucket/raw_events/orders/2023-01-04/
s3://mycompany_bucket/raw_events/orders/2023-01-05/
s3://mycompany_bucket/raw_events/orders/2023-01-06/
s3://mycompany_bucket/raw_events/orders/2023-01-07/
s3://mycompany_bucket/raw_events/orders/2023-01-08/
...
s3://mycompany_bucket/raw_events/orders/2023-03-31/
s3://mycompany_bucket/raw_events/orders/2023-04-01/
...
Now, say a feature avg_user_spend_l7d
is registered and deployed with a daily schedule frequency i.e. it get's calculated at the beginning of a day and the AVG is based on the last 7 days of data, Canso will automatically calculate the paths based on the feature's execution time. It will load data from those paths and perform the feature computation on the loaded data. To see more examples of how Features are computed, see Feature Materialization
Here's another example of Data Spans
Copy ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| varying_key_suffix_format = d=%Y-%m-%d/t=%H-%M |
| varying_key_suffix_freq = 30min |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Canso will generate the following keys in this scenario -
Copy s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=02-00/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-00/
...
Canso also supports an optional time_offset
argument, which can be used to displace the paths formed above.
For e.g. a time_offset = 10*60
(10 mins) on the example above will generate the following keys
Copy ...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-50/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-50/
...
Last updated 8 months ago