# Data Spans

For Object-based Data Sources (S3, GCS), Canso uses the concept of Data Spans for loading data from specific keys. Underlying raw data for a data source can have different directory stuctures.

## Introduction

For e.g. consider the underlying data to be present in the following directory tree

```console
mycompany_bucket/raw_events/orders/
|---2023-01-01
|   |   |---abc1.parquet
|   |   |---abc2.parquet
|---2023-01-02
|   |   |---abc3.parquet
|   |   |---abc4.parquet
|---2023-01-03
|   |   |---abc5.parquet
...
...
|---2023-03-31
|   |   |---abcm.parquet
|   |   |---abcn.parquet
```

So, the complete path to `abc1.parquet` is

```console
s3://mycompany_bucket/raw_events/orders/2023-01-01/abc1.parquet
```

Internally, Canso inteprets this Data Source as below -

```
┌───────────────────────────────────────────────────────────────────────────────────────┐
|    ┌─────────────────────┐    ┌──────────────────────┐       ┌──────────────┐         |
|    |                     |    |                      |       |              |         |
|    |  mycompany_bucket/  |    |  raw_events/orders/  |       |  2023-01-01  |         | 
|    |                     |    |                      |       |              |         | 
|    └─────────────────────┘    └──────────────────────┘       └──────────────┘         | 
|    <-------bucket------->     <-------base_key------->   <---varying_key_suffix--->   |
└───────────────────────────────────────────────────────────────────────────────────────┘
```

The `varying_key_suffix` has 2 components -

1. Format - e.g.
   * `%Y-%m-%d` for `2023-01-01`,
   * `d=%Y-%m-%d/t=%H-%M` for `d=2023-01-01/t=00-00`, `d=2023-01-01/t=00-15`, `d=2023-01-01/t=06-30`
2. Frequency - e.g.
   * `30min`
   * `1H`
   * `1D`

Therefore, in this particular case, `varying_key_suffix_format` and `varying_key_suffix_freq` will be as follows -

```
┌──────────────────────────────────────────┐
|   varying_key_suffix_format = %Y-%m-%d   |
|   varying_key_suffix_freq = 1D           |
└──────────────────────────────────────────┘
```

### How Data Spans are used

Based on the values of `varying_key_suffix_format` and `varying_key_suffix_freq` provided, Canso internally generates paths that will then be read while materializing features.

So, for the above example, the following is generated -

```
s3://mycompany_bucket/raw_events/orders/2023-01-01/
s3://mycompany_bucket/raw_events/orders/2023-01-02/
s3://mycompany_bucket/raw_events/orders/2023-01-03/
s3://mycompany_bucket/raw_events/orders/2023-01-04/
s3://mycompany_bucket/raw_events/orders/2023-01-05/
s3://mycompany_bucket/raw_events/orders/2023-01-06/
s3://mycompany_bucket/raw_events/orders/2023-01-07/
s3://mycompany_bucket/raw_events/orders/2023-01-08/
...
s3://mycompany_bucket/raw_events/orders/2023-03-31/
s3://mycompany_bucket/raw_events/orders/2023-04-01/
...
```

Now, say a feature `avg_user_spend_l7d` is registered and deployed with a daily schedule frequency i.e. it get's calculated at the beginning of a day and the AVG is based on the last 7 days of data, Canso will automatically calculate the paths based on the feature's execution time. It will load data from those paths and perform the feature computation on the loaded data. To see more examples of how Features are computed, see [Feature Materialization](broken://pages/9IXUX5R37OdWQ47JVgPJ#Feature-Materialization)

Here's another example of Data Spans

```
┌────────────────────────────────────────────────────┐
|   varying_key_suffix_format = d=%Y-%m-%d/t=%H-%M   |
|   varying_key_suffix_freq = 30min                  |
└────────────────────────────────────────────────────┘
```

Canso will generate the following keys in this scenario -

```
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=02-00/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-00/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-30/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-00/
...
```

Canso also supports an optional `time_offset` argument, which can be used to displace the paths formed above.

For e.g. a `time_offset = 10*60` (10 mins) on the example above will generate the following keys

```
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=01-50/
...
...
s3://mycompany_bucket/raw_events/orders/d=2023-01-01/t=23-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=00-50/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-20/
s3://mycompany_bucket/raw_events/orders/d=2023-01-02/t=01-50/
...
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.canso.ai/feature-store/data-sources/dataspans.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
