Skip to content
Merged

Dev #17

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: CI

on:
push:
branches: ["main", "evolution"]
pull_request:

jobs:
test:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install dependencies
run: pip install -e ".[dev,bdc]"

- name: Run tests
run: pytest -q
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ ENV/

outputs

catalog.db

site

# Python Cache
Expand Down
315 changes: 187 additions & 128 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,152 +1,211 @@
# DisSCube

> **⚠️ Project Status: Early Stage (Alpha)**
> This project is currently in its initial development phase. APIs and data structures are subject to frequent changes as we evolve the core engine.

DisSCube is a high-performance spatial data cube engine designed for land change modeling within the **DisSModel** ecosystem. It provides a bridge between raw geospatial data (Rasters, Vectors, Points) and multidimensional analysis ready for statistical and dynamic models (like TerraME).

## 🚀 Key Features

- **FillCell Operators**: Legacy logic of TerraView FillCell for robust data aggregation.
- **Raster**: Majority, Mean, Max, Min, and Sum resampling.
- **Vector**: Presence (Boolean), Count, and area-weighted strategies.
- **Proximity**: High-performance Euclidean distance transforms.
- **Temporal Backend**: Support for multi-period variables. Derived products can have temporal validity windows, allowing models to load dynamic drivers.
- **Snapped Grids**: Automatic alignment of local grids (e.g., State-level) to national meshes (e.g., BDC 5km) to ensure pixel-perfect interoperability.
- **Master Grid Architecture**: Native support for **Brazil Data Cube (BDC)** master grids (SM, MD, LG) and custom 100m grids with tiled processing.
- **SQLite Catalog**: High-performance, concurrent registry for spatial metadata and variable provenance.
- **Multidimensional Storage**: Uses **Zarr** and **Xarray** for efficient storage of high-resolution spatial variables.

## 🛠 Architecture

DisSCube is built on a decoupled architecture that separates spatial definitions (Grids) from data partitions (Tiles).

### 1. Data Processing Pipeline
The core engine follows a sequential pipeline where each stage transforms the `PipelineContext`.

```mermaid
graph LR
S[SpatialSource] --> N[Normalizer]
N --> GA[GridAligner]
GA --> AG[Aggregator]
AG --> VW[VariableWriter]
VW --> C[(SQLite Catalog)]
VW --> Z[[Zarr Storage]]

subgraph "Pipeline Execution"
N
GA
AG
VW
end
```

### 2. Temporal Awareness
Derivations can be associated with a time window. `to_lucc_data` automatically stacks these fatias into a 3D DataArray `(time, y, x)`, allowing CA models to query drivers by year.

```mermaid
graph TD
D1[Derivation 2000-2010] --> VW
D2[Derivation 2011-2025] --> VW
VW --> DB[(Catalog)]
DB --> LC[to_lucc_data]
LC --> RB[RasterBackend]
RB --> CA[Cellular Automata Model]
```

### 3. Entity Model
The catalog maintains the relationships between sources, derivations, and the physical assets.

```mermaid
classDiagram
class GridSpec {
+id: str
+crs: str
+resolution: float
+bbox: list
}
class SpatialSource {
+id: str
+format: raster|vector
+asset_url: str
+bbox: list
+time: int?
}
class SpatialDerivation {
+source_id: str
+grid_id: str
+valid_from: str?
+valid_until: str?
+spec_hash() str
}
class DerivedVariable {
+id: str
+grid_id: str
+tile_id: str
+spec_hash: str
+times: list[int]
+asset_url: str
}

GridSpec "1" -- "0..*" DerivedVariable : defines space
SpatialSource "1" -- "0..*" DerivedVariable : raw material
SpatialDerivation "1" -- "0..*" DerivedVariable : generates
```

## 📖 Quick Start

### Installation
> **Status: Alpha — APIs estáveis para o pipeline principal; modelos declarativos em evolução.**

DisSCube é o motor de cubos de dados espaciais do ecossistema **DisSModel**. Ele converte fontes geoespaciais brutas (rasters, vetores) em variáveis derivadas alinhadas a grades de modelagem LUCC (Land Use and Cover Change), prontas para modelos de Autômatos Celulares e análises espacio-temporais.

## Conceito central

```
SpatialSource → Derivation → Variable → DerivedVariable (Zarr)
```

Uma **fonte** (`SpatialSource`) passa por uma **derivação** (`SpatialDerivation` ou `Derivation`) que aplica um **operador** a uma **grade** (`GridSpec`), produzindo uma **variável derivada** registrada no catálogo SQLite e armazenada em Zarr.

## Instalação

```bash
# Clone the repository
git clone https://github.com/dissmodel/disscube.git
git clone https://github.com/DisSModel/disscube.git
cd disscube

# Set up environment
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m venv .venv && source .venv/bin/activate
pip install -e .
```

### Basic Usage
## Uso básico

### 1. Inicializar catálogo e registrar grade

```python
from disscube.client import CubeClient
from disscube.models import Variable, SpatialDerivation
from disscube.utils.grids import register_local_grid

# Initialize client (Now using SQLite)
cube = CubeClient(catalog="catalog.db", store="./data/")

# Define a derivation for a specific BDC Tile
derivation = SpatialDerivation(
source_id="urban_centers",
grid_id="BDC_100m", # Master Grid
grid = register_local_grid(
cube,
name="AC",
bbox_geo=(-73.99, -11.15, -66.62, -7.11),
resolution=5_000.0,
)
```

### 2. Registrar fonte

```python
from disscube.models import SpatialSource

cube.register_spatial_source(SpatialSource(
id="mapbiomas_2020",
name="MapBiomas Acre 2020",
format="raster",
asset_url="data/raw/mapbiomas_2020.tif",
crs="EPSG:4326",
time=2020,
))
```

### 3. Derivar — modo declarativo (recomendado)

```python
from disscube.derivation import Derivation

d = Derivation(
target="forest_pct",
source_id="mapbiomas_2020",
operator="percentage",
class_code=3,
role="driver",
valid_from="2020",
valid_until="2020",
)

cube.derive_declarative(d, grid_id="AC/5km")
```

### 4. Derivar — modo direto

```python
from disscube.models import SpatialDerivation, Variable

cube.derive(SpatialDerivation(
source_id="mapbiomas_2020",
grid_id="AC/5km",
role="driver",
variables=[Variable(name="dist_sedes", operator="min_distance")]
variables=[Variable(name="forest_pct", operator="percentage", class_code=3)],
valid_from="2020",
valid_until="2020",
))
```

### 5. Carregar resultado

```python
da = cube.load("forest_pct", grid_id="AC/5km")
print(da.shape) # (rows, cols)
```

### 6. Integrar ao DisSModel

```python
backend = cube.to_lucc_data(
["forest_pct", "dist_roads"],
grid_id="AC/5km",
period=("2015", "2020"),
)
```

# Execute pipeline for a specific partition
cube.derive(derivation, tile_id="009002")
## Operadores disponíveis

| Operador | Tipo | Resampling | Requer `class_code` |
|---|---|---|---|
| `mean` | zonal | average | não |
| `sum` | zonal | sum | não |
| `std` | zonal | nearest | não |
| `min` | zonal | min | não |
| `max` | zonal | max | não |
| `majority` | zonal | mode | não |
| `minority` | zonal | mode | não |
| `percentage` | zonal | mode | **sim** |
| `attribute` | zonal | nearest | não |
| `presence` | zonal | nearest | não |
| `min_distance` | proximity | nearest | não |
| `count` | proximity | nearest | não |

## Pipeline

# Load data with tile disambiguation
res = cube.load("dist_sedes", tile_id="009002")
print(res.shape)
```
SpatialSource
Normalizer — valida / carrega GeoDataFrame (vetor) ou abre raster
GridAligner — reprojeta por variável com o Resampling correto do operador
Aggregator — delega a operator.compute() → xr.DataArray por variável
VariableWriter — persiste Zarr + registra DerivedVariable no catálogo
```

## Estrutura de armazenamento

```
data/derived/{grid_id}/{partition}/{spec_hash}/{variable_name}.zarr
```

- `partition` = `tile_id` ou `global` para derivações sem tile.
- `spec_hash` = SHA-256 da derivação (fonte + grade + variáveis + janela temporal).

## Estrutura do projeto

```
disscube/
├── client/ CubeClient — ponto de entrada público
├── models/ GridSpec, SpatialSource, SpatialDerivation, Variable…
├── derivation.py Derivation declarativa (front-end sobre SpatialDerivation)
├── operators/ Operadores como classes (auto-registro via __init_subclass__)
│ ├── base.py Operator ABC + OPERATOR_REGISTRY
│ ├── zonal.py mean, sum, majority, percentage, attribute, presence…
│ └── proximity.py min_distance, count
├── pipeline/ Stages: Normalizer → GridAligner → Aggregator → Writer
├── catalog/ CatalogStore (Protocol) + SQLite e JSON implementations
├── storage/ AssetStore (fsspec — local e S3)
└── utils/grids.py register_local_grid, register_simulation_grids
```

## Adicionar um operador novo

Crie uma subclasse de `Operator` em qualquer arquivo importado na inicialização:

```python
from rasterio.warp import Resampling
from disscube.operators.base import Operator

class WeightedMeanOperator(Operator):
name = "weighted_mean"
_resampling = Resampling.average

def compute(self, data, var, grid):
# data é xr.DataArray (raster) ou GeoDataFrame (vetor)
...
```

O operador é registrado automaticamente e aceito em `Derivation` / `SpatialDerivation` sem nenhuma outra mudança.

## Limitações conhecidas

As limitações abaixo são decisões de escopo da versão atual, não bugs. Estão documentadas para que usuários e revisores entendam o que está implementado versus o que está planejado.

**Processamento em memória, single-tile**
Cada chamada a `derive()` carrega o dado completo de um tile em memória. Não há processamento lazy (Dask) nem distribuído. Para grades de escala continental (ex: `BR/1km`), use o loop de tiles — cada tile é processado e salvo independentemente.

**Agregação vetorial por rasterização (não área-ponderada)**
Operadores sobre fontes vetoriais (`majority`, `percentage`, `attribute`, `presence`, `minority`) convertem geometrias em raster antes de agregar pixels. A fração de cobertura de cada célula é estimada por contagem de pixels, não por cálculo de área de interseção. Para cobertura proporcional mais precisa, use uma fonte raster em resolução substancialmente maior que a célula-alvo.

## 📁 Storage Structure
**Desambiguação de tiles em `load()`**
`CubeClient.load(name)` sem `tile_id` retorna silenciosamente o primeiro resultado quando múltiplos tiles da mesma variável existem na mesma grade. Erro explícito ou mosaico automático estão planejados. **Especifique sempre `tile_id` em workloads multi-tile.**

Derived data is stored hierarchically to optimize access and prevent collisions:
`data/derived/{grid_id}/{tile_id}/{spec_hash}/{variable_name}.zarr`
**`SpatialRelation` não atua no pipeline**
O modelo `SpatialRelation` é persistido no catálogo, mas nenhum estágio do pipeline usa as relações durante a derivação — e por isso elas são **excluídas do `spec_hash`**. Incluí-las tornaria a chave de cache sensível a metadados que não afetam o resultado, quebrando a garantia de reprodutibilidade. A integração com estratégias hierárquicas de grades está reservada para versão futura.

## 🔍 Tools
**`purity_threshold` reservado**
O campo `purity_threshold` em `Derivation` é incluído no `spec_hash`, mas não é aplicado à saída — a máscara por pureza não está implementada. Definir `purity_threshold` muda o cache key sem mudar o resultado.

- `zarr_to_tif.py`: Export any Zarr variable to GeoTIFF for QGIS.
```bash
python tools/zarr_to_tif.py data/derived/.../var.zarr output.tif
```
- `inspect_raster.py`: Check CRS and bounds of raw rasters.
- `list_grids.py`: List all master and local grids in the SQLite catalog.
**Sem integração STAC**
Os campos `valid_from`/`valid_until` e `bbox` em `Derivation` seguem a convenção de nomenclatura STAC, mas nenhuma lógica de cliente, catálogo ou exportação STAC está implementada neste módulo.

## 📄 License
## Licença

This project is part of the DisSModel ecosystem. See the LICENSE file for details.
Parte do ecossistema DisSModel. Ver `LICENSE` para detalhes.
Binary file removed catalog.db
Binary file not shown.
3 changes: 2 additions & 1 deletion disscube/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from disscube.client import CubeClient
from disscube.models import GridSpec, SpatialSource, SpatialDerivation, Variable, DerivedVariable
from disscube.derivation import Derivation

__all__ = ["CubeClient", "GridSpec", "SpatialSource", "SpatialDerivation", "Variable", "DerivedVariable"]
__all__ = ["CubeClient", "GridSpec", "SpatialSource", "SpatialDerivation", "Variable", "DerivedVariable", "Derivation"]
4 changes: 4 additions & 0 deletions disscube/catalog/json_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ def save_derived(self, derived: DerivedVariable) -> None:
self._data["derived"][derived.id] = derived.model_dump()
self._save()

def delete_derived(self, derived_id: str) -> None:
self._data["derived"].pop(derived_id, None)
self._save()

def search_derived_variables(self, grid_id: str | None = None, role: str | None = None, tile_id: str | None = None) -> List[DerivedVariable]:
results = []
for d in self._data["derived"].values():
Expand Down
1 change: 1 addition & 0 deletions disscube/catalog/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ def get_spatial_source(self, source_id: str) -> Optional[SpatialSource]: ...
def list_spatial_sources(self) -> List[SpatialSource]: ...

def save_derived(self, derived: DerivedVariable) -> None: ...
def delete_derived(self, derived_id: str) -> None: ...
def search_derived_variables(self, grid_id: str | None = None, role: str | None = None, tile_id: str | None = None) -> List[DerivedVariable]: ...
def get_derived_by_hash(self, spec_hash: str) -> Optional[DerivedVariable]: ...

Expand Down
Loading
Loading