Docs / Python API

Python API.

The installer is the convenient path. For integration with pipelines, notebooks, and custom workflows, install the same code from PyPI and call it directly.

pip install datasus-etl

Configuration

Every public API takes a PipelineConfig — a pydantic model with sane defaults. Build one with PipelineConfig.create(...):

from datasus_etl.config import PipelineConfig

config = PipelineConfig.create(
    source="sihsus",
    start_date="2023-01-01",
    end_date="2023-12-31",
    ufs=["SP", "RJ"],
    data_dir="./data",
)

Running a pipeline

from datasus_etl.pipeline.sihsus_pipeline import SihsusPipeline

pipeline = SihsusPipeline(config)
pipeline.run()

Progress reporting

Every pipeline accepts a progress callback. It's called with a ProgressEvent describing the current stage — useful for wiring into your own dashboard or Jupyter widget.

def on_progress(event):
    print(f"[{event.stage}] {event.message} — {event.pct:.0%}")

pipeline.run(progress=on_progress)

Querying the parquet store

After a run, the Parquet files are partitioned by UF under the resolved storage root. Read them with any tool that speaks Parquet:

import polars as pl

df = pl.scan_parquet("./data/datasus_db/sihsus/**/*.parquet").collect()

Or via DuckDB directly, reusing the views the app registers:

from datasus_etl.storage.duckdb_query_engine import DuckDBQueryEngine

engine = DuckDBQueryEngine(data_dir="./data")
print(engine.execute("SELECT uf, count(*) FROM sihsus GROUP BY uf").fetch_df())