Python API.
The installer is the convenient path. For integration with pipelines, notebooks, and custom workflows, install the same code from PyPI and call it directly.
pip install datasus-etl Configuration
Every public API takes a PipelineConfig — a pydantic
model with sane defaults. Build one with
PipelineConfig.create(...):
from datasus_etl.config import PipelineConfig
config = PipelineConfig.create(
source="sihsus",
start_date="2023-01-01",
end_date="2023-12-31",
ufs=["SP", "RJ"],
data_dir="./data",
) Running a pipeline
from datasus_etl.pipeline.sihsus_pipeline import SihsusPipeline
pipeline = SihsusPipeline(config)
pipeline.run() Progress reporting
Every pipeline accepts a progress callback. It's called with a
ProgressEvent describing the current stage — useful
for wiring into your own dashboard or Jupyter widget.
def on_progress(event):
print(f"[{event.stage}] {event.message} — {event.pct:.0%}")
pipeline.run(progress=on_progress) Querying the parquet store
After a run, the Parquet files are partitioned by UF under the resolved storage root. Read them with any tool that speaks Parquet:
import polars as pl
df = pl.scan_parquet("./data/datasus_db/sihsus/**/*.parquet").collect() Or via DuckDB directly, reusing the views the app registers:
from datasus_etl.storage.duckdb_query_engine import DuckDBQueryEngine
engine = DuckDBQueryEngine(data_dir="./data")
print(engine.execute("SELECT uf, count(*) FROM sihsus GROUP BY uf").fetch_df())