datasus-etl
EN PT
Docs / Architecture

Architecture.

Opinionated notes on the parts of the system most likely to surprise a reader skimming the code.

Storage paths — one rule

storage/paths.py::resolve_parquet_dir is the only function that decides where parquet lives. Every other caller delegates here. The rules, in order:

  • If base_dir.name is already datasus_db or parquet (case-insensitive), append only the subsystem.
  • If a legacy {base_dir}/parquet/{subsystem} exists, return it unchanged (pre-0.1 layout).
  • Otherwise, return {base_dir}/datasus_db/{subsystem}.

The original bug that motivated this module produced a double-nested {base}/datasus_db/datasus_db/ layout; storage/migration.py detects and migrates it at CLI startup.

SIM filename parsing — a landmine

datasets/sim/config.py::parse_filename disambiguates CID9 vs. CID10 by stem length, not by prefix:

  • DOUFYYYY.dbc — 8-char stem → CID10 (1996+)
  • DORUFYY.dbc — 7-char stem → CID9 (1979–1995)

Checking "starts with DOR" first (a previous implementation) silently dropped every CID10 death record from RJ, RN, RO, RR, and RS.

Web UI bundling

The React SPA lives under web-ui/ and is built with Vite into src/datasus_etl/web/static/ — a gitignored directory. The wheel packaging picks up the built artifacts via a hatch custom build hook at hatch_build.py .

Tkinter in a subprocess

The folder picker runs in a one-shot subprocess via the hidden CLI subcommand datasus _pick-folder. This keeps tkinter on its own main thread (mandatory on macOS) without blocking the uvicorn worker, and the same invocation works for both pip install and Nuitka-compiled binaries.

Release pipeline

A push to main that modifies the root VERSION file triggers the release workflow. It validates the semver, runs Nuitka on three matrix runners (Windows x64, macOS arm64, Linux x64), packages each into a native installer, publishes a GitHub Release, and re-deploys this site with the new download links. Intel Macs (x86_64) are not part of the native matrix — GitHub's macos-13 runner has been deprecated with extreme queue times; Intel Mac users install from source via pip install ..

The workflow lives at .github/workflows/release.yml . The Nuitka flags — one line per dependency, each there because it failed without it — live in scripts/build_nuitka.py .

Versioning

The root VERSION file is the single source of truth. Hatch reads it to set wheel metadata (no hand-edited pyproject.toml), the build hook re-emits src/datasus_etl/__version__.py to match, and the UI footer plus datasus --version both read __version__.