Architecture.
Opinionated notes on the parts of the system most likely to surprise a reader skimming the code.
Storage paths — one rule
storage/paths.py::resolve_parquet_dir
is the only function that decides where parquet lives. Every
other caller delegates here. The rules, in order:
- If
base_dir.nameis alreadydatasus_dborparquet(case-insensitive), append only the subsystem. - If a legacy
{base_dir}/parquet/{subsystem}exists, return it unchanged (pre-0.1 layout). - Otherwise, return
{base_dir}/datasus_db/{subsystem}.
The original bug that motivated this module produced a
double-nested {base}/datasus_db/datasus_db/ layout;
storage/migration.py
detects and migrates it at CLI startup.
SIM filename parsing — a landmine
datasets/sim/config.py::parse_filename
disambiguates CID9 vs. CID10 by stem length, not by prefix:
DOUFYYYY.dbc— 8-char stem → CID10 (1996+)DORUFYY.dbc— 7-char stem → CID9 (1979–1995)
Checking "starts with DOR" first (a previous
implementation) silently dropped every CID10 death record from
RJ, RN, RO, RR, and RS.
Web UI bundling
The React SPA lives under web-ui/ and is built with
Vite into src/datasus_etl/web/static/ — a gitignored
directory. The wheel packaging picks up the built artifacts via
a hatch custom build hook at
hatch_build.py .
Tkinter in a subprocess
The folder picker runs in a one-shot subprocess via the hidden
CLI subcommand datasus _pick-folder. This keeps
tkinter on its own main thread (mandatory on macOS) without
blocking the uvicorn worker, and the same invocation works for
both pip install and Nuitka-compiled binaries.
Release pipeline
A push to main that modifies the root
VERSION file triggers the release workflow. It
validates the semver, runs Nuitka on three matrix runners (Windows
x64, macOS arm64, Linux x64), packages each into a native
installer, publishes a GitHub Release, and re-deploys this site
with the new download links. Intel Macs (x86_64) are not part of
the native matrix — GitHub's macos-13 runner has been
deprecated with extreme queue times; Intel Mac users install from
source via pip install ..
The workflow lives at
.github/workflows/release.yml . The Nuitka flags — one line per dependency, each there
because it failed without it — live in
scripts/build_nuitka.py .
Versioning
The root VERSION file is the single source of truth.
Hatch reads it to set wheel metadata (no hand-edited
pyproject.toml), the build hook re-emits
src/datasus_etl/__version__.py to match, and the UI
footer plus datasus --version both read
__version__.