datasus-etl
EN PT
Docs / Schemas

Subsystem schemas.

Each supported subsystem has a class describing its raw column list, output parquet schema, filename parser, and enrichment SQL.

SIHSUS

Source of truth: datasets/sihsus/ . Raw DBCs follow the DATASUS RDUFYYMM convention (reduced hospital admission, state, year, month). The output parquet is partitioned by UF.

Notable transformations

  • Municipality codes (6 chars) are joined against the IBGE 5,571-municipality registry and exposed alongside the canonical 7-char code.
  • CID-10 codes are normalized to uppercase and stripped.
  • Dates stored as YYYYMMDD strings are parsed into proper DATE columns (DT_INTER, DT_SAIDA, …).

SIM

Source of truth: datasets/sim/ . Raw filenames include both CID9 (DORUFYY, 1979–1995) and CID10 (DOUFYYYY, 1996+) variants — the parser disambiguates on stem length.

Notable transformations

  • IDADE (original 3-digit coded age) is expanded into IDADE_ANOS, IDADE_MESES, etc.
  • Multiple CID columns are combined into a single CID_ARRAY to simplify "any of these causes" queries.
  • Categorical BOOLEAN encodings are standardized to real BOOLEAN.