Subsystem schemas.
Each supported subsystem has a class describing its raw column list, output parquet schema, filename parser, and enrichment SQL.
SIHSUS
Source of truth:
datasets/sihsus/ . Raw DBCs follow the DATASUS RDUFYYMM convention
(reduced hospital admission, state, year, month). The output
parquet is partitioned by UF.
Notable transformations
- Municipality codes (6 chars) are joined against the IBGE 5,571-municipality registry and exposed alongside the canonical 7-char code.
- CID-10 codes are normalized to uppercase and stripped.
- Dates stored as
YYYYMMDDstrings are parsed into properDATEcolumns (DT_INTER,DT_SAIDA, …).
SIM
Source of truth:
datasets/sim/ . Raw filenames include both CID9 (DORUFYY,
1979–1995) and CID10 (DOUFYYYY, 1996+) variants — the
parser disambiguates on stem length.
Notable transformations
IDADE(original 3-digit coded age) is expanded intoIDADE_ANOS,IDADE_MESES, etc.- Multiple CID columns are combined into a single
CID_ARRAYto simplify "any of these causes" queries. - Categorical BOOLEAN encodings are standardized to real
BOOLEAN.