INFOSIGA-SP is the São Paulo State Traffic Accident Information and Management System, maintained by DETRAN-SP. It publishes, as open data, detailed records of every traffic crash in the state of São Paulo from 2015 onward.
The infosigasp package wraps the download and import of
those records. It takes care of the things that make the raw files
awkward to read directly:
;);DD/MM/YYYY;INFOSIGA-SP organises its data into three linked tables:
infosiga_datasets()
#> # A tibble: 3 × 4
#> dataset description grain keys
#> <chr> <chr> <chr> <chr>
#> 1 sinistros Traffic crash events recorded in the state of Sao Paulo. one … id_s…
#> 2 pessoas People (victims) involved in traffic crashes. one … id_p…
#> 3 veiculos Vehicles involved in traffic crashes. one … id_v…sinistros — crash events. One
row per recorded event, with the date, time, location (including
latitude/longitude), road attributes and a breakdown of how many
vehicles and victims were involved, by type and severity.pessoas — people (victims).
One row per person involved, with demographic attributes, injury
severity and, for fatalities, the date and place of death.veiculos — vehicles. One row
per vehicle involved, with make/model, manufacturing and model years,
colour and type.All three share the id_sinistro key, so they can be
joined together; pessoas and veiculos
additionally share id_veiculo.
The main entry point is read_infosiga(). The first call
downloads the source archive (about 120 MB) into a per-user cache;
subsequent calls read from that cache, so you only pay the download cost
once.
You can restrict the import to specific years with the
year argument, which filters on the year of the crash
(ano_sinistro):
By default read_infosiga() returns a
processed dataset (clean = TRUE). The
processing never renames columns, recodes category labels or drops rows
– it only fixes types and source artefacts so the data is ready for
analysis. The full, ordered list of steps lives in
?clean_infosiga; in brief:
Date and times to hms (in both modes), and the
ano_mes_* year-month columns (published as
"YYYY/MM") become first-of-month Date
values.nacionalidade ships as
"BRASILEIRA "); untrimmed, those values break
grouping and joins."NAO DISPONIVEL"
(“not available”) marker is replaced by NA (trimming runs
first, so space-padded markers are caught).dia_da_semana: Domingo < … <
Sábado (the Brazilian week starts on Sunday);turno: MADRUGADA < MANHA
< TARDE < NOITE;gravidade_lesao (victims): LEVE <
GRAVE < FATAL;faixa_etaria_demografica /
faixa_etaria_legal: age bands in order.tp_sinistro_* columns ("S" / empty) become
logical, so you can sum() or filter them
directly. The categorical tp_sinistro_primario is left as
text.tempo_sinistro_obito
(days from crash to death) becomes integer, and the
spurious trailing ".0" the export appends to
numero_logradouro ("193.0") is stripped.latitude/longitude outside the São Paulo state
bounding box – mis-encoded values and (0, 0) placeholders,
about 7% of records – are set to NA as a pair. No rows are
dropped; pass clean = FALSE for the raw coordinates.Because dia_da_semana is an ordered factor, a weekday
tabulation comes out in calendar order rather than alphabetically:
If you would rather have the data exactly as published — every text
column as a character vector, with "NAO DISPONIVEL" and the
source’s whitespace padding preserved verbatim — pass
clean = FALSE:
You can also apply the same processing to a raw import after the fact
with clean_infosiga():
The package ships a small sample of each dataset so you can inspect the columns without any network access:
sample_path <- system.file("extdata", "sinistros_sample.csv", package = "infosigasp")
sample <- readr::read_delim(
sample_path,
delim = ";",
show_col_types = FALSE
)
dim(sample)
#> [1] 100 48
names(sample)
#> [1] "id_sinistro" "tipo_registro"
#> [3] "data_sinistro" "ano_sinistro"
#> [5] "mes_sinistro" "dia_sinistro"
#> [7] "hora_sinistro" "ano_mes_sinistro"
#> [9] "dia_da_semana" "turno"
#> [11] "logradouro" "numero_logradouro"
#> [13] "tipo_via" "tipo_local"
#> [15] "latitude" "longitude"
#> [17] "cod_ibge" "municipio"
#> [19] "regiao_administrativa" "administracao"
#> [21] "conservacao" "circunscricao"
#> [23] "tp_sinistro_primario" "qtd_pedestre"
#> [25] "qtd_bicicleta" "qtd_motocicleta"
#> [27] "qtd_automovel" "qtd_onibus"
#> [29] "qtd_caminhao" "qtd_veic_outros"
#> [31] "qtd_veic_nao_disponivel" "qtd_gravidade_fatal"
#> [33] "qtd_gravidade_grave" "qtd_gravidade_leve"
#> [35] "qtd_gravidade_ileso" "qtd_gravidade_nao_disponivel"
#> [37] "tp_sinistro_atropelamento" "tp_sinistro_colisao_frontal"
#> [39] "tp_sinistro_colisao_traseira" "tp_sinistro_colisao_lateral"
#> [41] "tp_sinistro_colisao_transversal" "tp_sinistro_colisao_outros"
#> [43] "tp_sinistro_choque" "tp_sinistro_capotamento"
#> [45] "tp_sinistro_engavetamento" "tp_sinistro_tombamento"
#> [47] "tp_sinistro_outros" "tp_sinistro_nao_disponivel"Once imported, the data are ordinary tibbles, so any tidyverse (or base R) workflow applies. For example, counting traffic fatalities per year from the victims dataset:
library(dplyr)
deaths_by_year <- read_infosiga("pessoas") |>
filter(gravidade_lesao == "FATAL") |>
count(ano_obito, name = "deaths") |>
arrange(ano_obito)
deaths_by_yearOr fatalities broken down by the type of victim (driver, passenger, pedestrian):
read_infosiga("pessoas") |>
filter(gravidade_lesao == "FATAL") |>
count(tipo_de_vitima, sort = TRUE)Because sinistros carries latitude and longitude as
numeric columns, crash locations can be mapped directly or aggregated by
municipality (municipio / cod_ibge).
The download lives in an operating-system specific cache directory:
infosiga_cache_dir()
#> [1] "/github/home/.cache/R/infosigasp"
infosiga_cache_list()
#> character(0)The archive is refreshed monthly by DETRAN-SP. To pull the latest version, force a re-download:
To reclaim disk space, clear the cache:
You can point the cache somewhere else for a session (or permanently
via your .Rprofile) with the
infosigasp.cache_dir option:
INFOSIGA-SP distributes a field-by-field data dictionary (PDF, in
Portuguese). infosiga_dictionary() downloads it and returns
the paths to the extracted files:
Data are published by DETRAN-SP under a Creative Commons Attribution 4.0 licence. When you publish results based on these data, please cite INFOSIGA-SP / DETRAN-SP as the source: https://infosiga.detran.sp.gov.br/.