--- title: "Getting started with infosigasp" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with infosigasp} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Network access and a ~120 MB download are required for the live examples, so # code that hits the network is shown but not evaluated when the vignette is # built. Set this to TRUE to run them locally. run_live <- FALSE ``` ## Overview [INFOSIGA-SP](https://infosiga.detran.sp.gov.br/) is the São Paulo State Traffic Accident Information and Management System, maintained by DETRAN-SP. It publishes, as open data, detailed records of every traffic crash in the state of São Paulo from 2015 onward. The `infosigasp` package wraps the download and import of those records. It takes care of the things that make the raw files awkward to read directly: - the files are encoded in **Latin-1** (ISO-8859-1), not UTF-8; - fields are separated by **semicolons** (`;`); - decimal numbers (such as coordinates) use a **comma** decimal mark; - dates are formatted **`DD/MM/YYYY`**; - each dataset is split across **two files** (2015--2021 and 2022 onward). ```{r setup} library(infosigasp) ``` ## The three datasets INFOSIGA-SP organises its data into three linked tables: ```{r datasets} infosiga_datasets() ``` - **`sinistros`** — crash *events*. One row per recorded event, with the date, time, location (including latitude/longitude), road attributes and a breakdown of how many vehicles and victims were involved, by type and severity. - **`pessoas`** — *people* (victims). One row per person involved, with demographic attributes, injury severity and, for fatalities, the date and place of death. - **`veiculos`** — *vehicles*. One row per vehicle involved, with make/model, manufacturing and model years, colour and type. All three share the `id_sinistro` key, so they can be joined together; `pessoas` and `veiculos` additionally share `id_veiculo`. ## Reading data The main entry point is `read_infosiga()`. The first call downloads the source archive (about 120 MB) into a per-user cache; subsequent calls read from that cache, so you only pay the download cost once. ```{r read, eval = run_live} sinistros <- read_infosiga("sinistros") sinistros ``` You can restrict the import to specific years with the `year` argument, which filters on the year of the crash (`ano_sinistro`): ```{r read-year, eval = run_live} recent <- read_infosiga("sinistros", year = 2022:2025) ``` ### Processed vs. raw data By default `read_infosiga()` returns a **processed** dataset (`clean = TRUE`). The processing never renames columns, recodes category labels or drops rows -- it only fixes types and source artefacts so the data is ready for analysis. The full, ordered list of steps lives in `?clean_infosiga`; in brief: - **Dates and times.** Full dates are parsed to `Date` and times to `hms` (in both modes), and the `ano_mes_*` year-month columns (published as `"YYYY/MM"`) become first-of-month `Date` values. - **Whitespace.** Leading/trailing whitespace is trimmed from text columns. The source pads some fields to a fixed width (`nacionalidade` ships as `"BRASILEIRA "`); untrimmed, those values break grouping and joins. - **Missing values.** The `"NAO DISPONIVEL"` ("not available") marker is replaced by `NA` (trimming runs first, so space-padded markers are caught). - **Ordered factors.** Ordinal columns sort and plot in their natural order instead of alphabetically: - `dia_da_semana`: `Domingo` < ... < `Sábado` (the Brazilian week starts on Sunday); - `turno`: `MADRUGADA` < `MANHA` < `TARDE` < `NOITE`; - `gravidade_lesao` (victims): `LEVE` < `GRAVE` < `FATAL`; - `faixa_etaria_demografica` / `faixa_etaria_legal`: age bands in order. - **Crash-type flags.** The binary `tp_sinistro_*` columns (`"S"` / empty) become **logical**, so you can `sum()` or filter them directly. The categorical `tp_sinistro_primario` is left as text. - **Numeric strings.** `tempo_sinistro_obito` (days from crash to death) becomes **integer**, and the spurious trailing `".0"` the export appends to `numero_logradouro` (`"193.0"`) is stripped. - **Coordinates.** `latitude`/`longitude` outside the São Paulo state bounding box -- mis-encoded values and `(0, 0)` placeholders, about 7% of records -- are set to `NA` as a pair. No rows are dropped; pass `clean = FALSE` for the raw coordinates. ```{r clean, eval = run_live} sinistros <- read_infosiga("sinistros") levels(sinistros$dia_da_semana) ``` Because `dia_da_semana` is an ordered factor, a weekday tabulation comes out in calendar order rather than alphabetically: ```{r wday, eval = run_live} table(sinistros$dia_da_semana) ``` If you would rather have the data exactly as published — every text column as a character vector, with `"NAO DISPONIVEL"` and the source's whitespace padding preserved verbatim — pass `clean = FALSE`: ```{r raw, eval = run_live} raw <- read_infosiga("sinistros", clean = FALSE) ``` You can also apply the same processing to a raw import after the fact with `clean_infosiga()`: ```{r clean-fn, eval = run_live} processed <- clean_infosiga(raw, "sinistros") ``` ### A peek at the structure without downloading The package ships a small sample of each dataset so you can inspect the columns without any network access: ```{r sample} sample_path <- system.file("extdata", "sinistros_sample.csv", package = "infosigasp") sample <- readr::read_delim( sample_path, delim = ";", show_col_types = FALSE ) dim(sample) names(sample) ``` ## A short analysis Once imported, the data are ordinary tibbles, so any tidyverse (or base R) workflow applies. For example, counting traffic fatalities per year from the victims dataset: ```{r fatalities, eval = run_live} library(dplyr) deaths_by_year <- read_infosiga("pessoas") |> filter(gravidade_lesao == "FATAL") |> count(ano_obito, name = "deaths") |> arrange(ano_obito) deaths_by_year ``` Or fatalities broken down by the type of victim (driver, passenger, pedestrian): ```{r by-victim, eval = run_live} read_infosiga("pessoas") |> filter(gravidade_lesao == "FATAL") |> count(tipo_de_vitima, sort = TRUE) ``` Because `sinistros` carries latitude and longitude as numeric columns, crash locations can be mapped directly or aggregated by municipality (`municipio` / `cod_ibge`). ## Managing the cache The download lives in an operating-system specific cache directory: ```{r cache} infosiga_cache_dir() infosiga_cache_list() ``` The archive is refreshed monthly by DETRAN-SP. To pull the latest version, force a re-download: ```{r refresh, eval = run_live} infosiga_download(overwrite = TRUE) ``` To reclaim disk space, clear the cache: ```{r clear, eval = run_live} infosiga_cache_clear() ``` You can point the cache somewhere else for a session (or permanently via your `.Rprofile`) with the `infosigasp.cache_dir` option: ```{r cache-opt, eval = FALSE} options(infosigasp.cache_dir = "~/data/infosiga") ``` ## The official data dictionary INFOSIGA-SP distributes a field-by-field data dictionary (PDF, in Portuguese). `infosiga_dictionary()` downloads it and returns the paths to the extracted files: ```{r dictionary, eval = run_live} pdfs <- infosiga_dictionary() basename(pdfs) ``` ## Citing the data Data are published by DETRAN-SP under a Creative Commons Attribution 4.0 licence. When you publish results based on these data, please cite INFOSIGA-SP / DETRAN-SP as the source: .