| Title: | Download and Import Traffic Crash Data from 'INFOSIGA-SP' |
|---|---|
| Description: | Provides a programmatic interface to the open data published by the Sao Paulo State Traffic Accident Information and Management System ('INFOSIGA-SP'), maintained by the Sao Paulo State Department of Motor Vehicles ('DETRAN-SP'). Functions download and import tidy data frames of traffic crash events ('sinistros'), victims ('pessoas') and vehicles ('veiculos') from 2015 onward, handling the source encoding, decimal marks, date formats and on-disk caching. See <https://infosiga.detran.sp.gov.br/> for the original data portal. |
| Authors: | Vinicius Oike [aut, cre, cph] (ORCID: <https://orcid.org/0009-0005-8015-9189>) |
| Maintainer: | Vinicius Oike <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-21 15:16:17 UTC |
| Source: | https://github.com/viniciusoike/infosigasp |
Applies the standard processing that read_infosiga() performs by default
(clean = TRUE). Use this directly only when you imported a dataset with
clean = FALSE (the raw version) and want to process it afterwards.
clean_infosiga(data, dataset = c("sinistros", "pessoas", "veiculos"))clean_infosiga(data, dataset = c("sinistros", "pessoas", "veiculos"))
data |
A data frame imported with |
dataset |
Which dataset |
The processing is deliberately light: it standardises missing values, fixes source formatting artefacts and assigns meaningful types to columns whose published representation is inconvenient (ordinal text, binary flags, year-month strings). It never renames columns, recodes category labels or drops rows, so the result stays a faithful, analysis-ready view of the source.
The following steps are applied, in order. Every step is idempotent, so
clean_infosiga() can be called again on an already-processed dataset
without changing it.
Whitespace. Leading and trailing whitespace is trimmed from every
text column. Some source fields are space-padded to a fixed width (for
example nacionalidade is published as "BRASILEIRA "); without
trimming, comparisons, grouping and joins on those columns silently fail.
Missing values. The literal "NAO DISPONIVEL" ("not available")
marker is replaced by NA in every text column. Trimming happens first
so that space-padded markers are also caught.
Ordered factors. Ordinal columns are converted to ordered factors with their natural order:
dia_da_semana: Domingo < ... < Sabado (the Brazilian week
starts on Sunday).
turno: MADRUGADA < MANHA < TARDE < NOITE.
gravidade_lesao (in pessoas): LEVE < GRAVE < FATAL.
faixa_etaria_demografica, faixa_etaria_legal (in pessoas):
age bands in increasing order.
Year-month dates. Year-month columns (ano_mes_sinistro,
ano_mes_obito), published as "YYYY/MM" strings, are parsed to
first-of-month Date values, matching the Date class already used for
the full-date columns.
Crash-type flags (sinistros). The binary tp_sinistro_*
columns – which mark whether a crash involved a given event type and are
published as "S" (yes) or empty (no) – become logical (TRUE /
FALSE). The categorical tp_sinistro_primario (the primary crash type,
e.g. "COLISAO") is not a flag and is left as text.
Days to death (pessoas). tempo_sinistro_obito, the number of
days between the crash and the victim's death (published as a numeric
string), becomes integer.
Street numbers (sinistros). numero_logradouro is kept as text
(house numbers may contain letters), but a spurious trailing ".0" from
the source export ("193.0") is stripped to "193".
Coordinates (sinistros). latitude/longitude are validated as
a pair against the bounding box of the state of Sao Paulo. Points outside
the box – mis-encoded values and (0, 0) "null island" placeholders –
have both coordinates set to NA. This affects roughly 7% of records;
no rows are dropped. Use clean = FALSE if you need the raw coordinates.
Nominal text columns (such as municipio, tipo_via or sexo) are left as
character vectors. Numeric columns that are already well typed – notably
idade (the victim's age, in pessoas) – are passed through unchanged and
are not range-checked: missing ages are NA, and ages of 0 (infants)
are kept. In the current upstream data idade ranges from 0 to about 102,
but the package does not enforce any bound, so validate it yourself if your
analysis is sensitive to outliers.
A tibble with the same columns as data, with the
processing described in Details applied.
read_infosiga(), which calls this function when clean = TRUE.
# Process the bundled raw sample raw <- readr::read_delim( system.file("extdata", "pessoas_sample.csv", package = "infosigasp"), delim = ";", show_col_types = FALSE ) clean <- clean_infosiga(raw, "pessoas") levels(clean$gravidade_lesao)# Process the bundled raw sample raw <- readr::read_delim( system.file("extdata", "pessoas_sample.csv", package = "infosigasp"), delim = ";", show_col_types = FALSE ) clean <- clean_infosiga(raw, "pessoas") levels(clean$gravidade_lesao)
INFOSIGA-SP ships its data as a single archive of roughly 120 MB (uncompressed, over 700 MB). To avoid repeated downloads, infosigasp stores the archive in a per-user cache directory and reuses it across sessions. These functions inspect and manage that cache.
infosiga_cache_dir() infosiga_cache_list() infosiga_cache_clear(confirm = interactive())infosiga_cache_dir() infosiga_cache_list() infosiga_cache_clear(confirm = interactive())
confirm |
Logical. If |
The cache location defaults to the operating-system specific user cache
directory returned by tools::R_user_dir() ("infosigasp", "cache").
You can override it for the current session with the infosigasp.cache_dir
option, e.g. options(infosigasp.cache_dir = "~/my-cache"), or permanently
through your .Rprofile.
infosiga_cache_dir() returns the cache directory path (a string). It is
a pure accessor with no side effects: the directory itself is created
lazily the first time data is written (e.g. by infosiga_download()), so
the reported path may not yet exist.
infosiga_cache_list() returns a character vector of cached file paths
(possibly empty).
infosiga_cache_clear() invisibly returns the paths it removed.
# Where does infosigasp cache its files? infosiga_cache_dir() # What is currently cached? infosiga_cache_list()# Where does infosigasp cache its files? infosiga_cache_dir() # What is currently cached? infosiga_cache_list()
Returns a small tibble describing the datasets that read_infosiga() can
import, including their grain (what one row represents) and key columns.
infosiga_datasets()infosiga_datasets()
A tibble with columns dataset, description,
grain and keys.
infosiga_datasets()infosiga_datasets()
Downloads the official INFOSIGA-SP data dictionary, a set of PDF documents (one per dataset) describing every column and its accepted values. The archive is saved to the cache and the extracted PDF paths are returned.
infosiga_dictionary( dest = file.path(infosiga_cache_dir(), "dictionary"), overwrite = FALSE, quiet = FALSE )infosiga_dictionary( dest = file.path(infosiga_cache_dir(), "dictionary"), overwrite = FALSE, quiet = FALSE )
dest |
Directory in which to extract the PDF files. Defaults to a
|
overwrite |
Logical. Re-download even if the dictionary archive is
already cached. Defaults to |
quiet |
Logical. Suppress progress messages. Defaults to |
A character vector of paths to the extracted PDF files, invisibly.
## Not run: pdfs <- infosiga_dictionary() # Open the dictionary for the crash-events dataset browseURL(grep("sinistros", pdfs, value = TRUE)) ## End(Not run)## Not run: pdfs <- infosiga_dictionary() # Open the dictionary for the crash-events dataset browseURL(grep("sinistros", pdfs, value = TRUE)) ## End(Not run)
Downloads the consolidated INFOSIGA-SP data archive (dados_infosiga.zip)
from DETRAN-SP into the local cache. Most users do not need to call this
directly: read_infosiga() downloads the archive on demand. Use this
function when you want to pre-fetch the data (for example, before going
offline) or to force a refresh.
infosiga_download(overwrite = FALSE, quiet = FALSE, timeout = 3600)infosiga_download(overwrite = FALSE, quiet = FALSE, timeout = 3600)
overwrite |
Logical. If |
quiet |
Logical. If |
timeout |
Download timeout in seconds. The archive is large (around
120 MB), so the default temporarily raises |
The archive is updated monthly by DETRAN-SP and accumulates all records
from 2015 onward. The download URL can be overridden with the
infosigasp.zip_url option, which may be a character vector of mirror URLs
tried in order until one succeeds. The default is the official DETRAN-SP
endpoint followed by a GitHub-release mirror that serves a point-in-time
snapshot when the official portal is unavailable. Override the option to add
your own mirror or for testing.
Because DETRAN-SP overwrites the archive in place each month under the same
file name, a cached copy can become stale silently. When a cached archive is
reused that is older than the infosigasp.stale_days option (30 days by
default; set to Inf to disable), a warning suggests refreshing it. The age
is taken from the cached file's modification time.
The path to the cached archive, invisibly.
read_infosiga() to import the data, and infosiga_cache_dir()
to locate the cache.
## Not run: # Pre-fetch the archive into the cache infosiga_download() # Force a refresh after a monthly update infosiga_download(overwrite = TRUE) ## End(Not run)## Not run: # Pre-fetch the archive into the cache infosiga_download() # Force a refresh after a monthly update infosiga_download(overwrite = TRUE) ## End(Not run)
Downloads (if necessary) and imports one of the three INFOSIGA-SP datasets as a tidy tibble. The source archive is cached locally, so the first call triggers a download and subsequent calls read from disk.
read_infosiga( dataset = c("sinistros", "pessoas", "veiculos"), clean = TRUE, year = NULL, download_if_missing = TRUE, quiet = FALSE, ... )read_infosiga( dataset = c("sinistros", "pessoas", "veiculos"), clean = TRUE, year = NULL, download_if_missing = TRUE, quiet = FALSE, ... )
dataset |
Which dataset to import. One of:
|
clean |
Logical. If |
year |
Optional integer vector used to filter rows by year of the
crash ( |
download_if_missing |
Logical. If |
quiet |
Logical. If |
... |
Additional arguments passed to |
Source files are encoded in Latin-1 (ISO-8859-1), use ; as the field
separator, , as the decimal mark and DD/MM/YYYY dates. read_infosiga()
handles all of these and returns UTF-8 text, Date columns and numeric
coordinates. Each dataset is distributed across two period files inside the
archive (2015-2021 and 2022 onward); they are read and row-bound
transparently.
By default (clean = TRUE) the result is then processed by
clean_infosiga(): text columns are whitespace-trimmed, the
"NAO DISPONIVEL" ("not available") marker becomes NA, ordinal columns
(dia_da_semana, turno, gravidade_lesao, the age bands) become
ordered factors, the ano_mes_* year-month strings are parsed to
first-of-month Dates, the binary tp_sinistro_* crash-type flags become
logical, tempo_sinistro_obito becomes integer, and
latitude/longitude values outside the bounding box of Sao Paulo state
are dropped to NA. See clean_infosiga() for the complete, ordered list.
Pass clean = FALSE to obtain the raw data exactly as published – every
text column kept as a character vector, with "NAO DISPONIVEL" and the
source's fixed-width whitespace padding preserved verbatim.
A small fraction of rows in the source contain data-quality issues (for
example, an unescaped ; inside a street name, or mis-encoded coordinates).
Any value that cannot be parsed to its declared column type is set to NA
and recorded by readr::problems(). Empty fields are read as NA in both
modes. In the raw data (clean = FALSE) the crash-type flag columns
(tp_sinistro_*) hold "S" when the flag applies and NA otherwise; with
clean = TRUE they are converted to logical.
A tibble with one row per record. The columns
keep the original INFOSIGA-SP names (in Portuguese); see the package data
dictionary via infosiga_dictionary(). The three datasets can be joined
on id_sinistro (and id_veiculo, where present).
infosiga_download(), infosiga_cache_dir(),
infosiga_dictionary().
## Not run: # Import all crash events, processed (downloads the archive on first use) sinistros <- read_infosiga("sinistros") levels(sinistros$dia_da_semana) # Only victims from 2022 and 2023 vitimas <- read_infosiga("pessoas", year = 2022:2023) # The raw data, exactly as published raw <- read_infosiga("sinistros", clean = FALSE) ## End(Not run) # A bundled sample (no download required) illustrates the structure: sample_path <- system.file( "extdata", "sinistros_sample.csv", package = "infosigasp" ) if (nzchar(sample_path)) head(readr::read_delim(sample_path, ";"))## Not run: # Import all crash events, processed (downloads the archive on first use) sinistros <- read_infosiga("sinistros") levels(sinistros$dia_da_semana) # Only victims from 2022 and 2023 vitimas <- read_infosiga("pessoas", year = 2022:2023) # The raw data, exactly as published raw <- read_infosiga("sinistros", clean = FALSE) ## End(Not run) # A bundled sample (no download required) illustrates the structure: sample_path <- system.file( "extdata", "sinistros_sample.csv", package = "infosigasp" ) if (nzchar(sample_path)) head(readr::read_delim(sample_path, ";"))