3  Check Column Consistency

3.1 Problem description

The idea behind this was simply to bind rows of every spreadsheet and check for warnings and errors. If column types were different errors had to be risen and formats would be subject of warnings.

The first step of checking the spreadsheets consisted in trying to read all of them and binding their rows. By doing this, if the column type was not the same, we then had to set the columns with the same type. Also, by reading each spreadsheet, read_excel may warns us if something is wrong.

3.2 Problem solving

We defined for each column over the 5 spreadsheets which type all of them were. We populated a table directly in Excel and then we read using the following commands:

Code
col_types <- list()

path <- "support/column_types.xlsx"

sheet_names <- readxl::excel_sheets(path)

for (i in 1:length(sheet_names)) {
  col_types[[sheet_names[i]]] <- readxl::read_excel(path = path, sheet = i)
}

# Showing an example of the output
col_types[1]
$Underpasses
# A tibble: 17 × 2
   Column              Type   
   <chr>               <chr>  
 1 Infrastructure_type text   
 2 Structure_ID        text   
 3 Structure_type      text   
 4 Structure_cell      text   
 5 Structure_shape     text   
 6 Structure_photo     text   
 7 Structure_age       numeric
 8 Structure_height    numeric
 9 Structure_length    numeric
10 Structure_width     numeric
11 Waterbody_width     numeric
12 Latitude            numeric
13 Longitude           numeric
14 UTM Zone            text   
15 X (Easting)         numeric
16 Y (Northing)        numeric
17 Datum               text   

The following process has to be repeated for all the spreadsheets, however here we illustrate using the “Underpasses” spreadsheet. In order to collect the messages provided by R we adapted the read_excel function with a purrr::quietly function, that is more appropriate to show warnings and messages for each file/author. We enchain the customized read_sheet function that provides the full paths of all .xlsx files available to the purrr::quietly function.

In the sequence, we are able to collect the warnings (in this case, under_warns), check and correct them directly on the Excel files. The majority of the errors are misspelled text, decimal markers, date and time separators and so on…

After we zeroing the warnings, we are able to check if R successfully bind the rows from each author. If no warnings or errors are raised, it means that we reached our goal.

Code
#Underpasses ----
source("R/FUNCTIONS.R")

under_q <- purrr::quietly(function(file) {
  readxl::read_excel(
    file,
    col_types = col_types[["Underpasses"]]$Type,
    sheet = "Underpasses",
    na = c("NA", "na"),
    col_names = TRUE
  ) |>
    janitor::remove_empty("rows")
})

under_all_outputs <- read_sheet(path = "Example", results = FALSE) |>
  purrr::map(under_q)

under_warns <- under_all_outputs |>
  purrr::map(\(x) purrr::pluck(x, "warnings")) |>
  purrr::compact()

under_results <- under_all_outputs |>
  purrr::map(\(x) purrr::pluck(x, "result")) |>
  dplyr::bind_rows(.id = "Dataset")

under_results |>
  print(n = 10)
# A tibble: 294 × 19
   Dataset  Infrastructure_type Structure_ID    Structure_type    Structure_cell
   <chr>    <chr>               <chr>           <chr>             <chr>         
 1 Example1 Rodovia             P1 (iguaçu)     Ponte             <NA>          
 2 Example1 Rodovia             <NA>            Bueiro de concre… <NA>          
 3 Example1 Rodovia             P2 (mauricio)   Ponte             <NA>          
 4 Example1 Rodovia             BC2 (drenagem)  Bueiro de concre… <NA>          
 5 Example1 Rodovia             BCS1            Bueiro com plata… <NA>          
 6 Example1 Rodovia             P3 (varzea)     Ponte             <NA>          
 7 Example1 Rodovia             P4 (fazenda)    Ponte             <NA>          
 8 Example1 Rodovia             P5 (sapezal)    Ponte             <NA>          
 9 Example1 Rodovia             P6 (passa três) Ponte             <NA>          
10 Example1 Rodovia             P7 (lourenço)   Ponte             <NA>          
# ℹ 284 more rows
# ℹ 14 more variables: Structure_shape <chr>, Structure_photo <chr>,
#   Structure_age <dbl>, Structure_height <dbl>, Structure_length <dbl>,
#   Structure_width <dbl>, Waterbody_width <dbl>, Latitude <dbl>,
#   Longitude <dbl>, `UTM Zone` <chr>, `X (Easting)` <dbl>,
#   `Y (Northing)` <dbl>, Datum <chr>, structure_length <dbl>