2  Creating custom functions

2.1 Presentation

This project utilizes a set of custom, reusable functions that are shared across multiple data quality scripts. The purpose of this approach is to build a more robust and efficient validation framework.

Key benefits include:

  • Consistency: Guarantees that the same validation logic is applied uniformly in every check.

  • Maintainability: Allows us to update or fix logic in a single location, with changes automatically applied everywhere the function is used.

  • Efficiency: Avoids code duplication and speeds up the development of new checks by leveraging existing, tested code.

2.2 Functions

2.2.1 read_sheet

The function is used on the following chapters: 3 - Check Column Consistency, 4 - Check Blank Cells, 5 - Check Species Validity, 6 - Check Structures, 7 - Check ID on Camera, 8 - Check Date Consistency, 9 - Check Duplicated Camera on Structure, 10 - Check Measurement Columns, 11 - Check Coordinates Parameters, 12 - Check Structure Location, 13 - Check Species Record Criteria, 14 - Check Photos.

One the first steps of our datapaper was to create functions that are common to several other checks that we will perform.

The primary function concerns to reading the Excel files and their spreadsheets containing the data. Since some other checks didn’t need the results of the reading but only the names of the datasets and their paths, we created a function encompassing these issues. The function is called read_sheet.

The function has the following arguments:

  • path: the default was set to “Excel” as this is the folder where our files are stored.

  • sheet: we put as NULL because we may not want to have the full results, therefore, there is no need to read the spreadsheets.

  • na: to define what are the options to read NA. Default ““.

  • results: we may not want the results but only the paths for the datasets in customized functions (see Column Consistency).

  • set_col_types: we force R to read the spreadsheets as specific column types (from a support Excel file column_types.xlsx), as option TRUE. In cases when we need R to read all contents, we define it as “text” (option FALSE). Default TRUE.

  • recurse: whether to search for Excel files recursively inside the folder. Default TRUE.

Code
read_sheet <- function(
  path = "Excel",
  sheet = NULL,
  na = "",
  results = TRUE,
  set_col_types = TRUE,
  recurse = TRUE
) {
  excel <- list.files(
    path = path,
    pattern = "^\\w.+xlsx$",
    full.names = TRUE,
    recursive = recurse
  )

  names <- excel |>
    stringr::str_split("/|\\.") |>
    purrr::map_vec(\(x) dplyr::nth(x, -2))

  load <- excel |>
    purrr::set_names(names)

  if (!results) {
    return(load)
  }

  if (set_col_types) {
    column_types <- set_column_types(sheet = sheet)
  } else {
    column_types <- "text"
  }

  result <- load |>
    purrr::map(function(file) {
      df <- withCallingHandlers(
        readxl::read_xlsx(
          path = file,
          sheet = sheet,
          na = na,
          col_names = TRUE,
          col_types = column_types
        )
      ) |>
        janitor::remove_empty("rows")

      names(df) <- df |>
        janitor::clean_names() |>
        colnames() |>
        stringr::str_to_sentence()

      return(df)
    })

  return(result)
}

To read the spreadsheets we need to list the files on the folder, by specifying which one is an Excel file. We ask R to return the full names, which include the full path of the file, and we can optionally control whether subfolders should be searched. For the sake of organization we also named the previous vector with the paths of the spreadsheets with the respective names of their main authors. Now we set the names of the paths with the author’s names extracted in last step.

Firstly, we retrieve the full path of any .xlsx file found inside the “Excel” folder. The second step consists on splitting the full path into pieces and plucking the second element, as it represents the name of the dataset. Following, we name every full path with the name of the dataset. If we are reading the Excel files with a customized function, we stop here. If not, we read every Excel file of the list considering the sheet informed. The results comprise the name of the columns in a specific format, which is initiated by a capital letter followed by lowercase letters with the underscore separator for words.


2.2.2 unique_id

The function is used on the following chapters: 9 - Check Duplicated Camera on Structure.

The function takes two arguments:

  • x: a data frame from the Camera_trap sheet.

  • sep: a string (default "_") used to separate the original Camera_id from an appended letter when duplicates are found.

Internally, it adds a helper column rowid with to preserve the original row order, then groups the data by Structure_id and Camera_id and counts how many times each pair appears into a column called double. Within each group it saves the original Camera_id as Camera_id_orig, assigns a sequential index Dup_id and constructs Dup_form_name: if double equals 1 it keeps the original Camera_id, otherwise it appends the separator and a letter from LETTERS based on Dup_id. After ungrouping, it selects the helper columns (rowid, Camera_id_orig, Dup_form_name, double) and left-joins them back onto the original data by rowid, then updates Camera_id to Dup_form_name when present, relocates Camera_id_orig immediately after Camera_id, and drops the temporary columns.

The practical effect is to ensure that within each Structure_idCamera_id group any repeated camera IDs become unique by appending letters, while preserving the original ID in Camera_id_orig and returning the full updated data frame.

Code
unique_id <- function(x, sep = "_") {
  x_with_id <- x |>
    tibble::rowid_to_column()

  x_with_id |>
    dplyr::group_by(Structure_id, Camera_id) |>
    dplyr::add_count(Structure_id, Camera_id, name = "double") |>
    dplyr::mutate(
      Camera_id_orig = Camera_id,
      Dup_id = dplyr::row_number(Camera_id),
      Dup_form_name = dplyr::if_else(
        condition = double == 1,
        true = Camera_id,
        false = stringr::str_c(Camera_id, sep, LETTERS[Dup_id])
      )
    ) |>
    dplyr::ungroup() |>
    dplyr::select(rowid, Camera_id_orig, Dup_form_name, double) |>
    dplyr::left_join(x_with_id, ., by = "rowid") |>
    dplyr::mutate(
      Camera_id = dplyr::if_else(
        !is.na(Dup_form_name),
        Dup_form_name,
        Camera_id
      )
    ) |>
    dplyr::relocate(Camera_id_orig, .after = Camera_id) |>
    dplyr::select(-Dup_form_name, -rowid)
}

2.2.3 dttm_update

The function is used on the following chapters: 7 - Check ID on Camera, 8 - Check Date Consistency, 9 - Check Duplicated Camera on Structure, 13 - Check Species Record Criteria.

The function takes three arguments:

  • x: a data frame containing date and time columns.

  • date_col: the name (string) of the column that stores the date part.

  • time_col: the name (string) of the column that stores the time part.

Internally, we update the original date column using lubridate:::update_datetime(), combining the existing date with hours, minutes, and seconds extracted from the time column via lubridate::hour(), lubridate::minute(), and lubridate::second().

The practical effect is to merge the date information from one column with the time information from another, producing a complete datetime object in the specified date column and returning the updated data frame.

Code
dttm_update <- function(x, date_col, time_col) {
  date_sym <- rlang::sym(date_col)
  time_sym <- rlang::sym(time_col)

  x |>
    dplyr::mutate(
      !!date_sym := lubridate:::update_datetime(
        !!date_sym,
        hour = lubridate::hour(!!time_sym),
        minute = lubridate::minute(!!time_sym),
        second = lubridate::second(!!time_sym)
      )
    )
}

2.2.4 add_epsg

The function is used on the following chapters: 12 - Check Structure Location.

The function takes a single argument:

  • x: a data frame coming from the structures spreadsheets (overpasses, underpasses).

Since researchers provided their structures’ location in several different systems of geographical coordinates, to perform some of the tests needed we must standardize them into a single approach. We decided to use the epsg approach because sf package deals well with it and it is easier to type and establish than long strings. We checked on epsg.io for every combination of coordinate system and datum that appeared in our datasets, setting their number into the function.

Code
add_epsg <- function(x) {
  result <- x |>
    dplyr::mutate(
      epsg = dplyr::case_when(
        # geodetic
        type == "Geodetic" & Datum == "WGS84" ~ 4326L,
        type == "Geodetic" & Datum == "SIRGAS2000" ~ 4674L,
        type == "Geodetic" & Datum == "Corrego_Alegre" ~ 5524L,
        type == "Geodetic" & Datum == "SAD69" ~ 4618L,
        # projected WGS84 / UTM
        type == "Projected" & Datum == "WGS84" & hemis == "N" ~
          32600L + as.integer(zone),
        type == "Projected" & Datum == "WGS84" & hemis == "S" ~
          32700L + as.integer(zone),
        # projected SIRGAS2000 / UTM (south)
        type == "Projected" & Datum == "SIRGAS2000" & hemis == "S" ~
          31960L + as.integer(zone),
        # any other combination
        TRUE ~ NA_integer_
      )
    )

  return(result)
}

2.2.5 set_feature_from_infrastructure

The function is used on the following chapters: 12 - Check Structure Location.

The function takes a single argument:

  • x: a data frame containing the infrastructure type field to be converted to Open Street Maps (OSM) features.

Each dataset established for their structures what kind of infrastructure (pipelines, railways or roads) they were monitoring. Since the name of the infrastructures were written in different languages and they don’t match the features names to perform a search on OSM databases, we developed this function. It simply creates a feature column in the data frame that transforms blank values into na (this field is mandatory), and some other infrastructure types to “man_made” (for pipelines), “railway” (for any rail infrastructure) and “highway” (for any road infrastructure).

Code
set_feature_from_infrastructure <- function(x) {
  result <- x |>
    dplyr::mutate(
      feature = dplyr::case_when(
        is.na(Infrastructure_type) ~ NA_character_,
        Infrastructure_type %in% c("Ducto", "Gasoduto") ~ "man_made",
        stringr::str_detect(Infrastructure_type, "Ferro") ~ "railway",
        TRUE ~ "highway"
      )
    )

  return(result)
}

2.2.6 calc_nearest_osm_dist

The function is used on the following chapters: 12 - Check Structure Location.

This function takes the following arguments:

  • x: An simple feature (sf) data frame containing the geographic points (point geometry) to be analyzed.

  • feature: The type of infrastructure to search for on OSM. Accepted values are “highway”, “railway”, or “man_made” (for pipelines). Defaults to “highway”.

  • crs_metric: The EPSG code for a projected coordinate reference system (CRS) in meters, used to ensure accurate distance calculations. Defaults to 3857 (Web Mercator).

  • buffer: The distance in meters used to create a search area (buffer) around the input points, which optimizes the query to OSM. Defaults to 1000 meters.

  • thresh: A distance threshold in meters to flag points that are considered “far” from the nearest infrastructure feature. Defaults to 50 meters.

It is often necessary to verify that the monitoring points of a structure are correctly located in relation to the infrastructure they are supposed to be monitoring in the real world. For instance, the wildlife crossing structure in a specific highway should indeed be close to that highway on a reference map. This function was developed to automate this verification process using Open Street Maps (OSM) as the geospatial database.

The function calculates the distance from each input point to the nearest infrastructure feature found on OSM. It first defines a search area (a “buffer”) around the set of points to make the OSM query more efficient. Next, it searches for the line features (roads, railways, etc.) within that area, calculates the shortest distance in meters from each point to one of these lines, and finally enriches the original data frame with the calculated distance and metadata from the nearest OSM feature (such as its ID, name, and type).

Code
calc_nearest_osm_dist <- function(
  x,
  feature = c("highway", "railway", "man_made"),
  crs_metric = 3857,
  buffer = 1000,
  thresh = 50
) {
  feature <- match.arg(feature)

  # 2) Define bbox com um buffer (em graus) para não puxar todo o mundo
  bb <- x |>
    sf::st_transform(crs_metric) |>
    sf::st_bbox() |>
    sf::st_as_sfc() |>
    sf::st_buffer(buffer) |>
    sf::st_transform(4326) |>
    sf::st_bbox()

  # 3) Consulta OSM para linhas do tipo escolhido
  OSM_query <- OSMdata::opq(bbox = bb) |>
    OSMdata::add_OSM_feature(
      key = feature,
      # para rodovia: MOTORWAY, PRIMARY, etc.
      value = if (feature == "highway") {
        c(
          "motorway",
          "trunk",
          "primary",
          "secondary",
          "tertiary",
          "unclassified",
          "residential"
        )
      } else if (feature == "railway") {
        c("rail", "narrow_gauge", "disused", "abandoned")
      } else {
        c("pipeline", "goods_conveyor")
      }
    )

  OSM_lines <- OSM_query |>
    OSMdata::OSMdata_sf() |>
    purrr::pluck("OSM_lines")

  if (is.null(OSM_lines)) {
    cli::cli_alert("There are no features within the buffer")
  }

  OSM_lines_sf <- OSM_lines |>
    tibble::rownames_to_column("id_OSM") |>
    tibble::rowid_to_column("rowid") |>
    sf::st_as_sf()

  # 4) Transforma tudo para um CRS métrico (Web Mercator ou UTM local)
  pts_m <- sf::st_transform(x, crs_metric)
  lines_m <- sf::st_transform(OSM_lines_sf, crs_metric)

  # 5) Para cada ponto, encontra o índice da linha mais próxima
  idx_nearest <- sf::st_nearest_feature(pts_m, lines_m)

  df <- x |>
    dplyr::mutate(idx_nearest = idx_nearest) |>
    dplyr::inner_join(
      OSM_lines_sf |> sf::st_drop_geometry(),
      by = c("idx_nearest" = "rowid")
    )

  # 6) Calcula a distância “by element” ponto ↔ sua linha mais próxima
  dists <- sf::st_distance(pts_m, lines_m[idx_nearest, ], by_element = TRUE)

  # 7) Retorna o df original acrescido de distância em metros
  final_data <- df |>
    dplyr::mutate(
      distance_to = as.numeric(dists),
      out_thresh = distance_to > thresh
    ) |>
    dplyr::select(
      dplyr::any_of(c(
        "Dataset",
        "Infrastructure_type",
        "Structure_id",
        "id_OSM",
        !!feature,
        "name",
        "source",
        "feature",
        "surface",
        "distance_to",
        "out_thresh"
      )) |>
        dplyr::rename(
          feature_type = !!feature
        )
    )

  # --- novo: monta o objeto S3 ---
  result <- list(
    data = final_data,

    bbox_buffer = bb |>
      sf::st_as_sfc() |>
      tibble::enframe("id", "geometry") |>
      sf::st_as_sf(),

    OSM_lines = OSM_lines_sf |>
      dplyr::filter(rowid %in% unique(idx_nearest))
  )
  class(result) <- "nearest_OSM_dist"
  return(result)
}

2.2.7 validate_source

The function is used on the following chapters: 14 - Check Photos.

This function takes the following arguments:

  • source: the media source to be processed. Accepted values are “ct”, “under”, or “over”.

  • allowed: a character vector listing valid source values. Defaults to c("ct", "under", "over").

It performs a simple validation step to ensure that the requested source is one of the allowed options. If an invalid value is supplied, the function aborts immediately with a clear message. This early guardrail prevents the matching pipeline from running with an unsupported source and keeps downstream functions from handling unexpected inputs.

Code
validate_source <- function(source, allowed = c("ct", "under", "over")) {
  if (!source %in% allowed) {
    cli::cli_abort(
      "Source {source} is not an accepted value. Accepted values are 'ct', 'under', 'over'"
    )
  }
}

2.2.8 load_source_data

The function is used on the following chapters: 14 - Check Photos.

The function takes a single argument:

  • source: the short code identifying which Excel sheet to load (“ct”, “under”, or “over”).

It maps the requested source to the correct Excel sheet name using dplyr::case_match and then calls read_sheet with recurse = FALSE to load the spreadsheet. If the source is not mapped, the function aborts. By centralizing the mapping and enforcing non-recursive reads, later steps can rely on consistent inputs for each run.

Code
load_source_data <- function(source) {
  source <- as.character(source)

  sheet_name <- dplyr::case_match(
    source,
    "ct" ~ "Camera_trap",
    "under" ~ "Underpasses",
    "over" ~ "Overpasses"
  )

  if (is.na(sheet_name)) {
    cli::cli_abort("Source {source} is not mapped to a sheet.")
  }

  read_sheet(
    path = "Example/12",
    recurse = FALSE,
    sheet = sheet_name,
    na = c("NA", "-")
  )
}

2.2.9 ensure_dataset_names_match

The function is used on the following chapters: 14 - Check Photos.

This function takes the following arguments:

  • datasets: the list of data frames loaded from the Excel sheets.

  • media_list: a named list of media inventories split by dataset.

  • first_take: a logical flag (default TRUE) that enforces strict name matching only on the first run.

It compares the dataset names present in the Excel sheets with the names discovered in the media folders. If first_take is TRUE and any dataset exists on one side but not the other, the function aborts. This guard ensures the matching logic only proceeds when both the spreadsheets and disk inventory refer to the same set of datasets, while allowing later runs to continue after partial processing.

Code
ensure_dataset_names_match <- function(
  datasets,
  media_list,
  first_take = TRUE
) {
  datasets_not_in_common <- dplyr::setdiff(names(datasets), names(media_list))

  if (first_take == TRUE) {
    if (length(datasets_not_in_common) != 0) {
      cli::cli_abort(
        "There are different datasets between media folder and Excel sheets."
      )
    }
  }
}

2.2.10 extract_filenames_on_sheet

The function is used on the following chapters: 14 - Check Photos.

This function takes the following arguments:

  • df: a data frame for a single dataset loaded from the Excel sheet.

  • source: the media source (“ct”, “under”, or “over”) that determines which filename column to read.

It selects the appropriate filename column (Camera_vision_photo for camera traps, Structure_photo for under/overpasses), removes missing values, keeps distinct entries, and returns a vector of expected filenames. Centralizing this extraction keeps the rest of the matching code agnostic to column naming differences between sources.

Code
extract_filenames_on_sheet <- function(df, source) {
  column <- if (source == "ct") "Camera_vision_photo" else "Structure_photo"
  col_sym <- rlang::sym(column)

  df |>
    dplyr::distinct(dplyr::across(dplyr::all_of(column))) |>
    dplyr::filter(!is.na(!!col_sym)) |>
    dplyr::pull(!!col_sym)
}

2.2.11 media_candidates

The function is used on the following chapters: 14 - Check Photos.

This function takes the following arguments:

  • dataset_media: a tibble of media files for a single dataset.

  • source: the target source folder being processed.

It filters out any files that already live under the source-specific subfolder, returning only media that still need to be evaluated. This avoids re-processing or re-copying files that were placed correctly in previous runs.

Code
media_candidates <- function(dataset_media, source) {
  dataset_media |>
    dplyr::filter(!stringr::str_detect(value, glue::glue("\\/{source}\\/")))
}

2.2.12 stringdist_table

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • filenames_on_sheet: a character vector of expected filenames from the Excel sheet.

  • media: a character vector of candidate filenames found on disk.

It computes a Levenshtein distance matrix between sheet filenames and media filenames using stringdist::stringdistmatrix, converts it to a tibble, and flags rows where any distance is zero as match_exactly. This structured table feeds the candidate-building logic for exact and near matches.

Code
stringdist_table <- function(filenames_on_sheet, media) {
  stringdist::stringdistmatrix(
    filenames_on_sheet,
    media,
    method = "lv",
    useNames = "strings"
  ) |>
    as.data.frame() |>
    tibble::rownames_to_column("sheet") |>
    dplyr::as_tibble() |>
    dplyr::mutate(
      match_exactly = dplyr::if_any(dplyr::where(is.numeric), ~ . == 0)
    )
}

2.2.13 build_match_candidates

The function is used on the following chapters: 14 - Check Photos.

The function takes a single argument:

  • df_stringdist: the tibble returned by stringdist_table.

It pivots the distance matrix to long format, computes heuristic flags (same name without extension, same name ignoring capitalization, and partial matches with distance between 1 and 5), orders the results, and returns a candidate table ready for deduplication. This step enriches the raw distances with simple, interpretable cues.

Code
build_match_candidates <- function(df_stringdist) {
  df_stringdist |>
    tidyr::pivot_longer(
      cols = -c(sheet, match_exactly),
      names_to = "file",
      values_to = "stringdist"
    ) |>
    dplyr::mutate(
      match_file_no_extension = sheet == tools::file_path_sans_ext(file),
      match_file_diff_capitalization = stringr::str_to_upper(sheet) ==
        stringr::str_to_upper(file),
      match_partially = dplyr::if_any(
        dplyr::where(is.numeric),
        ~ dplyr::between(., 1, 5)
      )
    ) |>
    dplyr::relocate(match_exactly, .after = stringdist) |>
    dplyr::arrange(desc(match_file_no_extension), file)
}

2.2.14 dedupe_matches

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • match_candidates: the long-form table produced by build_match_candidates.

  • files_to_exclude: an optional character vector of filenames to drop from consideration.

It first removes excluded files, then identifies sheet/file pairs with exact or case-insensitive name matches. Using those signals, it marks ambiguous rows for removal so that only the most plausible pairs remain. When previously processed folders exist (media_files_types_list is not empty), it also removes candidates that are already present under the current source for the dataset. The output keeps the candidate set clean by eliminating duplicate or already-handled mappings that could mislead later steps.

Code
dedupe_matches <- function(match_candidates, files_to_exclude = character()) {
  filtered_candidates <- match_candidates |>
    dplyr::filter(!file %in% files_to_exclude)

  dup_sheet_file <- filtered_candidates |>
    dplyr::filter(
      match_file_no_extension == TRUE |
        match_file_diff_capitalization == TRUE
    ) |>
    dplyr::select(sheet, file)

  if (purrr::is_empty(media_files_types_list)) {
    filtered_candidates |>
      dplyr::mutate(
        keep = dplyr::case_when(
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            sheet %in% dup_sheet_file$sheet ~
            "REMOVE",
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            file %in% dup_sheet_file$file ~
            "REMOVE",
          TRUE ~ "KEEP"
        )
      ) |>
      dplyr::filter(keep == "KEEP") |>
      dplyr::select(-keep)
  } else {
    filtered_candidates |>
      dplyr::mutate(
        keep = dplyr::case_when(
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            sheet %in% dup_sheet_file$sheet ~
            "REMOVE",
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            file %in% dup_sheet_file$file ~
            "REMOVE",
          sheet %in% media_files_types_list[[.y]][[source]]$file ~ "REMOVE",
          TRUE ~ "KEEP"
        )
      ) |>
      dplyr::filter(keep == "KEEP") |>
      dplyr::select(-keep)
  }
}

2.2.15 copy_exact_matches

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • df_stringdist: the distance table from stringdist_table.

  • media_without_source: a tibble of media files that are not yet in the source-specific folder.

  • target_dir: the destination directory where exact matches should be copied.

It filters for exact matches, ensures the target directory exists, and copies the matched files to their destination while preserving timestamps. The function returns a tibble listing the files moved and their full paths; if no exact matches exist, it returns an empty tibble so that the calling code can continue safely.

Code
copy_exact_matches <- function(
  df_stringdist,
  media_without_source,
  target_dir
) {
  media_match_exactly <- df_stringdist |>
    dplyr::filter(match_exactly == TRUE)

  if (nrow(media_match_exactly) == 0) {
    return(tibble::tibble(file = character(), full_path_to_copy = character()))
  }

  if (!dir.exists(target_dir)) {
    dir.create(target_dir, recursive = TRUE)
  }

  files_found_in_sheet <- df_stringdist |>
    dplyr::filter(match_exactly == TRUE) |>
    dplyr::pull(sheet) |>
    tibble::enframe(value = "file") |>
    dplyr::inner_join(media_without_source, by = "file") |>
    dplyr::mutate(full_path_to_copy = glue::glue("{target_dir}/{file}"))

  file.copy(
    from = files_found_in_sheet$value,
    to = files_found_in_sheet$full_path_to_copy,
    overwrite = FALSE,
    copy.date = TRUE
  )

  files_found_in_sheet
}

2.2.16 process_dataset

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • df: a data frame for a single dataset coming from the Excel sheet.

  • dataset_name: the dataset identifier used to access media in media_list.

  • source: the source being processed (“ct”, “under”, or “over”).

  • media_list: the named list containing all media inventories split by dataset.

It orchestrates the per-dataset workflow: extract expected filenames, gather candidate media, compute string distances, copy exact matches, build candidates, and deduplicate them while excluding already moved files. If a dataset has no filenames or no candidate media, it returns an empty tibble so the caller can skip it cleanly.

Code
process_dataset <- function(df, dataset_name, source, media_list) {
  cli::cli_alert_info("Processing dataset {dataset_name}.")

  filenames_on_sheet <- extract_filenames_on_sheet(df, source)

  if (length(filenames_on_sheet) == 0) {
    return(tibble::tibble())
  }

  media_without_source <- media_candidates(media_list[[dataset_name]], source)

  if (nrow(media_without_source) == 0) {
    return(tibble::tibble())
  }

  media <- media_without_source |>
    dplyr::distinct(file) |>
    dplyr::pull(file)

  df_stringdist <- stringdist_table(filenames_on_sheet, media)

  files_copied <- copy_exact_matches(
    df_stringdist,
    media_without_source,
    glue::glue("Example/Media/{dataset_name}/{source}")
  )

  match_candidates <- build_match_candidates(df_stringdist)

  dedupe_matches(match_candidates, files_to_exclude = files_copied$file)
}

2.2.17 check_match_media

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • source: the source to check (“ct”, “under”, or “over”).

  • first_take: a logical flag (default TRUE) passed to ensure_dataset_names_match to enforce strict name checks on the initial run.

It validates the source, loads the corresponding sheet, confirms dataset names (optionally strict on the first run), filters out empty datasets, and then processes each dataset with process_dataset. Non-empty results are bound together and tagged with the source, producing a consolidated tibble of match candidates for that source.

Code
check_match_media <- function(source = NULL, first_take = TRUE) {
  validate_source(source)
  cli::cli_alert("Starting source {source}")

  datasets <- load_source_data(source)
  ensure_dataset_names_match(datasets, media_list, first_take = first_take)

  datasets_with_content <- datasets |>
    purrr::keep(~ nrow(.x) > 0)

  res <- purrr::imap(
    datasets_with_content,
    ~ process_dataset(.x, .y, source, media_list)
  )

  res |>
    purrr::keep(~ nrow(.x) > 0) |>
    dplyr::bind_rows(.id = "dataset") |>
    dplyr::mutate(source = source, .after = dataset)
}

2.2.18 save_partial_matches

The function is used on the following chapters: 14 - Check Photos.

The function takes a single argument:

  • result: the list of match results returned by run_check_match_media.

It filters each result to keep only partial matches (non-exact but within the partial-match threshold), nests them by dataset, orders them, and writes them to an Excel workbook under Example/Output/12 named with the current date. The function returns the filtered, nested list so it can be reused without re-reading the file.

Code
save_partial_matches <- function(result) {
  partial_res <- result |>
    purrr::map(
      ~ .x |>
        dplyr::filter(
          match_exactly == FALSE,
          match_partially == TRUE
        )
    ) |>
    dplyr::bind_rows() |>
    tidyr::nest(.by = dataset) |>
    dplyr::arrange(dataset) |>
    dplyr::mutate(data = purrr::set_names(data, dataset)) |>
    dplyr::pull(data)

  partial_res |>
    openxlsx2::write_xlsx(
      stringr::str_glue(
        "Example/Output/12/check_names_photos_{lubridate::today()}.xlsx"
      ),
      as_table = TRUE,
      overwrite = TRUE
    )

  return(partial_res)
}

2.2.19 cleanup_media_root

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • result: the list returned by run_check_match_media.

  • media_tbl: the tibble containing the media inventory (media_files in the notebook).

  • sources: a character vector of sources processed in the run.

It identifies files that were matched exactly but still reside in the root media folders (i.e., not under a source-specific subfolder) and removes them. By cleaning up these stray files after copies are made, the media tree stays organized and only contains properly classified content.

Code
cleanup_media_root <- function(result, media_tbl, sources) {
  sources_regex <- glue::glue_collapse(sources, "|")

  files_to_delete <- result |>
    purrr::map(
      ~ .x |>
        dplyr::filter(
          match_exactly == TRUE,
        ) |>
        dplyr::distinct(sheet, .keep_all = TRUE) |>
        dplyr::select(dataset, file = sheet)
    ) |>
    dplyr::bind_rows() |>
    dplyr::inner_join(media_tbl, by = c("dataset", "file")) |>
    dplyr::filter(
      !stringr::str_detect(value, glue::glue("\\/{sources_regex}\\/"))
    ) |>
    dplyr::pull(value)

  file.remove(files_to_delete)
}

2.2.20 run_check_match_media

The function is used on the following chapters: 14 - Check Photos.

The function takes the following arguments:

  • sources: a character vector of sources to process. Defaults to c("ct", "under", "over").

  • media_files: the media inventory tibble to reference when cleaning up. Defaults to media_files.

  • first_take: a logical flag (default TRUE) passed through to check_match_media to enforce strict dataset name checks on the initial run.

  • cleanup: a logical flag (default FALSE) controlling whether to remove stray files from the media root after copying exact matches.

It runs check_match_media for each source, optionally removes stray files with cleanup_media_root, writes the partial-match report via save_partial_matches, and returns the filtered partial-match list. Wrapping the workflow in this function makes the entire matching process repeatable with a single call.

Code
run_check_match_media <- function(
  sources = c("ct", "under", "over"),
  media_files = media_files,
  first_take = TRUE,
  cleanup = FALSE
) {
  result <- purrr::map(purrr::set_names(sources), function(source) {
    check_match_media(source, first_take = first_take)
  })

  if (cleanup == TRUE) {
    cleanup_media_root(result, media_files, sources)
  }

  final_result <- save_partial_matches(result)

  return(final_result)
}