14  Check Photos

14.1 Problem Description

We verify whether the delivered photos and videos match the expected records in the Excel sheets. We standardize the media structure, copy found files into specific folders (ct, under, over), generate a partial-match report, and remove media that remain in the root without a correct destination.

14.2 Problem Solving

Pipeline: (1) load helpers and check folder names versus Excel sheets; (2) stage raw media into a standard structure; (3) inventory media under Media while separating already-processed files (ct, under, over); (4) apply matching functions between sheets and files; (5) export partial matches and (optionally) clean stray files.

14.2.1 Common steps

These steps get the ground ready: we load shared utilities, list the incoming folders, and immediately compare their names with the Excel sheets. If something is off, we find out before spending time on matching. That early sanity check keeps the rest of the workflow focused on pairing files, not fixing misconfigured inputs.

14.2.1.1 Load helper functions

Here we simply source R/FUNCTIONS.R to bring in shared helpers (read_sheet, string utilities, logging). By centralizing those functions, the notebook stays lean and consistent with the R script. Nothing else happens besides preparing the environment for the next chunks.

Code
source("R/FUNCTIONS.R")

14.2.1.2 Check folder names vs. Excel sheets

Next, we list the folders under Example/12 and compare those names to the Excel sheet names. Using waldo::compare, any mismatch is surfaced right away, so we can align folder and sheet names before matching files. This symmetry avoids silent skips caused by small naming drifts.

Code
folders <- list.dirs(path = "Example/12", recursive = FALSE)

names_folders <- folders |>
  stringr::str_split_i(pattern = "\\/", 3)

names_excel <- read_sheet(
  path = "Example/12",
  results = FALSE,
  recurse = FALSE
) |>
  names()

names_folders |>
  waldo::compare(names_excel)
✔ No differences

14.2.2 Stage raw media

Here we scan every file under the incoming folders, keep only images or videos, and infer dataset and media type from the path and MIME type. Then we create the standardized destination folders and copy the files there. Executing this chunk once sets up a clean media tree for the later matching steps.

Code
files <- list.files(folders, full.names = TRUE, recursive = TRUE)

df <- tibble::tibble(file = files) |>
  dplyr::mutate(type = mime::guess_type(file)) |>
  dplyr::filter(stringr::str_detect(type, "^image/|^video")) |>
  dplyr::mutate(
    dataset = stringr::str_split_i(file, pattern = "\\/", 3),
    media_type = stringr::str_split_i(type, pattern = "\\/", 1),
    folder = stringr::str_glue("Example/Media/{dataset}/{media_type}/"),
    filename = stringr::str_glue("{folder}{basename(file)}")
  )

folders_to_create <- df$folder |>
  unique()

for (folder in folders_to_create) {
  dir.create(path = folder, recursive = TRUE)
}

files_to_copy <- df$file
folder_target <- df$filename

file.copy(from = files_to_copy, to = folder_target)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE

14.2.3 Inventory media under Media

Now we read everything under Example/Media, capturing the dataset, media type, and filename from the path. We then split files already placed under ct, under, or over into media_files_types_list and remove them from the main working list. The result, media_list, is the on-disk ground truth for files still needing a destination, which we compare against the Excel expectations.

Code
media_files <- list.files(
  path = "Example/Media",
  recursive = TRUE,
  full.names = TRUE
) |>
  dplyr::as_tibble() |>
  dplyr::mutate(
    dataset = stringr::word(value, 3, 3, sep = "\\/"),
    media_type = stringr::word(value, 4, 4, sep = "\\/"),
    file = basename(value)
  )

media_files_types <- media_files |>
  dplyr::filter(media_type %in% c("ct", "under", "over"))

media_files_types_list <- media_files_types |>
  (\(df) split(df, df$dataset))() |>
  purrr::map(\(df) split(df, df$media_type))

media_anti_join <- dplyr::anti_join(media_files, media_files_types)
Joining with `by = join_by(value, dataset, media_type, file)`
Code
media_list <- split(media_anti_join, media_anti_join$dataset)

14.2.4 Matching functions

In this step, we define the helpers that drive validation, data loading, matching, and cleanup. Each function focuses on one responsibility so the main flow stays readable. We also explain how we detect exact vs. near matches using stringdist.

14.2.4.1 Validate allowed sources and load the right sheet

We begin by validating the requested source and mapping it to the proper Excel sheet with case_match. If the source is unknown, we stop immediately; otherwise we load the sheet so the rest of the code can stay agnostic to sheet naming. We also disable recursive reading to avoid loading unintended files. This keeps the pipeline predictable from the first step.

Code
validate_source <- function(source, allowed = c("ct", "under", "over")) {
  if (!source %in% allowed) {
    cli::cli_abort(
      "Source {source} is not an accepted value. Accepted values are 'ct', 'under', 'over'"
    )
  }
}

load_source_data <- function(source) {
  source <- as.character(source)

  sheet_name <- dplyr::case_match(
    source,
    "ct" ~ "Camera_trap",
    "under" ~ "Underpasses",
    "over" ~ "Overpasses"
  )

  if (is.na(sheet_name)) {
    cli::cli_abort("Source {source} is not mapped to a sheet.")
  }

  read_sheet(
    path = "Example/12",
    recurse = FALSE,
    sheet = sheet_name,
    na = c("NA", "-")
  )
}

14.2.4.2 Check dataset-name consistency

Here we confirm that every dataset present in the Excel sheets also exists in the media inventory. The first_take flag lets us enforce this strictly on the initial run, while allowing later runs to proceed even if some datasets were already processed. That safeguard keeps disk contents and spreadsheet expectations aligned, without blocking follow-up executions.

Code
ensure_dataset_names_match <- function(
  datasets,
  media_list,
  first_take = TRUE
) {
  datasets_not_in_common <- dplyr::setdiff(names(datasets), names(media_list))

  if (first_take == TRUE) {
    if (length(datasets_not_in_common) != 0) {
      cli::cli_abort(
        "There are different datasets between media folder and Excel sheets."
      )
    }
  }
}

14.2.4.3 Extract filenames from the sheet

Continuing, we pull the expected filenames from the appropriate column (Camera_vision_photo for ct, Structure_photo otherwise). We keep only distinct, non-missing values so the matching targets are clean. Using tidy evaluation (col_sym) lets us pick the right column without hard-coding elsewhere.

Code
extract_filenames_on_sheet <- function(df, source) {
  column <- if (source == "ct") "Camera_vision_photo" else "Structure_photo"
  col_sym <- rlang::sym(column)

  df |>
    dplyr::distinct(dplyr::across(dplyr::all_of(column))) |>
    dplyr::filter(!is.na(!!col_sym)) |>
    dplyr::pull(!!col_sym)
}

14.2.4.4 Prepare media candidates

Now we prune out media files that are already sitting in a source-specific folder. By filtering any path containing /source/, we restrict the next steps to files that still need placement. That keeps us from re-processing what was matched before.

Code
media_candidates <- function(dataset_media, source) {
  dataset_media |>
    dplyr::filter(!stringr::str_detect(value, glue::glue("\\/{source}\\/")))
}

14.2.4.5 Compute string distances (exact and near matches)

Here we compare the sheet filenames to the media filenames using Levenshtein distance (stringdistmatrix). A distance of zero signals an exact match; small distances flag near matches we might want to inspect. We keep the result as a tibble and add a logical match_exactly column that downstream steps can consume.

Code
stringdist_table <- function(filenames_on_sheet, media) {
  stringdist::stringdistmatrix(
    filenames_on_sheet,
    media,
    method = "lv",
    useNames = "strings"
  ) |>
    as.data.frame() |>
    tibble::rownames_to_column("sheet") |>
    dplyr::as_tibble() |>
    dplyr::mutate(
      match_exactly = dplyr::if_any(dplyr::where(is.numeric), ~ . == 0)
    )
}

14.2.4.6 Build and prune match candidates

Next we reshape the distance table to long form, calculate simple heuristics (same name without extension, case-insensitive match, distance between 1 and 5), and order the candidates. Then we filter out ambiguous pairs that collide on sheet/file combinations. We also remove entries already present under ct, under, or over when those folders exist, preventing duplicated work. The goal is to surface solid candidates while trimming noisy duplicates.

Code
build_match_candidates <- function(df_stringdist) {
  df_stringdist |>
    tidyr::pivot_longer(
      cols = -c(sheet, match_exactly),
      names_to = "file",
      values_to = "stringdist"
    ) |>
    dplyr::mutate(
      match_file_no_extension = sheet == tools::file_path_sans_ext(file),
      match_file_diff_capitalization = stringr::str_to_upper(sheet) ==
        stringr::str_to_upper(file),
      match_partially = dplyr::if_any(
        dplyr::where(is.numeric),
        ~ dplyr::between(., 1, 5)
      )
    ) |>
    dplyr::relocate(match_exactly, .after = stringdist) |>
    dplyr::arrange(desc(match_file_no_extension), file)
}

dedupe_matches <- function(match_candidates, files_to_exclude = character()) {
  filtered_candidates <- match_candidates |>
    dplyr::filter(!file %in% files_to_exclude)

  dup_sheet_file <- filtered_candidates |>
    dplyr::filter(
      match_file_no_extension == TRUE |
        match_file_diff_capitalization == TRUE
    ) |>
    dplyr::select(sheet, file)

  if (purrr::is_empty(media_files_types_list)) {
    filtered_candidates |>
      dplyr::mutate(
        keep = dplyr::case_when(
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            sheet %in% dup_sheet_file$sheet ~
            "REMOVE",
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            file %in% dup_sheet_file$file ~
            "REMOVE",
          TRUE ~ "KEEP"
        )
      ) |>
      dplyr::filter(keep == "KEEP") |>
      dplyr::select(-keep)
  } else {
    filtered_candidates |>
      dplyr::mutate(
        keep = dplyr::case_when(
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            sheet %in% dup_sheet_file$sheet ~
            "REMOVE",
          match_file_no_extension == FALSE &
            match_file_diff_capitalization == FALSE &
            file %in% dup_sheet_file$file ~
            "REMOVE",
          sheet %in% media_files_types_list[[.y]][[source]]$file ~ "REMOVE",
          TRUE ~ "KEEP"
        )
      ) |>
      dplyr::filter(keep == "KEEP") |>
      dplyr::select(-keep)
  }
}

14.2.4.7 Copy exact matches

Here we handle the zero-distance hits: we create the destination folder if it does not exist, map sheet entries to real files, and copy them while preserving timestamps. If no exact matches are found, we return an empty tibble so the flow continues gracefully.

Code
copy_exact_matches <- function(
  df_stringdist,
  media_without_source,
  target_dir
) {
  media_match_exactly <- df_stringdist |>
    dplyr::filter(match_exactly == TRUE)

  if (nrow(media_match_exactly) == 0) {
    return(tibble::tibble(file = character(), full_path_to_copy = character()))
  }

  if (!dir.exists(target_dir)) {
    dir.create(target_dir, recursive = TRUE)
  }

  files_found_in_sheet <- df_stringdist |>
    dplyr::filter(match_exactly == TRUE) |>
    dplyr::pull(sheet) |>
    tibble::enframe(value = "file") |>
    dplyr::inner_join(media_without_source, by = "file") |>
    dplyr::mutate(full_path_to_copy = glue::glue("{target_dir}/{file}"))

  file.copy(
    from = files_found_in_sheet$value,
    to = files_found_in_sheet$full_path_to_copy,
    overwrite = FALSE,
    copy.date = TRUE
  )

  files_found_in_sheet
}

14.2.4.8 Process each dataset

For each dataset and source, we extract the expected filenames, gather candidate media, compute distances, copy any exact matches, and return a curated candidate table with the copied files excluded. This per-dataset step stitches together all the helper logic into a single pass.

Code
process_dataset <- function(df, dataset_name, source, media_list) {
  cli::cli_alert_info("Processing dataset {dataset_name}.")

  filenames_on_sheet <- extract_filenames_on_sheet(df, source)

  if (length(filenames_on_sheet) == 0) {
    return(tibble::tibble())
  }

  media_without_source <- media_candidates(media_list[[dataset_name]], source)

  if (nrow(media_without_source) == 0) {
    return(tibble::tibble())
  }

  media <- media_without_source |>
    dplyr::distinct(file) |>
    dplyr::pull(file)

  df_stringdist <- stringdist_table(filenames_on_sheet, media)

  files_copied <- copy_exact_matches(
    df_stringdist,
    media_without_source,
    glue::glue("Example/Media/{dataset_name}/{source}")
  )

  match_candidates <- build_match_candidates(df_stringdist)

  dedupe_matches(match_candidates, files_to_exclude = files_copied$file)
}

14.2.4.9 Orchestrate matching per source

Finally, check_match_media validates the source, loads the sheet, checks dataset names, iterates over non-empty datasets, and binds the results while tagging the source. Each source flows through the same validated routine, keeping the main pipeline slim and repeatable.

Code
check_match_media <- function(source = NULL, first_take = TRUE) {
  validate_source(source)
  cli::cli_alert("Starting source {source}")

  datasets <- load_source_data(source)
  ensure_dataset_names_match(datasets, media_list, first_take = first_take)

  datasets_with_content <- datasets |>
    purrr::keep(~ nrow(.x) > 0)

  res <- purrr::imap(
    datasets_with_content,
    ~ process_dataset(.x, .y, source, media_list)
  )

  res |>
    purrr::keep(~ nrow(.x) > 0) |>
    dplyr::bind_rows(.id = "dataset") |>
    dplyr::mutate(source = source, .after = dataset)
}

14.2.5 Run the full matching

To wrap up, we run check_match_media for each source, export an Excel report of partial matches, and (optionally) clean files that remain at the root (image or video) but should now sit under ct, under, or over. Reporting stays inside save_partial_matches, while housekeeping lives in cleanup_media_root and can be toggled on when needed. The wrapper also accepts first_take so the initial run can enforce dataset-name consistency.

Code
save_partial_matches <- function(result) {
  partial_res <- result |>
    purrr::map(
      ~ .x |>
        dplyr::filter(
          match_exactly == FALSE,
          match_partially == TRUE
        )
    ) |>
    dplyr::bind_rows() |>
    tidyr::nest(.by = dataset) |>
    dplyr::arrange(dataset) |>
    dplyr::mutate(data = purrr::set_names(data, dataset)) |>
    dplyr::pull(data)

  partial_res |>
    openxlsx2::write_xlsx(
      stringr::str_glue(
        "Example/Output/12/check_names_photos_{lubridate::today()}.xlsx"
      ),
      as_table = TRUE,
      overwrite = TRUE
    )

  return(partial_res)
}

cleanup_media_root <- function(result, media_tbl, sources) {
  sources_regex <- glue::glue_collapse(sources, "|")

  files_to_delete <- result |>
    purrr::map(
      ~ .x |>
        dplyr::filter(
          match_exactly == TRUE,
        ) |>
        dplyr::distinct(sheet, .keep_all = TRUE) |>
        dplyr::select(dataset, file = sheet)
    ) |>
    dplyr::bind_rows() |>
    dplyr::inner_join(media_tbl, by = c("dataset", "file")) |>
    dplyr::filter(
      !stringr::str_detect(value, glue::glue("\\/{sources_regex}\\/"))
    ) |>
    dplyr::pull(value)

  file.remove(files_to_delete)
}

run_check_match_media <- function(
  sources = c("ct", "under", "over"),
  media_files = media_files,
  first_take = TRUE,
  cleanup = FALSE
) {
  result <- purrr::map(purrr::set_names(sources), function(source) {
    check_match_media(source, first_take = first_take)
  })

  if (cleanup == TRUE) {
    cleanup_media_root(result, media_files, sources)
  }

  final_result <- save_partial_matches(result)

  return(final_result)
}

run_check_match_media(first_take = TRUE)
$Example0
# A tibble: 2 × 8
  source sheet      file        stringdist match_exactly match_file_no_extension
  <chr>  <chr>      <chr>            <dbl> <lgl>         <lgl>                  
1 under  PNSB02.JPG PNSB02.jpg           3 FALSE         FALSE                  
2 under  PNSB05.JPG PNSB05a.JPG          1 FALSE         FALSE                  
# ℹ 2 more variables: match_file_diff_capitalization <lgl>,
#   match_partially <lgl>

$Example1
# A tibble: 14 × 8
   source sheet            file  stringdist match_exactly match_file_no_extens…¹
   <chr>  <chr>            <chr>      <dbl> <lgl>         <lgl>                 
 1 ct     DSCF0028 - fram… DSCF…          4 FALSE         TRUE                  
 2 ct     DSCF0033 - fram… DSCF…          4 FALSE         TRUE                  
 3 ct     DSCF0149 - fram… DSCF…          4 FALSE         TRUE                  
 4 ct     PTDC0006 - fram… PTDC…          4 FALSE         TRUE                  
 5 ct     VD_00005 - fram… VD_0…          4 FALSE         TRUE                  
 6 ct     VD_00007 - fram… VD_0…          4 FALSE         TRUE                  
 7 ct     VD_00011 - fram… VD_0…          4 FALSE         TRUE                  
 8 ct     VD_00019 - fram… VD_0…          4 FALSE         TRUE                  
 9 ct     VD_00131 - fram… VD_0…          5 FALSE         FALSE                 
10 ct     VD_00092 - fram… VD_0…          2 FALSE         FALSE                 
11 ct     VD_00131 - fram… VD_0…          1 FALSE         FALSE                 
12 over   Fupala.JPEG      Fupa…          4 FALSE         FALSE                 
13 over   Manecão.jpeg     Mane…          2 FALSE         FALSE                 
14 over   São Paulo.jpeg   São …          2 FALSE         FALSE                 
# ℹ abbreviated name: ¹​match_file_no_extension
# ℹ 2 more variables: match_file_diff_capitalization <lgl>,
#   match_partially <lgl>