14  Check Photos

14.1 Problem Description

We verify whether the delivered photos and videos match the expected records in the Excel sheets. We standardize the media structure, copy found files into specific folders (ct, under, over), generate a partial-match report, and remove media that remain in the root without a correct destination.

14.2 Problem Solving

Pipeline: (1) use helper functions and check folder names versus Excel sheets; (2) stage raw media into a standard structure; (3) inventory media under Media while separating already-processed files (ct, under, over); (4) apply matching functions between sheets and files; (5) export partial matches and (optionally) clean stray files.

14.2.1 Common steps

These steps get the ground ready: we load shared utilities, list the incoming folders, and immediately compare their names with the Excel sheets. If something is off, we find out before spending time on matching. That early sanity check keeps the rest of the workflow focused on pairing files, not fixing misconfigured inputs.

14.2.1.1 Check folder names vs. Excel sheets

Next, we list the folders under Example/12 and compare those names to the Excel sheet names. Using waldo::compare, any mismatch is surfaced right away, so we can align folder and sheet names before matching files. This symmetry avoids silent skips caused by small naming drifts.

Code
folders <- list.dirs(path = "Example/12", recursive = FALSE)

names_folders <- folders |>
  stringr::str_split_i(pattern = "\\/", 3)

names_excel <- datapaperchecks::read_sheet(
  path = "Example/12",
  results = FALSE,
  recurse = FALSE
) |>
  names()

names_folders |>
  waldo::compare(names_excel)
✔ No differences

14.2.2 Stage raw media

Here we scan every file under the incoming folders, keep only images or videos, and infer dataset and media type from the path and MIME type. Then we create the standardized destination folders and copy the files there. Executing this chunk once sets up a clean media tree for the later matching steps.

Code
files <- list.files(folders, full.names = TRUE, recursive = TRUE)

df <- tibble::tibble(file = files) |>
  dplyr::mutate(type = mime::guess_type(file)) |>
  dplyr::filter(stringr::str_detect(type, "^image/|^video")) |>
  dplyr::mutate(
    dataset = stringr::str_split_i(file, pattern = "\\/", 3),
    media_type = stringr::str_split_i(type, pattern = "\\/", 1),
    folder = stringr::str_glue("Example/Media/{dataset}/{media_type}/"),
    filename = stringr::str_glue("{folder}{basename(file)}")
  )

folders_to_create <- df$folder |>
  unique()

for (folder in folders_to_create) {
  dir.create(path = folder, recursive = TRUE)
}

files_to_copy <- df$file
folder_target <- df$filename

file.copy(from = files_to_copy, to = folder_target)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE

14.2.3 Inventory media under Media

Now we read everything under Example/Media, capturing the dataset, media type, and filename from the path. We then split files already placed under ct, under, or over into media_files_types_list and remove them from the main working list. The result, media_list, is the on-disk ground truth for files still needing a destination, which we compare against the Excel expectations.

Code
media_files <- list.files(
  path = "Example/Media",
  recursive = TRUE,
  full.names = TRUE
) |>
  dplyr::as_tibble() |>
  dplyr::mutate(
    dataset = stringr::word(value, 3, 3, sep = "\\/"),
    media_type = stringr::word(value, 4, 4, sep = "\\/"),
    file = basename(value)
  )

media_files_types <- media_files |>
  dplyr::filter(media_type %in% c("ct", "under", "over"))

media_files_types_list <- media_files_types |>
  (\(df) split(df, df$dataset))() |>
  purrr::map(\(df) split(df, df$media_type))

media_anti_join <- dplyr::anti_join(media_files, media_files_types)
Joining with `by = join_by(value, dataset, media_type, file)`
Code
media_list <- split(media_anti_join, media_anti_join$dataset)

14.2.4 Matching functions

The matching workflow now uses functions from datapaperchecks directly (validation, loading, candidate creation, exact-copy handling, and report export). This keeps chapter code focused on data preparation, while package code centralizes matching logic and maintenance.

14.2.5 Run the full matching

We run datapaperchecks::run_check_match_media() with the media objects created in this chapter. The function processes ct, under, and over, writes the partial-match report to Example/Output/12, and returns the partial candidates by dataset.

Code
result <- datapaperchecks::run_check_match_media(
  media_list = media_list,
  media_files = media_files,
  media_files_types_list = media_files_types_list,
  first_take = TRUE
)

result
$Example0
# A tibble: 2 × 8
  source sheet      file        stringdist match_exactly match_file_no_extension
  <chr>  <chr>      <chr>            <dbl> <lgl>         <lgl>                  
1 under  PNSB02.JPG PNSB02.jpg           3 FALSE         FALSE                  
2 under  PNSB05.JPG PNSB05a.JPG          1 FALSE         FALSE                  
# ℹ 2 more variables: match_file_diff_capitalization <lgl>,
#   match_partially <lgl>

$Example1
# A tibble: 13 × 8
   source sheet            file  stringdist match_exactly match_file_no_extens…¹
   <chr>  <chr>            <chr>      <dbl> <lgl>         <lgl>                 
 1 ct     DSCF0028 - fram… DSCF…          4 FALSE         TRUE                  
 2 ct     DSCF0033 - fram… DSCF…          4 FALSE         TRUE                  
 3 ct     DSCF0149 - fram… DSCF…          4 FALSE         TRUE                  
 4 ct     PTDC0006 - fram… PTDC…          4 FALSE         TRUE                  
 5 ct     VD_00005 - fram… VD_0…          4 FALSE         TRUE                  
 6 ct     VD_00007 - fram… VD_0…          4 FALSE         TRUE                  
 7 ct     VD_00011 - fram… VD_0…          4 FALSE         TRUE                  
 8 ct     VD_00019 - fram… VD_0…          4 FALSE         TRUE                  
 9 ct     VD_00131 - fram… VD_0…          5 FALSE         FALSE                 
10 ct     VD_00092 - fram… VD_0…          2 FALSE         FALSE                 
11 ct     VD_00131 - fram… VD_0…          1 FALSE         FALSE                 
12 over   9 irmãos.JPEG    9 ir…          4 FALSE         FALSE                 
13 over   Fupala.JPEG      Fupa…          4 FALSE         FALSE                 
# ℹ abbreviated name: ¹​match_file_no_extension
# ℹ 2 more variables: match_file_diff_capitalization <lgl>,
#   match_partially <lgl>