5 Check Species Validity

5.1 Problem description

In databases it is frequent to have a species taxa list. Since most of the lists are filled by humans it is expected that the taxa names have typos and different styles of determining unindentified species notation. In that sense it is necessary to check and correct the taxa name using services specifics to that end.

Since we’re checking crossings obtained by camera traps it is normal that some taxa couldn’t be identified in species level. This represents a more difficult approach as we have to consider the lowest possible taxonomic level. In this context we have to have solutions that tackle the validity of several ways of filling the species field.

5.2 Problem solving

5.2.1 Common steps

To solve this issue, we follow some of the first basic steps from previous checks, as using our customized read_sheet function that provides the full paths of all .xlsx files available in order to read the species sheet from all files.

Code

source("R/FUNCTIONS.R")

spreadsheets <- read_sheet(path = "Example", results = FALSE)

sp_full <- purrr::map(.x = spreadsheets, function(file) {
  readxl::read_excel(
    file,
    sheet = "Species_records_camera",
    na = c("NA", "na"),
    col_types = c("guess", "guess", "guess", "date", "guess", "guess", "guess"),
    col_names = TRUE
  )
})

head(sp_full[1][[1]]) # show head of first file

# A tibble: 6 × 7
  Structure_ID   Camera_ID Species       Record_date         Record_time        
  <chr>          <chr>     <chr>         <dttm>              <dttm>             
1 P1 (iguaçu)    cam1      Cavia sp.     2017-05-09 00:00:00 1899-12-31 03:59:00
2 P3 (varzea)    cam1      Aramides sar… 2017-05-01 00:00:00 1899-12-31 08:37:00
3 P3 (varzea)    cam1      Leopardus sp. 2017-05-05 00:00:00 1899-12-31 21:09:00
4 BC2 (drenagem) cam2      Didelphis au… 2018-07-30 00:00:00 1899-12-31 04:18:00
5 BC2 (drenagem) cam2      Didelphis au… 2018-07-30 00:00:00 1899-12-31 20:02:00
6 BC2 (drenagem) cam2      Didelphis au… 2018-07-31 00:00:00 1899-12-31 00:10:00
# ℹ 2 more variables: Record_criteria <lgl>, Behavior <chr>

5.2.2 Specific steps

5.2.2.1 Full species

We first start keeping only valid (full) species. In this sense, we are considering only two-worded terms that doesn’t have sp, spp, ni and similar, which are commonly used to designate unidentified species.

Code

species_all_check <- sp_full |>
  purrr::map(function(x) {
    x |>
      dplyr::distinct(Species) |> # unique values for Species
      dplyr::mutate(Species = stringr::str_squish(Species)) |> # remove whitespaces
      dplyr::filter(
        stringr::str_count(Species, " ") == 1,
        !stringr::str_detect(stringr::word(Species, 2, 2), "\\."),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^sp$"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^sp(?=\\.)"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^spp(?=\\.)"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "\\("),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^ni$"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^NI$"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^NID$"),
        !stringr::str_detect(Species, "\\/")
      ) |>
      dplyr::arrange(Species) |> # alphabetical order
      dplyr::pull() # vector
  })

head(species_all_check[1][[1]]) # show head of first list

[1] "Aramides saracura" "Ardea alba"        "Cabassous tatouay"
[4] "Caracara plancus"  "Cerdocyon thous"   "Coendou spinosus"

Since there is a chance that some of the species name have multiple types of spelling considering trailing spaces, we check for names that are similar.

Code

species_all_check |>
  purrr::map(function(x) {
    table <- table(x)

    table[table > 1]
  }) |>
  purrr::keep(~ any(.x > 1))

named list()

Having the full species list, we use the Global Names Verifier API (https://verifier.globalnames.org/) to check the species names. We opted to do it through Integrated Taxonomic Information System (ITIS) which is data source = 3 on the address for the API. We use the package httr to help on checking the API.

For each dataset, we checked all full species names. By the end of the code chunk, we unnested the columns bestResult and scoreDetails that come originally as a data frame from the Global Names Verifier. Following this procedure, we compiled the species results in a single data frame for all species for each dataset.

Code

list_check_globalnames <- list()

for (dataset in names(species_all_check)) {
  species <- species_all_check[[dataset]]

  message(stringr::str_glue("Starting dataset {dataset}"))

  for (sp in species) {
    sp_ <- stringr::str_replace(sp, " ", "_")

    result <- httr::GET(stringr::str_glue(
      "https://verifier.globalnames.org/api/v1/verifications/{sp_}?data_sources=3"
    )) # the link for the API check

    list_check_globalnames[[dataset]][[sp_]] <- jsonlite::fromJSON(rawToChar(
      result$content
    ))[["names"]] # save the part that interests us on a list composed by the dataset and the species name
  }
  # bind the species list on a single data frame unnesting the columns that are a data frame
  list_check_globalnames[[dataset]][["all_results"]] <- list_check_globalnames[[
    dataset
  ]] |>
    dplyr::bind_rows() |>
    tidyr::unnest(cols = c(bestResult), names_repair = "unique") |>
    tidyr::unnest(cols = c(scoreDetails), names_repair = "unique") |>
    tibble::as_tibble()
}

list_check_globalnames[[1]][["all_results"]]

# A tibble: 96 × 110
   id    name  cardinality matchType...4 dataSourceId...5 dataSourceTitleShort…¹
   <chr> <chr>       <int> <chr>                    <int> <chr>                 
 1 270f… Aram…           2 Exact                        3 ITIS                  
 2 7714… Arde…           2 Exact                        3 ITIS                  
 3 cba5… Caba…           2 Exact                        3 ITIS                  
 4 fe40… Cara…           2 Exact                        3 ITIS                  
 5 5b42… Cerd…           2 Exact                        3 ITIS                  
 6 53c8… Coen…           2 Exact                        3 ITIS                  
 7 fc6f… Colu…           2 Exact                        3 ITIS                  
 8 4345… Cryp…           2 Exact                        3 ITIS                  
 9 3b28… Cuni…           2 Exact                        3 ITIS                  
10 a089… Cyan…           2 Exact                        3 ITIS                  
# ℹ 86 more rows
# ℹ abbreviated name: ¹dataSourceTitleShort...6
# ℹ 104 more variables: curation...7 <chr>, recordId...8 <chr>,
#   outlink...9 <chr>, entryDate...10 <chr>, sortScore...11 <dbl>,
#   matchedNameID...12 <chr>, matchedName...13 <chr>,
#   matchedCardinality...14 <int>, matchedCanonicalSimple...15 <chr>,
#   matchedCanonicalFull...16 <chr>, currentRecordId...17 <chr>, …

The next step consisted in creating a full data frame of all the species from all the datasets. We mapped the all_results list from each dataset and then stacked them on a single data frame.

Code

list_sp <- list_check_globalnames |>
  purrr::map("all_results") |>
  dplyr::bind_rows(.id = "dataset") |>
  janitor::clean_names() |>
  dplyr::mutate(query = stringr::str_replace_all(name, "_", " "), .after = name)

head(list_sp)

# A tibble: 6 × 178
  dataset  id              name  query cardinality match_type_4 data_source_id_5
  <chr>    <chr>           <chr> <chr>       <int> <chr>                   <int>
1 Example1 270f98a4-840c-… Aram… Aram…           2 Exact                       3
2 Example1 771418ff-c606-… Arde… Arde…           2 Exact                       3
3 Example1 cba54653-8408-… Caba… Caba…           2 Exact                       3
4 Example1 fe40eafd-8adb-… Cara… Cara…           2 Exact                       3
5 Example1 5b42a2c7-767f-… Cerd… Cerd…           2 Exact                       3
6 Example1 53c82c2e-a688-… Coen… Coen…           2 Exact                       3
# ℹ 171 more variables: data_source_title_short_6 <chr>, curation_7 <chr>,
#   record_id_8 <chr>, outlink_9 <chr>, entry_date_10 <chr>,
#   sort_score_11 <dbl>, matched_name_id_12 <chr>, matched_name_13 <chr>,
#   matched_cardinality_14 <int>, matched_canonical_simple_15 <chr>,
#   matched_canonical_full_16 <chr>, current_record_id_17 <chr>,
#   current_name_id_18 <chr>, current_name_19 <chr>,
#   current_cardinality_20 <int>, current_canonical_simple_21 <chr>, …

Since we want only the errors, we filtered the column match_type_4 to show every row in which the result was not “Exact”. That means that every species in which the query and the result was not the exact same term were selected to further evaluation.

Code

sp_with_errors <- list_sp |>
  dplyr::filter(match_type_4 != "Exact")

head(sp_with_errors)

# A tibble: 6 × 178
  dataset  id              name  query cardinality match_type_4 data_source_id_5
  <chr>    <chr>           <chr> <chr>       <int> <chr>                   <int>
1 Example1 46334ffe-8d29-… Guer… Guer…           2 PartialExact                3
2 Example1 b6551cb7-7323-… Não_… Não …           0 NoMatch                    NA
3 Example1 acaac2da-d4a3-… Subu… Subu…           0 NoMatch                    NA
4 Example3 c6eeb584-9cf4-… Dico… Dico…           2 PartialExact                3
5 Example3 81a8d8a4-a90b-… Paux… Paux…           2 PartialExact                3
6 Example4 15638204-393b-… Sylv… Sylv…           2 Fuzzy                       3
# ℹ 171 more variables: data_source_title_short_6 <chr>, curation_7 <chr>,
#   record_id_8 <chr>, outlink_9 <chr>, entry_date_10 <chr>,
#   sort_score_11 <dbl>, matched_name_id_12 <chr>, matched_name_13 <chr>,
#   matched_cardinality_14 <int>, matched_canonical_simple_15 <chr>,
#   matched_canonical_full_16 <chr>, current_record_id_17 <chr>,
#   current_name_id_18 <chr>, current_name_19 <chr>,
#   current_cardinality_20 <int>, current_canonical_simple_21 <chr>, …

After checking the list of species considered not “Exact”, we found that some species that were not “Exact” must be whitelisted, since we are sure that the name is valid (for example, checking the List of Brazilian Mammals from the Brazilian Mastozoological Society). They can be appended to the names that were considered as “Exact”. This is the last step for the “Full species” inspection.

Code

sp_whitelist <- list_sp |>
  dplyr::filter(match_type_4 == "Exact") |>
  dplyr::pull(query) |>
  append(c("Guerlinguetus brasiliensis", "Guerlinguetus ingrami")) |> # manually insert species that we know that are correct but the API don't think they are.
  unique() |>
  sort()

head(sp_whitelist)

[1] "Alouatta macconnelli" "Ameiva ameiva"        "Aotus nigriceps"     
[4] "Aramides saracura"    "Ardea alba"           "Ateles chamek"

5.2.2.2 Imprecise taxa

First of all we have to filter for the species that were not on the query for full species - meaning that all of the terms that were not considered as full species still have to be evaluated.

Code

non_species_all_check <- purrr::map2(
  sp_full,
  species_all_check,
  function(x, y) {
    x |>
      dplyr::distinct(Species) |>
      dplyr::mutate(Species = stringr::str_squish(Species)) |>
      dplyr::pull(Species) |>
      setdiff(y)
  }
) |>
  purrr::compact()

head(non_species_all_check[[1]])

[1] "Cavia sp."      "Leopardus sp."  "Didelphis sp."  "Dasyprocta sp."
[5] "Dasypus sp."    "Mammalia"

We perform the same approach as we did for the full species, this time for the terms that are not full. By the end, we create a data frame that comprises all terms that were not considered as “Exact” on the query from the API, as well as queries that involved terms as “NI” or “spp”.

Code

list_non_sp_with_errors <- list()

for (dataset in names(non_species_all_check)) {
  species <- non_species_all_check[[dataset]]

  message(stringr::str_glue("Starting dataset {dataset}"))

  for (sp in species) {
    sp_ <- sp |>
      stringr::str_remove_all("[[:punct:]]") |>
      stringr::str_replace_all(pattern = " ", replacement = "_")

    result <- httr::GET(stringr::str_glue(
      "https://verifier.globalnames.org/api/v1/verifications/{sp_}?data_sources=3"
    )) # the link for the API check

    list_non_sp_with_errors[[dataset]][[sp_]] <- jsonlite::fromJSON(rawToChar(
      result$content
    ))[["names"]] # save the part that interests us on a list composed by the dataset and the species name
  }
  # bind the species list on a single data frame unnesting the columns that are a data frame
  list_non_sp_with_errors[[dataset]][[
    "all_results"
  ]] <- list_non_sp_with_errors[[dataset]] |>
    dplyr::bind_rows() |>
    tibble::as_tibble()
}

non_sp_with_errors <- list_non_sp_with_errors |>
  purrr::map("all_results") |>
  dplyr::bind_rows(.id = "dataset") |>
  janitor::clean_names() |>
  dplyr::mutate(query = stringr::str_replace_all(name, "_", " "), .after = name)

head(non_sp_with_errors)

# A tibble: 6 × 11
  dataset  id          name  query cardinality match_type best_result$dataSour…¹
  <chr>    <chr>       <chr> <chr>       <int> <chr>                       <int>
1 Example1 252d16aa-b… Cavi… Cavi…           0 Exact                           3
2 Example1 12eef028-a… Leop… Leop…           0 Exact                           3
3 Example1 79ee7bf1-e… Dide… Dide…           0 Exact                           3
4 Example1 0d57b139-b… Dasy… Dasy…           0 Exact                           3
5 Example1 1f7764d3-8… Dasy… Dasy…           0 Exact                           3
6 Example1 9fd2fe86-2… Mamm… Mamm…           1 Exact                           3
# ℹ abbreviated name: ¹best_result$dataSourceId
# ℹ 30 more variables: best_result$dataSourceTitleShort <chr>, $curation <chr>,
#   $recordId <chr>, $outlink <chr>, $entryDate <chr>, $sortScore <dbl>,
#   $matchedNameID <chr>, $matchedName <chr>, $matchedCardinality <int>,
#   $matchedCanonicalSimple <chr>, $matchedCanonicalFull <chr>,
#   $currentRecordId <chr>, $currentNameId <chr>, $currentName <chr>,
#   $currentCardinality <int>, $currentCanonicalSimple <chr>, …

The last step is to put together a full list of problems/errors independently if they are for full species or imprecise taxa. In this step we use the sp_whitelist to escape this terms that we think are correct.

Code

sp_with_errors |>
  dplyr::select(
    dataset,
    query,
    matched_canonical_simple,
    match_type = match_type_4
  ) |>
  dplyr::filter(!query %in% sp_whitelist) |>
  dplyr::bind_rows(non_sp_with_errors) |>
  dplyr::arrange(dataset)

# A tibble: 177 × 12
   dataset  query      matched_canonical_si…¹ match_type id    name  cardinality
   <chr>    <chr>      <chr>                  <chr>      <chr> <chr>       <int>
 1 Example1 Não ident… <NA>                   NoMatch    <NA>  <NA>           NA
 2 Example1 Subulo go… <NA>                   NoMatch    <NA>  <NA>           NA
 3 Example1 Cavia sp   <NA>                   Exact      252d… Cavi…           0
 4 Example1 Leopardus… <NA>                   Exact      12ee… Leop…           0
 5 Example1 Didelphis… <NA>                   Exact      79ee… Dide…           0
 6 Example1 Dasyproct… <NA>                   Exact      0d57… Dasy…           0
 7 Example1 Dasypus sp <NA>                   Exact      1f77… Dasy…           0
 8 Example1 Mammalia   <NA>                   Exact      9fd2… Mamm…           1
 9 Example1 Aramides … <NA>                   Exact      2c8c… Aram…           0
10 Example1 Cricetidae <NA>                   Exact      5561… Cric…           1
# ℹ 167 more rows
# ℹ abbreviated name: ¹matched_canonical_simple
# ℹ 5 more variables: best_result <df[,27]>, data_sources_num <int>,
#   data_sources_ids <list>, curation <chr>, best_results <list>