Skip to contents

remove_duplicates merges metadata data frames from suwo queries.

Usage

remove_duplicates(
  metadata,
  same_repo = FALSE,
  cores = getOption("mc.cores", 1),
  pb = getOption("pb", TRUE),
  repo_priority = c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "Wikiaves",
    "Observation")
)

Arguments

metadata

data frame obtained from possible duplicates with the function find_duplicates. The data frame must have the column 'duplicate_group' returned by find_duplicates.

same_repo

Logical argument indicating if observations labeled as duplicates that belong to the same repository should be removed. Default is FALSE. If TRUE, only one of the duplicated observations from the same repository will be retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart.

cores

Numeric vector of length 1. Controls whether parallel computing is applied by specifying the number of cores to be used. Default is 1 (i.e. no parallel computing). Can be set globally for the current R session via the "mc.cores" option (e.g. options(mc.cores = 2)). Note that some repositories might not support parallel queries from the same IP address as it might be identified as denial-of-service cyberattack.

pb

Logical argument to control if progress bar is shown. Default is TRUE. Can be set globally for the current R session via the "pb" option ( options(pb = TRUE)).

repo_priority

Character vector indicating the priority of repositories when selecting which observation to retain when duplicates are found. Default is c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "Wikiaves", "Observation"), which gives priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF).

Value

A single data frame with a subset of the 'metadata' with those observations that were determined not to be duplicates.

Details

This function removes duplicate observations identified with the function find_duplicates. When duplicates are found, one observation from each group of duplicates is retained in the output data frame. However, if multiple observations from the same repository are labeled as duplicates, by default (same_repo = FALSE) all of them are retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart. This behavior can be modified. If same_repo = TRUE, only one of the duplicated observations from the same repository will be retained in the output data frame. The function will give priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF), but this can be modified with the argument 'repo_priority'.

Author

Marcelo Araya-Salas (marcelo.araya@ucr.ac.cr)

Examples

if (FALSE) { # \dontrun{
# get metadata from 2 repos
gb <- query_gbif(species = "Turdus rufiventris", format =  "sound")
xc <- query_xenocanto(species = "Turdus rufiventris")

# combine metadata
merged_metadata <- merge_metadata(xc, gb)

# find duplicates
label_dup_metadata <- find_duplicates(metadata = merged_metadata)

# remove duplicates
dedup_metadata <- remove_duplicates(label_dup_metadata)
} # }