remove_duplicates
merges metadata data frames from suwo queries.
Arguments
- metadata
data frame obtained from possible duplicates with the function
find_duplicates
. The data frame must have the column 'duplicate_group' returned byfind_duplicates
.- same_repo
Logical argument indicating if observations labeled as duplicates that belong to the same repository should be removed. Default is
FALSE
. IfTRUE
, only one of the duplicated observations from the same repository will be retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart.- cores
Numeric vector of length 1. Controls whether parallel computing is applied by specifying the number of cores to be used. Default is 1 (i.e. no parallel computing). Can be set globally for the current R session via the "mc.cores" option (e.g.
options(mc.cores = 2)
). Note that some repositories might not support parallel queries from the same IP address as it might be identified as denial-of-service cyberattack.- pb
Logical argument to control if progress bar is shown. Default is
TRUE
. Can be set globally for the current R session via the "pb" option (options(pb = TRUE)
).- repo_priority
Character vector indicating the priority of repositories when selecting which observation to retain when duplicates are found. Default is
c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "Wikiaves", "Observation")
, which gives priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF).
Value
A single data frame with a subset of the 'metadata' with those observations that were determined not to be duplicates.
Details
This function removes duplicate observations identified with the function find_duplicates
. When duplicates are found, one observation from each group of duplicates is retained in the output data frame. However, if multiple observations from the same repository are labeled as duplicates, by default (same_repo = FALSE
) all of them are retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart. This behavior can be modified. If same_repo = TRUE
, only one of the duplicated observations from the same repository will be retained in the output data frame. The function will give priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF), but this can be modified with the argument 'repo_priority'.
Author
Marcelo Araya-Salas (marcelo.araya@ucr.ac.cr)
Examples
if (FALSE) { # \dontrun{
# get metadata from 2 repos
gb <- query_gbif(species = "Turdus rufiventris", format = "sound")
xc <- query_xenocanto(species = "Turdus rufiventris")
# combine metadata
merged_metadata <- merge_metadata(xc, gb)
# find duplicates
label_dup_metadata <- find_duplicates(metadata = merged_metadata)
# remove duplicates
dedup_metadata <- remove_duplicates(label_dup_metadata)
} # }