space_similarity
estimate pairwise similarities of phenotype spaces
Usage
space_similarity(
formula,
data,
cores = 1,
method = "mcp.overlap",
pb = TRUE,
outliers = 0.95,
pairwise.scale = FALSE,
distance.method = "Euclidean",
seed = NULL,
...
)
Arguments
- formula
an object of class "formula" (or one that can be coerced to that class).Must follow the form
group ~ dim1 + dim2
where dim1 and dim2 are the dimensions of the phenotype space andgroup
refers to the group labels.- data
Data frame containing columns for the dimensions of the phenotypic space (numeric) and a categorical or factor column with group labels.
- cores
Numeric vector of length 1. Controls whether parallel computing is applied by specifying the number of cores to be used. Default is 1 (i.e. no parallel computing).
- method
Character vector of length 1. Controls the method of (di)similarity metric to be compare the phenotypic sub-spaces of two groups at the time. Seven built-in metrics are available which quantify as pairwise sub-space overlap ('similarity') or pairwise distance between bi-dimensional sub-spaces ('dissimilarity'):
density.overlap
: proportion of the phenotypic sub-spaces area that overlap, taking into account the irregular densities of the sub-spaces. Two groups that share their higher density areas will be more similar than similar sub-spaces that only share their lower density areas. Two values are supplied as the proportion of the space of A that overlaps B is not necessarily the same as the proportion of B that overlaps A. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 6 observations.mean.density.overlap
: similar to 'density.overlap' but the two values are merged into a single pairwise mean overlap. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 6 observations.mcp.overlap
: proportion of the phenotypic sub-spaces area that overlap, in which areas are calculated as the minimum convex polygon of all observations for each su-space. Two values are supplied as the proportion of the space of A that overlaps B is not necessarily the same as the proportion of B that overlaps A. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 5 observations.mean.mcp.overlap
: similar to 'mcp.overlap' but the two values are merged into a single pairwise mean overlap. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 5 observations.proportional.overlap
: proportion of the joint area of both sub-spaces that overlaps (overlapped area / total area of both groups). Sub-space areas are calculated as the minimum convex polygon. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 5 observations.distance
: mean euclidean pairwise distance between all observations of the compared sub-spaces. Dissimilarity metric (higher values means less similar). The minimum sample size (per group) must be 1 observation.centroid.distance
: euclidean distance between the centroid of the compared sub-spaces. Dissimilarity metric (higher values means less similar). The minimum sample size (per group) must be 1 observation.
In addition, machine learning classification models can also be used for quantify dissimilarity as a measured of how discriminable two groups are. These models can use more than two dimensions to represent phenotyypic spaces. The following classification models can be used: "AdaBag", "avNNet", "bam", "C5.0", "C5.0Cost", "C5.0Rules", "C5.0Tree", "gam", "gamLoess", "glmnet", "glmStepAIC", "kernelpls", "kknn", "lda", "lda2", "LogitBoost", "msaenet", "multinom", "nnet", "null", "ownn", "parRF", "pcaNNet", "pls", "plsRglm", "pre", "qda", "randomGLM", "rf", "rFerns", "rocc", "rotationForest", "rotationForestCp", "RRF", "RRFglobal", "sda", "simpls", "slda", "smda", "snn", "sparseLDA", "svmLinear2", "svmLinearWeights", "treebag", "widekernelpls" and "wsrf". See https://topepo.github.io/caret/train-models-by-tag.html for details on each of these models. Additional arguments can be pased using
...
. Note that some machine learning methods can significantly affect com- pb
Logical argument to control if progress bar is shown. Default is
TRUE
.- outliers
Numeric vector of length 1. A value between 0 and 1 controlling the proportion of outlier observations to be excluded. Outliers are determined as those farthest away from the sub-space centroid. Ignored when using machine learning methods.
- pairwise.scale
Logical argument to control if pairwise phenotypic spaces are scaled (i.e. z-transformed) prior to similarity estimation. If so (
TRUE
) similarities are decoupled from the size of the global phenotypic space. Useful to compare similarities coming from different phenotypic spaces. Default isFALSE
. Not available for 'density.overlap', 'mean.density.overlap' or any machine learning model.- distance.method
Character vector of length 1 indicating the method to be used for measuring distances (hence only applicable when distances are calculated). Available distance measures are: "Euclidean" (default), "Manhattan", "supremum", "Canberra", "Wave", "divergence", "Bray", "Soergel", "Podani", "Chord", "Geodesic" and "Whittaker". If a similarity measure is used similarities are converted to distances.
- seed
Integer number containing the random number generator (RNG) state for random number generation in order to make results from the machine learning stochastic methods replicable.
- ...
Additional arguments to be passed to
train
.
Value
A data frame containing the similarity metric for each pair of groups. If the similarity metric is not symmetric (e.g. the proportional area of A that overlaps B is not necessarily the same as the area of B that overlaps A, see space_similarity
) separated columns are supplied for the two comparisons.
Details
The function quantifies pairwise similarity between phenotypic sub-spaces. The built-in methods quantify similarity as the overlap (similarity, or machine learning based discriminability) or distance (dissimilarity) between group. Machine learning methods implemented in the caret package function train
are available to assess the similarity of spaces as the proportion of observations that are incorrectly classified. In this case group overlaps are the class-wise errors (if available) while the mean overlap is calculated as 1- model accuracy
.
References
Araya-Salas, M, & K. Odom. 2022, PhenotypeSpace: an R package to quantify and compare phenotypic trait spaces R package version 0.1.0.
Author
Marcelo Araya-Salas marcelo.araya@ucr.ac.cr)
Examples
{
# load data
data("example_space")
# get proportion of space that overlaps
prop_overlaps <- space_similarity(
formula = group ~ dimension_1 + dimension_2,
data = example_space,
method = "proportional.overlap")
#' # get symmetric triangular matrix
rectangular_to_triangular(prop_overlaps)
# get minimum convex polygon overlap for each group (non-symmetric)
mcp_overlaps <- space_similarity(
formula = group ~ dimension_1 + dimension_2,
data = example_space,
method = "mcp.overlap")
# convert to non-symmetric triangular matrix
rectangular_to_triangular(mcp_overlaps, symmetric = FALSE)
# check available distance measures
summary(proxy::pr_DB)
# get eculidean distances (default)
area_dist <- space_similarity(
formula = group ~ dimension_1 + dimension_2,
data = example_space,
method = "distance",
distance.method = "Euclidean")
# get Canberra distances
area_dist <- space_similarity(
formula = group ~ dimension_1 + dimension_2,
data = example_space,
method = "distance",
distance.method = "Canberra")
## using machine learning classification methods
# check if caret package and needed dependencies are available
rlang::check_installed("caret")
rlang::check_installed("randomForest")
# random forest 3 dimension data, using 5 repeats and repeated CV resampling
# extract data subset
sub_data <- example_space[example_space$group %in% c("G1", "G2", "G3"), ]
# set method parameters
ctrl <- caret::trainControl(method = "repeatedcv", repeats = 5)
# get similarities ("overlap")
space_similarity(
formula = group ~ dimension_1 + dimension_2 + dimension_3,
data = sub_data,
method = "rf",
trControl = ctrl,
tuneLength = 4,
seed = 123
)
# Single C5.0 Tree using boot resampling
ctrl <- caret::trainControl(method = "boot")
space_similarity(
formula = group ~ dimension_1 + dimension_2,
data = sub_data,
method = "C5.0Tree",
trControl = ctrl,
tuneLength = 3
)
}
#> Loading required package: lattice
#> group.1 group.2 overlap
#> 1 G1 G2 0.009488302
#> 2 G1 G3 0.046777361
#> 3 G2 G3 0.136665576