Standardize community data matrix
mm_standardize.Rd
This function standardizes a given data matrix using different methods such as total sum scaling, max normalization, frequency scaling, standardization, presence-absence transformation, chi-square transformation, Hellinger transformation, log transformation, and others.
Arguments
- data
A numeric matrix or data frame to be standardized.
- method
A character string specifying the standardization method (see details). Available methods are:
"total"
: Divides each entry by the total sum in the given margin."max"
: Divides each entry by the maximum value in the given margin."frequency"
: Frequency transformation."normalize"
: Normalization by Euclidean norm."range"
: Standardizes by range (min-max scaling)."rank"
: Converts values to ranks."rrank"
: Relative rank transformation."standardize"
: Standardization (z-score normalization)."pa"
: Presence-absence transformation (binary)."chi.square"
: Chi-square standardization."hellinger"
: Hellinger transformation."log"
: Log transformation."clr"
: Centered log-ratio transformation."rclr"
: Robust centered log-ratio transformation."alr"
: Additive log-ratio transformation.
- margin
An integer specifying the margin for standardization:
1
: Rows2
: Columns
- range_global
A matrix specifying the range for standardization (optional, used with
"range"
method).- logbase
The base for logarithmic transformation (default is 2).
- na.rm
Logical. If
TRUE
, missing values (NA
) are removed before calculations.- ...
Additional arguments passed to transformation functions.
Details
The function provides the following standardization methods for community data:
"total"
: Divides by margin total (defaultmargin = 1
)."max"
: Divides by margin maximum (defaultmargin = 2
)."frequency"
: Divides by margin total and multiplies by the number of non-zero items, ensuring the average of non-zero entries is one (Oksanen 1983; defaultmargin = 2
)."normalize"
: Scales data so that the sum of squares along the specified margin equals one (defaultmargin = 1
)."range"
: Standardizes values into the range[0,1]
(defaultmargin = 2
). If all values are constant, they will be transformed to 0."rank"
,"rrank"
:"rank"
replaces abundance values by their increasing ranks, leaving zeros unchanged."rrank"
is similar but uses relative ranks with a maximum of 1 (defaultmargin = 1
).
"standardize"
: Scalesx
to zero mean and unit variance (defaultmargin = 2
)."pa"
: Convertsx
to presence/absence scale (0/1)."chi.square"
: Divides by row sums and the square root of column sums, then adjusts for the square root of the matrix total (Legendre & Gallagher 2001). When used with Euclidean distance, the distances should be similar to Chi-square distances in correspondence analysis (defaultmargin = 1
)."hellinger"
: Computes the square root ofmethod = "total"
(Legendre & Gallagher 2001)."log"
: Logarithmic transformation suggested by Anderson et al. (2006): $$\log_b (x) + 1$$ for \(x > 0\), where \(b\) is the base of the logarithm. Zeros remain unchanged. Higher bases give less weight to quantities and more to presences."alr"
: Additive log ratio (ALR) transformation (Aitchison 1986). Reduces skewness and compositional bias. Requires positive values; pseudocounts can be added. The transformation is defined as: $$alr = [\log(x_1 / x_D), ..., \log(x_{D-1} / x_D)]$$ where the denominator sample \(x_D\) can be chosen arbitrarily."clr"
: Centered log ratio (CLR) transformation (Aitchison 1986). Common in microbial ecology (Gloor et al. 2017). Only supports positive data; pseudocounts can be used to handle zeros. The transformation is defined as: $$clr = \log(x / g(x)) = \log x - \log g(x)$$ where \(x\) is a single value, and \(g(x)\) is the geometric mean of \(x\)."rclr"
: Robust CLR transformation. Unlike CLR, this method allows zeros without requiring pseudocounts. It divides values by the geometric mean of observed (non-zero) features, preserving zeros (Martino et al. 2019). The transformation is defined as: $$rclr = \log(x / g(x > 0))$$ where \(x\) is a single value, and \(g(x > 0)\) is the geometric mean of sample-wide values \(x\) that are positive (\(x > 0\)).
Standardization, as contrasted to transformation, means that the entries are transformed relative to other entries.
All methods have a default margin. margin=1
means rows
(sites in a normal data set) and margin=2
means columns (species in a
normal data set).
Command wisconsin
is a shortcut to common Wisconsin double
standardization where species (margin=2
) are first standardized
by maxima (max
) and then sites (margin=1
) by
site totals (tot
).
Most standardization methods will give nonsense results with
negative data entries that normally should not occur in the community
data. If there are empty sites or species (or constant with
method = "range"
), many standardization will change these into
NaN
.
Function decobackstand
can be used to transform standardized
data back to original. This is not possible for all standardization
and may not be implemented to all cases where it would be
possible. There are round-off errors and back-transformation is not
exact, and it is wise not to overwrite the original data. With
zap=TRUE
original zeros should be exact.
Note
This function is adapted from the decostand
function in the vegan
R package,
with modifications to improved handling.
References
Aitchison, J. The Statistical Analysis of Compositional Data (1986). London, UK: Chapman & Hall.
Anderson, M.J., Ellingsen, K.E. & McArdle, B.H. (2006) Multivariate dispersion as a measure of beta diversity. Ecology Letters 9, 683–693.
Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G., Barcel'o-Vidal, C. (2003) Isometric logratio transformations for compositional data analysis. Mathematical Geology 35, 279–300.
Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V. & Egozcue, J.J. (2017) Microbiome Datasets Are Compositional: And This Is Not Optional. Frontiers in Microbiology 8, 2224.
Legendre, P. & Gallagher, E.D. (2001) Ecologically meaningful transformations for ordination of species data. Oecologia 129, 271–280.
Martino, C., Morton, J.T., Marotz, C.A., Thompson, L.R., Tripathi, A., Knight, R. & Zengler, K. (2019) A novel sparse compositional technique reveals microbial perturbations. mSystems 4, 1.
Oksanen, J. (1983) Ordination of boreal heath-like vegetation with principal component analysis, correspondence analysis and multidimensional scaling. Vegetatio 52, 181–189.
Examples
# Example usage with sample data
cam_data <- read.csv(system.file('penessoulou_season1.csv', package = 'maimer'))
cam_data <- cam_data %>%
mm_to_community(site_column = camera, species_column = species,
size_column = number, values_fill = 0)
standardized_data <- mm_standardize(data = cam_data[, 2:11], method = "total")
standardized_data
#> # A tibble: 13 × 10
#> `Syncerus caffer` `Lepus crawshayi` `Erythrocebus patas`
#> <dbl> <dbl> <dbl>
#> 1 0.981 0 0.0169
#> 2 0 0.375 0.5
#> 3 0.430 0 0.522
#> 4 0.941 0 0
#> 5 0 0 0.0714
#> 6 0.480 0 0.341
#> 7 0 0 1
#> 8 0.247 0 0.306
#> 9 0 0 0
#> 10 0 0 0
#> 11 0 0 0
#> 12 0 0 0
#> 13 0.0240 0 0.974
#> # ℹ 7 more variables: `Tragelaphus scriptus` <dbl>,
#> # `Chlorocebus aethiops` <dbl>, `Canis adustus` <dbl>,
#> # `Mellivora capensis` <dbl>, `Sylvicapra grimmia` <dbl>,
#> # `Thryonomys swinderianus` <dbl>, `Genetta genetta` <dbl>