Title: | Calculation and Interpretation of Data Cube Indicator Uncertainty |
---|---|
Description: | This R package provides functions to explore data cubes using robustness measures and cross-validation techniques. It can also be used for uncertainty calculation using the bootstrap resampling method, and functionality is provided for efficient interpretation and visualisation of uncertainty related to indicators based on biodiversity data cubes. |
Authors: | Ward Langeraert [aut, cre] (<https://orcid.org/0000-0002-5900-8109>, Research Institute for Nature and Forest (INBO)), Toon Van Daele [aut] (<https://orcid.org/0000-0002-1362-853X>, Research Institute for Nature and Forest (INBO)), Research Institute for Nature and Forest (INBO) [cph], European Union's Horizon Europe Research and Innovation Programme (ID No 101059592) [fnd] |
Maintainer: | Ward Langeraert <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2025-02-17 12:41:32 UTC |
Source: | https://github.com/b-cubed-eu/dubicube |
This function adds classified effects to a dataframe as ordered factor variables by comparing the confidence intervals with a reference and thresholds.
add_effect_classification( df, cl_columns, threshold, reference = 0, coarse = TRUE )
add_effect_classification( df, cl_columns, threshold, reference = 0, coarse = TRUE )
df |
A dataframe containing summary data of confidence limits. Two
columns are required containing lower and upper limits indicated by the
|
cl_columns |
A vector of 2 column names in |
threshold |
A vector of either 1 or 2 thresholds. A single threshold
will be transformed into |
reference |
The null hypothesis value to compare confidence intervals against. Defaults to 0. |
coarse |
Logical, defaults to |
This function is a wrapper around effectclass::classify()
and
effectclass::coarse_classification()
from the effectclass package
(Onkelinx, 2023). They classify effects in a stable and transparent manner.
Symbol | Fine effect / trend | Coarse effect / trend | Rule |
++ |
strong positive effect / strong increase | positive effect / increase | confidence interval above the upper threshold |
+ |
positive effect / increase | positive effect / increase | confidence interval above reference and contains the upper threshold |
+~ |
moderate positive effect / moderate increase | positive effect / increase | confidence interval between reference and the upper threshold |
~ |
no effect / stable | no effect / stable | confidence interval between thresholds and contains reference |
-~ |
moderate negative effect / moderate decrease | negative effect / decrease | confidence interval between reference and the lower threshold |
- |
negative effect / decrease | negative effect / decrease | confidence interval below reference and contains the lower threshold |
-- |
strong negative effect / strong decrease | negative effect / decrease | confidence interval below the lower threshold |
?+ |
potential positive effect / potential increase | unknown effect / unknown | confidence interval contains reference and the upper threshold |
?- |
potential negative effect / potential decrease | unknown effect / unknown | confidence interval contains reference and the lower threshold |
? |
unknown effect / unknown | unknown effect / unknown | confidence interval contains the lower and upper threshold |
The returned value is a modified version of the original input
dataframe df
with additional columns effect_code
and effect
containing
respectively the effect symbols and descriptions as ordered factor variables.
In case of coarse = TRUE
(by default) also effect_code_coarse
and
effect_coarse
containing the coarse classification effects.
Onkelinx, T. (2023). effectclass: Classification and visualisation of effects [Computer software]. https://inbo.github.io/effectclass/
Other uncertainty:
bootstrap_cube()
,
calculate_bootstrap_ci()
# Example dataset ds <- data.frame( mean = c(0, 0.5, -0.5, 1, -1, 1.5, -1.5, 0.5, -0.5, 0), sd = c(1, 0.5, 0.5, 0.5, 0.5, 0.25, 0.25, 0.25, 0.25, 0.5) ) ds$lcl <- qnorm(0.05, ds$mean, ds$sd) ds$ucl <- qnorm(0.95, ds$mean, ds$sd) add_effect_classification( df = ds, cl_columns = c("lcl", "ucl"), threshold = 1, reference = 0, coarse = TRUE)
# Example dataset ds <- data.frame( mean = c(0, 0.5, -0.5, 1, -1, 1.5, -1.5, 0.5, -0.5, 0), sd = c(1, 0.5, 0.5, 0.5, 0.5, 0.25, 0.25, 0.25, 0.25, 0.5) ) ds$lcl <- qnorm(0.05, ds$mean, ds$sd) ds$ucl <- qnorm(0.95, ds$mean, ds$sd) add_effect_classification( df = ds, cl_columns = c("lcl", "ucl"), threshold = 1, reference = 0, coarse = TRUE)
This function generate samples
bootstrap replicates of a statistic applied
to a data cube. It resamples the data cube and computes a statistic fun
for
each bootstrap replicate, optionally comparing the results to a reference
group (ref_group
).
bootstrap_cube( data_cube, fun, ..., grouping_var, samples = 1000, ref_group = NA, seed = NA, progress = FALSE )
bootstrap_cube( data_cube, fun, ..., grouping_var, samples = 1000, ref_group = NA, seed = NA, progress = FALSE )
data_cube |
A data cube object (class 'processed_cube' or 'sim_cube',
see |
fun |
A function which, when applied to |
... |
Additional arguments passed on to |
grouping_var |
A string specifying the grouping variable(s) for the
bootstrap analysis. The output of |
samples |
The number of bootstrap replicates. A single positive integer. Default is 1000. |
ref_group |
A string indicating the reference group to compare the
statistic with. Default is |
seed |
A positive numeric value setting the seed for random number
generation to ensure reproducibility. If |
progress |
Logical. Whether to show a progress bar. Set to |
Bootstrapping is a statistical technique used to estimate the distribution of a statistic by resampling with replacement from the original data (Davison & Hinkley, 1997; Efron & Tibshirani, 1994). In the case of data cubes, each row is sampled with replacement. Below are the common notations used in bootstrapping:
Original Sample Data:
The initial set of observed data points. Here, is the sample
size. This corresponds to the number of cells in a data cube or the number
of rows in tabular format.
Statistic of Interest:
The parameter or statistic being estimated, such as the mean
, variance
, or a biodiversity indicator. Let
denote the estimated value of
calculated from the complete dataset
.
Bootstrap Sample:
A sample of size
drawn with replacement from the original sample
. Each
is drawn independently from
.
A total of
bootstrap samples are drawn from the original data.
Common choices for
are 1000 or 10,000 to ensure a good
approximation of the distribution of the bootstrap replications (see
further).
Bootstrap Replication:
The value of the statistic of interest calculated from the -th
bootstrap sample
. For example, if
is
the sample mean,
.
Bootstrap Statistics:
Bootstrap Estimate of the Statistic:
The average of the bootstrap replications:
Bootstrap Bias:
This bias indicates how much the bootstrap estimate deviates from the original sample estimate. It is calculated as the difference between the average bootstrap estimate and the original estimate:
Bootstrap Standard Error:
The standard deviation of the bootstrap replications, which estimates the variability of the statistic.
A dataframe containing the bootstrap results with the following columns:
sample
: Sample ID of the bootstrap replicate
est_original
: The statistic based on the full dataset per group
rep_boot
: The statistic based on a bootstrapped dataset (bootstrap
replicate)
est_boot
: The bootstrap estimate (mean of bootstrap replicates per
group)
se_boot
: The standard error of the bootstrap estimate (standard
deviation of the bootstrap replicates per group)
bias_boot
: The bias of the bootstrap estimate per group
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and their Application (1st ed.). Cambridge University Press. doi:10.1017/CBO9780511802843
Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap (1st ed.). Chapman and Hall/CRC. doi:10.1201/9780429246593
Other uncertainty:
add_effect_classification()
,
calculate_bootstrap_ci()
# Get example data # install.packages("remotes") # remotes::install_github("b-cubed-eu/b3gbi") library(b3gbi) cube_path <- system.file( "extdata", "denmark_mammals_cube_eqdgc.csv", package = "b3gbi") denmark_cube <- process_cube( cube_path, first_year = 2014, last_year = 2020) # Function to calculate statistic of interest # Mean observations per year mean_obs <- function(data) { out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year names(out_df) <- c("year", "diversity_val") # Rename columns return(out_df) } mean_obs(denmark_cube$data) # Perform bootstrapping bootstrap_mean_obs <- bootstrap_cube( data_cube = denmark_cube$data, fun = mean_obs, grouping_var = "year", samples = 1000, seed = 123, progress = FALSE) head(bootstrap_mean_obs)
# Get example data # install.packages("remotes") # remotes::install_github("b-cubed-eu/b3gbi") library(b3gbi) cube_path <- system.file( "extdata", "denmark_mammals_cube_eqdgc.csv", package = "b3gbi") denmark_cube <- process_cube( cube_path, first_year = 2014, last_year = 2020) # Function to calculate statistic of interest # Mean observations per year mean_obs <- function(data) { out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year names(out_df) <- c("year", "diversity_val") # Rename columns return(out_df) } mean_obs(denmark_cube$data) # Perform bootstrapping bootstrap_mean_obs <- bootstrap_cube( data_cube = denmark_cube$data, fun = mean_obs, grouping_var = "year", samples = 1000, seed = 123, progress = FALSE) head(bootstrap_mean_obs)
This function calculates confidence intervals for a dataframe containing
bootstrap replicates based on different methods, including percentile
(perc
), bias-corrected and accelerated (bca
), normal (norm
), and basic
(basic
).
calculate_bootstrap_ci( bootstrap_samples_df, grouping_var, type = c("perc", "bca", "norm", "basic"), conf = 0.95, aggregate = TRUE, data_cube = NA, fun = NA, ..., ref_group = NA, jackknife = ifelse(is.element("bca", type), "usual", NA), progress = FALSE )
calculate_bootstrap_ci( bootstrap_samples_df, grouping_var, type = c("perc", "bca", "norm", "basic"), conf = 0.95, aggregate = TRUE, data_cube = NA, fun = NA, ..., ref_group = NA, jackknife = ifelse(is.element("bca", type), "usual", NA), progress = FALSE )
bootstrap_samples_df |
A dataframe containing the bootstrap replicates,
where each row represents a bootstrap sample. As returned by
|
grouping_var |
A string specifying the grouping variable(s) used for the bootstrap analysis. This variable is used to split the dataset into groups for separate confidence interval calculations. |
type |
A character vector specifying the type(s) of confidence intervals to compute. Options include:
|
conf |
A numeric value specifying the confidence level of the intervals.
Default is |
aggregate |
Logical. If |
data_cube |
Only used when |
fun |
Only used when |
... |
Additional arguments passed on to |
ref_group |
Only used when |
jackknife |
Only used when
|
progress |
Logical. Whether to show a progress bar for jackknifing. Set
to |
We consider four different types of intervals (with confidence level
). The choice for confidence interval types and their calculation
is in line with the boot package in R (Canty & Ripley, 1999) to ensure
ease of implementation. They are based on the definitions provided by
Davison & Hinkley (1997, Chapter 5)
(see also DiCiccio & Efron, 1996; Efron, 1987).
Percentile: Uses the percentiles of the bootstrap distribution.
where
and
are the
and
percentiles of the bootstrap distribution, respectively.
Bias-Corrected and Accelerated (BCa): Adjusts for bias and acceleration
Bias refers to the systematic difference between the observed statistic from the original dataset and the center of the bootstrap distribution of the statistic. The bias correction term is calculated as follows:
where is the counting operator and
the inverse
cumulative density function of the standard normal distribution.
Acceleration quantifies how sensitive the variability of the statistic is to changes in the data.
: The statistic's variability does not depend on the data
(e.g., symmetric distribution)
: Small changes in the data have a large effect on the
statistic's variability (e.g., positive skew)
: Small changes in the data have a smaller effect on the
statistic's variability (e.g., negative skew).
The acceleration term is calculated as follows:
where
denotes the influence of data point
on the
estimation of
.
can be estimated using jackknifing.
Examples are (1) the negative jackknife:
, and (2) the positive
jackknife
(Frangos & Schucany, 1990). Here,
is the estimated
value leaving out the
’th data point
. The boot
package also offers infinitesimal jackknife and regression estimation.
Implementation of these jackknife algorithms can be explored in the
future.
The bias and acceleration estimates are then used to calculate adjusted percentiles.
,
So, we get
Normal: Assumes the bootstrap distribution of the statistic is approximately normal
where is the
quantile of the
standard normal distribution.
Basic: Centers the interval using percentiles
where
and
are the
and
percentiles of the bootstrap distribution, respectively.
A dataframe containing the bootstrap results with the following columns:
est_original
: The statistic based on the full dataset per group
rep_boo
est_boot
: The bootstrap estimate (mean of bootstrap replicates per
group)
se_boot
: The standard error of the bootstrap estimate (standard
deviation of the bootstrap replicates per group)
bias_boot
: The bias of the bootstrap estimate per group
int_type
: The interval type
ll
: The lower limit of the confidence interval
ul
: The upper limit of the confidence interval
conf
: The confidence level of the interval
When aggregate = FALSE
, the dataframe contains the columns from
bootstrap_samples_df
with one row per bootstrap replicate.
Canty, A., & Ripley, B. (1999). boot: Bootstrap Functions (Originally by Angelo Canty for S) [Computer software]. https://CRAN.R-project.org/package=boot
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and their Application (1st ed.). Cambridge University Press. doi:10.1017/CBO9780511802843
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3). doi:10.1214/ss/1032280214
Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171–185. doi:10.1080/01621459.1987.10478410
Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap (1st ed.). Chapman and Hall/CRC. doi:10.1201/9780429246593
Frangos, C. C., & Schucany, W. R. (1990). Jackknife estimation of the bootstrap acceleration constant. Computational Statistics & Data Analysis, 9(3), 271–281. doi:10.1016/0167-9473(90)90109-U
Other uncertainty:
add_effect_classification()
,
bootstrap_cube()
# Get example data # install.packages("remotes") # remotes::install_github("b-cubed-eu/b3gbi") library(b3gbi) cube_path <- system.file( "extdata", "denmark_mammals_cube_eqdgc.csv", package = "b3gbi") denmark_cube <- process_cube( cube_path, first_year = 2014, last_year = 2020) # Function to calculate statistic of interest # Mean observations per year mean_obs <- function(data) { out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year names(out_df) <- c("year", "diversity_val") # Rename columns return(out_df) } mean_obs(denmark_cube$data) # Perform bootstrapping bootstrap_mean_obs <- bootstrap_cube( data_cube = denmark_cube$data, fun = mean_obs, grouping_var = "year", samples = 1000, seed = 123, progress = FALSE) head(bootstrap_mean_obs) # Calculate confidence limits # Percentile interval ci_mean_obs1 <- calculate_bootstrap_ci( bootstrap_samples_df = bootstrap_mean_obs, grouping_var = "year", type = "perc", conf = 0.95, aggregate = TRUE) ci_mean_obs1 # All intervals ci_mean_obs2 <- calculate_bootstrap_ci( bootstrap_samples_df = bootstrap_mean_obs, grouping_var = "year", type = c("perc", "bca", "norm", "basic"), conf = 0.95, aggregate = TRUE, data_cube = denmark_cube$data, # Required for BCa fun = mean_obs, # Required for BCa progress = FALSE) ci_mean_obs2
# Get example data # install.packages("remotes") # remotes::install_github("b-cubed-eu/b3gbi") library(b3gbi) cube_path <- system.file( "extdata", "denmark_mammals_cube_eqdgc.csv", package = "b3gbi") denmark_cube <- process_cube( cube_path, first_year = 2014, last_year = 2020) # Function to calculate statistic of interest # Mean observations per year mean_obs <- function(data) { out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year names(out_df) <- c("year", "diversity_val") # Rename columns return(out_df) } mean_obs(denmark_cube$data) # Perform bootstrapping bootstrap_mean_obs <- bootstrap_cube( data_cube = denmark_cube$data, fun = mean_obs, grouping_var = "year", samples = 1000, seed = 123, progress = FALSE) head(bootstrap_mean_obs) # Calculate confidence limits # Percentile interval ci_mean_obs1 <- calculate_bootstrap_ci( bootstrap_samples_df = bootstrap_mean_obs, grouping_var = "year", type = "perc", conf = 0.95, aggregate = TRUE) ci_mean_obs1 # All intervals ci_mean_obs2 <- calculate_bootstrap_ci( bootstrap_samples_df = bootstrap_mean_obs, grouping_var = "year", type = c("perc", "bca", "norm", "basic"), conf = 0.95, aggregate = TRUE, data_cube = denmark_cube$data, # Required for BCa fun = mean_obs, # Required for BCa progress = FALSE) ci_mean_obs2
This function performs leave-one-out (LOO) or k-fold (experimental) cross-validation (CV) on a biodiversity data cube to assess the performance of a specified indicator function. It partitions the data by a specified variable, calculates the specified indicator on training data, and compares it with the true values to evaluate the influence of one or more categories on the final result.
cross_validate_cube( data_cube, fun, ..., grouping_var, out_var = "taxonKey", crossv_method = c("loo", "kfold"), k = ifelse(crossv_method == "kfold", 5, NA), max_out_cats = 1000, progress = FALSE )
cross_validate_cube( data_cube, fun, ..., grouping_var, out_var = "taxonKey", crossv_method = c("loo", "kfold"), k = ifelse(crossv_method == "kfold", 5, NA), max_out_cats = 1000, progress = FALSE )
data_cube |
A data cube object (class 'processed_cube' or 'sim_cube',
see |
fun |
A function which, when applied to |
... |
Additional arguments passed on to |
grouping_var |
A string specifying the grouping variable(s) for |
out_var |
A string specifying the column by which the data should be
left out iteratively. Default is |
crossv_method |
Method of data partitioning.
If |
k |
Number of folds (an integer). Used only if
|
max_out_cats |
An integer specifying the maximum number of unique
categories in |
progress |
Logical. Whether to show a progress bar. Set to |
This function assesses the influence of each category in out_var
on the
indicator value by iteratively leaving out one category at a time, similar to
leave-one-out cross-validation. K-fold CV works in a similar fashion but is
experimental and will not be covered here.
Original Sample Data:
The initial set of observed data points, where there are
different categories in
out_var
and
total samples across all
categories (= the sample size).
corresponds to the number of cells
in a data cube or the number of rows in tabular format.
Statistic of Interest:
The parameter or statistic being estimated, such as the mean
, variance
, or a biodiversity indicator. Let
denote the estimated value of
calculated from the complete dataset
.
Cross-Validation (CV) Sample:
The full dataset
excluding all samples belonging to
category
. This subset is used to investigate the influence of
category
on the estimated statistic
.
CV Estimate for Category :
The value of the statistic of interest calculated from
, which excludes category
.
For example, if
is the sample mean,
.
Error Measures:
The Error is the difference between the statistic estimated without
category (
) and the statistic calculated
on the complete dataset (
).
The Relative Error is the absolute error, normalised by the true
estimate
and a small error term
to avoid division by zero.
The Percent Error is the relative error expressed as a percentage.
Summary Measures:
The Mean Relative Error (MRE) is the average of the relative errors over all categories.
The Mean Squared Error (MSE) is the average of the squared errors.
The Root Mean Squared Error (RMSE) is the square root of the MSE.
A dataframe containing the cross-validation results with the following columns:
Cross-Validation id (id_cv
)
The grouping variable grouping_var
(e.g., year)
The category left out during each cross-validation iteration
(specified out_var
with suffix '_out' in lower case)
The computed statistic values for both training (rep_cv
) and true
datasets (est_original
)
Error metrics: error (error
), squared error (sq_error
),
absolute difference (abs_error
), relative difference (rel_error
), and
percent difference (perc_error
)
Error metrics summarised by grouping_var
: mean relative difference
(mre
), mean squared error (mse
) and root mean squared error (rmse
)
See Details section on how these error metrics are calculated.
# Get example data # install.packages("remotes") # remotes::install_github("b-cubed-eu/b3gbi") library(b3gbi) cube_path <- system.file( "extdata", "denmark_mammals_cube_eqdgc.csv", package = "b3gbi") denmark_cube <- process_cube( cube_path, first_year = 2014, last_year = 2020) # Function to calculate statistic of interest # Mean observations per year mean_obs <- function(data) { out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year names(out_df) <- c("year", "diversity_val") # Rename columns return(out_df) } mean_obs(denmark_cube$data) # Perform leave-one-species-out CV cv_mean_obs <- cross_validate_cube( data_cube = denmark_cube$data, fun = mean_obs, grouping_var = "year", out_var = "taxonKey", crossv_method = "loo", progress = FALSE) head(cv_mean_obs)
# Get example data # install.packages("remotes") # remotes::install_github("b-cubed-eu/b3gbi") library(b3gbi) cube_path <- system.file( "extdata", "denmark_mammals_cube_eqdgc.csv", package = "b3gbi") denmark_cube <- process_cube( cube_path, first_year = 2014, last_year = 2020) # Function to calculate statistic of interest # Mean observations per year mean_obs <- function(data) { out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year names(out_df) <- c("year", "diversity_val") # Rename columns return(out_df) } mean_obs(denmark_cube$data) # Perform leave-one-species-out CV cv_mean_obs <- cross_validate_cube( data_cube = denmark_cube$data, fun = mean_obs, grouping_var = "year", out_var = "taxonKey", crossv_method = "loo", progress = FALSE) head(cv_mean_obs)