Introduction

Flexible discriminant analysis (FDA) is a general methodology which aims at providing tools for multigroup non-linear classification. It is a classification model based on a mixture of non-parametric regression models e.g. MARS and linear discriminant analysis.

The first step of an FDA is a non-parametric regression, which uses optimal scoring to transform the response variable so that the data are in a better form for linear separation. It builds multiple regression models, so called basis functions (BF), across the range of predictor values. In this procedure, the range of predictor values is partitioned in several groups/ categories. 


In the second step of an FDA the groups identified in the first step are used to run a linear discriminant analysis. Linear discriminant analysis focuses on maximising the seperatibility among groups, while minimising the variance within each group.

 


The first axis that LDA creates (environmental predictor 1) accounts for the most variation between the groups. The second axis (environmental predictor 2) accounts for the second most variation between the groups. This continues until every predictor is ranked. For simplicity reason only a 2-dimensional graph with 2 predictors (axis) is displayed at one time. 


Advantages

  • Works well with a large number of predictor variables
  • Automatically detects interactions between variables
  • It is an efficient and fast algorithm, despite its complexity
  • Robust to outliers

Limitations

  • Strong sensitivity to configuration setting
  • Susceptible to overfitting
  • More difficult to understand and interpret than other methods

Assumptions

No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation.

Requires absence data

Yes.

 

Configuration options

BCCVL uses the ‘fda’ package, implemented in biomod2. The user can set the following configuration options:


nb_run_eval

The original dataset is splitted in two, one to calibrate and another to calibrate the model. you can repeat this process ‘N’ times - This is called n-fold cross validation.

data_split

The proportion of data used for model calibration. Allows robust tests when independent data is not available. 

prevalence 

Allows to give more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight independent of the number of presences and absences. If the value is set below 0.5 absences are given more weight, whereas a value above 0.5 gives more weight to presences.

However, when pseudo-absence data have been generated weights (prevalence) are by default 0.5, as you should not give a higher value to pseudo-absence than presences. The model will not run if prevalence is set to 0.7 for example, as we are using pseudo-absence.

var_import

Number of permutations to estimate the importance of each variable. If this value is larger than 0, the algorithm will produce an object called ‘variabImprortance.Full.csv’, in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance. 

rescale_all_models

if true, all model prediction will be scaled with a binomial GLM. i.e. values between 1 and 0.  For ‘FDA’ and ‘ANN’, categorical models need to be scaled. In this case, it is recommended to scale all models computed to ensure comparable projections. However, it is not advised in other cases, as it reduces the projection scale amplitude. 

do_full_models

calibrate & evaluate models with the whole dataset

method

The regression method used in optimal scaling. The default is Multiple Adaptive Regression Splines (MARS)


 

References

  • Hallgren W., Santana F., Low-Choy S., Zhao Y., Mackey B. (2019). Species distribution models can be highly sensitive to algorithm configuration. Ecological Modelling,408. doi.org/10.1016/j.ecolmodel.2019.108719.
  • Hastie T., Tibshirani R., Buja A. (1994) Flexible Discriminant Analysis by Optimal Scoring. Journal of the American Statistical Association, Vol. 89, No. 428
  • Hastie T., Tibshirani R., Friedman J. (2009) The elements of statistical learning: data mining, inference and prediction. 2nd edition, Springer.
  • Reynès C., Sabatier R., Molinari N. (2006).  Choice of B-splines with free parameters in the flexible discriminant analysis context. Computational Statistics & Data Analysis Vol. 51
  • Wilfried Thuiller, Damien Georges, Maya Gueguen, Robin Engler and Frank Breiner (2021). biomod2: Ensemble Platform for Species Distribution Modeling. R package version 3.5.1. https://CRAN.R-project.org/package=biomod2