Introduction
Flexible discriminant analysis (FDA) is a general methodology that aims at providing tools for multigroup non-linear classification. It is a classification model based on a mixture of non-parametric regression models e.g. MARS and linear discriminant analysis.
The first step of an FDA is a non-parametric regression, which uses optimal scoring to transform the response variable so that the data are in a better form for linear separation. It builds multiple regression models, so-called basis functions (BF), across the range of predictor values. In this procedure, the range of predictor values is partitioned into several groups/ categories.
In the second step of an FDA the groups identified in the first step are used to run a linear discriminant analysis. Linear discriminant analysis focuses on maximising the seperatibility among groups, while minimising the variance within each group.
The first axis that LDA creates (environmental predictor 1) accounts for the most variation between the groups. The second axis (environmental predictor 2) accounts for the second most variation between the groups. This continues until every predictor is ranked. For simplicity reason only a 2-dimensional graph with 2 predictors (axis) is displayed at one time.
Advantages
- Works well with a large number of predictor variables
- Automatically detects interactions between variables
- It is an efficient and fast algorithm, despite its complexity
- Robust to outliers
Limitations
- Strong sensitivity to configuration setting
- Susceptible to overfitting
- More difficult to understand and interpret than other methods
- The response variable or grouping variable can be categorical, but independent variables are continuous, assumed to be normal.
Assumptions
No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation.
Requires absence data
Yes.
Configuration options
EcoCommons allows the user to set model arguments as specified below.
random_seed | Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation. |
nb_run_eval | The original dataset is split in two, one to calibrate and another to calibrate the model. You can repeat this process ‘N’ times - This is called n-fold cross validation. (default = 10) |
data_split | The proportion of data used for model calibration. Allows robust tests when independent data is not available. (default = 100) |
prevalence | Allows giving of more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight independent of the number of presences and absences. If the value is set below 0.5 absences are given more weight, whereas a value above 0.5 gives more weight to presences. However, when pseudo-absence data have been generated weights (prevalence) are by default 0.5, as you should not give a higher value to pseudo-absence than presences. The model will not run if prevalence is set to 0.7 for example, as we are using pseudo-absence. |
var_import | Number of permutations to estimate the importance of each variable. If this value is larger than 0, the algorithm will produce an object called ‘variabImprortance.Full.csv’, in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance. (default = 0) |
rescale_all_models | If true, all model prediction will be scaled with a binomial GLM. i.e. values between 1 and 0. For ‘FDA’ and ‘ANN’, categorical models need to be scaled. In this case, it is recommended to scale all models computed to ensure comparable projections. However, it is not advised in other cases, as it reduces the projection scale amplitude. (default = FALSE) |
do_full_models | Calibrate & evaluate models with the whole dataset (default = TRUE) |
method | The regression method used in optimal scaling. The default is Multiple Adaptive Regression Splines (default = MARS) |
References
- Hallgren, W., Santana, F., Low-Choy, S., Zhao, Y., & Mackey, B. (2019). Species distribution models can be highly sensitive to algorithm configuration. Ecological Modelling, 408, 108719.
- Hastie, T., Friedman, J., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer.
- Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible Discriminant Analysis by Optimal Scoring. Journal of the American Statistical Association, 89(428), 1255–1270.
- Reynès, C., Sabatier, R., & Molinari, N. (2006). Choice of B-splines with free parameters in the flexible discriminant analysis context. Computational Statistics & Data Analysis, 51(3), 1765–1778.
- Thuiller, W., Georges, D., Gueguen, M., Engler, R., & Breiner, F. (2021). biomod2: Ensemble Platform for Species Distribution Modeling (3.5.1). https://CRAN.R-project.org/package=biomod2
Additional Reading
- Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
- Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
- Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
- Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.