Introduction 

Flexible discriminant analysis (FDA) is a general methodology that aims at providing tools for multigroup non-linear classification. It is a classification model based on a mixture of non-parametric regression models e.g. MARS and linear discriminant analysis. 

The first step of an FDA is a non-parametric regression, which uses optimal scoring to transform the response variable so that the data are in a better form for linear separation. It builds multiple regression models, so-called basis functions (BF), across the range of predictor values. In this procedure, the range of predictor values is partitioned into several groups/ categories.  

Inserting image... 

 
 

In the second step of an FDA the groups identified in the first step are used to run a linear discriminant analysis. Linear discriminant analysis focuses on maximising the seperatibility among groups, while minimising the variance within each group. 

Inserting image...  

 
 

The first axis that LDA creates (environmental predictor 1) accounts for the most variation between the groups. The second axis (environmental predictor 2) accounts for the second most variation between the groups. This continues until every predictor is ranked. For simplicity reason only a 2-dimensional graph with 2 predictors (axis) is displayed at one time.  

 
 

Advantages 

  • Works well with a large number of predictor variables 
  • Automatically detects interactions between variables 
  • It is an efficient and fast algorithm, despite its complexity 
  • Robust to outliers 
     
     

Limitations 

  • Strong sensitivity to configuration setting 
  • Susceptible to overfitting 
  • More difficult to understand and interpret than other methods 
  • The response variable or grouping variable can be categorical, but independent variables are continuous, assumed to be normal. 
     
     

Assumptions 


No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation. 
 

 

Requires absence data 

Yes. 

  

Configuration options  

EcoCommons allows the user to set model arguments as specified below. 


random_seed 

Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation. 

nb_run_eval 

The original dataset is split in two, one to calibrate and another to calibrate the model. You can repeat this process ‘N’ times - This is called n-fold cross validation. (default = 10) 

data_split 

The proportion of data used for model calibration. Allows robust tests when independent data is not available. (default = 100) 

prevalence  

Allows giving of more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight independent of the number of presences and absences. If the value is set below 0.5 absences are given more weight, whereas a value above 0.5 gives more weight to presences. 

However, when pseudo-absence data have been generated weights (prevalence) are by default 0.5, as you should not give a higher value to pseudo-absence than presences. The model will not run if prevalence is set to 0.7 for example, as we are using pseudo-absence. 

var_import 

Number of permutations to estimate the importance of each variable. If this value is larger than 0, the algorithm will produce an object called ‘variabImprortance.Full.csv’, in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance. (default = 0) 

rescale_all_models 

If true, all model prediction will be scaled with a binomial GLM. i.e. values between 1 and 0.  For ‘FDA’ and ‘ANN’, categorical models need to be scaled. In this case, it is recommended to scale all models computed to ensure comparable projections. However, it is not advised in other cases, as it reduces the projection scale amplitude. (default = FALSE) 

do_full_models 

Calibrate & evaluate models with the whole dataset (default = TRUE) 

method 

The regression method used in optimal scaling. The default is Multiple Adaptive Regression Splines (default = MARS) 

 
  



References 

Additional Reading 

  • Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.  
  • Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.  
  • Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.  
  • Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.