Introduction 

This model is similar to Boosted Regression Trees only run through a different package in R. 

These models are a combination of two techniques: decision tree algorithms and boosting methods. Generalized Boosting Models repeatedly fit many decision trees to improve the accuracy of the model. For each new tree in the model, a random subset of all the data is selected using the boosting method. For each new tree in the model, the input data are weighted in such a way that data that was poorly modelled by previous trees has a higher probability of being selected in the new tree. This means that after the first tree is fitted the model will take into account the error in the prediction of that tree to fit the next tree, and so on. By taking into account the fit of previous trees that are built, the model continuously tries to improve its accuracy. This sequential approach is unique to boosting. 

Generalized Boosting Models have two important parameters that need to be specified by the user. 

  • Interaction depth (= tree complexity in BRT): this controls the number of splits in each tree. A value of 1 results in trees with only 1 split, and this means that the model does not take into account interactions between environmental variables. A value of 2 results in two splits, etc. 
  • Shrinkage (= learning rate in BRT): this determines the contribution of each tree to the growing model. As small shrinkage value results in many trees to be built. 

 
 

These two parameters together determine the number of trees that is required for optimal prediction. The aim is to find the combination of parameters that results in the minimum error for predictions and a model with at least 1000 trees. The optimal values for these parameters depend on the size of your dataset. For datasets with <500 occurrence points, it is best to model simple trees (interaction depth = 2 or 3) with small enough shrinkage rates to allow the model to grow at least 1000 trees. 

Generalized Boosting Models are a powerful algorithm and work very well with large datasets or when you have a large number of environmental variables compared to the number of observations, and they are very robust to missing values and outliers. 

 
 

Advantages 

  • Can be used with a variety of response types (binomial, gaussian, poisson) 
  • Stochastic, which improves predictive performance 
  • The best fit is automatically detected by the algorithm 
  • Model represents the effect of each predictor after accounting for the effects of other predictors 
  • Robust to missing values and outliers 
     

Limitations 

  • Need at least 2 predictor variables to run 

 
 

Assumptions 

No formal distributional assumptions, generalized boosting models are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal. 

 
 

Requires absence data 

Yes. 

 
Configuration options  

EcoCommons allows the user to set model arguments as specified below. 

random_seed  

Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation. 

Number of repetitions (nb_run_eval) 

Integer value, corresponding to the number of repetitions to be done for calibration/validation splitting. (default = 10) 

Data split percentage (data_split) 

Numeric value between 0 and 100, corresponding to the percentage of data used to calibrate the models (calibration/validation splitting). (default = 100) 

prevalence 

Allows to give more or less weight to particular observations; default = NULL: each observation (presence or absence) has the same weight; if value < 0.5: absences are given more weight; if value > 0.5: presences are given more weight. (algorithm parameter) 

Variable importance (var_import) 

Integer value, corresponding to the number of permutations to be done for each variable to estimate variable importance. (default = 0) 

Scale models (rescale_all_models) 

A logical value defining whether all models predictions should be scaled with a binomial GLM or not. (default = FALSE) 

Evaluate all models (do_full_models) 

A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not. (default = TRUE) 

Distribution (distributions) 

 

Distribution of the response variable. ‘bernoulli’ should be used if the response has only 2 unique values. If the response is a factor, multinomial is assumed; otherwise, if the response has class "Surv", coxph is assumed; otherwise, gaussian. (default =  bernoulli) 

 

Number of trees (n_trees) 

 

Total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. (default = 2500 ) 

 

Maximum depth (interactions_depth) 

 

Integer, specifying the maximum depth of each tree. A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions. (default = 7 ) 

 

Minimum observations (n_minobsinnode) 

 

Integer, specifying the minimum number of observations in the terminal nodes of the trees. Note that this is the actual number of observations, not the total weight. (default = 5 ) 

 

Shrinkage (shrinkage) 

 

The shrinkage applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. (default =  0.01) 

 

Training fraction (bag_fraction) 

 

The fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction < 1 then running the same model twice will result in similar but different fits. (default = 0.5 ) 

 

Training fraction (train_faction) 

 

The first train.fraction * nrows(data) observations are used to fit the gbm and the remainder are used for computing out-of-sample estimates of the loss function. (default = 1 ) 

 

Cross-validation (cv_folds) 

 

Number of cross-validation folds to perform. If cv.folds>1 then gbm, in addition to the usual fit, will perform a cross-validation, calculate an estimate of generalization error returned in cv.error. (default =  3) 

 

 

References 

  • De’ath, G. (2007). Boosted Trees for Ecological Modeling and Prediction. Ecology88(1), 243–251.  
  • Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology77(4), 802–813.  
  • Franklin, J. (2010). Mapping species distributions: spatial inference and prediction. Cambridge University Press. 
  • Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 172-181. 
  • Thuiller, W.Lafourcade, B., Araujo, M. (2012). Presentation manual for BIOMOD. Laboratoire d'Écologie Alpine, Université Joseph Fourier, Grenoble, France 

Additional Reading 

  • Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE13(6), e0199292.  
  • Feuda, R., Bannikova, A. A., Zemlemerova, E. D., Di Febbraro, M., Loy, A., Hutterer, R., Aloise, G., Zykov, A. E., Annesi, F., & Colangelo, P. (2015). Tracing the evolutionary history of the mole, Talpa europaea, through mitochondrial DNA phylogeography and species distribution modelling. Biological Journal of the Linnean Society114(3), 495–512. 
  • Greiser, C., Hylander, K., Meineri, E., Luoto, M., & Ehrlén, J. (2020). Climate limitation at the cold edge: Contrasting perspectives from species distribution modelling and a transplant experiment. Ecography43(5), 637–647.  
  • Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology48(12), 1549–1555.  
  • Rose, P. M., Kennard, M. J., Moffatt, D. B., Sheldon, F., & Butler, G. L. (2016). Testing Three Species Distribution Modelling Strategies to Define Fish Assemblage Reference Conditions for Stream Bioassessment and Related Applications. PLOS ONE11(1), e0146728.  
  • Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators104, 333–340.