Tree-based models partition the data into increasingly homogeneous groups of presence or absence based on their relationship to a set of environmental variables, the predictor variables. The single classification tree is the most basic form of a decision tree model. As the name suggests, classification trees resemble a tree and consist of three different types of nodes, connected by directed edges (branches):
- Root node: no incoming branches - this represents the undivided data at the top
- Internal nodes: have exactly 1 incoming branch, and 2 or more outgoing branches
- leaf nodes (= terminal nodes): have exactly 1 incoming branch, and no outgoing branches
Classification Tree Analysis consists of three steps:
Growing: calibration of the tree starts with the complete dataset as one group, forming the root node. The tree is then grown by repeatedly splitting the data into increasingly homogeneous groups. Each split is based on the environmental variable that best divides the data into two groups, where at least one of the groups is very homogeneous. If a group is not homogeneous, it might have a mix of presence and absence records, then it needs to be split further. The model will continue to do this until the second step.
Stopping: this is where the splitting process is stopped when a set of predefined criteria is met. This can either be when further splitting is impossible because all remaining observations have similar values of predictor variables, and thus all groups are relatively homogeneous and no further improvements to the model can be made. Splitting can also be stopped when the number of observations in each terminal node would fall below a predefined minimum, or when some maximum number of splits in the tree is reached.
Pruning: reducing the complexity of the tree to avoid overfitting of the data. This is achieved by keeping only the most important splits.
Although classification trees provide a very useful tool to visualize the hierarchical effects of multiple environmental variables on species occurrence, they are often criticized for being unstable and having low prediction accuracy. This has led to the development of other methods that build upon classification trees, such as random forests and boosted regression trees.
Simple to understand and interpret
Can handle both numerical and categorical data
Identify hierarchical interactions between predictors
Characterize threshold effects of predictors on species occurrence
Robust to missing values and outliers
Less effective for linear or smooth species responses due to the stepwise approach
Requires large datasets to detect patterns, especially with many predictors
Very unstable: small changes in the data can change the tree considerably
No formal distributional assumptions, classification trees are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.
Requires absence data?
EcoCommons allows the user to set model arguments as specified below.
Seed used for generating random values. Using the same seed value, i.e. 123, ensures that running the same model, with the same data and settings generates the same result, despite stochastic processes such as machine learning or cross-validation.
Number of repetitions (nb_run_eval)
Integer value, corresponding to the number of repetitions to be done for calibration/validation splitting (default = 10)
Data split percentage (data_split)
Numeric value between 0 and 100, corresponding to the percentage of data used to calibrate the models (calibration/validation splitting) (default = 100)
Weighted response weights (Prevalence)
Allows the user to give more or less weight to particular observations. Each observation (presence or absence) has the same weight. If value <; 0.5: absences are given more weight; if value >; 0.5: presences are given more weight (default = NULL)
Variable importance (var_import)
Integer value, corresponding to the number of permutations to be done for each variable to estimate variable importance (default = 0)
Scale models (rescale_all_models)
A logical value defining whether all models predictions should be scaled with a binomial GLM or not (default = FALSE)
Evaluate all models (do_full_models)
A logical value defining whether models calibrated and evaluated over the whole dataset should be computed or not (default = TRUE)
Method to be used. "class" for a classification tree or
Minimum bucket (control_minbucket)
The minimum number of observations in any terminal node. (default = 1)
Complexity parameter (control_cp)
Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. The user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it. (default = 0.01)
Maximum competitor splits (control_maxcompete)
The number of competitor splits retained in the output. It is useful to know not just which split was chosen, but which variable came in second, third, etc. (default = 4)
Maximum surrogate splits (control_maxsurrogate)
The number of surrogate splits retained in the output. If this is set to zero the compute time will be reduced, since approximately half of the computational time (other than setup) is used in the search for surrogate splits. (default = 5)
Surrogate splits (control_usesurrogate)
How to use surrogates in the splitting process. 0 means display only; an observation with a missing value for the primary split rule is not sent further down the tree. 1 means use surrogates, in order, to split subjects missing the primary variable; if all surrogates are missing the observation is not split. For value 2, if all surrogates are missing, then send the observation in the majority direction. (default = 2)
Number of cross-validations. (default = 10)
Best surrogate (control_surrogatestyle)
Controls the selection of a best surrogate. If set to 0, the program uses the total number of correct classifications for a potential surrogate variable, if set to 1 it uses the percent correct, calculated over the non-missing values of the surrogate. The first option more severely penalizes covariates with a large number of missing values. (default = 0)
Maimum depth (control_maxdepth)
Set the maximum depth of any node of the final tree, with the root node counted as depth 0. (default = 30)
Breiman, L., Friedman, J. H., Olshen, R. H., & Stone, C. J. (1984). Classification and regression trees. Chapman and Hall, New York, USA.
De’ath, G., & Fabricius, K. E. (2000). Classification and Regression Trees: A Powerful yet Simple Technique for Ecological Data Analysis. Ecology, 81(11), 3178–3192.
Franklin, J. (2010). Mapping Species Distributions: Spatial Inference and Prediction. Cambridge University Press.
Arenas-Castro, S., Gonçalves, J., Alves, P., Alcaraz-Segura, D., & Honrado, J. P. (2018). Assessing the multi-scale predictive ability of ecosystem functional attributes for species distribution modelling. PLOS ONE, 13(6), e0199292.
Ducci, L., Agnelli, P., Di Febbraro, M., Frate, L., Russo, D., Loy, A., Carranza, M. L., Santini, G., & Roscioni, F. (2015). Different bat guilds perceive their habitat in different ways: A multiscale landscape approach for variable selection in species distribution modelling. Landscape Ecology, 30(10), 2147–2159.
Phillips, N. D., Reid, N., Thys, T., Harrod, C., Payne, N. L., Morgan, C. A., White, H. J., Porter, S., & Houghton, J. D. R. (2017). Applying species distribution modelling to a data poor, pelagic fish complex: The ocean sunfishes. Journal of Biogeography, 44(10), 2176–2187.
Quillfeldt, P., Engler, J. O., Silk, J. R. D., & Phillips, R. A. (2017). Influence of device accuracy and choice of algorithm for species distribution modelling of seabirds: A case study using black-browed albatrosses. Journal of Avian Biology, 48(12), 1549–1555.
Zhang, Z., Xu, S., Capinha, C., Weterings, R., & Gao, T. (2019). Using species distribution model to predict the impact of climate change on the potential distribution of Japanese whiting Sillago japonica. Ecological Indicators, 104, 333–340.