In order to design a species distribution model, think carefully about the three components of the model: the species data, the environmental data, and the algorithm. The term algorithm, refers to the actual method used to determine the probability of occurrence based on a set of environmental data. Which algorithm is most suitable is dependent on the type of data available, but it also goes the other way around: based on the preferred algorithm there are different optimal sampling methods for each algorithm, for example to generate the pseudo-absence data.
Step 1: Define your question.
In order to design a species distribution model, think carefully about the question you want to answer. This is the most important part of the journey and research, especially if you either hope to publish your results, or present results that lead to helpful actions for biodiversity. This process helps to avoid unnecessary steps by identifying each step that is deemed necessary.
Step 2: Think about what kind of species occurrence and environmental data are needed to find that answer.
Remember: ‘garbage in, garbage out’ means that if you just throw some data into a model, the result won’t be very meaningful. With the enormous amount of data available online, and tools like virtual labs that make it easier to run a species distribution model, it is good to take a step back and evaluate the input to make sure the results will be reliable
Data Availability and Accuracy:
What data is available about your species of interest, and is this data accurate? While a lot of organizations that provide data have their own procedures of checking and validating data, if you are using data that is provided by an open online source, it is your own responsibility to check the quality of the data. You can think about a few simple checks such as checking for duplicated records or outliers. But also things like species names. There could be alternative common names for your species across countries.
It is important that you check whether there are any anomalies in the occurrence or environmental dataset, or whether there is any sampling bias that needs to be taken into account such as the geographical coverage of the data.
How large does the occurrence dataset need to be? How many occurrence points are necessary for a good performance of the model?
The accuracy of the algorithm will be dependent on the sample size of the data that is available. For some common species, such as the wedge-tailed eagle, there may be datasets with tens of thousands records. But for conservation purposes, a rare species such as the Richmond frog may have far less records. The optimal number of occurrence records is related to the geographic range of the species.
In general, models tend to be less accurate for species that have a broad geographic range and that are tolerant to a range of environmental conditions compared to species with smaller geographic ranges and limited environmental tolerances.
So even if the species is rare and there are only a few occurrence records, if its geographical range is small, it is likely that the suitable environmental conditions for it are accurately sampled with fewer points compared to a species with a larger range.
Generally, the minimum necessary number of occurrence records is about 30, and algorithms that use only presence data are less affected by small sample sizes.
Step 3: Factors likely to influence the distribution of species.
Before starting the search for environmental data, think about which factors are likely to influence the distribution of the species of interest. Although some algorithms are able to handle a large amount of predictor variables, it is always good to remain critical about which variables are included in the model.
Do a bit of research to get to know the species, and choose predictors that directly affect the distribution of the species. For example, if the species is sensitive to very high or low temperatures, make sure that temperature-related variables are included in your model.
Not sure which factors influence the species? First run a model with lots of predictors. The outcome of the model will show the response curves for each environmental variable which you can use as a guide to select the most important predictors to run a subsequent, more refined model.
In this example, the response curves for soil type and radiation show a flat line, which means that they did not influence the probability of occurrence, and thus you could choose to leave these variables out in the next model. Be aware that most algorithms take into account interactions between variables and thus adding or leaving out variables can change the outcome of the model. This again highlights the importance of doing some research when you design your species distribution model.
Step 4: What kind of algorithm is needed to find the answer.
The fourth aspect of a species distribution model is choosing the algorithm that will be used to associate species occurrences with environmental conditions. There are a lot of different algorithms available to model species distributions. EcoCommons focuses on four main groups: geographic, profile, statistical regression and machine learning models. This categorisation is not set in stone, and can be a bit arbitrary, as many machine learning models are based on regression techniques that are also used in statistical regression models.
Geographic models only use presence data, and do not use environmental data. They function in geographic space, and can thus be graphically visualised with latitude and longitude on the axes. These models use simple algorithms that predict that a species is present at sites within a certain shape or distance around the occurrence points. So in this example (a convex hull), the model draws a shape around the outermost occurrence points and predicts that a species can be present anywhere within that shape, here indicated in green. Because geographic models do not take into account the environmental conditions of occurrence sites, they are often not considered as true species distribution models. But they provide a good method to get a quick idea of the spatial extent of a species.
Profile models are the most basic true species distribution models. Like geographic models, they also use only occurrence data, but these models do use environmental data as well. Therefore they function in environmental space, and the axes of the graph represent different environmental variables that are used to predict the probability of occurrence. The best known profile model is Bioclim, which is regarded as the first species distribution model. Bioclim constructs a boundary box around the minimum and maximum values of each environmental variable, and it predicts that species can be present in all locations that fall within those boundaries. Profile models have a few limitations as they can only handle continuous environmental variables, and they do not take into account interactions between the variables, but they are very good to explore which factors influence a species if this information is not available beforehand.
Statistical regression models need both presence and absence data. Absence data can either be true absence data or be represented by ‘made up’ data, which we call pseudo-absence data. These models also use environmental data, and the algorithms use all the data available to estimate the coefficients of the environmental variables, and they construct a function that best describes the effect of those variables on species occurrence. Statistical regression models can handle both continuous and categorical predictors and also include interactions between those variables.
Machine learning models consist of a lot of different approaches that all use environmental data. Most algorithms use both presence and absence data, except for the popular Maxent technique that uses presence data in combination with background data. A variety of machine learning models are based on decision trees. (Please click the links for an explanation on how classification trees work, and to look at more complex tree-based models: Random Forests and Boosted Regression Trees and Artificial Neural Networks).
Step 5: How to choose which algorithm to use in your species distribution model.
There are quite a lot of different algorithms available to model species distributions. Which ones to choose depends on the youryour question, on the available data, and your understanding of your species and its ecology.
There is no straightforward answer to this question as it depends on a lot of different things. Although it is almost impossible to recommend one method over another, below is a short overview of some limitations and assumptions of the models, that might guide you in the design of your particular species distribution model.
1. The data that is available or that you want to use might limit some of the algorithm options:
- If you don’t have any environmental data available that is relevant to your species, you are limited to a geographic model, which mostly gives just an indication of the range of a species.
- If you do have data on environmental conditions, then you can design a true species distribution model.
2. If you only have presence data, you can choose to run a simple profile model, such as Bioclim.
- An alternative if you only have presence data, is Maxent, which is a presence-background model that contrasts the environmental conditions of presence locations with all available locations.
- Alternatives to presence only and presence-background models are presence-absence models with either true absence or pseudo-absence data. These can either be statistical regression models or machine learning models.
3. Each of the algorithms have their own assumptions and limitations with regards to the input data. Remember:
- Profile models are not able to include categorical predictors or interactions. They generally show poorer performance compared to presence-absence or presence-background models.
- Statistical models tend to be more sensitive to outliers and missing data compared to the machine learning models. But machine learning models are more sensitive to overfitting the data.
- Maxent has an inbuilt process to avoid overfitting.
- An advantage of machine learning models is that they are able to handle large datasets.
- However, if you don’t have many occurrence points available, Maxent or a statistical model might work better.
4. The choice for a model also depends a lot on what you want as a user, remember the interpretation of the output differs between the models:
- Maxent and Bioclim work from an environment perspective, and they test the suitability of the environment for presence of a species.
- Statistical and machine learning models take a species perspective, and test the probability of occurrence in locations with particular environmental conditions.
- The expertise of the user. Although some tools might make it easier to design species distribution models, it is important that you understand what you are modelling.
- Some models might perform very well, but are more complex to understand and interpret.
- Some models need to be tuned by setting the configuration options to specific values depending on the datasets. In those cases, just running an algorithm with the default configuration options might not give an optimal result.
- Not everyone has the time or resources available to learn new techniques, and thus each user has to think about what they are capable of.
- Keep in mind that modelling is a complex topic that needs some time and investment to fully comprehend.
- Whether the modelling tools are freely available or not.
- Whether you have access to the computational infrastructure that you sometimes need for running large models, and visualising the output.
5. Pseudo-absence data is generated by the model, so it doesn’t represent true observations in the field. This means it will likely introduce some kind of error into the model, so:
- Think carefully about how to generate this data with regards to two aspects:
- the number of points that you generate, and
- the method that you use.
- Researchers have provided general guidelines with a recommendation of 10,000 pseudo-absence points, these are randomly generated in the study area for statistical models, and an equal number of pseudo-absence points as there are occurrence points, are generated in locations with contrasting environmental conditions to those occurrence points for machine learning models.
It is highly recommended that users do some research on recent developments and recommendations for algorithms of your interest.
So in summary there is not one perfect algorithm for all of your research questions. But all these options give you the opportunity to design a species distribution model suitable for the species of interest and the study area.
"Make sure that you think about the criteria and assumptions and justify why you choose a particular algorithm".
Remember: In a tool such as EcoCommons, you can easily run multiple algorithms and compare their output, so if you’re not sure which one fits your data best, you can select more than one. If you use multiple models, you might get slightly different results, and it is always good to take all of these results into account before you draw your final conclusions about the distribution of your species of interest.
- Barbet‐Massin, M., Jiguet, F., Albert, C.H. and Thuiller, W., 2012. Selecting pseudo‐absences for species distribution models: how, where and how many?. Methods in ecology and evolution, 3(2), pp.327-338.
- Beaumont LJ, Graham E, Duursma DE, et al. 2016 Which species distribution models are more (or less) likely to project broad-scale, climate-induced shifts in species ranges? Ecological Modelling, 342, 135-146.
- Elith J, Graham CH, Anderson RP, et al. (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129-151.
- Elith J & Leathwick JR (2009) Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics, 40(1), 677.
- Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press.
- Guisan A & Zimmermann NE (2000) Predictive habitat distribution models in ecology. Ecological modelling, 135(2), 147-186.
- Guisan A & Thuiller W (2005) Predicting species distribution: offering more than simple habitat models. Ecology letters, 8(9), 993-1009.
- Pearson RG (2010) Species’ distribution modeling for conservation educators and practitioners. Lessons in conservation, 3, 54-89.
“EcoCommons Australia (2022). [online] Brisbane: EcoCommons Australia. Available at: https://support.ecocommons.org.au/support/solutions/articles/6000162160-designing-an-sdm/
This module was originally published on https://bccvl.org.au/training/ and has been adapted to the EcoCommons platform. It was created under attribution to:
Huijbers CM, Richmond SJ, Low-Choy SJ, Laffan SW, Hallgren W, Holewa H (2016) SDM Online Open Course, module 4: design a species distribution model. Biodiversity and Climate Change Virtual Laboratory, http://www.bccvl.org.au/training/.