Selection of absence, pseudo-absence, or background data can have profound impacts on your results. For instance, MaxEnt requires background data, which is often 10,000 randomly selected locations from within your study area extent. However, a growing number of papers highlight how using “targeted background” data or background points selected from a “bias layer” produces better results than 10,000 randomly selected points. Geographic models or profile models such as or a Surface Range Envelope do not require absence data. The remaining statistical models or machine learning algorithms available on do require absence or pseudo-absence data.
True absence data
A true absence location is one where the observer is certain the target species was not present. This kind of data with absence certainty is rare, and most analysts assume a species is absent when an area is repeatedly surveyed, and the target species is not observed. In the context of a species distribution we assume that the reason a species is absent relates to the environmental conditions at that location. However, this assumption can be false for a variety of reasons. For example, a species may not be able to move into areas with suitable conditions, a species may not remain in suitable areas all year (. e. migratory species) or the presence of other species may exclude the target species from areas that are otherwise suitable. For this reason, most SDMs are acknowledged to be measuring the potential distribution of a species, not the actual or “” distribution. Nonetheless, a model trained on “true” absence data will often outperform a model trained on “pseudo-absence” data.
In general, comprehensive surveys can supply true absence data when sites have been visited one or more times and people used high quality detection methods suitable for the species. The number of surveys required to achieve 95% detection probability will vary by species and terrain. For example, to record true absences of a species that is only active during the night, surveys should only be carried out at night and conclusions about absences cannot be drawn if surveys were only conducted during the daytime. Further, some nocturnal species may only require 5 surveys to be sure a species is absent while another species may require 20 visits to have that same confidence. Such surveys, however, are time-consuming, and therefore true absence data is rarely available across broad geographic areas for any species.
If true absence data is not available for your species of interest, but you do want to use an algorithm that requires absence data, you can use pseudo-absence data. Again, you should appreciate that pseudo-absence data often generate less accurate results than models trained with true absence data.
The most common way to generate pseudo-absence data in is to simply select random locations within your study area with the same number of random points selected as presence points. There are many problems with this, not the least of which is that it is statistically inappropriate. Additionally, most occurrence datasets are full of sampling bias in both geographic and environmental space. Selecting random points when there high sampling bias results in a model assuming that areas beyond the areas sampled heavily are in fact absence locations which is often untrue, and your resulting model will predict your species occurs in areas you sampled but will often miss areas where the species occurs but were not sampled adequately. One easy way to limit the impact of sampling bias is to select points randomly that are at least a specified distance from each occurrence point but are no further away than a specified distance. For example, you may select a point that is at least one kilometer away from your occurrence point, because the home range of your species is on average 3 km2 but and then select a maximum distance for your random point to be no more than 5 km from your occurrence record. that outer distance is kept relatively close to the occurrence points, then your pseudo-absence locations are roughly matching your sampling bias which increases the chances of your model focusing on real differences of where the species is or is not present. In this is what the “disc” pseudo-absence strategy does.
Often, we might have thousands of locations where other similar species were recorded, but the target species was not recorded. If we have ~10,000 such locations, we might consider these to be “targeted background” points in a Maxent model. If there are many more of these “targeted background” points than presence locations, but there are nothing approaching 10,000, it might be better to consider a zero-inflated GLM to differentiate between presence and pseudo-absence data. If the number of “targeted background” points are roughly similar in number to the number of presences these data could be treated as pseudo-absence data in . While not ideal, it would be a better strategy than randomly selecting background points from
Two aspects of generating pseudo-absence data that can be customized in : The number of pseudo-absence points generated and the generation method. The optimum settings for both these aspects can differ among algorithms, and therefore it is good to investigate what the best options are for the algorithm of your choice. For example, Barbet-Massin et al. (2012) compared the performance of a variety of algorithms with different methods.
Number of pseudo-absence points
With regards to the number of pseudo-absence points generated, it is often advised to the ratio to the number of presence points. This ratio is also called the refers to the proportion of occupied locations relative to the number of absence points. Prevalence has been shown to influence model accuracy, which highlights the importance of selecting an appropriate ratio.
The default ratio in is set to 1:1 (pseudo-absence ), which thus generates the same number of pseudo-absence points as there are presence points.
Pseudo-absence generation methods
In , we offer three different methods to generate pseudo-absence data:
Random (default): pseudo-absence points are randomly generated in a predefined geographical area, anywhere except for locations where presence has been recorded. In , the geographical area is either the extent of the environmental/climate layers, or the area defined in the geographical constraint tab of the SDM experiment.
Min-max radius (referred to as 'disk' in ): this method generates pseudo-absence points only within a delimited geographical distance from recorded presence points, defined by a minimum and maximum radius around each presence location. It requires the input of a minimum and maximum distance from your presence points. Setting a minimum distance ensures that pseudo-absence points are not generated too close to a presence record, as you can assume that the environmental conditions would be too similar. Setting a maximum distance ensures that pseudo-absence points are not generated in inappropriate locations which may result in over-prediction.
In general, Barbet-Massin et al. (2012) recommended using an equal number of pseudo-absence points as there are presence points (1:1 ratio) generated in locations with contrasting environmental conditions to those presence points for classification techniques (Classification Tree, Random Forest, Boosted Regression Tree).
Barbet‐Massin M, F, Albert CH, W (2012) Selecting pseudo‐absences for species distribution models: how, where and how many? Methods in Ecology and Evolution, 3(2), 327-338.
RM, Lobo JM (2008) Assessing the effects of pseudo-absences on predictive distribution model performance. Ecological modelling, 210(4), 478-486.
Lobo JM, Jiménez‐Valverde A, J (2010) The uncertain nature of absences and their importance in species distribution modelling. , 33(1), 103-114.
Phillips SJ, M, Elith J, Graham CH, Lehmann A, J, Ferrier S (2009) Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications, 19(1), 181-197.
J, Shoo LP, Graham C, Williams SE (2009) Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know? Ecological modelling, 220(4), 589-594.