The BioDT Recreational Potential Model for Scotland
Developers’ Report
This report provides a high-level, chronological overview of development work carried out by two of the authors (JMR, MT) between November 2024 and June 2025. The intention is to provide some context and explanation for posterity. We also make several recommendations for further development work. Most items are accompanied by one or more links to issues and/or pull requests on GitHub, which often contain further details and references to specific lines of code.
1 Background
1.1 Existing code
The Recreational Potential model has had a complicated history of development involving several individuals working independently in different languages (QGIS, Python and R).
As of November 2024, the model existed as an interactive Shiny app written in the R
language. The actual computation of RP was deeply intertwined with Shiny app code and pre-processing of raster data, although much of it was factored out into functions.
The user could load, edit and save personas which were stored in a single .csv
file. They could then choose between a selection of geographical areas to run the model, corresponding to a collection of shapefiles stored alongside the code.
As well as the persona .csv
and shapefiles, the model required input data in the form of raster files (.tif
extension) covering Scotland at 20 metre resolution. For the most part (with the exception of the slope layer) these raster were integer-valued, with different integers corresponding to different categories of land cover. For example, each pixel in the Water_Lakes
layer could take any of the following values:
Raster Value | Corresponding Feature |
---|---|
5 | Pond |
1 | Lochan |
4 | Small Loch |
2 | Medium Loch |
3 | Large Loch |
6 | Major Loch |
0 or NA |
No feature present |
Both code and data were held on NERC DataLabs, a browser-based service for environmental data science resembling JupyterHub and Google Colab in which Jasmin users can run instances of R Studio and publish Shiny apps.
Parts of the code were being tracked using git, with the remote living in a private (but now public) repository within the BioDT
GitHub organisation, although the most up-to-date code on DataLabs was several weeks ahead of the most recent commit.
1.2 SPEAK
The objective of the SPEAK study was to gain insights into the RP model from the perspective of users. The study recruited approximately 40 members of the public, who would be given access to the RP app to use in the planning stage of recreational activities. The participants would then go off and do an activity, and then fill out a post-activity survey. Participants would be provided with GPS-enabled Garmin watches which would give the researchers the means to track participants’ paths during their recreational activities.
1.3 BioDT
The requirements of BioDT were quite different than those of SPEAK. The BioDT consortium have collaborated on an R Shiny app which attempts to bring the various models, or ‘prototype Digital Twins’ (pDTs), together through a single user interface.
The way this was supposed to end up working is this:
- Users would interact with the main BioDT app, providing the information needed to run one or more of the available models (e.g. the Biodiversity model and the Recreational Potential model, which make up the Cultural Ecosystem Services pDT).
- A request would then be sent to run the model(s), with inputs provided by the user, on the LUMI HPC cluster. Since access to LUMI is restricted, the request is made via an intermediate layer called Lexis.
- This would trigger a slurm job submission to run a containerised version of the model(s) on LUMI.
- The results would be transferred back, again via Lexis, for the user to view in the app.
Thus, from the BioDT perspective, embedding the RP model within another Shiny app was unhelpful. Instead, the RP model needed to be runnable via a command line script taking (i) a persona and (ii) a target region as command-line arguments. This script would then be containerised to facilitate running on LUMI.
1.4 Initial scope & development objectives
The main issues motivating the move to bring in additional software developers were:
- Access to the RP app required an account on DataLabs, which could not be provided to members of the public.
- The RP app was too slow: very small regions (e.g. Bush estate) took minutes to run, and larger regions (Cairngorms) took hours.
- Potential users were limited to computing RP within the pre-determined regions, which did not cover all of Scotland.
- In its existing form, i.e. as a Shiny app, the model could not be incorporated in the main BioDT app.
Hence, the original development objectives were:
- Make the app accessible to participants in the SPEAK study.
- Improve the execution speed and ability to scale to larger areas.
- Provide a means for users to choose the region of interest.
- Disaggregate the RP computation from the Shiny app, and reformulate it so that it could be run on LUMI via the BioDT app.
1.5 Expanded scope & revised development objectives
Initially a few weeks were allocated for this work, and the two of us worked independently on small, self-contained tasks. MT worked on removing dependence on the legacy raster
package and replacing it with a dependence on the terra
package, which is significantly faster, has better memory management, and supports lazy loading of large rasters. This led to tangible improvements in speed. JMR worked on #1 which turned out to be a red herring.
After gaining familiarity with the codebase and a better understanding of objective 4 it became clear that the most straightforward way to meet objectives 2–4 simultaneously was to fully rewrite the code from scratch. A full rewrite would be the most effective way to improve the speed and scalability (objective 2) and the flexibility to add or modify functionality (including objective 3). Some outstanding issues could also be neatly circumvented.
2 Overview of work done
2.1 Reconfiguration of git repository & developer setup
Although it is possible to give multiple users access to a DataLabs notebook, simultaneous access by multiple users is not currently supported. Furthermore, since all users share the same user space, only one user may cache their git configuration and GitHub credentials.
The first task was therefore to move development off of DataLabs and use GitHub for version control in a multi-developer setup. The most recent (as of 22/11/2024) version of the code and data was downloaded from DataLabs and added without modification to a new ‘orphan’ git branch — see commit a795456.
After some discussion, the old main
branch was moved to old-main
to preserve the history, and tagged 2024-model
to facilitate rediscovery in future.1 The new branch was renamed main
and another branch develop
was created by branching from main
. develop
was assigned as the default branch to align with a workflow where contributions are merged via Pull Requests to develop
, which is periodically merged with main
after ensuring that everything is functional and self-consistent.
Run-time calls to install.packages
were removed and instead the renv
package was used to compile a list of version-pinned dependencies (see comments in #1 for further explanation).
Pull Requests merged:
- A working Singularity container with minimal modifications to the existing code #1
- Introduce formatR #13 (later replaced by StyleR)
Issues closed:
2.2 Publishing existing app to Posit Connect
The SPEAK study was scheduled to launch in early February 2025. Hence, there was a quite urgent need to provide a version of the RP app that could be accessed by study participants through their browser. This led to the decision to prioritise publishing the RP app in its existing form (i.e. objective 1), and postpone development work towards the other objectives (2–4).
The initial expectation was that publishing the app through DataLabs would be quick and simple. However, it quickly became clear that this would not be feasible because DataLabs does not give deployed apps write access to the filesystem. This was a problem for us since files were being written to disk during the creation of personas and at various points during the calculation of Recreational Potential. To get around the persona issue we would have needed to very quickly reformulate the app to use remote storage such as Google sheets. However, there other places where data was being written to disk that would have been more complicated to fix.
Fortunately we were able to instead publish the app to UKCEH’s Posit Connect server, which does allow applications to access the filesystem.
Probably the easiest way to publish to Posit Connect is to install the R package rsconnect
and run the following commands from the root of the repository:
::addServer("https://connect-apps.ceh.ac.uk", name="connect-apps.ceh.ac.uk")
rsconnect::connectApiUser(server="connect-apps.ceh.ac.uk", account="<username>", apiKey="<api key>")
rsconnect::deployApp(appDir=".") rsconnect
This will take a long time if you have a lot of data to upload. For me it took up to an hour to upload ~2GB data.
You will want to add the directory rsconnect/
to your .gitignore
file.
See the UKCEH Posit Connect documentation for further guidance (requires an account with UKCEH Posit Connect).
There were a few minor additions required before we could make the app available to SPEAK study participants. First, the app needed to be accessible to study participants but ideally not anyone else. We agreed that it was sufficient to have a single, global password for the app which would be shared with the participants. For this we used an R
package called ShinyManager which checks the username/password combination provided by the user against the correct values set as shell environment variables.
UKCEH requires that user-facing apps meet a set of criteria before they can be published.
- Adhere to UKCEH branding guidelines.
- Include a link to the UKCEH privacy notice.
- Include a contact email.
Satisfying these requirements is straightforward with Shiny apps especially, where there is a template theme available at github.com/NERC-CEH/UKCEH_shiny_theming. However, in our case applying this theme caused some buttons to become invisible, which had to be fixed with some custom CSS as detailed in this Issue.
Pull Requests merged:
Issues closed:
2.3 Full rewrite of the RP model
The original aim of #18 was to refactor the existing code to make it possible to run the RP model via a script, independently of the Shiny app. As explained earlier, however, the decision was taken to rewrite the code from scratch. A number of subtle bugs in the original code that led to incorrect outputs (#22, #23) were also fixed during the rewrite.
In the process of arriving at a mathematical definition of the RP model (Section 1, Technical Supplement), it became transparent that the ‘dynamic’ part — i.e. the calculations involving user-provided inputs — could be separated from the ‘non-dynamical’ part, which dealt with quantities that only needed to be computed once, ahead of time, and stored for later use.
Hence, the revised design split the code into three independent parts:
- ‘Non-dynamic’ raster pre-processing, i.e. production of input data (independent of user).
- ‘Dynamic’ calculation of Recreational Potential (based on user-provided persona and region of interest).
- Implementation of the Shiny app.
The dynamic part was recast as a sum of pre-computed raster layers weighted by the persona scores. Concretely, this consisted of reading the rasters from the disk, and then performing the following steps:
- For each of the four components:
- Multiply each layer by a scalar.
- Sum the layers.
- Rescale the result to the unit interval.
- Sum the components.
- Rescale the result to the unit interval.
Steps 1.1 and 1.2 are better treated as a single step; an independent multiply-and-sum at each spatial location. This is close to the simplest mathematical operation one can do on a computer, and is both extremely quick and highly parallelisable.
In the 2024 model, the raster layers took integer values, so a layer might look something like this:
\[ C_i = \begin{bmatrix} \ddots \\ & 1 & 0 & 0 \\ & 2 & 1 & 0 \\ & 2 & 1 & 3 & \\ & & & & \ddots \end{bmatrix} \]
These integers were really labels for different categories corresponding to the features present in the layer.
These values were then remapped to user scores, which involves an indexing operation for each layer, effectively values(layer) <- mapping[values(layer)]
. This isn’t particularly inefficient or problematic in and of itself, but it requires some due diligence, such as ensuring that the data is genuinely integer-typed (i.e. not float) and avoiding operations that assume continuous data, such as interpolation (e.g. as part of a reprojection).
In order to pre-compute the proximity layers (#21 ) we needed to switch to a one-hot representation, where each feature type gets its own layer that is binary-valued — the feature is either present or not.
\[ C_i \overset{\text{one hot}}{\longrightarrow} ( I_{i0}, I_{i1}, I_{i2}, I_{i3}, \ldots) \]
\[ \begin{aligned} I_{i0} &= \begin{bmatrix} \ddots \\ & 0 & 1 & 1 \\ & 0 & 0 & 1 \\ & 0 & 0 & 0 & \\ & & & & \ddots \end{bmatrix} \, , \quad I_{i1} = \begin{bmatrix} \ddots \\ & 1 & 0 & 0 \\ & 0 & 1 & 0 \\ & 0 & 1 & 0 & \\ & & & & \ddots \end{bmatrix} \, , \\ I_{i2} &= \begin{bmatrix} \ddots \\ & 0 & 0 & 0 \\ & 1 & 0 & 0 \\ & 1 & 0 & 0 & \\ & & & & \ddots \end{bmatrix} \, , \quad I_{i3} = \begin{bmatrix} \ddots \\ & 0 & 0 & 0 \\ & 0 & 0 & 0 \\ & 0 & 0 & 1 & \\ & & & & \ddots \end{bmatrix} \, . \end{aligned} \]
If we run with this and give all 87 feature types their own layer, the recreational potential can be computed in a single vectorised step as a weighted sum of layers.
The splitting of the original model into a one-time pre-processing stage and a dynamical stage is the essential reason for the multiple orders of magnitude improvement in speed and memory use realised by this code rewrite. All of the slower, more memory-intensive computations could be performed up-front.
Further gains resulted from a number of factors made possible through the rewrite:
- Eliminated intermediate raster i/o operations that were previously required to keep the peak memory requirements within the total amount of memory available.
- Replaced indexing operations with multiply operations, made possible by using a one-hot representation for the input rasters, instead of integer-valued categorical rasters.
- Replaced loops with vectorised operations using
terra::app
and related functions.
As an example, the main dynamic part of the model — the calculation of a component — is essentially reduced to the following lines of code:
<- load_raster(
raster file.path(data_dir, paste0(component, ".tif")),
bbox
)
<- persona[names(raster)]
scores
<- terra::app(raster, function(features) {
result sum(scores * features, na.rm = TRUE)
})
- 1
-
Lazily load the raster, cropping to a region given by
bbox
- 2
-
Extract the scores for this component from the
persona
dataframe - 3
- Compute the weighted sum.
terra::app
automatically parallelises this multiply-and-sum operation across the grid; each ‘pixel’ is computed in a separate thread. Even better, when terra::app
is applied to a simple operator such as sum
, it actually calls a fast C++
implementation rather than a pure R
one (see the terra documentation).
This is also a very memory-light operation; terra
only has to load one pixel at a time per parallel thread. This means at any given instance only \(N_\text{threads} \times N_\text{layers}\) values need to be loaded in memory, irrespective of how many pixels there are in total.
Pull Requests merged:
Issues closed:
- Peak memory use is too high #4
- Running the model without the Shiny app #6
- Separation of input data and configuration, outputs and intermediate artefacts #7
- Remove redundant packages #12
- App crashes if name of persona
.csv
does not match variable contained therein #16 - Persona csv need only contain a list of numbers #20
- Proximity rasters can be pre-computed #21
- Proximity should be computed using unique features instead of non-unique persona scores #22
- Mismatch between integer-integer lookup table and double-valued raster data breaks the raster reclassification step #23
- Input rasters should be integer-typed #24
- Missing values in rasters should be consistently NA, not NaN #26
- One-hot encode the input raster #27
2.4 Full rewrite of the Shiny app
Although the original plan was to deploy the existing app for the SPEAK study, and only then move on to look at improvements in speed and scale, some factors outside of our immediate control meant that the model rewrite (#18) was essentially complete before the original app was shared with participants.
This opened up the tantalising possibility of using the new model back-end, which was significantly faster and able to work with arbitrary regions of interest rather than the fixed set of shapefiles, for the SPEAK study. However, the new back-end required a new front-end Shiny app.
At this point I (JMR) wish to be transparent about my use of Generative AI tools for explaining and generating code. For a number of reasons this seemed a ‘good’ time to lean on Generative AI: the tight deadline (a few days), low stakes (the user interface cannot be ‘incorrect’ and need not be well-written), and the fact that I had never written a Shiny app before (or any R
at all for that matter), combined to influence my decision. I used the GPT-4 Turbo model through the Edinburgh Language Model (ELM) interface provided by Edina at the University of Edinburgh.
In this case I found Generative AI to be very helpful; I was able to learn R Shiny principles and syntax ‘on the fly’ through exposure to lots of generated examples. This stands in contrast to other frustrating and unproductive experiences I have had with Generative AI in domains where I have more existing knowledge.
Part of the reason that Generative AI works well here is that R Shiny is extremely popular so there is an abundance of existing R Shiny code available on the internet. I wish to acknowledge the countless individuals whose work (or hobby) shared online has contributed to the generated outputs that I used during this work.
The rewritten app was released to SPEAK study participants in the second week of February. Several issues and possible UI improvements were identified by users in the early period (see list of Issues below). These were addressed and the app redeployed.
In particular, we draw attention to issue #49, which was a serious issue in which layers in the input rasters were jumbled up, effectively mislabelled. The origins of this issue were multiple and quite subtle, involving undocumented and unexpected behaviour in terra::rast
. We discuss this in more detail in Section 3.2.1, as well as recommending some changes to ensure that this cannot happen undetected in future.
To ensure that the specific version of the app used in this study could be identified in future, the final deployment (February 25th) was tagged SPEAK-prerelease
on GitHub.
Pull Requests merged:
- Rewrite shiny app #28
- App improvements #30
- Posit Connect #37
- Final-final app updates before study launch #38
- Fix data production #46
- FAQs texts #47
- Post-launch tweaks #54
- Fix bug in appending personas to existing file #58
- Improve handling of bbox checks #79
Issues closed:
- App hangs when output or temporary directory doesn’t exist #3
- User documentation for the Shiny app #8
- App crashes if name of persona
.csv
does not match variable contained therein #16 - Should be possible to download + upload persona files from the app #31
- Indicate computations in progress using
waiter
tool #32 - Slider for lower cutoff value, below which is transparent #33
- Toggle greyscale on background map #34
- Subheaders for item groups #36
- Users should not be able to overwrite existing personas #42
- Fail gracefully for cases of zero recreational potential #44
- Use Scotland shapefile instead of bbox for checking if user area selection out of bounds #45
- Local path network data missing #48
- Data production jumbled up layers #49
- Mapping of csv rows onto sliders is broken #55
2.5 Script & container
As explained previously, the key requirement for BioDT was to factor out the RP model calculation into a command-line script that takes the two user-provided inputs as arguments:
- A full persona, given as a
.csv
file. - A region in which to calculate RP, given as a quadruple of ‘bounding box’ coordinates.
The main substantive task was to configure parsing command line inputs. We decided against using third-party packages (optparse, argparse) and stick with base R functionality to reduce the future maintenance burden. The result is a very simple command-line interface, lacking nice-to-have features such as argument validation, but sufficient for a ‘minimum working example’ to be handed over to the BioDT team.
We then produced a Singularity container .def
file based on the rocker/r-ver
image (preferred over r-base
) which is itself based on the Ubuntu image. The build stage involves (i) installing required C and C++ libraries using apt-get
, (ii) installing the biodt.recreation
package from GitHub, and (iii) downloading the data. Once the container is built, one can run singularity run app.sif ARGS
in order to execute Rscript scripts/cli/main.R ARGS
inside the container.
We also created a container for the input data production script, anticipating that this would need to be run on LUMI or wherever else we could access a high-memory compute node for the calculation of distances (see Section 2.7).
Pull Requests merged:
- Run the Recreational Potential model as a script #67
- Add singularity container for CLI and data production scripts #90
Issues closed:
2.6 Packaging and repository management
We adopted the increasingly common practice of organising the code into a package. This has a number of advantages compared with a stand-alone script, since the standardised setup allows us to leverage tools for installing, documenting and testing the code, and managing its dependencies on other packages.
Packaging in R
is not as mature as in other languages, and there were some false starts during the process of familiarising ourselves with the setup and development workflow. In particular, we initially intended to create a separate package for the Shiny app, which would depend on the model package, but later returned to a single-package layout for simplicity and maintenance reasons. It was simply too awkward to develop the model code and the Shiny app code simultaneously.
We added Roxygen2
-style documentation for all the functions, which produces nicely formatted Markdown documentation in RStudio. We utilised the Lintr and Styler packages, which help make the format consistent as well as catching syntax bugs. Maintaining a consistent format is useful for ensuring that meaningful changes in commits are not obscured by formatting changes. We added a couple of simple tests which are run using testthat
, but due to the time pressures on developing a working version of the model for the SPEAK study we did not develop any further tests, which is unfortunate. We configured pre-commit hooks which run the linter, styler, tests and documentation generation, but also provided a script that performs the same steps in case the developer does not want to install pre-commit.
Finally, we added some vignettes that demonstrate basic usage, although this could be expanded.Vignettes are run during the build stage of the package. This meant they could not use the full input data since (i) it is unacceptable to have to download 3GB data just to build the package, and (ii) we did not want the package to contain 3GB data itself. To get around this, we included some small rasters in the package (inst/extdata/rasters/Bush
) and wrote some helper functions for accessing it.
inst/
directory
Apparently it has become standard practice to use the inst/
directory to distribute additional elements such as scripts and Shiny apps along with the package. This is the practice we have adopted here. There are really just two important things to know about the inst/
directory:
First, that its contents are copied to the top-level of the package when the package is built. Hence, the top level of the built package will contain examples/
, extdata/
and scripts/
(the directory containing shell scripts is not included in the package).
Hence, in order for file paths to work when the package is installed, one must omit inst/
from the path. But this means the path will not work during development.
The way to regain consistency between the two situations (installed versus not installed) is to usesystem.file
. There are several examples of this in the code, including the one below, where paths to files or directories under inst/
needed to be constructed.
<- function() {
get_preset_persona_file system.file("extdata", "personas", "presets.csv",
package = "biodt.recreation", mustWork = TRUE
) }
Pull Requests merged:
- Add pre-commit hooks #73
- Tidy up the main model package #76
- Tidy up Shiny App #78
- Single package layout (take 2) #80 (first attempt in #74)
- Improve examples and vignettes #81
- Bring data production into the main package #88
Issues closed:
2.7 Proximity contributions and pair-wise distances
The 2024 version of the Shiny app contained a step, applying to each road/track/path feature (FIPS_I
component) and river/lake feature (Water
component), which involved a determination of the distance from each pixel to the nearest pixel in which that feature was present.2
This step, which after the rewrite became the final stage of the raster pre-processing pipeline, presented challenges when scaling it up from small regions to the whole of Scotland at 20 metre squared resolution.
As a result, the version used in the SPEAK study did not include these proximity contributions. Not everyone was displeased by this — there was generally an acknowledgement that the proximity part of the model needed some more thought. Unfortunately, when linear features (width ~1 pixel) were reprojected onto the Leaflet map this led to some quite confusing results (#61).
2.7.1 Why is this any different from the other data production stages?
Because every single other processing step is entirely local - all of the information required to generate an output value for a pixel at coordinate \(x\) is contained in the input pixels at coordinate \(x\). This makes it trivial to process the data in chunks, and memory costs scale linearly with the area of the chunks.
In contrast, a pairwise distance calculation is non-local, since the output value at \(x\) depends on the entire set of input values, i.e. at all coordinates.
Hence, when one calls terra::distance
on a Scotland-wide raster, one immediately encounters segmentation faults indicating that the C++ backend to terra
has attempted to write to a location in memory that does not exist (because it could not be allocated).
On my (JMR) desktop I have 16GB RAM. With terra::terraOptions(memfrac = 0.5)
there is a bit under 8GB to play with. From playing I found I could process Scotland in slices of 1/8 the total size. On DataLabs there is double the RAM, and I could process 1/4-sized slices, which was reassuring. Extrapolating, this gives 64GB as a ballpark figure for the required RAM to process all of Scotland in one go. LUMI (or indeed any other HPC facility) would provide a compute node with the required RAM. Unfortunately, our project’s allocation on LUMI expired before we managed to attempt this.
Code for some of these attempts can be found in the legacy.R
file.
2.7.2 Attempt 1: cutoff distance parameter in terra::distance
As of version 1.8.21, we can set a cutoff distance maxdist
in terra::distance
, beyond which distances are assigned NA
, which reduces the function from being global to being dependent only on pixels within the cutoff distance. However, based on cursory look at the code for this cutoff (which was introduced into terra
just 2 weeks earlier in this commit!) I could not see evidence of sophisticated memory management based on cutoff distances — it appears to just mask the output.
2.7.3 Attempt 2: Create a buffer region around features
Since the model is only sensitive to distances within around 500m, any pixel greater than 500m from the nearest feature may as well not be computed and just be set to NA
. This can be achieved be computing a buffer region around each feature and restricting terra::distance
to the buffer region only.
We made two attempts to compute a buffer using terra::focal
and terra::buffer
in the notebooks focal.qmd
and buffer.qmd
here. Unfortunately the use of terra::buffer
on a raster runs into the exact same memory issues as the distance calculation; both buffer
and distance
call GDALComputeProximity
under the hood (see this comment).
The use of terra::focal
, which can be used to apply a function over a moving window (e.g. with radius 500m) over the raster, was successful at reducing memory requirements, but took an extremely long time.
In the case of linear features such as roads, paths and rivers, it would have been significantly more efficient to compute a buffer region using the original vector layers (#51). Unfortunately we were not able to obtain access to these layers at the time.
2.7.4 Attempt 3: Replace with convolution
The outputs of terra::distance
(distances in metres) are passed through a logistic function to map them to the unit interval. There is nothing particularly important about computing distances accurately; it just matters that the outputs of the composition of the distance and logistic function fall monotonically from 1 to 0 over the right range. It seemed reasonable, therefore, to swap this pair of functions for a single convolution, e.g. with a Gaussian kernel. This would reduce memory requirements for the same reason as computing the buffer using a sliding window reduces memory requirements; only small chunks of the raster need to be loaded into memory at any given time.
However, the outputs of a convolution were qualitatively different from the distance + logistic. This is most obvious when one considers extended bodies such as lakes; the output of the convolution in the centre of the lake will be different than at the borders, whereas the output of distance + logistic is constant over the lake surface. The result is also different when two features are present in the convolutional window.
2.7.5 Attempt 4: Compute distance in overlapping subdomains:
Since terra::distance
is a global function, it cannot be decomposed into independent subdomains to be computed sequentially. However, since we only care about distances under 500m, domain decomposition is possible provided the domains overlap by at least this distance.
The original Scotland-wide raster was divided into 20 overlapping spatial windows with 500m overlapping buffers. After the distance step was complete within each window, the windows were stitched together, discarding the overlaps. The distance raster was also truncated at 500m, setting any values above this to NA
which simply passes through the logistic function and ensures that we don’t waste disk space storing high-precision floats that are functionally equivalent to zero.
Pull Requests merged:
- Fix data production #46 (note - this PR tried several approaches to reducing the memory requirements, but did not reach a satisfactory solution.)
- adding tiling example #92
- Running scotland #95
- Updated rasters bush #96
Issues closed:
3 Discussion and recommendations
3.1 Outstanding issues
At the time of writing there are around 20 open issues on the GitHub repository.
3.2 Re-engineering the data production process
The data production process has significant room for improvement in terms of robustness and reproducibility. Some suggested changes are:
- As far as is feasible, the manual manipulations in QGIS should be substituted for deterministic scripts written in (for example)
R
. - Data should be reprojected onto a consistent grid at the earliest possible opportunity.
- The projection CRS should be the same as for Leaflet maps, since that is where all raster data is eventually destined. This would avoid additional coordinate transformations during rendering and prevent the smearing of linear features and edges as described in #61.
- Layers containing linear features such as roads and rivers should not be rasterised until the last step, if at all. If nothing else, the proximity step would benefit from these layers remaining as vectors (#51).
- Layers should be constructed as binary presence/absence rasters from the beginning; one can skip the creation of categorical rasters altogether.
- The size of rasters should be reducible using the approaches described in #93.
- Layer names should be changed to be more descriptive (see below).
3.2.1 Use descriptive names for the features / layers
The final recommendation is something we feel duty bound to elaborate on, since it relates to a very subtle and serious issue that affected earlier iterations of the code, and ultimately meant we had to re-deploy the app two weeks into the SPEAK study. This discussion closely follows that in #49 and #50.
Currently the feature/layer names are in the format {Component}_{Dataset}_{i}
.
These names appear in the Name
column of config.csv
. Through the mapping Name <-> Raster_Val
, these names are assigned to the layers of the one-hot-encoded rasters in the one_hot_layer
function. After this point, the layer names are propagated through each operation. In app.R
they are essentially used to index rows of config.csv
, e.g. to get the Description
out to put next to the sliders.
The main problem which gave rise to #49 is that this naming convention mirrors the default names assigned by terra
at various points in the data production, except we have no guarantee on what the trailing index {i}
will be. Essentially, the fallback behaviour of terra
is to assign layers the correct names, but not necessarily in the correct order.
This means that if, for whatever reason, layer names are not being preserved by terra
operations, or the assignment of names to a raster is not working as expected, we will not see any error raised. Rather, we will see (if we are looking closely) that the layer names have been scrambled once we go to run the model.
To give a concrete example that caused a headache, terra::rast(list_of_rasters)
appears to silently drop the layer names of the list_of_rasters
, and instead generate default layer names that, in our case, were the same as the names we expected to see, but in a different order.
In another example, we failed to notice that layer names were not being set correctly in the one-hot encoding step. Later on, the same layer names were reassigned during this stacking step, based on the order in which those layers appear in the raster, which implied a different order from the one we expected. This meant that FIPS_I_RoadsTracks
features were reassigned the names FIPS_I_RoadsTracks_1
, FIPS_I_RoadsTracks_2
, and so on.
To be clear, this is (mostly) our fault, not the fault of terra
— we could and should use different layer names! However, the fact that layer names sometimes do and sometimes don’t propagate, and names are silently dropped and regenerated with no warning message, seems like asking for trouble.
Changing the naming convention would not require much effort. Indeed, if we just edited config.csv
before running the data production script everything should sort itself out, with the exception of some particular logic in app.R
that assumes the layer name starts with {Component}_{Dataset}
.
A simple fix would be to do nothing more than replace the integers {i}
with a descriptor. For example, FIPS_I_RoadsTracks_1
could become FIPS_I_RoadsTracks_Motorway
, and Water_Lakes_1
could become Water_Lakes_Pond
.
If nothing else, this would cause an error if ever these names were not propagated and instead replaced by default names, which is extremely important.
3.3 Make the model more interpretable
It is important to appreciate that, although the RP model produces numerical outputs,3 the purpose of the model is not to quantify, predict or even understand an objective physical world, but to serve as a qualitative input in an individual’s assessment about where to recreate.
The design objectives of the model should therefore include the maximisation of a user’s ability to discern suitable locations to recreate. Thus, the mathematical construction of the model, like the user interface, is just another design choice to be iteratively refined based on user feedback from stakeholder engagement studies such as SPEAK.
With this point in mind, I recommend changes to the RP model that would take it in the direction of greater simplicity, transparency and interpretability to both users and researchers.
- Do not rescale the four components to \([0, 1]\); instead, let RP be a simple weighted sum of all 87 items.
- Do not rescale the final result either, or at least make it a global rescaling by dividing by the maximum value of RP possible, not just the maximum value in the chosen region. Note that the colour mapping can still be adjusted to make sure different values are clearly differentiable.
- Replace the logistic function in the proximity part with a simple linear decrease over a fixed length scale. That is, a straight line from \((0, 1)\) to \((d_{\text{max}}, 0)\) where \(d_{\text{max}}\) is a distance explicitly chosen, perhaps even by the user as a ‘hyper-parameter’ of the model. It would be a simple extension to make \(d_\text{max}\) specific to individual items, e.g. larger for major lochs than for cycle paths.
3.3.1 Why not rescale the components?
Regarding point 1, the idea behind rescaling the four components before summing them was so that they “contribute equally” to the RP value, despite containing different numbers of items — the Landscape (SLSRA
) component contains 39 items whilst the Infrastructure component (FIPS_I
) only contains 10.
The problem with this is not that the argument is wrong per se, but that the outcome is undesirable. It means that, regardless of how the user configures their persona, these four components will always contribute equally to the RP value.
To illustrate the problem we can take an extreme example. Let’s say the user really loves water and scores all lakes and rivers with 10s and 9s, but they have little interest in paths and actively dislike roads, so score them 1s and 0s, respectively. They would reasonably expect rivers and lakes to appear with the highest RP values on the map, and paths and roads to be lowest. However, the paths (scored with 1s) will actually contribute more to the RP value than the rivers (scored with 9s) and therefore show up as having a higher RP value. Why is this? Ignoring the possibility that different features overlap, after rescaling paths will still be effectively scored 1, whereas rivers will be effectively scored 0.9.
Put differently, this rescaling of the components means that only within-component differences in the scores are translated into differences in RP value. The fact that 9 (rivers) is much larger than 1 (paths) in absolute terms is irrelevant, due to the rescaling.
When the RP value is just a simple weighted sum, these confusing outcomes are avoided and the relationship between the scores and the RP values becomes much more transparent.
This is reason enough to remove the rescaling step, but there is another angle to this, which is that most pixels only have a handful of features ‘present’ within them. One does not often find roads overlapping lakes overlapping lowlands — they are mutually exclusive. If one wanted to rescale or normalise the RP values, the maximum number of overlapping features is a far more relevant quantity than the number of features within a component.
Finally, removing the rescaling step takes the possibility of encountering a divide-by-zero off the table. Recall from the Technical Supplement that this occurs when the maximum and minimum values for a component in the region of interest are equal. This is not unlikely to occur if the user selects a small region of interest, or decides to score an entire component with the same values — e.g. zero if the user intends to nullify contributions from a component. Currently the model does not crash in these situations since we defined some safe handling of NA
values, but it is not ideal.
3.4 Changes to the app
3.4.1 UI changes
Anecdotal evidence suggests that participants in the SPEAK study found the individual components more useful than their aggregate, the RP layer. It would not be difficult to provide them with the option to toggle individual items (e.g. cycle paths) on the map, as well as their persona-weighted sum. Some users may prefer to omit items by deselecting a checkbox rather than navigating to the persona tab and giving them a score of 0.
As well as providing control at the item level, the app should also contain information about the origin of the data they are seeing.
Linear features should be vector layers rather than rasters. This would avoid pixelation of roads, cycle paths etc.
The UI may also display the latitude-longitude coordinate or what3words based on the cursor position. This would help users look up the relevant area on another service such as Komoot or Alltrails.
3.4.2 Storage of personas
As things stand, personas are stored in .csv
files on the server side. This means the app requires write access to the server’s filesystem.
This is convenient, since users do not have to upload their own persona files. However, it is not a scalable solution.
Future work should take a different approach to handling user’s personas, such as providing the option to link the app to Google sheets.
Removing the requirement for write access would also allow UKCEH to publish the app on the in-house DataLabs service, which may be preferable to Posit Connect.
3.5 Towards a Digital Twin
The RP model is intended to form half of the Cultural Ecosystem Services ‘Prototype Digital Twin’, but the relationship between the RP model and a vision of a Digital Twin is not always transparent #53.
The following is quoted from the BioDT website (accessed 19/02/2025)
The BioDT consortium adheres to the following definition of a digital twin (DT), released by the Digital Twin Consortium:
A digital twin is a virtual representation of real-world entities and processes, synchronized at a specified frequency and fidelity.
Here, fidelity refers to the level of precision captured by the DT in comparison with its physical counterpart.
It is difficult to make the case that the RP model in its current form fits this definition; one would have to argue that customising a persona amounts to “synchronising at a specified frequency and fidelity” a “virtual representation of real-world entities and processes.”
From the author’s perspective, there are two planned enhancements that will move the RP model in the direction of a Digital Twin:
- Combine the RP model with the biodiversity model as described in (Rolph et al., 2024).
- Update the raster data at a “specified frequency” to “synchronise” it with the “real-world entities” it is representing.
However, meeting this definition of Digital Twin cannot be an objective in and of itself; it must be an outcome that is realised in pursuit of other objectives. The availability of near real-time data does not justify real-time updates to the RP model. Rather, the frequency with which to update these datasets should be determined by balancing the needs of users with the monetary and environmental cost of the data acquisition and processing pipeline.
In cases where the utility of real-time data updates is clear, it may be worth considering whether integrating the BioDT app with third party data services may be more efficient. For example, providing users with a link to a local weather forecast is perfectly sufficient; there would be no need to embed real-time weather forecasts in the app itself. Third-party services could provide information as to the availability of parking, toilets, places to eat, and other things that people may need to plan their activity.
4 Concluding remarks
In the author’s opinion, one of the key contributions from this period of development work was to write down a concrete definition of the RP model (Technical Supplement, Section 1). From this flowed the substantial improvements to the design and efficiency of the code, and the ability to verify that it is working as expected.
The kernel of the RP model is a sum of 87 items representing presence/absence/proximity to a landscape feature, weighted by the corresponding persona scores. In the author’s opinion, this simplicity is a good thing! Perhaps we do not need to expend vast amounts of energy, water and money to compute complex metrics for Recreational Potential. Considering the objective is for users to be able to easily run the model many, many times, both the business case and the environmental case for a small computational footprint are high. Were I to take this project forward I would lean into the simplicity and aim to provide users with the most relevant information in a transparent manner.
If the intention is to scale up the BioDT app, or the CES component, to make it widely available to members of the public, it may be wise to clearly demarcate the role of researchers — i.e. the science content, stakeholder engagement — and to then outsource development work to professional app developers (i.e. not the author!). It may be fruitful to look for possible integrations with popular apps used by recreationalists, and perhaps consider presenting the CES app as an ‘add-on’ to existing apps rather than a competitor.
Access to code & data
Code for the Recreational Potential model is available as an R Package hosted on GitHub at https://github.com/BioDT/uc-ces-recreation. These reports relate specifically to the version 1.0 release, which is also available on Zenodo at https://doi.org/10.5281/zenodo.15705544.
The model requires some input data that is too large to be bundled in with the package, but which can be downloaded from Dropbox after package installation using biodt.recreation::download_data()
. This data is also included in the Zenodo listing.
The source for these documents and this website can be found at https://github.com/BioDT/ces-recreation-reports.
Funding
Funding for OpenNESS came from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 308428, OpenNESS Project (Operationalisation of Natural Capital and Ecosystem Services: From Concepts to Real-world Applications.)
Funding for BioDT came from the European Union’s Horizon Europe Research and Innovation Programme under grant agreement No 101057437 (BioDT project, https://doi.org/10.3030/101057437). Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.
Funding for SPEAK came from the Natural Environment Research Council – Growing Shoots Partnership and application co-creation bursary. NE/Y005805/1 _Growing Shoots.
CRediT statement
JMR: software; visualisation; writing. MT: software; visualisation; writing. CA: conceptualisation; data curation; funding acquisition; software. JD: conceptualisation; funding acquisition; project administration. SR: funding acquisition; software.
Acknowledgements
We wish to acknowledge significant contributions to a previous version of the Recreational Potential model by Will Bolton. JMR would also like to thank Tomáš Martinovič for assisting with understanding the requirements of BioDT.
We are very grateful to all the participants of the SPEAK project which this output is based on including: Alastair Leaver, Alice MacSporran, Brian Cassidy, Chris Pollock, David Giles, Jean Cowie, Joanna Gilliatt, Justyna Olszewska, Laura Taylor, Maddi Bunker, Rebecca MacLennan, S Mayes, Shaila Rao, Steve MacKinnon and Tom Gebbie, Active Stirling. We are equally grateful to an additional 18 participants who preferred to remain anonymous.
Correspondence
{jand,chan}@ceh.ac.uk
for enquiries relating to the BioDT or SPEAK projects, or ongoing and future work.chan@ceh.ac.uk
for enquiries relating to the data sources and QGIS processing.{joemar,madtig}@ceh.ac.uk
for enquiries relating to the code and anything else appearing in the technical supplement and/or developer’s report.simrol@ceh.ac.uk
for enquiries relating to the biodiversity component of the BioDT project.
References
Footnotes
The
2024-model
tag on GitHub is a snapshot of all of the code that was on DataLabs. This included multiple versions of the RP model. The ‘working version’ at that time was kept under theR_ShinyVersion_ReducedModel/
subdirectory.↩︎In truth, while this was the intention, it was not quite what was being calculated - see #22)↩︎
The RP model constructs a continuous scale by applying operations \((+, \times)\) on ordinal (persona scores) and nominal (feature presence/absence) data. In a strict mathematical sense this is not ‘allowed’, although it is common practice in many fields, with an a posteriori justification based on self-consistency or simply alignment with qualitative expectations.↩︎
Reuse
Copyright
Citation
@report{ukceh2025,
author = {Marsh Rossney, Joe and Tigli, Maddalena and Andrews,
Christopher and Dick, Jan and Rolph, Simon},
publisher = {Zenodo},
title = {Reports on the {BioDT} {Recreational} {Potential} {Model} for
{Scotland}},
version = {1.0},
date = {2025-06},
url = {https://doi.org/10.5281/zenodo.15715070},
doi = {10.5281/zenodo.15715070},
langid = {en},
abstract = {These reports describe the BioDT Recreational Potential
Model for Scotland (1.0), at the end of the BioDT project in June
2025.}
}