“Learning algorithms” are a class of computational tool designed to infer information from a data set, and then apply that information predictively. They are particularly well suited to complex pattern recognition, or to situations where a mathematical relationship needs to be modelled but where the underlying processes are not well understood, are too expensive to compute, or where signals are over-printed by other effects. If a representative set of examples of the relationship can be constructed, a learning algorithm can assimilate its behaviour, and may then serve as an efficient, approximate computational implementation thereof. A wide range of applications in geomorphometry and Earth surface dynamics may be envisaged, ranging from classification of landforms through to prediction of erosion characteristics given input forces. Here, we provide a practical overview of the various approaches that lie within this general framework, review existing uses in geomorphology and related applications, and discuss some of the factors that determine whether a learning algorithm approach is suited to any given problem.

The human brain has a remarkable capability for identifying patterns in complex, noisy data sets, and then applying this knowledge to solve problems or negotiate new situations. The research field of “learning algorithms” (or “machine learning”) centres around attempts to replicate this ability via computational means, and is a cornerstone of efforts to create “artificial intelligence”. The fruits of this work may be seen in many different spheres – learning algorithms feature in everything from smartphones to financial trading. As we shall discuss in this paper, they can also prove useful in scientific research, providing a route to tackling problems that are not readily solved by conventional approaches. We will focus particularly on applications falling within geomorphometry and Earth surface dynamics, although the fundamental concepts are applicable throughout the geosciences, and beyond.

This paper does not attempt to be comprehensive. It is impossible to list every problem that could potentially be tackled using a learning algorithm, or to describe every technique that might somehow be said to involve “learning”. Instead, we aim to provide a broad overview of the possibilities and limitations associated with these approaches. We also hope to highlight some of the issues that ought to be considered when deciding whether to approach a particular research question by exploring the use of learning algorithms.

The artificial intelligence literature is vast, and can be confusing. The
field sits at the interface of computer science, engineering, and statistics:
each brings its own perspective, and sometimes uses different language to
describe essentially the same concept. A good starting point is the book by

Learning algorithms are computational tools, and a number of software
libraries are available which provide users with a relatively straightforward
route to solving practical problems. Notable examples include

This paper begins with a brief overview of the general framework within which learning algorithms operate. We then introduce three fundamental classes of problem that are often encountered in geomorphological research, and which seem particularly suited to machine learning solutions. Motivated by this, we survey some of the major techniques within the field, highlighting some existing applications in geomorphology and related fields. Finally, some of the practical considerations that affect implementation of these techniques are discussed, and we highlight some issues that should be noted when considering exploring learning algorithms further.

Fundamentally, a learning algorithm is a set of rules that are designed to find and exploit patterns in a data set. This is a familiar process when patterns are known (or assumed) to take a certain form – consider fitting a straight line to a set of data points, for example – but the power of most learning algorithm approaches lies in their ability to handle complex, arbitrary structures in data. Traditionally, a distinction is drawn between “supervised” learning algorithms – which are aimed at training the system to recognise known patterns, features, or classes of object – and “unsupervised” learning, aimed finding patterns in the data that have not previously been identified or that are not well defined. Supervised learning typically involves optimising some pre-defined measure of the algorithm's performance – perhaps minimising the difference between observed values of a quantity and those predicted by the algorithm – while in unsupervised learning, the goal is usually for the system to reach a mathematically stable state.

At a basic level, most learning algorithms can be regarded as “black boxes”: they take data in, and then output some quantity based upon that data. The detail of the relationship between inputs and outputs is governed by a number of adjustable parameters, and the “learning” process involves tuning these to yield the desired performance. Thus, a learning algorithm typically operates in two modes: a learning or “training” phase, where internal parameters are iteratively updated based on some “training data”, and an “operational” mode in which these parameters are held constant, and the algorithm outputs results based on whatever it has learned. Depending on the application, and the type of learning algorithm, training may operate as “batch learning” – where the entire data set is assimilated in a single operation – or as “online learning”, where the algorithm is shown individual data examples sequentially and updates its model parameters each time. This may be particularly useful in situations where data collection is ongoing, and it is therefore desirable to be able to refine the operation of the system based on this new information.

In the context of learning algorithms, a “data set” is generally said to
consist of numerous “data vectors”. For our purposes, each data vector
within the data set will usually correspond to the same set of physical
observations made at different places in space or time. Thus, a data set might
consist of numerous stream profiles, or different regions extracted from a
lidar-derived digital elevation model (DEM). It is possible to combine
multiple, diverse physical observations into a single data vector: for
example, it might be desirable to measure both the cross section and
variations in flow rate across streams, and regard both as part of the same
data vector. It is important to ensure that all data vectors constituting a
given data set are obtained and processed in a similar manner, so that any
difference between examples can be attributed solely to physical factors. In
practice, pre-processing and “standardising” data to enhance features that
are likely to prove “useful” for the desired task can also significantly
impact performance: we return to this in Sect.

Broadly speaking, we see three classes of problem where learning algorithms can be particularly useful in geomorphology: classification and cataloguing; cluster analysis and dimension reduction; and regression and interpolation. All represent tasks that can be difficult to implement effectively via conventional means, and which are fundamentally data-driven. However, there can be considerable overlap between all three, and many applications will not fit neatly into one category.

Classification problems are commonplace in observational science, and provide the canonical application for supervised learning algorithms. In the simplest case, we have a large collection of observations of the same type – perhaps cross sections across valleys – and we wish to assign each to one of a small number of categories (for example, as being of glacial or riverine form). In general, this kind of task is straightforward to an experienced human eye. However, it may be difficult to codify the precise factors that the human takes into account, preventing their implementation as computer code: simple rules break down in the face of the complexities inherent to real data from the natural world. With a learning algorithm approach, the user typically classifies a representative set of examples by hand, so that each data vector is associated with a “flag” denoting the desired classification. The learning algorithm then assimilates information about the connection between observations and classification, seeking to replicate the user's choices as closely as possible. Once this training procedure has been completed, the system can be used operationally to classify new examples, in principle without the need for further human involvement.

Beyond an obvious role as a labour-saving device, automated systems may enable users to explore how particular factors affect classification. It is straightforward to alter aspects of data processing, or the labelling of training examples, and then re-run the classification across a large data set. Another advantage lies in the repeatable nature of the classification: observations for a new area, or obtained at a later date, can be processed in exactly the same manner as the original data set, even if personnel differ. It is also possible to use multiple data sets simultaneously when performing the classification – for example, identification of certain topographic features may be aided by utilising high-resolution local imagery, plus lower-resolution data showing the surrounding region, or land use classification may benefit from using topography together with satellite imagery.

It is sometimes claimed that it is possible to somehow “interrogate” the
learning algorithm so as to discover its internal state and understand which
aspects of the data are used to make a particular classification. This
information could offer new insights into the physical processes underpinning
a given problem. For the simplest classifiers, this may be possible, but in
general we believe it ought to be approached with some scepticism.
Classification systems are complex, and subtle interactions between their
constituent parts can prove important. Thus, simplistic analysis may prove
misleading. A more robust approach, where feasible, would involve classifying
synthetic (artificial) data and exploring how parameters controlling the
generation of this affect results

Conventionally, classification problems assume that all examples presented to the system can be assigned to one category or another. A closely related problem, which we choose to call “cataloguing”, involves searching a large data set for examples of a particular feature – for example, locating moraines or faults in regional-scale topographic data. This introduces additional challenges: each occurrence of the feature should be detected only once, and areas of the data set that do not contain the desired feature may nevertheless vary considerably in their characteristics. As a result, cataloguing problems may require an approach that differs from other classification schemes.

Examples of classification problems in geomorphology where machine learning
techniques have been applied include classifying elements of urban
environments

Classification problems arise when the user has prior knowledge of the
features they wish to identify within a given data set. However, in many cases
we may not fully understand how a given process manifests itself in
observable phenomena. Cluster analysis and dimension reduction techniques
provide tools for “discovering” structure within data sets, by identifying
features that frequently occur, and by finding ways to partition a data set
into two or more parts, each with a particular character. An accessible
overview of the conceptual basis for cluster analysis, as well as a survey of
available approaches, can be found in

In many applications, data vectors are overparameterised: the representations
used for observable phenomena have more degrees of freedom than the
underlying physical system. For example, local topography might be
represented as a grid of terrain heights. If samples are taken every 10 m, then 11 samples span a distance of

One particularly important application of dimension reduction lies in visualisation. Where individual data vectors are high-dimensional, it may be difficult to devise effective means of plotting them in (typically) two dimensions. This makes it difficult to appreciate the structure of a data set, and how different examples relate to one another. In order to tackle this problem, learning algorithms may be used to identify a two-dimensional representation of the data set that somehow preserves the higher-dimensional relationships between examples. This process involves identifying correlations and similarities between individual data vectors, and generally does not explicitly incorporate knowledge of the underlying physical processes. Thus, the coordinates of each example within the low-dimensional space may not have any particular physical significance, but examples that share common characteristics will yield closely spaced points. Thus, it may be possible to identify visual patterns, and hence discover relationships within the high-dimensional data.

Geomorphological applications of cluster analysis include identifying flow
directions from glacial landscapes

The third class of problem involves learning relationships between physical parameters, in order to make predictions or to infer properties. Very often, it is known that one set of observable phenomena are closely related to a different set – but the details of that relationship may be unknown, or it may be too complex to model directly. However, if it is possible to obtain sufficient observations where both sets of phenomena have been measured, a learning algorithm can be used to represent the link, and predict one given the other – for example, an algorithm might take measurements of soil properties and local topography and then output information about expected surface run-off rates. Alternatively, the same training data could be used to construct a system that infers the soil parameters given topography and run-off measurements. This may be useful when there are fewer measurements available for one of the physical parameters, perhaps because it is harder or more expensive to measure: once trained on examples where this parameter has been measured, the algorithm can be used to estimate its value in other locations based on the more widely available parameters.

Questions of this sort may be framed deterministically – so that the system provides a single prediction – or statistically, where the solution is presented as a probability distribution describing the range of possible outcomes. The choice of approach will depend upon the nature of the underlying problem, and upon the desired use of the results. In general, probabilistic approaches are desirable, since they provide a more realistic characterisation of the system under consideration – deterministic approaches can be misleading when more than one solution is compatible with available data, or where uncertainties are large. However, in some cases it may be difficult to interpret and use information presented as a probability distribution. For completeness, we observe that most learning algorithms have their roots in statistical theory, and even when used “deterministically”, the result is formally defined within a statistical framework.

In geomorphology, a common application of machine learning for regression and
interpolation is to link widely available remote sensing measurements with
underlying parameters of interest that cannot be measured directly: for
example, sediment and chlorophyll content of water from colour measurements

Summary of methods discussed in this paper. For
each of the “popular techniques” discussed in Sect.

The aforementioned problems can be tackled in almost any number of ways: there is rarely a single “correct” approach to applying learning algorithms to any given question. As will become clear, once a general technique has been selected, there remains a considerable array of choices to be made regarding its precise implementation. Usually, there is no clear reason to make one decision instead of another – often, the literature describes some “rule of thumb”, but its underlying rationale may not always be obvious. A certain amount of trial and error is generally required to obtain optimal results with a learning algorithm. This should perhaps be borne in mind when comparisons are drawn between different approaches: although many studies can be found in the literature that conclude that one method outperforms another for a given problem, it is unlikely that this has been demonstrated to hold for all possible implementations of the two methods. It is also worth noting that the relationship between performance and computational demands may differ between algorithms: a method that gave inadequate performance on a desktop computer a decade ago may nevertheless excel given the vastly increased resources of a modern, high-performance machine.

In what follows, we outline a selection of common methods, with an emphasis
on conveying general principles rather than providing precise formal
definitions. There is no particular rationale underpinning the methods we
choose to include here, beyond a desire to cover a spectrum of important
approaches. Other authors would undoubtedly make a different selection (for
example, see

A decision tree is a system that takes a data vector, and processes it via a
sequence of

Evolution of a decision tree for land use data
based on normalised parameters for vegetation index, seasonal colour
variability, and topographic roughness.

Typically, each data vector contains a number of “elements” – distinct
observations, perhaps made at different points in space or time, or of
different quantities relevant to the phenomenon of interest. Each vector is
also associated with a particular “desired outcome” – the classification or
state that the tree should output when given that example. Basic decision
tree generation algorithms aim to identify a test that can be applied to any
one element, which separates desired outcomes as cleanly as possible
(e.g. Fig.

Tree generation assumes that the data are perfect, and will therefore continue adding branch points until all training data can be classified as desired. When real-world data sets – which invariably contain errors and “noise” – are used, this tends to result in overly complex trees, with many branches. This phenomenon is known as “overfitting”, and tends to result in a tree with poor generalisation performance: when used to process previously unseen examples, the system does not give the desired outcome as often as one might hope. It is therefore usual to adopt some sort of “pruning” strategy, by which certain branches are merged or removed. Essentially, this entails prioritising simple trees over perfect performance for training data; a variety of techniques exist, and the choice will probably be application-specific.

Another approach to this issue, and to the fact that in many problems the
number of data elements vastly exceeds the number of possible outcomes, lies
in the use of “random forests”

A recent example of an application of random forests to Earth surface data
can be found in

The

By far the most well-known technique for cluster analysis,

The algorithm is readily understood, and is illustrated in
Fig.

In order to implement this, it is necessary to define what the word
“distance” means in the context of comparing any two data vectors. There
are a number of possible definitions, but it is most common to use the
“Euclidean” distance: the sum of the squared difference between each
element of the two vectors. Thus, if

In its basic form, the

Clustering algorithms such as

Principal component analysis of a simple data set. Dominant principal component shown in red; secondary principal component shown in blue. Line lengths are proportional to the weight attached to each component. It is apparent that the principal components align with the directions in which the data set shows most variance.

Often, the different observations comprising a given data vector are
correlated, and thus not fully independent: for example, topographic heights
at adjacent sites are likely to be reasonably similar to one another, and an
imprint of topography will often be found in other data sets such as soil
thickness or temperature. For analysis purposes, it is often desirable to
identify the patterns common to multiple data elements, and to transform
observations into a form where each parameter, or component, is uncorrelated
from the others. Principal component analysis (PCA) provides one of the most
common techniques for doing so, and has its roots in the work of

Numerical algorithms for performing PCA are complex, and there is usually
little need for the end user to understand their intricacies. In general
terms, PCA involves finding the “direction” in which the elements of a
data set exhibit the greatest variation, and then repeating this with the
constraint that each successive direction considered must be at right angles
(orthogonal) to those already found. Although easiest to visualise in two or
three dimensions (see Fig.

Image reconstruction using principal
components. PCA has been performed on a data set containing 1000 square
“patches” of bathymetric data, each representing an area of dimension
150 km

Thus, the outcome of PCA is an set of orthogonal directions (referred to as
principal components) ordered by their importance in explaining a given
data set: in a certain sense, this can be regarded as a new set of co-ordinate
axes against which data examples may be measured. The principal components
may be regarded as a set of weighted averages of different combinations of
the original parameters, chosen to best describe the data. Often, much of the
structure of a data set can be expressed using only the first few principal
components, and PCA can therefore be used as a form of
dimensionality-reduction (see Fig.

One example of this can be found in

Schematic of a simple neural
network (a “multi-layer perceptron”). The network takes an

Perhaps the most varied and versatile class of algorithm discussed in this
paper is the neural network. As the name suggests, these were originally
developed as a model for the workings of the brain, and they can be applied
to tackling a wide range of problems. It has been shown

A neural network is constructed from a large number of interconnected
“neurons”. Each neuron is a processing unit that takes a number of numerical
inputs, computes a weighted sum of these, and then uses this result to
compute an output value. The behaviour of the neuron can therefore be
controlled by altering the weights used in this computation. By connecting
together many neurons, with the outputs from some being used as inputs to
others, a complex system – or network – can be created, as in
Fig.

Such a brief description glosses over the richness of approaches within
“neural networks”: choices must be made regarding how individual neurons
behave, how they are connected together (the “network architecture”), and
how the training algorithm operates. These are generally somewhat
application-dependent, and may dictate how effectively the network performs.
The simplest form of neural network, sometimes called a “multi-layer
perceptron” (MLP), consists of neurons arranged in two or three “layers”,
with the outputs from neurons in one layer being used as the inputs for the
next layer (see Fig.

In recent years, attention has increasingly focussed on “deep learning”,
where many more layers of neurons are used. This has proven effective as a
means to “discover” structure in large, complex data sets and represent
these in a low-dimensional form

The modern concept of the support vector machine (SVM) stems from the work of

However, in most realistic cases, the data set cannot be cleanly categorised
using linear boundaries: all possible linear decision boundaries will
misclassify some data points. To handle this scenario, the SVM approach uses
a mathematical trick to create nonlinear decision boundaries in a way that is
computationally tractable. Data are first mapped into a higher-dimensional
“feature space” (the opposite of dimensionality reduction); in this space,
the data can then be separated using linear boundaries (see
Fig.

Again, landslide susceptibility assessment offers one aspect of geomorphology
where SVMs have found significant application

Segmenting data sets with linear decision boundaries. In the original one dimension (top), it is not possible to separate red squares from blue circles. However, by mapping this data set into an artificial second dimension, it becomes possible to draw a linear “decision boundary” that distinguishes the two classes. The support vector machine provides a technique for achieving this in such a way that the “margin” between the boundary and the nearest data points is maximised (as shown by the dotted lines).

The self-organising map (SOM). A data set consists of
numerous data points (blue), and is to be represented by a one-dimensional
SOM (red). The SOM contains of a number of nodes, with a well-defined spatial
relationship (here, depicted by lines connecting points). Initially, all SOM
nodes are associated with random locations in data space

The concept of the self-organising map (SOM) stems from the work of

To create an SOM, we start with a map consisting of a number of “nodes”,
often arranged as a regular grid, so that it is possible to define a
“distance” between any two nodes, and hence identify the set of nodes that
lie within a certain radius of a given node, known as its “neighbourhood”.
Each node is associated with a random “codebook vector” with the same
dimensionality as the data (Fig.

Once the SOM is trained, various approaches exist to enable visualisation of
the codebook vectors, with the goal of highlighting any underlying structure
within the data set. One common approach is to try to identify clusters in the
data set by examining the distances between the codebook vectors for
neighbouring nodes, often by plotting the SOM grid coloured according to
these distances (sometimes described as depicting the “U-matrix”).
Alternatively, a method can be used to distort the grid in such a way that
when plotted in 2-D, the distance between nodes is proportional to the
distance between their codebook vectors; one common technique for this is
“Sammon's mapping”

Potential applications in geomorphology are numerous.

To conclude this section, we mention the concept of Bayesian inference. This
is a much broader topic than any of the methods discussed so far; indeed, in
many cases these methods are themselves formally derived from Bayesian
concepts. Bayes' theorem

In many cases, it is possible to estimate or compute the various
probabilities required to implement Bayes' theorem, and thus it is possible
to make probabilistic assessments. This is often useful: for example, hazard
assessment is generally better framed in terms of the chance, or risk, of an
event, rather than attempting to provide deterministic predictions. An
extensive discussion of Bayesian analysis can be found in

As a simple example, suppose we are interested in classifying land use from
satellite imagery. Grassland will appear as a green pixel 80 % of the time,
although it may also be brown: thus,

Most examples in geomorphology again come from landslide susceptibility:

Each of the techniques discussed in the previous section – and the wide variety of alternatives not mentioned here – has its strengths and weaknesses, and a number of practical issues may need to be considered when implementing a solution to a particular problem. Here, we discuss some topics that may be relevant across a range of different approaches, and which may affect the viability of a learning algorithm solution in any case.

Unsurprisingly, the training data used when implementing a learning algorithm
can have a major impact upon results: everything the system “knows” is
derived entirely from these examples. Any biases or deficiencies within the
training data will therefore manifest themselves in the performance of the
trained system. This is not, in itself, necessarily problematic – indeed, the
landform cataloguing system introduced by

One particular issue that can arise stems from the fact that the learning algorithm lacks the trained researcher's sense of context: it has no preconception that certain structures in the data are more or less significant than others. For example, suppose a system is developed to classify valley profiles as being formed by either a river or a glacier. The training data for this system would consist of a sequence of hand-classified valley profiles, each expressed as a vector of topographic measurements. If, for the sake of example, all glacial examples happen to be drawn from low-elevation regions, and all riverine examples from high-elevation regions, it is likely that the system would learn to treat elevation as an important factor during classification.

A second potential pitfall arises from the statistical basis underpinning
most learning algorithms. Typically, the extent to which a particular feature
or facet of the data set will be “learnt” depends on its prevalence within
the training examples as a whole. This can make it difficult for a system to
recognise and use information that occurs infrequently in the training data,
since it gets “drowned out” by more common features. Again, considering the
problem of valley classification, if 99 % of training examples are glacial,
the system is likely to learn to disregard its inputs and simply classify
everything as glacial, since this results in a very low error rate; for best
results, both types should occur in roughly equal proportions within the
training set. As before, this should be regarded as a natural property of the
learning process, rather than as being inherently problematic; indeed, it can
be exploited as a tool for “novelty detection”, allowing unusual features
within a data set to be identified

To avoid problems, it is important to choose training data with care, and to
develop strategies for evaluating and monitoring the performance of the
trained system. It is often beneficial to invest time in finding the best way
to represent a given data type, so as to accentuate the features of interest,
and remove irrelevant variables. This process is sometimes referred to as
“feature selection”

For “supervised” learning algorithms, where a pre-defined relationship or structure is to be learnt, it is possible to assess performance using a second set of user-selected data – often referred to as a “test” or “monitoring” data set. These test data intended to provide an independent set of examples of the phenomena of interest, allowing quantitative measures of effectiveness to be evaluated; this strategy is sometimes referred to as “cross-validation”. It is important to do this using examples separate from the training set in order to ensure that the system's “generalisation performance” is measured: we want to be sure that the algorithm has learned properties that can be applied to new cases, rather than learning features specific to the particular examples used during training. As an analogy: a dog may learn to respond to particular commands, but only when issued in a very specific manner; it cannot then be said to properly understand a spoken language.

The metric by which performance is assessed is likely to be
situation-dependent. In general, a supervised learning algorithm will be
designed to optimise certain “error measures”, and these are likely to
provide a good starting point. Nevertheless, other statistics may also prove
useful. For classification systems, analysis of “receiver operating
characteristics” (ROCs) such as hit and false positive rates may be
instructive

Assessing performance in unsupervised learning is more challenging, as we may not have prior expectations against which to measure results: fundamentally, it may be challenging to define what “good performance” should signify. In many cases, application-specific statistics may be helpful – for example, in cluster analysis it is possible to calculate the standard deviation of each cluster, quantifying how tightly each is defined – and again, the researcher's sense of plausibility may provide some insight. It may also prove instructive to repeat training using a different subset of examples, to assess how stable results are with respect to variations in the training data: a structure or grouping that appears consistently is more likely to be real and significant than one that is very dependent on the precise examples included in the training set.

The phenomenon of “overtraining” or “overfitting”, which typically arises
in the context of supervised learning algorithms, has already been alluded
to. It occurs when an iterative training procedure is carried out for too
many iterations: at some point, the system usually begins to learn
information that is specific to the training examples, rather than being
general. This tends to reduce the performance of the system when subsequently
applied to unseen data. It can often be detected by monitoring the
algorithm's generalisation performance using an independent set of examples
as training proceeds: this enables the training procedure to be terminated
once generalisation performance begins to decrease. In certain cases,
post-training strategies can be used to reduce the degree of over-fitting:
the example of “pruning” decision trees has already been mentioned. It has
also been shown that “ensemble methods” may be useful (indeed, “random
forests” provide one example of an ensemble method), whereby multiple
instances of the same learning algorithm are (over-)trained, from different
randomised starting points, and their outputs are then somehow averaged

Another strategy that is adopted is to add random noise to the training data. The rationale here is that training examples typically exhibit an underlying pattern or signal of interest, overprinted by other processes and observational errors. In order to learn the structure of interest, we wish to desensitise our training procedure to the effects of this overprinting. If we can define a “noise model” that approximates the statistical features of the unwanted signal, adding random realisations of this to each training example allows us to limit the extent to which the algorithm learns to rely on such features: by making their appearance random, we make them less “useful” to the algorithm. Returning again to the example of valley classification, local variations in erosion and human intervention might be modelled as correlated Gaussian noise on each topographic measurement. During training, each example is used multiple times, with different noise on each occasion; in theory, this results in only the gross features of the valley profile being taken into account for classification purposes. However, it may be challenging to identify and construct appropriate noise models in many realistic cases.

It is worth noting here that a similar strategy may prove useful in other cases where it is desirable to desensitise systems to particular aspects of the data. For example, spatial observations are typically reported on a grid, aligned with geographic coordinates. However, natural phenomena typically do not display any such alignment, and orientation information may be irrelevant in many cases. If 2-D spatial information is used in a particular case, it may be desirable to make use of multiple, randomly rotated copies of each training set example. This allows an effectively larger training set to be created, and reduces the chance that features are treated differently depending on their alignment.

As with any analysis technique, results obtained using learning algorithms can be misleading if not treated carefully. This is especially true where a technique invites treatment as a “black box”, and where the mechanism by which it operates is not always easily understood. The great strength of artificial intelligence is that it enables a computer to mimic the experienced researcher – but this is also a potential drawback, tending to distance the researcher from their data. In some sense, this is an inevitable consequence of the ever-increasing quantity of data available to researchers – but there is a risk that it leads to subtleties being missed, or results interpreted wrongly due to misapprehensions surrounding computational processing. To minimise the risk of these issues arising, it is important that researchers develop heuristics and procedures that enable “intelligent” systems to be monitored. For example, users of automated data classification systems should monitor the statistical distributions of classification outputs, and investigate any deviations from the norm; it is also desirable to spot-check classifications. This is particularly true in settings where there is a risk that new examples lie outside the scope of the training set – perhaps where data are drawn from new geographic regions, or were collected at a different time of year.

Learning algorithms have immense potential in enabling exploration of large,
complex data sets and in automating complicated tasks that would otherwise
have to be done manually. However, developing a learning algorithm
system – especially those targeted at classification, regression, or
interpolation – can also be time-consuming and resource-intensive.
Computational demands may also be significant: although many applications
require no more than a standard laptop or desktop computer, some may only be
viable with access to large-scale parallel computing resources. By way of an
illustration for the more computationally intensive end of the spectrum:
training the learning algorithm used to catalogue seamounts in

In this paper, we have attempted to provide a general survey of the field of learning algorithms, and how they might be useful to researchers working in the fields of geomorphometry and Earth surface dynamics. These fields benefit from extensive, large-scale, feature-rich data sets, bringing opportunities and challenges in equal measure. Although currently dominated by a few specific topics in the field, such as landslide hazard assessment, the use of artificial intelligence to help explore, process, and interpret geomorphological data is almost certain to be an increasingly significant aspect of research in coming years.

An increased use of learning algorithms in geomorphological communities is likely to require developments in computational infrastructure. There are obvious requirements for access to appropriate hardware and software resources, as well as skill development. In particular, larger problems or more complex algorithms may make use of parallel computing and other high-performance computing techniques essential. In addition, in light of the potentially substantial development cost of implementing some of the more complex learning algorithms, it is worth trying to plan for a flexible implementation. Most of these approaches can, in principle, be implemented in a fairly general framework, allowing the same underlying algorithm to be applied to many problems.

The computational implications of both parallelisation and generalisation are beyond the scope of this review, but one area of particular relevance to the geomorphology community concerns data input and output, and hence data format: the ability to reuse an algorithm across multiple data types is a key element of flexibility in a field with such a diversity of measurements. The ability to handle large file sizes and support efficient data access, potentially in parallel, is also an important consideration. As these techniques develop, this places increasing importance on the development and use of robust community standards for data formats. File frameworks, which allow the development of multiple specialist file formats all adhering to a common set of rules, may be particularly valuable in combining consistency from an algorithmic point of view with flexibility to accommodate varied data.

However, learning algorithms are not a panacea. “Traditional” approaches to
data analysis will remain important for the foreseeable future, and are
well suited to tackling many problems; we do not advocate their wholesale
replacement. In addition, it is important that the use of more advanced
computational techniques is not allowed to become a barrier between
researchers and data; it is almost certain that the nature and practice of
“research” will need to evolve to accommodate increasing use of these
technologies. Some interesting perspectives on these issues may be found
in – for example – an issue of

We are grateful to the associate editor, John Hillier, and to Niels Anders, J. J. Becker, Ian Evans, and Evan Goldstein for reviews and comments on the initial draft of this manuscript. We also thank Jeannot Trampert for numerous useful discussions. A. P. Valentine is supported by the European Research Council under the European Union's Seventh Framework Programme (FP/2007-2013)/ERC grant agreement no. 320639. Edited by: J. K. Hillier