Convergence refers to the limit of a process and can be a useful analytical tool when evaluating the expected performance of an optimization algorithm.

It can also be a useful empirical tool when exploring the learning dynamics of an optimization algorithm, and machine learning algorithms trained using an optimization algorithm, such as deep learning neural networks. This motivates the investigation of learning curves and techniques, such as early stopping.

If optimization is a process that generates candidate solutions, then convergence represents a stable point at the end of the process when no further changes or improvements are expected. **Premature convergence** refers to a failure mode for an optimization algorithm where the process stops at a stable point that does not represent a globally optimal solution.

In this tutorial, you will discover a gentle introduction to premature convergence in machine learning.

After completing this tutorial, you will know:

- Convergence refers to the stable point found at the end of a sequence of solutions via an iterative optimization algorithm.
- Premature convergence refers to a stable point found too soon, perhaps close to the starting point of the search, and with a worse evaluation than expected.
- Greediness of an optimization algorithm provides a control over the rate of convergence of an algorithm.

Let’s get started.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- Convergence in Machine Learning
- Premature Convergence
- Addressing Premature Convergence

## Convergence in Machine Learning

Convergence generally refers to the values of a process that have a tendency in behavior over time.

It is a useful idea when working with optimization algorithms.

Optimization refers to a type of problem that requires finding a set of inputs that result in the maximum or minimum value from an objective function. Optimization is an iterative process that produces a sequence of candidate solutions until ultimately arriving upon a final solution at the end of the process.

This behavior or dynamics of the optimization algorithm arriving on a stable-point final solution is referred to as convergence, e.g. the convergence of the optimization algorithms. In this way, convergence defines the termination of the optimization algorithm.

Local descent involves iteratively choosing a descent direction and then taking a step in that direction and repeating that process until convergence or some termination condition is met.

— Page 13, Algorithms for Optimization, 2019.

**Convergence**: Stop condition for an optimization algorithm where a stable point is located and further iterations of the algorithm are unlikely to result in further improvement.

We might measure and explore the convergence of an optimization algorithm empirically, such as using learning curves. Additionally, we might also explore the convergence of an optimization algorithm analytically, such as a convergence proof or average case computational complexity.

Strong selection pressure results in rapid, but possibly premature, convergence. Weakening the selection pressure slows down the search process …

— Page 78, Evolutionary Computation: A Unified Approach, 2002.

Optimization, and the convergence of optimization algorithms, is an important concept in machine learning for those algorithms that fit (learn) on a training dataset via an iterative optimization algorithm, such as logistic regression and artificial neural networks.

As such, we may choose optimization algorithms that result in better convergence behavior than other algorithms, or spend a lot of time tuning the convergence dynamics (learning dynamics) of an optimization algorithm via the hyperparameters of the optimization (e.g. learning rate).

Convergence behavior can be compared, often in terms of the number of iterations of an algorithm required until convergence, to the objective function evaluation of the stable point found at convergence, and combinations of these concerns.

## Premature Convergence

Premature convergence refers to the convergence of a process that has occurred too soon.

In optimization, it refers to the algorithm converging upon a stable point that has worse performance than expected.

Premature convergence typically afflicts complex optimization tasks where the objective function is non-convex, meaning that the response surface contains many different good solutions (stable points), perhaps with one (or a few) best solutions.

If we consider the response surface of an objective function under optimization as a geometrical landscape and we are seeking a minimum of the function, then premature optimization refers to finding a valley close to the starting point of the search that has less depth than the deepest valley in the problem domain.

For problems that exhibit highly multi-modal (rugged) fitness landscapes or landscapes that change over time, too much exploitation generally results in premature convergence to suboptimal peaks in the space.

— Page 60, Evolutionary Computation: A Unified Approach, 2002.

In this way, premature convergence is described as finding a locally optimal solution instead of the globally optimal solution for an optimization algorithm. It is a specific failure case for an optimization algorithm.

**Premature Convergence**: Convergence of an optimization algorithm to a worse than optimal stable point that is likely close to the starting point.

Put another way, convergence signifies the end of the search process, e.g. a stable point was located and further iterations of the algorithm are not likely to improve upon the solution. Premature convergence refers to reaching this stop condition of an optimization algorithm at a less than desirable stationary point.

## Addressing Premature Convergence

Premature convergence may be a relevant concern on any reasonably challenging optimization task.

For example, a majority of research into the field of evolutionary computation and genetic algorithms involves identifying and overcoming the premature convergence of the algorithm on an optimization task.

If selection focuses on the most-fit individuals, the selection pressure may cause premature convergence due to reduced diversity of the new populations.

— Page 139, Computational Intelligence: An Introduction, 2nd edition, 2007.

Population-based optimization algorithms, like evolutionary algorithms and swarm intelligence, often describe their dynamics in terms of the interplay between selective pressures and convergence. For example, strong selective pressures result in faster convergence and likely premature convergence. Weaker selective pressures may result in a slower convergence (greater computational cost) although perhaps locate a better or even global optima.

An operator with a high selective pressure decreases diversity in the population more rapidly than operators with a low selective pressure, which may lead to premature convergence to suboptimal solutions. A high selective pressure limits the exploration abilities of the population.

— Page 135, Computational Intelligence: An Introduction, 2nd edition, 2007.

This idea of selective pressure is helpful more generally in understanding the learning dynamics of optimization algorithms. For example, an optimization that is configured to be too greedy (e.g. via hyperparameters such as the step size or learning rate) may fail due to premature convergence, whereas the same algorithm that is configured to be less greedy may overcome premature convergence and discover a better or globally optimal solution.

Premature convergence may be encountered when using stochastic gradient descent to train a neural network model, signified by a learning curve that drops exponentially quickly then stops improving.

The number of updates required to reach convergence usually increases with training set size. However, as m approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set.

— Page 153, Deep Learning, 2016.

The fact that fitting neural networks are subject to premature convergence motivates the use of methods such as learning curves to monitor and diagnose issues with the convergence of a model on a training dataset, and the use of regularization, such as early stopping, that halts the optimization algorithm prior to finding a stable point comes at the expense of worse performance on a holdout dataset.

As such, much research into deep learning neural networks is ultimately directed at overcoming premature convergence.

Empirically, it is often found that ‘tanh’ activation functions give rise to faster convergence of training algorithms than logistic functions.

— Page 127, Neural Networks for Pattern Recognition, 1995.

This includes techniques such as work on weight initialization, which is critical because the initial weights of a neural network define the starting point of the optimization process, and poor initialization can lead to premature convergence.

The initial point can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficulties and fails altogether.

— Page 301, Deep Learning, 2016.

This also includes the vast number of variations and extensions of the stochastic gradient descent optimization algorithm, such as the addition of momentum so that the algorithm does not overshoot the optima (stable point), and Adam that adds an automatically adapted step size hyperparameter (learning rate) for each parameter that is being optimized, dramatically speeding up convergence.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Tutorials

### Books

### Articles

## Summary

In this tutorial, you discovered a gentle introduction to premature convergence in machine learning.

Specifically, you learned:

- Convergence refers to the stable point found at the end of a sequence of solutions via an iterative optimization algorithm.
- Premature convergence refers to a stable point found too soon, perhaps close to the starting point of the search, and with a worse evaluation than expected.
- Greediness of an optimization algorithm provides a control over the rate of convergence of an algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.