# Indiscriminate crossover in genetic algorithm is bad

It is quite well-established that the crossover operator in genetic algorithm (GA), albeit vital for opening doors to new search spaces, also suffers from a myriad of problems. When chromosomes are ordered, crossing over may disrupt existing order thus necessitating additional repairing. In a previous post, I also discussed an issue where crossover is possibly detrimental when the chromosome has complex internal structures. This time, I am going to explore another potential source of pitfall, and try to propose a solution.

## The existence of sub-populations

Let’s consider the largest counterpart to genetic algorithm: Nature. By the principle of survival of the fittest, life has evolved over eons leading to theoretically a very fit population. But paradoxically, instead of converging to a single optimum where near uniform entities overwhelm our planet (no, humans have yet to fall into this category), earth’s life forms are surprisingly diverse. Each specie is exceedingly adept at surviving in their own niche. There is life that thrives at the boiling heat of deep-sea vents, and at freezing temperatures around Arctic ice. In the realm of mathematical optimization, those species can be viewed as candidates congregating at various local minima in the search space of the fittest individual. Since in nature combining genomes from disparate species rarely result in viable life form, much less a fitter one, it could be surmised that the crossover operation in genetic algorithm is subject to similar restrictions. When the crossover operation is applied to two highly different candidates, the resulting chromosome could be completely nonsensical, and quickly eliminated after a few generations. This defeats the purpose of crossing over, which attempts to mix features from fit parents to get fitter offspring.

As an example, suppose GA is used to find an optimum of this fitness function:

This is a two-parameter function, hence GA is locating the highest points in two-dimensional space. Here’s the 3D plot:

And the contour plot:

It is apparent that the two optimum points lie in the middle of area 1 and 4 respectively. However, they actually belong to different sub-populations. A crossover operation on points taken from both area 1 and area 4 will result in a point that lies in area 2 or 3, which has low fitness scores. E.g. point [1.5, 4.5] combined with [4.0, 1.0] will either yield [1.5, 1.0] (in area 3) or [4.0, 4.5] (in area 2). This shows that crossover operation on different species, or sub-populations, has the potential to produce low-performing offspring. Crossover done on points from the same sub-population, on the other hand, does not pose this issue. E.g. [0.5, 5.0] combined with [0.8, 5.6] may give [0.5, 5.6], which still has a pretty high fitness score.

In fact, if the population is uniformly distributed over the search space, each crossover operation has 12.5% probability of generating poor offspring from superior parents. This probability soon skyrockets to 50% as the population starts to converge to the two areas of high fitness. *This means that half of the computational effort is wasted evaluating and eliminating poor candidates that shouldn’t have been generated in the first place!*

## Curse of dimensionality

The situation quickly deteriorates as the number of dimensions increases. Consider the following scenario: the search space is an N-dimension space and candidates are N-element vectors whose values are to be optimized. The fitness function is set to be:

where **A _{i} **represents the i-th element in the candidate vector

**A**. The function evaluates the square difference between components of the vector. Hence the optimal solution is an hyper-ridge where all components have the same value ([1,1,1], [2,2,2] etc.).

Here the crossover operation becomes increasingly an impediment to the progress of search as dimensionality increases. I produced a graph of number of steps required to reach a fairly optimal solution (fitness > -0.1) against number of dimensions:

From the graph, it can be seen that initially pure mutation combine with crossover has the upper hand given the additional diversity it provides, but as number of dimensions increases, crossover became increasingly susceptible to producing inferior offspring. The population is replete with unfit candidates produced by crossing over parents from incompatible sub-population, hence greatly reducing the possibility of any candidate accidentally stumbling upon an optimum solution via mutation.

## A patch to crossover?

Now that it is clear that applying crossover indiscriminately is detrimental to genetic search, is there any way to prevent this, or to regulate crossover? By again looking to nature, from which the inspiration for GA is originally obtained, I could think of a rough idea: clustering. Just like different species in nature never mate, you don’t want to apply crossover to candidates from different sub-populations. One way to identify the sub-populations is to run a clustering algorithm on the candidate vectors, possibly putting candidates with small Euclidean distance to one another other into one category, after which crossover is only performed on candidates from the same category. But this introduces more parameters (e.g. cut-off point for distance) into the optimization processes, which necessitates more tuning. A cluster too uniform would be useless since no new elements can ever be introduced by crossing over near identical ones. A cluster too diverse may again pose unnecessary risks of producing individuals of low fitness.

A self-organizing clustering process would be nice. Specifically, instead of pessimistically assuming that objects belonging to different clusters do not play well, it is better to perform a random sampling over the results of crossover applied to candidates from two clusters, thus giving a rough estimate of the two clusters’ compatibility with each other. A graph of compatibility relationships can be established, and clusters deemed compatible by experimentation can now be merged into one larger cluster, from which elements will be selected for crossover. This effectively prevents indiscriminate crossover to a large extent, while preserving population diversity.