How to find outliers: the formula and methods

Simple guide on how to find outliers explains the definition, how to detect outliers, and methods of dealing with them.

Find more education guides, tips and advice
Find more business guides, tips and advice

If you are doing analysis for business, you will occasionally be faced with outliers that risk skewing the data. For this reason, it is important to know how to find outliers and how to deal with them.

Outliers are extreme values that stand out from the other values in the data set and are clearly isolated from the remaining points in the data space. Results of data analyzes, for example a cluster analysis, can be completely distorted by a few outliers. It is therefore of great importance to know how to find outliers, recognize outliers and to deal with them properly.

The record’s ‘purple nose’ is an outlier. Adding the ‘nose’ to ‘eyes’ or ‘mouth’ would distort the natural shape of the clusters.

An outlier within a data set, through which a linear best-fit line is to be laid, leads to the so-called leverage effect. The outlier has a disproportionate effect on the resulting regression line and should therefore be neglected.

There are various methods of recognizing the deviation of a data point from the other data and assessing it as ‘strong enough’. The decision as to whether the point under consideration is an outlier is mostly at the discretion of the user.

Outliers are not necessarily incorrect or inaccurately recorded values, but under certain circumstances also values that are correct and precise, but contrary to expectations. This fact makes it difficult to distinguish outliers from normal data.

How to find outliers: Examples of outliers

The difficulties in detecting outliers, for example, resulted in the ozone hole over Antarctica not being discovered for years. The software that measured the ozone over Antarctica automatically removed the supposed outliers, the data that indicated the decline of the ozone layer.

The ozone hole had been measured for many years. However, the measurements were interpreted as “wrong” and defined as outliers! Because of this, the effects of the ozone hole were not fully captured

Something similar happened in technology.

The de Havilland Comet was the world’s first jet-propelled airliner. This aircraft revolutionized passenger flights, as travel times were cut in half and flights became very comfortable (quiet and low-vibration).

In the years 1953 to 1954, three machines crashed. There were no survivors. Up until then, none of the planes showed any abnormalities that could explain this crash. The failure of the first machine was attributed to the weather. A systematic error was thus excluded.

Types of outliers

In principle, an outlier can be assigned to one of the following groups:

is simply coincidence. With all scattered data, extreme values can appear purely by chance.
is a number rotator (instead of 13, 31 was entered for data transfer).
incorrect data was entered by mistake, measurement signals were mixed up or there is a measurement error.
gives a plausible physical reason that this value differs from the rest of the values.

Strictly speaking, the first three points are not outliers, but extreme values. There are basically two ways of reacting to outliers and extreme values.

On the one hand, the data evaluation can be carried out using methods that are robust against outliers. To be mentioned here are parameters such as the median instead of the mean or the quartile instead of the standard deviation.

On the other hand, potential outliers can be found with the help of statistical methods. More precisely, values are identified with the help of statistical methods that deviate statistically significantly from the remaining values. Whether it is actually an outlier must then be clarified by the processor through a separate investigation.

It is often assumed that the sample values are distributed randomly according to the normal distribution (or in terms of fatigue strength according to the logarithmic normal distribution). It is assumed that a value can be declared as an outlier if it deviates statistically significantly from this distribution.

How to find outliers: detection

Outlier tests can be identified with the help of outlier tests. These tests are mostly based on statistical measures that check the extent to which a data point meets the expectations of the data collection.

There are a number of such outlier tests that have various advantages and disadvantages and are more or less suitable depending on the nature of the data set. The following tests can be found in the literature:

Grubbs’ outlier test
Outlier test according to Hampel
Walsh outlier test
Dean-Dixon test
Nalimov test
2- σ method

In the following we want to take a closer look at the 2- σ method.

2- σ method

In the case of a data set with n data points from ^p , an outlier x _k ( k = 1 , …, n ) is classified as soon as at least one component x _k⁽ⁱ⁾ ( i = 1 , …, p ) by more than the standard deviation σ deviates twice from the mean value x .

The 2 σ method is easy to implement. However, this test cannot distinguish incorrect data, i.e. real outliers, from ‘exotic’, correct but unusual data.

Dealing with outliers

If outliers have been identified with the help of an outlier test, the further treatment of the outliers has a great influence on the results of the data analysis. Depending on how much the data record is to be changed, the following steps are taken:

Correction of the affected component x _k^{( i ) of} the outlier x _k
Removal of the affected component x _k^{( i ) of} the outlier x _k
Removal of the entire outlier x _k

Removing the entire outlier x _k is often used. However, in practice this has the disadvantage that a large part of the original data set may be removed. This means that too little data may be available for data analysis. The remedy for this is to only remove those components of the outlier that actually deviate too much from the other data.

If one does not want to reduce the collected data set, it is advisable to correct those components of the outlier that deviate too strongly from the other data according to the outlier test used. The data correction of the affected component can be done, for example, with the replacement

the maximum or minimum value
the global mean
the nearest neighbor value

Depending on the purpose of the data exploration, it is extremely important to treat the identified outliers ‘properly’. A decision about what ‘right’ means must be made by the user in each case.

How to find outliers: the formula and methods

How to find outliers: Examples of outliers

Types of outliers

How to find outliers: detection

2- σ method

Dealing with outliers

Editorial Team

Categories

How to find outliers: the formula and methods

How to write thank you cards and emails: personal and professional

How to write letters of recommendation: 12 easy steps

How to erase formatting in Word documents: 4 simple ways

How to find outliers: Examples of outliers

Types of outliers

How to find outliers: detection

2- σ method

Dealing with outliers

Editorial Team

How to write thank you cards and emails: personal and professional

How to write letters of recommendation: 12 easy steps

How to erase formatting in Word documents: 4 simple ways

Categories